For a more formal discussion than this article of what exactly lexing and parsing are, start with Wikipedia's definitions: Lexing and Parsing.
Also, the word parsing sometimes includes lexing and sometimes doesn't. This can cause confusion, but I'll try to keep them clear. Such situations arise with other words, and our minds usually resolve the specific meaning intended by analysing the context in which the word is used. So, keep your mind in mind.
At this point, an article such as this would normally provide a summary of historical developments in this field, as in 'how we got to here'. I won't do that, especially as I first encountered parsing many years ago, when the only tools (lex, bison, yacc) were so complex to operate I took the pledge to abstain. Nevertheless, it's good to know such tools are still available, so here are a few references:
Flex is a successor to lex, and Bison is a successor to yacc. This article explains why I (still) don't use any of these.
There are many situations where the only path to a solution requires a lexer and a parser.
The lex phase and the parse phase can be combined into a single process, but I advocate always keeping them separate, and I aim below to demonstrate why this is the best policy.
Also, for beginners, note that the phases very conveniently run in alphabetical order: We lex and then we parse.
So, let's consider some typical situations where lexing and parsing are the tools needed:
This is trivial to understand, but not to implement. In order to run a program we need to set up a range of pre-conditions:
After all, it must be syntactically correct before we run it. If not, we display syntax errors.
The real point of this step is to determine the programmer's intention, i.e. what is the reason for writing the code? Not to run it actually, but to get the output. And how do we do that?
Then we can gaze at the output which, hopefully, is correct. Or, perhaps, we are confronted by logic errors.
The steps are identical to the above, with HTML replacing Perl, althought I can't bring myself to call writing HTML writing a program.
This time, we're asking: What is the web page designer's intention, i.e. how exactly do they want the page to be rendered?
Of course, syntax checking is far looser than with a programming language, but must still be undertaken. For instance, here's an example of clearly-corrupt HTML which can be parsed by Marpa:
<title>Short</title><p>Text</head><head>
See Marpa::R2::HTML for details. The original version of Marpa has been superceded by Marpa::R2. Now I use Marpa::R2 in all my work (which happens to not involve HTML).
Consider this file, written in the DOT language, as used by the Graphviz graph visualizer (teamwork.dot):
digraph Perl { graph [ rankdir="LR" ] node [ fontsize="12pt" shape="rectangle" style="filled, solid" ] edge [ color="grey" ] "Teamwork" [ fillcolor="yellow" ] "Victory" [ fillcolor="red" ] "Teamwork" -> "Victory" [ label="is the key to" ] }
Here we have a 'program' and we wish to give effect to the author's intention by rendering this as an image:
What's required to do that? As above, lex, parse, render. Using Graphviz's dot command to carry out these 3 tasks, we would run:
shell> dot -Tsvg teamwork.dot > teamwork.svg
Note: Files used in these examples can be downloaded from http://savage.net.au/Ron/html/graphviz2.marpa/teamwork.tgz.
Now, the above link to the DOT language points to a definition of DOT's syntax, written in a somewhat casual version of BNF: Backus-Naur Form. This is significant, since it's usually straight-forward to translate a BNF description of a language into code within a lexer and parser.
Let's say we feel the Graphviz language is too complex, and hence we write a wrapper around it, so end users can code in a simplified version of that language. This has been done, with the original effort available in the now-obsolete Perl module Graph::Easy. Tels, the author, devised his own very clever little language, which he called Graph::Easy. The manual for that is on-line here.
When I took over maintenance of Graph::Easy, I found the code too complex to read, let alone work on, so I wrote another implementation of the lexer and parser, released as Graph::Easy::Marpa. I'll have much more to say about that in the next article in this 2-part series, so for now we'll just examine the above graph re-cast in Graph::Easy (teamwork.easy):
graph {rankdir: LR} node {fontsize: 12pt; shape: rectangle; style: filled, solid} edge {color: grey} [Teamwork]{fillcolor: yellow} -> {label: is the key to} [Victory]{fillcolor: red}
Simpler for sure, but how does Graph::Easy::Marpa work? As always: lex, parse, render. More samples of Graph::Easy::Marpa's work are here.
It should be clear by now that lexing and parsing are in fact widespread, although they often operate out of sight, with just their rendered output visible to the average programmer and web surfer.
What all such problems have in common is complex but well-structured source text formats, with a bit of hand-waving over the tacky details available to authors of documents in HTML. And, in each case, it is the responsibility of the programmer writing the lexer and parser to honour the intention of the original text's author.
And we can only do that by recognizing each token in the input as embodying some meaning (e.g. a word such as 'print' means output something of the author's choosing), and bringing that meaning to fruition (make the output visible on a device).
With all that I can safely claim that it's the ubiquity and success of lexing and parsing which justify their recognition as vital constituents in the world of software engineering. And with that we're done answering the question posed above: Why study them?
But there's another - significant - reason to discuss lexing and parsing. And that's to train programmers, without expertise in such matters, to resist the understandable urge to opt for using tools they are already familiar with, with regexps being the 'obvious' choice.
Sure, regexps suit many simple cases, and the old standbys of flex and bison are always available, but now there's a new kid on the block: Marpa.
Marpa is heavily based on theoretical work done over many decades, and comes in various forms:
Hand-crafted in C.
The Perl and C-based interface to the previous version of libmarpa.
The Perl and C-based interface to the most recent version of libmarpa. This is the version I use.
The newest and thinnest interface to libmarpa, which documents how to make Marpa accessible to non-Perl languages.
The problem, of course, is whether or not any of these are a good, or even excellent, choice.
Marpa's advantages are huge, and can be summarized as:
This alone is of great significance.
This means I can specify the task in Perl, and let Marpa handle the details.
Hence, Open Source says you can see exactly how other people use it.
Once you get used to it, of course! And if you're having trouble, just post on the Google Group.
Actually, if you've ever worked with flex and bison, you'll be astonished at how simple it is to drive Marpa.
This is a bit surprising, since new technology usually needs some time to surpass established technology while delivering the all-important stability.
For instance, recently the author eliminated the dependency on Glib, to improve portability.
What's important is that this work is on-going, and we can expect a series of incremental improvements for some time to come.
So, some awareness of the choice of tools is important long before coding begins.
BTW: I use Marpa in Graph::Easy::Marpa and GraphViz2::Marpa.
However, this is not an article on Marpa (but the next one is), so let's return to discussing lexing and parsing.
As mentioned, the stages, conveniently, run in English alphabetical order, so we lex and then we parse.
Here, I'm using lexing to mean the comparatively simple process of tokenising a stream of text, which means chopping that input stream into discrete tokens, and identifying the type of each token. The output is a new stream, this time of stand-alone tokens. And I say 'comparatively' because I see parsing as complex compared to lexing.
And no, lexing does not do anything more than identify tokens. Therefore questions about the meanings of those tokens, or their acceptable order, are matters for the parser.
So, the lexer will say: I have found another token, and have identified it as being of some type T. Hence, for each recognized token, 2 items will be output:
Since the process happens repeatedly, the output will be an array of token elements, with each element needing at least these 2 components: type and value.
In practice, I prefer to represent these elements as a hashref, like this:
{ count => $integer, # 1 .. N. name => '', # Unused. type => $string, # The type of the token. value => $value, # The value from the input stream. }
with the array being managed by an object of type Set::Array, which I did not write but which I do now maintain. The advantage of Set::Array over Set::Tiny is that the former preserves the ordering of the elements. See also this report comparing a range of set-handling modules.
The 'count' field, apparently redundant, is sometimes employed in the clean-up phase of the lexer, which may need to combine tokens unnecessarily split by the regexp-based approach. Also, it is available to the parser if needed, so I always include it in the hashref.
The 'name' field really is unused, but gives people who fork or sub-class my code a place to work with their own requirements, without worrying that their edits will affect the fundamental code.
BTW, if you have an application where the output is best stored in a tree, the Perl module Tree::DAG_Node is superb (and which I also did not write but now maintain).
The parser then, concerns itself with the context in which each token appears, which is a way of saying it cares about whether or not the sequence and combination of tokens actually detected fits the expected grammar.
Ideally, the grammar is provided in BNF Form. This makes it easy to translate into the form acceptable to Marpa. If it's not in that form, you're work is (probably) going to be harder, simply because someone else has not done the hard work formalizing the grammar.
And now it's time to expand on 'grammars'.
An example grammar was mentioned above: DOT. But, how are we to understand a block of text written in BNF? Well, training is of course required when taking on such a task, and to that I'd add what I've gleaned from this work, as follows.
To us beginners eventually comes the realization that grammars, no matter how formally defined or otherwise, contain within them 2 sub-grammars:
One sub-grammar specifies what a token looks like, meaning what range of forms it can assume in the input stream. If an incomprehensible candidate it detected, the lexer can generate an error, or it can activate a strategy called - by Jeffrey Kegler, the author of Marpa - Ruby Slippers (which has no relation to the Ruby programming language).
Put simply, the Ruby Slippers strategy fiddles the current token, or perhaps an even larger section of the input stream, in a way that satisfies the grammar, and restarts processing at the new current token. Marpa is arguably unique in being able to do this.
The other sub-grammar specifies how these tokens are allowed to combine, meaning if they don't conform to the grammar, the code generates a syntax error of some sort.
It is: We will encode the first sub-grammar into the lexer and the second into the parser.
This says that if we know what tokens look like, we can tokenize the input stream, i.e. split it into separate tokens using a lexer. Then we give those tokens to the parser for (more) syntax checking, and for interpretation of what the user presumably intended with that specific input stream (combination of tokens).
And that gives us a clear plan-of-attack when confronted by a new project.
Truely, it just sounds complicated, because I'll be introducing various presumably-new concepts, but don't dispair. It's not really that difficult.
We have to, somehow and somewhere, manage the complexity of the question 'Is this a valid document?' for a given grammar.
Recognizing a token with a regex is easy. Keeping track of the context in which we see that token, and the context in which our grammar allows that token, are hard.
Yes, the complexity of setting up and managing a formal grammar and the DFA (see below) seems like a lot of work, but it's a specified and well understood mechanism we don't have to reinvent something every time.
By limiting the code we have to write two things: a set of rules for how to construct tokens within a grammar, and a set of rules for what happens when we construct a valid combination of tokens, we can focus on the important part of our application - determining what a document which conforms to the grammar means (the author's intention) - and less on the mechanics of verifying that a document matches the grammar.
Here's how it works: The lexer's job is to recognise tokens, and our sub-grammar #1 specifies what they look like. So our lexer will have to examine the input stream, possibly one character at a time, to see if the 'current' input, appended to the immediately preceding input, fits the definition of a token.
Now, a programmer can write a lexer anyway they want of course. The way I do it is with regexps combined with a DFA (Discrete Finite Automaton) module. This blog entry - More Marpa Madness - discusses using Marpa in the lexer (i.e. as well as in the parser, which is where I use it).
And what is a DFA? Abusing any reasonable definition, let me describe them thusly: The 'Finite' part means the input stream only contains a limited number of different tokens, which simplifies the code, and the 'Automata' in our case is the software machine we are writing, i.e. the program. For more, see that Wikipedia article on DFAs. BTW: DFAs are often known by their nick, STT (State Transition Table).
And now, what's this State Transition Table (STT)? It is precisely the set of regexps which recognise tokens, together with instructions about what to do when a specific type of token is recognised.
But how do we make this all work? MetaCPAN is your friend! In particular, we'll use Set::FA::Element to drive the process. For candidate alternatives I assembled a list of Perl modules with relevance in the area, while cleaning up the docs for Set::FA. See https://metacpan.org/module/Set::FA#See-Also. I did not write Set::FA, nor Set::FA::Element, but I now maintain them.
Be assured, such a transformation of BNF (or whatever our grammar's source is) into a DFA gives us:
To cast BNF into regexps means we must understand exactly what the grammar definition is saying.
We end up with a spreadsheet which simply and clearly encodes our understanding of tokens.
Spreadsheet? Yes, I store the derived regexps, along with other information, in a spreadsheet, as explained below. Techniques for incorporating this spreadsheet into the source code are discussed shortly.
In practice, we simply read and re-read, many times, the definition of our BNF (here the DOT language), and build up the corresponding set of regexps to handle each 'case'. This is labourious work, no doubt about it.
For example, by using a regexp like /[a-zA-Z_][a-zA-Z_0-9]*/, we can get Perl's regexp engine to intelligently gobble up characters as long as they fit the definition. Here, the regexp is just saying: Start with a letter, upper- or lower-case, or an underline, followed by 0 or more letters, digits or underlines. Look familiar? It's very close to the Perl definition of \w, but disallows leading digits. Actually, DOT disallows them (in certain circumstances), so we have to, but of course it (DOT) does allow pure numbers in certain circumstances.
And what do we do with all these hand-crafted regexps? We use them as data to feed into the DFA, along with the input stream. The output of the DFA is basically a flag saying Yes/No, the input stream matches/doesn't match the token definitions specified by the given regexps. Along the way, the DFA calls one of our callback functions each time a token is successfully recognized, so we can stockpile them. Then, at the end of the run, we can output them as a stream of tokens, each with its identifying 'type', as per 'The Lexer's Job Description' above.
A note about callbacks: Sometimes it's easier to design a regexp to capture more than seems appropriate, and to use code in the callback to chop up what's been captured, outputting several token elements as a consequence.
Since developing the STT is such an iterative process, you'll want various test files, and some scripts with short names to run the tests. Short names because you're going to be running these scripts an unbelievable number of times....
So, what are states and why do we care about them?
Well, at any instant our STT (automation, software machine) is in precisely 1 (one) state. Perhaps it has not yet received even 1 token (it's in the 'start' state), or perhaps it has just finished processing the previous one. Whatever, the code maintains information so as to know exactly what state it is in, and this leads to knowing exactly what set of tokens is now acceptable. I.e. any one of this set will be a legal token in the current state.
This is telling us then that we have to associate each regexp with a specific state, and visa versa, and it's implicitly telling us that we stay in the current state as long as each new input character matches a regexp belonging to the current state, and that we jump (transition) to a new state when that character does not match.
Consider this simplistic code from the synopsis of Set::FA::Element:
my($dfa) = Set::FA::Element -> new ( accepting => ['baz'], start => 'foo', transitions => [ ['foo', 'b', 'bar'], ['foo', '.', 'foo'], ['bar', 'a', 'foo'], ['bar', 'b', 'bar'], ['bar', 'c', 'baz'], ['baz', '.', 'baz'], ], );
In the transitions parameter the first line says: 'foo' is a state's name, and 'b' is a regexp saying we jump to state 'bar' if the next input char is 'b'. And so on.
This in turn tells us that to use Set::FA::Element we have to prepare a transitions parameter matching this format. Hence the need for states and regexps.
And this is code I've used, taken directly from GraphViz2::Marpa::Lexer::DFA:
Set::FA::Element -> new ( accepting => \@accept, actions => \%actions, die_on_loop => 1, logger => $self -> logger, start => $self -> start, transitions => \@transitions, verbose => $self -> verbose, );
Let's discuss these parameters.
This is an arrayref of state names, meaning that, after processing the entire input stream, if we end up in one of these states, then we have 'accepted' that input stream. That just means that every input token matched an appropriate regexp, where 'appropriate' means every char matched the regexp belonging to the current state, whatever the state was at the instant that char was input.
This is a hashref of function names, so we can call a function, optionally, upon entering or leaving any state. That's how we stockpile recognized tokens.
Further, since we write these functions, and we attach each to a particular combination of state and regexp, we encode into the function the knowledge of what 'type' of token has just been matched when the DFA calls a function. And that's how our stockpile will end up with (token, type) pairs to output at the end of the run.
This flag, if true, tells the DFA to stop if none of the regexps belonging to the current state match the current input char, i.e. stop rather than loop for ever.
You might wonder what stopping automatically is not the default, or even mandatory. Well, it's because you might was to try some recovery algorithm in such a situation, before dying.
This is an (optional) logger objct.
This is the name of the state in which the STT starts, so the code knows which regexp(s) to try upon inputting the very first char.
This is a potentially large arrayref, listing separately for all states all the regexps which are to be invoked, one at a time, in the current state, while trying to match the current input char.
Specifies how much to report if the logger object is not defined.
So, the next problem is how to prepare the grammar in such a way as to fit into this parameter list, and hence addressing that must come next.
The coder thus needs to develop regexps etc which can be fed directly into the chosen DFA, here Set::FA::Element, or which can be transformed somehow into a format acceptable to that module. But so far I haven't actually said how I do that. It's time to be explicit....
I use a spreadsheet with 9 columns:
This just contains 1 word, 'Yes', against the name of the state which is the start state.
This contains the word 'Yes' against the name of any state which will be an accepting state (see above).
This is the name of the state.
This is a regexp, which is read to mean: The event fires (is triggered) if the current input char matches the regexp in this column.
Now, since the regexp belongs to a given state, we know the DFA will only process regexps associated with the current state, of which there will be 1 or at most a few.
When there are multiple regexps per state, I leave all other columns empty.
The name of the 'next' state, i.e. the name of the state to which the STT will jump if the current char matches the regexp given on the same line of the spreadsheet (in the current state of course).
The optional name of the function the DFA is to call upon (just before) entry to the (new) state.
The optional name of the function the DFA is to call upon exiting from the current state.
This is a working column, in which I put formulas so that I can refer to them in various places in the 'Event' column. It is not passed to the DFA in the 'transitions' parameter.
Comments to self.
The STT for GraphViz2::Marpa is on-line here.
Now, this structure has various advantages:
It is very easy to read, and to work with. Don't forget, to start with you'll be basically switching back and forth between the grammar definition document (hopefully in BNF) and this spreadsheet. This is a way of saying there's no coding done at this stage.
Since I have not yet addressed the question of how the code will read the spreadsheet, I offer these techniques:
There is no problem with this method, except the complexity of the code (in the external module which does the reading of course), and the slowness of loading and running this code.
Actually, I should mention that since I use LibreOffice I can either force end-users to install OpenOffice::OODoc, or export the spreadsheet as an Excel file, in order to avail themselves of this option. I have chosen to not support reading the *.ods file directly in the modules (Graph::Easy::Marpa and GraphViz2::Marpa) I ship.
This way, we can read a CSV file into the DFA fairly quickly, without loading the module which reads spreadsheets.
Be careful here with LibreOffice, since it forces you to use Unicode for the spreadsheet, but exports e.g. double-quotes as the 3 byte sequence 0xe2, 0x80, 0x9c, which when used in a regexp will never match a 'real' double-quote in your input stream. Sigh. Do No Evil. If only. So, when exporting, always choose the ASCII option.
We do this in 2 stages: Export to a CSV file, and just use an editor to append that file to the end of the source code of our module, after the __DATA__ token.
Such in-line data can be accessed effortlessly by the very neat and very fast module Data::Section::Simple. Clearly, since our module has been loaded already, because it's precisely what's being executed, there is essentially no overhead whatsoever in reading data from within it. Don't you just love Perl! And MetaCPAN of course. And a community which contributes such wonderous code.
An advantage of this alternative is that it lets end-users edit the shipped *.csv or *.ods files, after which they can use a command line option on scripts to read their file, over-riding the built-in STT.
After all this, it's just a matter of code which read and validates the structure of the STT's data, and then reformats it into what Set::FA::Element demands.
So, enough about the lexer, but we can say that by now the first sub-grammar has been incorporated into the design and code of the lexer, and that the second sub-grammar must now be encoded into the parser, for that's how the parser performs syntax checking.
How we do this depends intimately on which pre-existing module, if any, we choose to use to aid the development of the parser. Since I choose Marpa (currently Marpa::R2), I am orienting this article to that module. However, only in the next article will I deal in depth with Marpa.
Whichever module is chosen, you can think of the process like this: Our input stream is a set of pre-defined tokens (probably but not necessarily output from the lexer), and we must now specify all possible legal combinations of those tokens, i.e. the syntax of the language. This really means the remainder of the syntax, since by now we've lost interest in the definition of a (legal) token. I.e. at this point we are assuming all incoming tokens are legal, which is a way of saying we will not try to parse and run a program containing token-based syntax errors, although it may contain logic errors (even if written in Perl :-).
Then, a combination of tokens which does not match any of the given legal combinations can be immediately rejected as a syntax error. And keep in mind that the friendliest compilers find as many syntax errors as possible per parse.
And since this check takes place on a token-by-token basis, we (ought to) know precisely which token triggered the error, which means we can emit a nice error message, identifying the culprit and its context.
Here's a sample of a Marpa::R2 grammar (adapted from its synopsis):
my($grammar) = Marpa::R2::Grammar -> new ({ actions => 'My_Actions', start => 'Expression', rules => [ { lhs => 'Expression', rhs => [qw/Term/] }, { lhs => 'Term', rhs => [qw/Factor/] }, { lhs => 'Factor', rhs => [qw/Number/] }, { lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' }, { lhs => 'Factor', rhs => [qw/Factor Multiply Factor/], action => 'do_multiply' }, ], default_action => 'do_something', });
We need to understand these parameters before being able to write something like this for our chosen grammar.
Now, despite the differences between this and the calls to Set::FA::Element -> new() above, it's basically the same:
This is the name of a Perl package in which actions such as do_add() and do_multiply() will be looked for. OK, so the lexer has no such option, the 'current' package being the default.
This is the lhs name of the rule to start with, as with the lexer.
This is just an arrayref of rule descriptors defining the syntax of the grammar. This is the lexer's transitions parameter.
Use this (optional) callback as the action for any rule element which does not explicitly specify its own action.
So our real problem is re-casting the syntax from BNF, or whatever, into a set of these (lhs, rhs, action) rule descriptors.
Now, how do we think about such a problem. I suggest contrast-and-compare real code with what the grammar says it must be:
Firstly, let's repeat teamwork.dot from above:
digraph Perl { graph [ rankdir="LR" ] node [ fontsize="12pt" shape="rectangle" style="filled, solid" ] edge [ color="grey" ] "Teamwork" [ fillcolor="yellow" ] "Victory" [ fillcolor="red" ] "Teamwork" -> "Victory" [ label="is the key to" ] }
Generalizing, we know a Graphviz (DOT) graph must start like one of these:
strict digraph $id {...} # Case 1. $id is a variable. strict digraph {...} strict graph $id {...} # Case 3 strict graph {...} digraph $id {...} # Case 5 digraph {...} graph $id {...} # Case 7 graph {...}
As indeed the real code does, with the graph's id being Perl (i.e. case 5 in that list).
If you've ever noticed that BNFs can be written as a tree (can they?), you'll know what comes next: We're going to start writing rule descriptors from the root down.
Drawing this as a tree gives:
DOT's Grammar | V --------------------- | | strict | | | --------------------- | V --------------------- | | digraph or graph | | --------------------- | V --------------------- | | $id | | | --------------------- | V {...}
But wait, what's this? I've just said in that tree that the strict is optional. Well, no it's not, in the parser. It is optional in the DOT language, but I designed the lexer, and I therein ensured it would necessarily output strict => no when the author of the graph omitted the strict.
So, by the time we're inside the parser, it's no longer optional, and that was done simply to make the life easier for consumers of the lexer's output stream, such as authors of parsers.
Likewise, for digraph 'v' graph, I designed the lexer to output digraph => 'yes' in one case and digraph => 'no' in the other.
What does that mean? It means, for teamwork.dot, the lexer will output (in some convenient format) the equivalent of:
strict => no digraph => yes graph_id => Perl ...
graph_id was chosen because the DOT language allows other types of ids, e.g. for nodes, edges, ports and compass points.
This gives us our first 6 Marpa-friendly rules, embedded in an arrayref of rules:
[ { # Root-level stuff. lhs => 'graph_grammar', rhs => [qw/prolog_and_graph/], }, { lhs => 'prolog_and_graph', rhs => [qw/prolog_definition graph_sequence_definition/], }, { # Prolog stuff. lhs => 'prolog_definition', rhs => [qw/strict_definition digraph_definition graph_id_definition/], }, { lhs => 'strict_definition', rhs => [qw/strict/], action => 'strict', # <== Callback. }, { lhs => 'digraph_definition', rhs => [qw/digraph/], action => 'digraph', # <== Callback. }, { lhs => 'graph_id_definition', rhs => [qw/graph_id/], action => 'graph_id', # <== Callback. }, ... ]
That is, we're saying the graph, as a whole, consists of:
And then...
Remember, those names 'prolog_and_graph' etc are just ids I made up.
Next, a prolog consists of:
Which is now not optional, and then...
Which will turn out to match the lexer input of /^(di|)graph$/, and the lexer output of digraph => /^(yes|no)$/, and then...
Which is optional, and then...
Which will be the precise definition of real live graphs, represented above by {...} in the list of the 8 possible formats for the prolog.
But take another look at those rule descriptors. They say nothing about the values of the tokens! For instance, in graph_id => Perl what happens to ids such as Perl. Nothing. They are ignored. And that's because that's just how these grammars work.
Recall: It was the lexer's job to identify valid graph ids based on the 1st sub-grammar. By the time the data hits the parser, we know we have a valid graph id, and as long as it plugs in to the structure of the grammar in the right place, we are prepared to accept any valid graph id. Hence Marpa::R2 does not even look at the graph id, which is a way of saying this one grammar works with every valid graph id.
This point also raises the tricky discussion of whether a specific implementation of lexer/parser code can or must keep the 2 phases separate, or whether in fact you can roll them into one without falling for the 'premature optimisation' trap. I'll just draw a veil over that discussion, since I've already declared my stance: They're implemented in 2 modules.
But if these rules have to be chained into a tree, how do we handle the root? Well, consider this call to Marpa::R2's new() method:
my($grammar) = Marpa::R2::Grammar -> new(... start => 'graph_grammar', ...);
And graph_grammar is precisely the lhs in the first rule descriptor.
After that, every rule's rhs, including the root's, must be defined later in the list of rule descriptors. This forms the links in the chain, and if drawn you'll see the end result is a tree.
Here's the full Marpa::R2 grammar for DOT (as used in the GraphViz2::Marpa module) as an image: http://savage.net.au/Ron/html/graphviz2.marpa/Marpa.Grammar.svg. This image was created with (you guessed it!) Graphviz via GraphViz2. Numbers have been added to node names in the tree, otherwise Graphviz would regard any 2 identical numberless names as one and the same node.
Here I'll stop building the tree of the grammar (see the next article), and turn to some design issues.
The remainder of this document is to help beginners orient their thinking when confronted with a problem they don't have experience at tackling. Of course, if you're an expert in lexing and parsing, feel free to ignore everything I say.
And, if you think I've misused lexing/parsing terminology here, please let me know.
Yep, this old one again. It has various connotations:
Don't aim to combine the lexer and parser, even though that's what might eventuate.
I.e. wait until the design of each is clear and finalized, before trying to jam them into a single module (or program).
Do make the lexer identify the existence of tokens, but not identify their ultimate role or meaning.
Don't make the lexer do context analysis.
Here I mean make the parser be the one to disambiguate tokens with multiple meanings, by using the context, which at this point are tokens identified by the lexer.
And context analysis for businesses, for example, is probably not what you want either.
Don't make the lexer do syntax checking. This is effectively the same as the last point.
Don't minimize the lexer's output stream. For instance, don't force the code which reads the lexer's output to guess whether or not a variable-length set of tokens has ended. Output a specific token as a set terminator. The point of this token is to tell the parser exactly what's going on. Without such a token, the next token has to do double-duty: Firstly it tells the parser the variable-length part has finished, and secondly, it represents itself. Such overloading is unnecessary.
In the STT, don't try to minimize the number of states, at least not until the code has stabilized (i.e. is no longer under [rapid] development).
I develop my STTs in a spreadsheet, which means a formula (regexp) stored in 1 cell can be referenced in any number of other cells. This is very convenient.
Hmmm, another ancient aphorism. Naturally, these persist precisely because they're telling us something important.
Here, it means study the problem carefully, and deal with each part (lexer, parser) of it separately. 'Nuff said.
Yes, I know you'd never do that.
Anyway, there are Perl modules to help with things like the STT. E.g.: Set::FA::Element. Check its 'See Also' (in Set::FA, actually) for other STT helpers.
Developing the STT takes many iterations:
For each iteration, prepare a separate test case.
Have a tiny script which runs 1 test. Giving it a short, perhaps temporary, name, makes each test just that little bit easier to run.
By temporary name I mean you can give it a meaningful name later, when including it in the distro.
Have a script which runs all tests.
I keep the test data files in the data/ dir, and the scripts in the scripts/ dir. Then, creating tests in the t/ dir can perhaps utilize these two sets of helpers.
Since I've only used Marpa::R2 for graphical work, the output of the wrapper is a web page, which makes viewing the results simple. I like to include (short) input or output text files on such a page, beside the *.svg images. That way I can see at a glance what the input was and hence I can tell what the output should be without switching to the editor's window.
There's a little bit of effort initially, but after that it's so easy to check the output of the latest test.
Sample output from my wrapper scripts:
As with the STT, this, at least for me, is very much a trial-and-error process.
Tips:
A good idea is not to start by coding with your editor, but to draw the grammar as a tree, on paper.
This refers to when one of several tokens can appear in the input stream. Learn exactly how to draw that without trying to minimize (see above) the number of branches in the tree.
Of course, you will still need to learn how to code such a construct. Here's a bit of code from Graph::Easy::Marpa which deals with this (note: we're back to the Graph::Easy language from here on!):
{ # Graph stuff. lhs => 'graph_definition', rhs => [qw/graph_statement/], }, { lhs => 'graph_statement', # 1 of 3. rhs => [qw/group_definition/], }, { lhs => 'graph_statement', # 2 of 3. rhs => [qw/node_definition/], }, { lhs => 'graph_statement', # 3 of 3. rhs => [qw/edge_definition/], },
This is telling you that a graph thingy can be any one of a group, node or edge. It's Marpa::R2's job to try the 1/2/3 of 3 in order, to see which (if any) matches the input stream.
So, this represents a point in the input stream where one of several alternatives can appear.
The tree would look like:
graph_definition | V graph_statement | V --------------------------------------- | | | V V V group_definition node_definition edge_definition
The comment '3 of 3', for instance, says an edge can stand alone.
But consider the node_definition:
{ # Node stuff. lhs => 'node_definition', rhs => [qw/node_sequence/], min => 0, }, { lhs => 'node_sequence', # 1 of 4. rhs => [qw/node_statement/], }, { lhs => 'node_sequence', # 2 of 4. rhs => [qw/node_statement daisy_chain_node/], }, { lhs => 'node_sequence', # 3 of 4. rhs => [qw/node_statement edge_definition/], }, { lhs => 'node_sequence', # 4 of 4. rhs => [qw/node_statement group_definition/], },
Here the comment '3 of 4' tells you that nodes can be followed by edges.
A realistic sample is: [node_1] -> [node_2], where '[x]' is a node and '->' is an edge, because an edge can be followed by a node (applying '3 of 4' below).
So, this (above and below) represents a point in the input stream where one of several specific sequences of tokens are allowed/expected. Here's the edge_definition:
{ # Edge stuff. lhs => 'edge_definition', rhs => [qw/edge_sequence/], min => 0, }, { lhs => 'edge_sequence', # 1 of 4. rhs => [qw/edge_statement/], }, { lhs => 'edge_sequence', # 2 of 4. rhs => [qw/edge_statement daisy_chain_edge/], }, { lhs => 'edge_sequence', # 3 of 4. rhs => [qw/edge_statement node_definition/], }, { lhs => 'edge_sequence', # 4 of 4. rhs => [qw/edge_statement group_definition/], }, { lhs => 'edge_statement', rhs => [qw/edge_name attribute_definition/], }, { lhs => 'edge_name', rhs => [qw/edge_id/], action => 'edge_id', },
But, I have to stop somewhere, so...
I hope I've clarified what can be a complex and daunting part of programming, and I also hope I've convinced you that working in Perl, with the help of a spreadsheet, is the modern aka only way to lexer and parser bliss.
Ron Savage .
Home page: http://savage.net.au/index.html
Australian Copyright © 2012 Ron Savage. All rights reserved.
All Programs of mine are 'OSI Certified Open Source Software'; you can redistribute them and/or modify them under the terms of The Artistic License, a copy of which is available at: http://www.opensource.org/licenses/index.html