Lexing and Parsing Continued

Table of contents

Lexing and Parsing Continued
Brief Recap - The Two Grammars
Some Context
Scripts for Testing
Some Modules
Some Notes on the STT
Working With An Incomplete BNF
Understanding the State Transition Table
Line One
Line Two
Line Three
Line Four
More States
The 'graph' State
The 'graph_id' State
The Remaining States
Lexer Actions (Callbacks)
A long Exit-state Function
A tiny Exit-state Function
Using Marpa in the Lexer
The Parser's Structure
Marpa Actions (Callbacks)
The Grammar in Practice
Trees Have Leaves
Where to from here
Wrapping Up and Winding Down
The Lexer and the State Transition Table - Revisited
Sample Output
Author
Licence

Lexing and Parsing Continued

This article is part 2 of a series which started with http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html.

That article aimed to discuss lexing and parsing in general terms, while trying to minimalize the amount on how to actually use Marpa::R2 to do the work. In the end, however, it did have quite a few specifics. This article has yet more detail with regard to working with both a lexer and a parser. BTW, Marpa's blog is here.

Also, files used in these examples can be downloaded from http://savage.net.au/Ron/html/graphviz2.marpa/teamwork.tgz.

Brief Recap - The Two Grammars

Article 1 defined the first sub-grammar as the one which identifies tokens, and the second sub-grammar as the one which specifies which combinations of tokens are legal within the target language.

And, the lexer implements the first sub-grammar, and the parser implements the second.

Some Context

Consider this image:

It's actually a copy of the image on this page of the manual for Graph::Easy. Note: My module Graph::Easy::Marpa is a complete re-write of Graph::Easy, after I offered to take over maintenance of the latter, and found the code so complex I literally couldn't understand any of it.

There are 3 ways (of interest to us) to specify the contents of this image (formatted for readability):

o As a Perl program using the GraphViz2 module

This is teamwork.pl:

        #!/usr/bin/env perl

        use strict;
        use warnings;

        use GraphViz2;

        # ---------------

        my($graph) = GraphViz2 -> new
                (
                graph  => {rankdir => 'LR'},
                edge   => {color => 'grey'},
                logger => '',
                );

        $graph -> default_node(fontsize => '12pt', shape => 'rectangle', style => 'filled, solid');

        $graph -> add_node(name => 'Teamwork', fillcolor => 'red');
        $graph -> add_node(name => 'Victory',  fillcolor => 'yellow');

        # The dir attribute makes the arrowhead appear.

        $graph -> add_edge(dir => 'forward', from => 'Teamwork', to => 'Victory', label => 'is the key to');

        my($format)      = shift || 'svg';
        my($output_file) = shift || "teamwork.$format";

        $graph -> run(format => $format, output_file => $output_file);
o As a Graphviz DOT file written in a little language

Here we use the Graph::Easy language invented by the author (Tels) of the Graph::Easy Perl module. Call this teamwork.easy. It's actually input for Graph::Easy::Marpa:

        graph {rankdir: LR}
        node {fontsize: 12pt; shape: rectangle; style: filled, solid}
        [Teamwork]{fillcolor: yellow}
        -> {label: is the key to}
        [Victory]{fillcolor: red}

Note: In some rare cases, the syntax supported by Graph::Easy::Marpa will not be exactly identical to the syntax supported by the original Graph::Easy.

o As a DOT file

Call this teamwork.dot:

        digraph Perl
        {
        graph [ rankdir="LR" ]
        node  [ fontsize="12pt" shape="rectangle" style="filled, solid" ]
        edge  [ color="grey" ]
        "Teamwork" [ fillcolor="yellow" ]
        "Victory"  [ fillcolor="red" ]
        "Teamwork" -> "Victory" [ label="is the key to" ]
        }

This article is about using GraphViz2::Marpa to parse files of the last type, i.e. in the DOT language.

Of course the Graphviz package itself provides a set of programs which parse DOT files in order to render them into many different formats.

So, you're wondering why anyone would write a new parser for DOT. One reason is to practice your Marpa skills. Or, perhaps, you plan to write an on-line editor for Graphviz files.

Another reason is to provide add-on services to the Graphviz package. For instance, some users might want to find all clusters of nodes, where a cluster is a set of nodes connected to each other, but not connected to any nodes outside the cluster. Yet other uses might want to find all paths of a given length emanating from a given node.

In fact, I have written algorithms which provide these last 2 features. See the module GraphViz2::Marpa::PathUtils and the demo page.

But back to using Marpa::R2 from within GraphViz2::Marpa.

Scripts for Testing

The code being developed obviously needs to be tested thoroughly, since any such little language has many ways to get things right, and a horrendously large number of ways to get things slightly wrong, or worse.

Luckily, since graphs specified in DOT can be very brief, it's a simple matter to make up many samples. Further, other more complex samples can be copied from the Graphviz distro itself, where they ship in the graphs/directed/ and graphs/undirected/ directories.

In fact, the GraphViz2::Marpa distro I developed ships with (currently) 86 data/*.gv (input) files, and the corresponding 79 data/*.lex, data/*.parse, data/*.rend and html/*.svg (output) files. The missing files are due to deliberate errors in the input files, which stops those 7 output files being created.

As well, there are the obvious scripts, lex.pl (lex a file), parse.pl (parse a lexed file), rend.pl (render a parsed file back into DOT), and one named vaguely after the package, g2m.pl, which runs the lexer and the parser.

Why a rend.pl? Simply that if we can't reconstruct the input DOT file, something got lost in translation...

Also, there'll be some scripts which operate on a set of files (data/ and html/ are dirs within the distro):

data/*.gv -> dot2lex.pl -> runs lex.pl once per file -> data/*.lex (CSV files).

data/*.lex -> lex2parse.pl -> runs parse.pl once per file -> data/*.parse (CSV files).

data/*.parse -> parse2rend.pl -> runs rend.pl once per file -> data/*.rend (dot files).

data/*.rend -> rend2svg.pl -> html/*.svg.

Lastly, generate.demo.pl is used to create the demo page.

g2m.pl is normally the only one you want. All others are primarily to help with testing.

See the documentation for more on the scripts, which ship in the scripts/ directory within the disto.

Some Modules

GraphViz2::Marpa::Lexer::DFA is a wrapper around Set::FA::Element. It has various tasks to do:

o Process the State Transition Table (STT)

The STT comes in via GraphViz2::Marpa::Lexer, which got it from within its own source code or an external CSV file.

The lexer has already validated the structure of the STT's data.

The STT is on-line.

o Transform the STT from our form (spreadsheet/CSV file) into what Set::FA::Element expects
o Set up the logger

Actually, it gets one from it's caller, GraphViz2::Marpa::Lexer.

o Provide the code for all the functions which handle enter-state and exit-state events

This is the code which can apply checking above and beyond what was built into the set of regexps which came from the spreadsheet.

For example, I could have used Perl's \w instead of /[a-zA-Z_][a-zA-Z_0-9]*/ to find alphanumeric tokens, and at this point rejected those starting with a digit, since that restriction is imposed by DOT.

But most importantly of all, this code stockpiles the tokens themselves, and something which identifies what type of token each is. Hence the 2 columns in the sample data/27.lex just below.

o Run the DFA
o Check the result of that run

This means, did we end up in an accepting state or not. Yes is ok, no is an error.

Here is some sample data which ships with GraphViz2::Marpa, formatted for maximum clarity:

o A DOT file, data/27.gv, input to the lexer:
        digraph graph_27
        {
                node_27_1
                [
                        color     = red
                        fontcolor = green
                ]
                node_27_2
                [
                        color     = green
                        fontcolor = red
                ]
                node_27_1 -> node_27_2
        }
o A token file, data/27.lex, output from the lexer:
        "type","value"
        strict          , "no"
        digraph         , "yes"
        graph_id        , "graph_27"
        start_scope     , "1"
        node_id         , "node_27_1"
        open_bracket    , "["
        attribute_id    , "color"
        equals          , "="
        attribute_value , "red"
        attribute_id    , "fontcolor"
        equals          , "="
        attribute_value , "green"
        close_bracket   , "]"
        node_id         , "node_27_2"
        open_bracket    , "["
        attribute_id    , "color"
        equals          , "="
        attribute_value , "green"
        attribute_id    , "fontcolor"
        equals          , "="
        attribute_value , "red"
        close_bracket   , "]"
        node_id         , "node_27_1"
        edge_id         , "->"
        node_id         , "node_27_2"
        end_scope       , "1"

Again, here's the demo page.

Some Notes on the STT

Firstly, note, that when reading the input file, whole-line comments, matching m!^(?:#|//)!, are allowed. These lines are discarded when the input file is read, and so do not appear in the STT.

Working With An Incomplete BNF

This section discusses a situation where your input language, here DOT, is specified by a BNF (Backus-Naur Form) but there are components in the language (IDs in DOT) not precisely defined by the BNF.

Now for all the bad news about DOT's IDs. IDs can be surrounded by double-quotes, and in some case must be surrounded by double-quotes. To be more specific, if we regard an attribute declaration to be of the form key=value, both the key and the value can be embedded in double-quotes, and sometimes the value must be.

Even worse, IDs can be attributes. For instance, you might want your font color to be green (see above). That appears to be simple, but note: attributes must be attached to some component of the graph. Hence (abbreviated): node_27_1 [fontcolor = green].

But here's the pain. In DOT, the thing to which the attribute belongs may be omitted as being 'understood'. I.e. the name of the thing's owner is optional.

For instance, you might want the graph to be 6 inches x 6 inches. Here's how you can specify that requirement:

o graph [size = "6,6"]

The double-quotes are mandatory in this context.

o [size = "6,6"]
o size = "6,6"

But wait, there's more! The value of the attribute can be omitted, if it's 'true'. Hence the distro, and the demo page, have a set of tests, called data/42.*.gv, targetted at testing this part of the code. Grrrr.

All this means is that when the terminator of the attribute declaration (in the input stream) is detected, and we switch states from 'within-attribute' to 'after-attribute', the code which emits output from the lexer has to have some knowledge of what the hell is going on, so that it can pretend it received the 1st of these 3 forms even if it received the 2nd or 3rd form. That is, it must output 'graph' (the graph as a whole) as the owner of the attribute in question.

And, as you've seen, any attribute declaration can contain a set of attribute key=value pairs as in the 'node_27_2 [ color = green fontcolor = red ]' above.

Conclusion: Don't try to solve every problem with regexps. Be prepared to add code in 2 places:

o At switch of state
o After all input has been parsed

Indeed, GraphViz2::Marpa::Lexer::DFA contains a long sub, _clean_up(), which repeatedly fiddles the array of detected tokens, fixing up the list before it's fit to be inflicted on the parser.

Understanding the State Transition Table

Recall this diagram from the first article:

             DOT's Grammar
                  |
                  V
        ---------------------
        |                   |
     strict                 |
        |                   |
        ---------------------
                  |
                  V
        ---------------------
        |                   |
     digraph     or       graph
        |                   |
        ---------------------
                  |
                  V
        ---------------------
        |                   |
       $id                  |
        |                   |
        ---------------------
                  |
                  V
                {...}

A dot file starts with an optional 'strict', followed by either 'graph' or 'digraph'. (The 'di' stands for directed, meaning edges between nodes have arrowheads on them. And yes, there are many attributes which can be attached to edges. See http://www.graphviz.org/content/attrs, and look for an 'E' (edge) in the second column of the table of attribute names).

The /(di|)graph/ is in turn followed by an optional ID.

Line One

So, examining the first line (after the headings) of the STT (http://savage.net.au/Perl-modules/html/graphviz2.marpa/default.stt.html), you can see how we ended up with:

o A start state flag => 'Yes'

Meaning the state defined on this line is the start state.

o An accept flag => ''

Since the initial state can't be an accepting start, this column - in the row - must be empty.

Further down in the STT there must necessarily be states with 'Yes' in this column.

o A state name => 'initial'

I just chose to call it 'initial'. Many times it's called 'start'.

o An event => 'strict'

This - although it might not yet be clear - is actually a regexp, /(?:strict)/. The DFA adds the prefix '/(?:' and the suffix ')/'.

o A next state => 'graph'

I.e. The state to jump to if indeed we have a match. Here, match means a match between the regexp (event) in the previous column and the head of the input stream.

o An entry function => ''

It's not used here, but in general this says to call a particular function when entering the named state.

o An exit function => 'save_prefix'

Similar to the entry function, this says to call a particular function when exiting the named state.

o A regexp => '(?:"[^"]*"|<\s*<.*?>\s*>|[a-zA-Z_][a-zA-Z_0-9]*|-?(?:\.[0-9]+|[0-9]+(?:\.[0-9])*))'

This means I can save a set of regexps in this column, and use spreadsheet formula elsewhere to refer to them.

o An interpretation => 'ID'

Notes to self. This one says the regexp in the previous column specifies what an ID must match in the DOT language.

Line Two

The event in the second line, /(?:graph|digraph)/, tells us what to do if the 'strict' is absent from the input stream.

To clarify this point, recall that the DFA matches entries in the 'Event' column one at a time, from the first listed against the name of the state - here /(?:strict)/ - down to the last for the given state - here /(?:\s+)/.

The first regexp to match wins, in that it's the one which triggers the exit state logic, and it's the one whose entry in the 'Next' column specifies which is the next state to be entered. That in turn specifies which (next) state's entry function is called, if any.

So, if 'strict' is not at the head of the input stream, and it can definitely be absent, as is seen in the above diagram, this regexp - /(?:graph|digraph)/ - is the next one tested by the DFA's logic.

Line Three

The hard-to-read regexp \/\*.*\*\/ says to skip C-language-style (/* ... */) comments. The skip takes place because the Next state is 'initial', the current state. In other words the text at the head of the input stream which is gobbled up by this regexp, is simply discarded.

And why is it discarded? Because that's the way Set::FA::Element operates. Looping within a state does not trigger the exit-state and enter-state functions. Hence there is no opportunity to stockpile the matched text. But that's good in this case. We don't want to save it, precisely because it's a comment.

If you think about this, though, it does mean that once a comment (or anything) is discarded, it becomes impossible for the text which is stockpiled to ever exactly replicate the input stream. Hence you only take the decision to discard something when you fully understand the consequences.

Lastly, we can say this regexp is used often, meaning we accept such comments at many places in the input stream.

Line Four

The regexp \s+ says to skip spaces (i.e. in front of or between interesting tokens). As with the previous line, we skip to the very same state.

So, this state has 4 regexps attached to it.

More States

Re-examining the STT shows we have 2 introductory states, for input with and without a (leading) 'strict'. The states are arbitrarily called 'initial' and 'graph'.

This says that if the initial 'strict' is present, state 'initial' handles it (in the exit function), and jumps to state 'graph' to handle what comes next.

If, however, 'strict' is absent, state 'initial' still handles it, but it then jumps to state 'graph_id'.

A (repeated) word of warning about Set::FA::Element. A loop within a state does not trigger the exit-state and enter-state functions. Sometimes this can actually be rather unfortunate.

You can see elsewhere in the STT where I have had to use pairs of almost-identically named states (e.g. statement_list_1 and statement_list_2), and designed the logic to rock the STT back and forth between them, just to allow the state machine to gobble up certain input sequences. This is a technique you are strongly advised to be aware of.

So, proceeding in this fashion, driven by the BNF of the input language, we construct the whole STT. Each time a new enter-state or exit-state function is needed, we write the code, and run a small demo to test it.

The 'graph' State

We get to this state simply by the absence of a leading 'strict' in the input stream.

And, apart from not bothering to cater for comments (as did the 'initial' state), this state is really the same as the 'initial' state.

Now, only 5 paragraphs back I warned about a feature designed into Set::FA::Element, looping within a state. That fact is why the 'graph' state exists. If the 'initial' state could have looped to itself upon detecting 'strict', and executed the exit or entry functions, there would be no need for the 'graph' state.

The 'graph_id' State

Next, we look for an optional graph id, at the current head of the input stream (because anything which matched previously has been ripped off and handled).

Here's our first use of a formula: Cell H2 contains (?:"[^"]*"|<[^>]*>|[a-zA-Z_][a-zA-Z_0-9]*|-?(?:\.[0-9]+|[0-9]+(?:\.[0-9])*)).

This just means we accept an ID which is double-quoted, or quoted with '<' and '>', or alphanumeric (but not starting with a digit), or a number.

If we see such a token, we jump to the 'open_brace' state, meaning the very next non-whitespace character had better (barring comments) be a '{', or we die.

Otherwise, we accept a '{' without an ID and jump to the 'statement_list_1' state, or we discard comments and spaces by looping within the 'graph_id' state.

The Remaining States

What follows in the STT gets complex, but in reality is more of the same. Several things should be clear by now:

o The development of the STT is iterative
o We need lots of tiny but different test data files, to test these steps
o We need quite a lot of patience, which, unfortunately, can't be downloaded from the internet...

Lexer Actions (Callbacks)

Matching something with a DFA only makes sense if the matched text is captured for processing. Hence the use of state-exit and state-entry callback functions.

In these functions, we must decide what text to output for each recognized input token.

To help with this, I use a method called items(), accessed in each function via the $myself mentioned above. This method manages an stack (array) of items of type Set::Array.

Each element in this array is a hashref:

        {
                count => $integer, # 1 .. N.
                name  => '',       # Unused.
                type  => $string,  # The type of the token.
                value => $value,   # The value from the input stream.
        }

So, whenever a token is recognized, we just push a new item onto the stack, and the value of the 'type' string is the result of the DFA's work identifying the token.

This identification process uses the 1st of the 2 sub-grammars mentioned in the first article.

A long Exit-state Function

The 'save_prefix' function looks like:

        # Warning: This is a function (i.e. not a method).

        sub save_prefix
        {
                my($dfa)   = @_;
                my($match) = trim($dfa -> match);

                # Upon each call, $match will be 1 of:
                # o strict.
                # o digraph.
                # o graph.

                # Note: Because this is a function, $myself is a global alias to $self.

                $myself -> log(debug => "save_prefix($match)");

                # Input     => Output (a new item, i.e. a hashref):
                # o strict  => {name => strict,  value => yes}.
                # o digraph => {name => digraph, value => yes}.
                # o graph   => {name => digraph, value => no}.

                if ($match eq 'strict')
                {
                        $myself -> new_item($match, 'yes');
                }
                else
                {
                        # If the first token is '(di)graph' (i.e. there was no 'strict'),
                        # jam a 'strict' into the output stream.

                        if ($myself -> items -> length == 0) # Output stream is empty.
                        {
                                $myself -> new_item('strict', 'no');
                        }

                        $myself -> new_item('digraph', $match eq 'digraph' ? 'yes' : 'no');
                }

        } # End of save_prefix.

A tiny Exit-state Function

Here's one of the shorter exit functions, attached in the STT to the 'open_brace' and 'start_statement' states:

        sub start_statements
        {
                my($dfa) = @_;

                $myself -> new_item('open_brace', $myself -> increment_brace_count);

        } # End of start_statements.

The code to push a new item onto the stack is just:

        sub new_item
        {
                my($self, $type, $value) = @_;

                $self -> items -> push
                        ({
                                count => $self -> increment_item_count,
                                name  => '',
                                type  => $type,
                                value => $value,
                         });

        } # End of new_item.

Using Marpa in the Lexer

Yes, you can use Marpa in the lexer, as discussed in the first article. I prefer to use a spreadsheet full of regexps.

But enough of the lexer. Let's turn to the parser.

The Parser's Structure

The parser incorporates the 2nd sub-grammar, and gets Marpa::R2 to validate the output from the lexer against this grammar.

The parser's structure is very similar to the lexer's:

o Initialize using the parameters to new()
o Declare the grammar
o Run Marpa
o Save the output

Marpa Actions (Callbacks)

As with the lexer, the parser works via callbacks, which are functions named within the grammar and called by Marpa::R2 whenever the input sequence of lexed items matches some component of the grammar.

Specifically, consider these 4 rule descriptors in the grammar declared in GraphViz2::Marpa::Parser's grammar() method:

        [
        ...
        {   # Prolog stuff.
                lhs => 'prolog_definition',
                rhs => [qw/strict_definition digraph_definition graph_id_definition/],
        },
        {
                lhs    => 'strict_definition',
                rhs    => [qw/strict/],
                action => 'strict', # <== Callback.
        },
        {
                lhs    => 'digraph_definition',
                rhs    => [qw/digraph/],
                action => 'digraph', # <== Callback.
        },
        {
                lhs    => 'graph_id_definition',
                rhs    => [qw/graph_id/],
                action => 'graph_id', # <== Callback. See sub graph_id() just below.
        },
        ...
        ]

In each case the 'lhs' is a name I've chosen so that I can refer to each rule descriptor in other rule descriptors. That's how we chain rules together, ending up with a tree structure. Dealing with the root of this tree was discussed in the first article, under the heading 'Chains and Trees'.

This grammar fragment is saying we expect the input stream of items from the lexer to consist (at the start of the stream, actually) of 3 components: A strict thingy, a digraph thingy, and a graph_id thingy.

But since we wrote the lexer, we can ensure that this is exactly what the lexer pumps out.

To emphasise, the grammar says that these 3 items are the only things we are prepared to accept at this point in the input stream, and that they must be in the given order, and that they must literally consist of the 3 tokens (see 'rhs'): strict, digraph and graph_id.

These latter 3 come from the type key in the array of hashrefs built by the lexer.

The 3 corresponding value keys in those hashrefs would be 'yes' or 'no' for strict, 'yes' or 'no' for digraph, and an id or the empty string for graph_id.

As with the lexer, when in incoming token (type) matches what we expect, Marpa::R2 triggers a call to an action, here called (for clarity) the same as the 'rhs'.

Let's examine one of those functions:

        sub graph_id
        {
                my($stash, $t1, undef, $t2)  = @_;

                $myself -> new_item('graph_id', $t1);

                return $t1;

        } # End of graph_id.

The parameter list is courtesy of how Marpa::R2 says callbacks must be written.

In fact, $t1 will the incoming graph id. So in data/27.gv (above), that would be 'graph_27'.

Marpa does not supply the string 'graph_id' to this function, because we don't need it. The design of the grammar is such that this function is only called when the value of the incoming type is 'graph_id', so we know precisely under what circumstances this function was called.

And that in turn is why we can hard-code the string 'graph_id' in the body of the graph_id() function.

The Grammar in Practice

And now you might be thinking: Just a second! That code seems to be doing no more than copying the input token to the output stream. Well, you're right, sort of.

True understanding comes when you realize that that code is called at the appropriate point, by Marpa, precisely because the type 'graph_id' and its value 'graph_27' were found at exactly the right place in the input stream.

And by 'exactly the right place' I mean that the location of the pair:

        (type => value) i.e. ('graph_id' => 'graph_27')

in the input stream was exactly where it had to be to satisfy the grammar which was used to initialize Marpa::R2::Grammar.

And if it had not been there, Marpa would have thrown an exception, which we would recognize as a 'syntax error'. And that means a syntax error in the input stream fed into the lexer, but which is picked up by testing that input stream against the grammar declared in the parser. The role of the lexer as an intermediary is to simplify the logic of the code as a whole, aka divide-and-conquer.

So, it was no accident that that function was called at a particular point in time during the parser's processing of its input stream.

Now let's turn to another problem which arises as we build up the set of rule descriptors within the grammar.

Trees Have Leaves

Chains and Trees were discussed in the first article. See 'prolog_definition' etc above for the syntax used by Marpa. Briefly, each rule descriptor must be chained to other rule descriptors.

But the astute reader will have already seen a problem: How do we define the meanings of the leaves of this tree, when the chain of definitions must end at each leaf?

Firstly, here's part of that input file, data/27.lex (above):

        "type","value"
        strict          , "no"
        digraph         , "yes"
        graph_id        , "graph_27"
        start_scope     , "1"
        node_id         , "node_27_1"
        open_bracket    , "["
        attribute_id    , "color"
        equals          , "="
        attribute_value , "red"
        attribute_id    , "fontcolor"
        equals          , "="
        attribute_value , "green"
        close_bracket   , "]"
        ...

And what do the corresponding rules descriptors look like? Just this:

        [
        ...
        {
                lhs => 'attribute_statement',
                rhs => [qw/attribute_key has attribute_val/],
        },
        {
                lhs    => 'attribute_key',
                rhs    => [qw/attribute_id/], # <=== This is a terminal.
                min    => 1,
                action => 'attribute_id',
        },
        {
                lhs    => 'has',
                rhs    => [qw/equals/],
                min    => 1,
        },
        {
                lhs    => 'attribute_val',
                rhs    => [qw/attribute_value/], # <=== And so is this.
                min    => 1,
                action => 'attribute_value',
        },
        ...
        ]

The items marked as terminals (standard parsing terminology) are not further defined, meaning that 'attribute_key' and 'attribute_val' are leaves in the tree of rule descriptors.

And what does that mean? It means the terminals 'attribute_id' and 'attribute_value' must appear literally in the input stream.

Switching between 'attribute_key' and 'attribute_id' is just a requirement of Marpa, to avoid ambiguity in the statement of the grammar. Likewise for 'attribute_val' and 'attribute_value'.

The 'min' makes the attributes mandatory. Not in the sense that nodes and edges must have attributes, they don't, but in the sense that if the input stream has an 'attribute_id' token, then it must have an 'attribute_value' token, and vice versa.

Recall now the section above 'Working With An Incomplete BNF'. If the original *.gv file used one of:

        size = "6,6"
        [size = "6,6"]
        graph [size = "6,6"]

Then, the one chosen really represents the graph attribute:

        graph [size = "6,6"]

So, to make this work, the design of the lexer must force the output to be:

        "type","value"
        ...
        class_id        , "graph"
        open_bracket    , "["
        attribute_id    , "size"
        equals          , "="
        attribute_value , "6,6"
        close_bracket   , "]"

And this matches what was just explained, in that both 'attribute_id' and 'attribute_value' are present, and their owner, so-to-speak, the 'graph' object itself, is also present, and is identified by the type 'class_id'.

This should reinforce the point that the design of the lexer is intimately tied to the design of the parser.

And by taking decisions like this in the lexer you can standardize its output, hence simplifying the work that needs to be done in the parser.

Where to from here

Recently, a new Perl module, MarpaX::Simple::Rules, was released, which takes a BNF and generates the coresponding grammar in the format expected by Marpa::R2.

Jeffrey Kegler (author of Marpa) has blogged about it.

This is a very interesting development, since it automates the labourious process of converting a BNF into a set of Marpa's rule descriptors.

Consequently, it makes sense for anyone contemplating using Marpa::R2 to investigate how appropriate it would be to do so via MarpaX::Simple::Rules.

Wrapping Up and Winding Down

You've seen samples of lexer output and some parts of the grammar which both define the second sub-grammar of what's expected, and match precisely the input from that lexer.

And if they don't match, it is in fact the parser which issues the dread 'syntax error' message, because only it (i.e. not the lexer) knows which combinations of input tokens are acceptable, and which are not.

And, functions known as callbacks (just as in the lexer) stockpile items which have passed Marpa::R2's attempt to match up input tokens with rule descriptors. In this case we use the callbacks to record exactly which rules fired in which order.

After Marpa::R2 has run to completion, we have a stack of items whose elements are a (lexed and) parsed version of the original file.

It simply remains to output that stack to a file, or await the caller of the parser to ask for the stack to be supplied via RAM as an arrayref.

Finally, a rather earlier (July 2011) article on Marpa::R2 is here.

The Lexer and the State Transition Table - Revisited

The complexity of the STT in GraphViz2::Marpa is what justifies in my mind the decision to split the lexer and the parser into separate modules.

Clearly that will not always be the case. Given a sufficiently simple grammar, the lexer phase may be redundant.

Consider this test data file, data/sample.1.ged, from Genealogy::Gedcom:

        0 HEAD
        1 SOUR Genealogy::Gedcom
        2 NAME Genealogy::Gedcom::Reader
        2 VERS V 1.00
        2 CORP Ron Savage
        3 ADDR Box 3055
        4 STAE Vic
        4 POST 3163
        4 CTRY Australia
        3 EMAIL ron@savage.net.au
        3 WWW http://savage.net.au
        2 DATA
        3 COPR Copyright 2011, Ron Savage
        1 NOTE
        2 CONT This file is based on test data in Paul Johnson's Gedcom.pm
        2 CONT Gedcom.pm is Copyright 1999-2009, Paul Johnson (paul@pjcj.net)
        2 CONT Version 1.16 - 24th April 2009
        2 CONT
        2 CONT Ron's modules under the Genealogy::Gedcom namespace are free
        2 CONT
        2 CONT The latest versions of these modules are available from
        2 CONT my homepage http://savage.net.au and http://metacpan.org
        1 GEDC
        2 VERS 5.5.1-5
        2 FORM LINEAGE-LINKED
        1 DATE 10-08-2011
        1 CHAR ANSEL
        1 SUBM @SUBM1@
        0 TRLR

Each line matches /^(\d+)\s([A-Z]{3,4})\s(.+)$/, i.e. an integer, a keyword, and a string. In this case I'd skip the lexer, and have the parser tokenize the input.

So, horses for courses.

The GEDCOM defining document (for genealogical data) is on-line here.

Sample Output

GraphViz2 (non-Marpa)

GraphViz2::Marpa

GraphViz2::Marpa::PathUtils

Graph::Easy::Marpa

Author

Ron Savage .

Home page: http://savage.net.au/index.html

Licence

Australian Copyright © 2012 Ron Savage. All rights reserved.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html