Conditional Preservation of Whitespace when Parsing with Marpa::R2

Table of contents

Conditional Preservation of Whitespace when Parsing with Marpa::R2
MarpaX::Languages::Dash V 1.00
Constructing a Mental Picture of Lexing and Parsing
The Language of the Grammar
The Whitespace Problem
A Sub-grammar for Whitespace
There is Syntax, and then there is Syntax
An Intermediate Summary
Some Sample Input
The Full Grammar of MarpaX::Languages::Dash
The Grammar
A Warning
Handling the Pause on the Perl Side of Things
The Process
Lines 18 .. 23
Lines 29 .. 32
Lines 33 .. 36 - 'attribute_name'
Lines 37 .. 48 - 'attribute_value'
Lines 49 ... 52 - 'directed_edge'
Lines 53 .. 56 - 'end_attributes'
Lines 57 .. 65 - 'end_node'
Lines 66 .. 72 - 'literal_label'
Lines 73 .. 78 - 'node_name'
Lines 79 .. 82 - 'start_attributes'
Lines 83 .. 87 - 'start_node'
Lines 87 .. 90 - 'undirected_edge'
Line 92
A Summary
Some General Notes about the Code
Wrapping Up and Winding Down
Author
Licence

Conditional Preservation of Whitespace when Parsing with Marpa::R2

MarpaX::Languages::Dash V 1.00

The module, MarpaX::Languages::Dash, was written to showcase the material discussed here, and to give you something to play with.

And the grammar from this module will be incorporated into the re-write of GraphViz2::Marpa.

Also, I'm currently (2016-12-25) using Marpa::R2 V 3.000000.

Constructing a Mental Picture of Lexing and Parsing

With the advent of Marpa's SLIF, we need to feel comfortable with both Marpa's approach and the fact that, classically, lexing and parsing were often such visibly distinct phases, perhaps even requiring 2 programs.

The following is from quotation from Jeffrey Kegler to the Google Group on Marpa, reformatted slightly by me to suit this article:

"Marpa's SLIF follows the classical separation of lexing and parsing, but it introduces a notation which is friendly to those programmers more accustomed to regular expressions and recursive descent.

So instead of two completely separate modules, lexer and parser, each with its own interface, you have a single DSL source file, and the lexing and parsing statements look similar, with only the different operators ("::=" versus "~") to tell them apart.

Underneath however, in the implementation, the distinction is still there, as sharp as ever."

See Wikipedia for more on DSLs (Domain-specific languages).

And here is another quotation from the author of Marpa:

"Marpa has 3 phases: Grammar pre-processing; Recognition; and Evaluation. That is, Marpa:

o Precomputes a grammar
o Recognizes an input, creating a table
o Turns that table into a tree, and evaluates it.

Events occur during recognition. Actions are executed during the evaluation. In other words, by the time the first action is executed, the last event is ancient history."

The section below, called 'The Full Grammar of MarpaX::Languages::Dash', warns you about a trap which can ensue after a mis-understanding of the last point, i.e. that all events are triggered before any action sub is called.

These special tokens within the grammar's syntax, event and action, are discussed at length below.

And all this takes place very efficiently indeed.

Consequently, Marpa's new design completely eliminates the necessity for you to write code to perform your own lexing.

But keep in mind that Marpa does not forbid you to do your own lexing, and thus to pass specific tokens into Marpa, either instead of letting Marpa do any lexing, or by using a combination of both manual and Marpa-based lexing.

The Language of the Grammar

Marpa now reads your grammar using the syntax for Marpa's own extended BNF. Marpa's docs for this are here.

I'm going to be referring to this many times, and rather than say 'Marpa-style BNF', or 'Extended BNF', I'll call it an SLIF-DSL, or usually just a BNF.

In practice, this means you express your grammar in a string, and Marpa treats that as a set of rules as to how you want Marpa to treat your input stream.

The Whitespace Problem

We're all familiar with code which declares a single- or double-quoted string, and expects any whitespace within that string to be preserved, while at the same time expecting arbitrary whitespace in the source code to be discarded. This means we can format the source in different ways, and still have whitespace preserved when it's protected by quotes. This much is simple.

Well, this is precisely what I mean by 'conditional preservation of whitespace': Some is preserved and some is discarded. So, how do we do that using Marpa?

One way would be to have Marpa preserve all whitespace, and commit ourselves to processing it somehow. Here, I'm going to ignore that option, since there's now a rather more sophisticated way to do it.

A Sub-grammar for Whitespace

Firstly, here's a tiny fragment of a grammar, in SLIF-DSL:

        source       => \(<<'END_OF_SOURCE'),
                ...
        :discard     ~ whitespace
        whitespace   ~ [\s]+

        END_OF_SOURCE

The entire grammar is a reference to a string of rules, whose oddish delimiters are \(<<'END_OF_SOURCE') and END_OF_SOURCE.

But, see those 2 rules which mention the reserved pseudo-symbol :discard and the token whitespace? This is Marpa's way of allowing us to say: Discard all whitespace by default. We could have called whitespace anything. It's not a reserved word.

For more on :discard, see the docs: Marpa::R2::Scanless::DSL#Discard-rules. And note, of course, that these 2 rules are optional. Or, you may wish to discard different characters.

In other words, we're ordering Marpa to discard whitespace. Since this conflicts with our stated aim of preserving some whitespace, we need a mechanism which allows us to tell Marpa to override this default under circumstances chosen by us. Enter the pause and event adverbs.

Note that this usage of Marpa's adverbs event and pause should be regarded as an intermediate/advanced technique. For people just beginning to use Marpa, use of the action adverb is the recommended technique.

Here's a bit more of the grammar, given the regexp that node names must match is /^\[.*\]$/:

        node_name        ::= start_node   end_node # Allow for the anonymous node, [].
                               | start_node   node_name   end_node # Allow for [wxyz].

        :lexeme          ~ start_node   pause => before   event => start_node
        start_node       ~ '['

        :lexeme          ~ end_node     pause => before   event => end_node
        end_node         ~ ']'

        :lexeme          ~ node_name    pause => before   event => node_name

        node_name        ~ string_char_set+

        string_char_set  ~ escaped_char
                               | [^;:}\]] # Neither a separator [;:] nor a terminator [}\]].

        escaped_char     ~ '\' [[:print:]]

Pause allows us to specify that at certain locations within the parse, i.e. at certain locations within the input stream, we want Marpa to stop parsing for a while, and pass control of the parsing process over to us.

It's the for a while part that makes the pause adverb so powerful. After a while, i.e. whenever we feel like it, we simply tell Marpa to resume parsing.

And, during that pause, we handle all parsing, and preserve whitespace if we wish. Hence the word conditional in the title of this article. And I say if we wish because there are times when we encounter leading and trailing spaces, for example, and choose to discard it. Likewise, we might choose to compress internal whitespace.

Event allows us to name pauses, and to query Marpa as to which event triggered any particular pause.

There is Syntax, and then there is Syntax

Don't get confused by these 2:

o The syntax for a node

For my application, the regexp a node must match is /^\[.*\]$/.

The name of the node is within those '[' and ']' chars.

The name of the node matches /.*/, meaning it can be zero-length as it happens, but you (the reader) should not attach any significance to the length.

o The syntax for the node's grammar

The grammar a node must match is either 'node_name ::= start_node end_node', or 'node_name ::= start_node node_name end_node', where start_node is '[' and end_node is ']'.

So what is the node's grammar saying?

o node_name ::= start_node end_node

There is no name between the start and end lexemes, so when Marpa finds such a case, i.e. no text in the input stream between '[' and ']', it triggers the event start_node, where we ignore the '[', and it triggers the event end_node, where we handle the case of the anonymous node by inserting a Tree::DAG_Node node into the output tree. As it happens, we also ignore the ']'. This all takes place in the method _process().

o node_name ::= start_node node_name end_node

Here, there is a name between '[' and ']', and as above Marpa triggers the events (in this order), start_node, node_name, end_name. And, also as above, we happen to ignore the '[' and ']', but for the node_name we trigger our own (simple) processing, via the methods clean_after() and _add_daughter().

Luckly, these 2 cases are fairly simple. Below, where attributes of a node or an edge are discussed, much more complexity need be handled.

I should warn you about Marpa's lookahead. There isn't any. That is, when Marpa detects a lexeme such as '[' at, presumably, the start of a node, Marpa is not saying that it has used lookahead to find the corresponding ']', and hence the '[' is provably the start of a node. It's just saying that so far (i.e. at the '['), the text does not contain anything which contradicts the grammar used so far. That is, the '[' triggers the 'start_node' event without implying anything about what follows the '['. So even saying it's the start of a node is an assumption, not yet supported by the facts.

Of course, once an event is triggered you know the offset within the input stream ($pos in my code), so you can perform your own lookahead.

An Intermediate Summary

We've actually touched on a number of topics already, ofter interrelated ones, so here's a list (just in case your head is spinning):

o The Scanless Interface (SLIF)

This is Marpa's newest and (only) recommended interface.

o SLIF-DSL

This is classic BNF enhanced to give us a way to specify various new features:

o We can explicitly define and name lexemes
o We can provide defaults to Marpa covering the whole parse
o We can can interact with Marpa in different ways for different lexemes
o See http://savage.net.au/Perl-modules/html/marpa.papers/chapter2.html for much more
o Ways to handle whitespace

In short, these are:

o Preserve all whitespace by default

I.e. Design a grammar where whitespace is never discarded.

o Preserve some whitespace

That is, preserve whitespace by default, while overriding that sometimes, by using pause, to handle the precise way whitespace is conditionally discarded.

o Discard all whitespace by default

I.e. Design a grammar where whitespace is always discarded.

o Discard some whitespace

That is, discard whitespace by default, while overriding that sometimes, by using pause, to handle the precise way whitespace is conditionally preserved.

The last choice is the topic of this article.

o Pseudo-symbols such as :discard

These allow us to specify static aspects of how we want Marpa to act.

o Adverbs such as event

These allow us to name pauses, and to query Marpa as to which event triggered any particular pause.

o Adverbs such as pause

These allow us to specify dynamic aspects of how our code interacts with the parsing process.

Other languages implement the command (verb) yield, which transfers the flow of control back and forth in a vaguely similar fashion.

Also, when we pause on a lexeme, that lexeme does not have to be a fixed string. In the BNF below, the string 'label' is a lexeme, but so are the variable-length strings I've called 'attribute_name' and 'attribute_name'. Likewise for 'node_name'.

o Getting Marpa to resume parsing after the callback

Here we control the manner in which Marpa resumes. We do this by moving the offset within the input stream at which Marpa resumes.

In fact, this could involve moving that pointer backwards or forwards.

Since this determines the next token Marpa sees, it thus directly controls the next rule triggered. Naturally that could be the same rule as the last one triggered.

In this article, I'll always be moving it forwards, I assure you!

The direct implication of this is that after the pointer is moved forward, I am saying these 4 things:

o I've told Marpa I accept responsibility for processing a certain subset of the input stream

This is part of the grammar.

o I respond when Marpa tells me it's reached such a point

This is part of my code.

o I've finished that processing

So I return control back to Marpa.

o I'm telling Marpa to treat that subset as invisible, and to resume parsing after it

This happens because I told Marpa where to resume within the input stream.

Neat, huh?

Some Sample Input

These graphs are copied from the sample data for MarpaX::Languages::Dash. Since I've called them graphs, that means each line below is a complete input stream. Further, all sample lines below can be combined into a single file, another graph, now shipped in the distro as data/graph.04.dash.

o Edges
        ->
        ->
o Graphs
        [node]
        [node] ->
        -> {label: Start} -> {color: red} [node.1] {color: green} -> [node.2]
        [node.1] [node.2] [node.3]
o Nodes
        []
        [node.1]
        [node 1]
        [\[node\]]
        ["node"]
        [     From here     ] -> [     To there     ]

The Full Grammar of MarpaX::Languages::Dash

The Grammar

BTW, here is an SVG of this grammar, produced by MarpaX::Grammar::GraphViz2:

    01    :default                ::= action => [values]
    02
    03    lexeme default          =  latm => 1        # Longest Acceptable Token Match.
    04
    05    graph_grammar           ::= graph_definition
    06
    07    # Graph stuff.
    08
    09    graph_definition        ::= node_definition
    10                                | edge_definition
    11    # Node stuff
    12
    13    node_definition         ::= node_statement
    14                                | node_statement graph_definition
    15
    16    node_statement          ::= node_name_token
    17                                | node_name_token attribute_definition
    18                                | node_statement (',') node_statement
    19
    20    node_name_token         ::= start_node end_node     # Allow for the anonymous node.
    21                                | start_node node_name end_node
    22
    23    # Edge stuff
    24
    25    edge_definition         ::= edge_statement
    26                                | edge_statement graph_definition
    27
    28    edge_statement          ::= edge_name
    29                                | edge_name attribute_definition
    30                                | edge_statement (',') edge_statement
    31
    32    edge_name               ::= directed_edge
    33                                | undirected_edge
    34
    35    # Attribute stuff.
    36
    37    attribute_definition    ::= attribute_statement+
    38
    39    attribute_statement     ::= start_attributes string_token_set end_attributes
    40
    41    string_token_set        ::= string_token_pair+
    42
    43    string_token_pair       ::= literal_label
    44                                | attribute_name (':') attribute_value
    45
    46    # Lexemes in alphabetical order.
    47
    48    :lexeme                 ~ attribute_name    pause => before     event => attribute_name
    49
    50    attribute_name          ~ string_char_set+
    51
    52    :lexeme                 ~ attribute_value   pause => before     event => attribute_value
    53
    54    attribute_value         ~ string_char_set+
    55
    56    :lexeme                 ~ directed_edge     pause => before     event => directed_edge      priority => 2
    57    directed_edge           ~ '->'
    58
    59    :lexeme                 ~ end_attributes    pause => before     event => end_attributes     priority => 1
    60    end_attributes          ~ '}'
    61
    62    :lexeme                 ~ end_node          pause => before     event => end_node           priority => 1
    63    end_node                ~ ']'
    64
    65    escaped_char            ~ '\' [[:print:]]
    66
    67    # Use ' here just for the UltraEdit syntax hiliter.
    68
    69    :lexeme                 ~ literal_label     pause => before     event => literal_label      priority => 1
    70    literal_label           ~ 'label'
    71
    72    :lexeme                 ~ node_name         pause => before     event => node_name
    73
    74    node_name               ~ string_char_set+
    75
    76    :lexeme                 ~ start_attributes  pause => before     event => start_attributes
    77    start_attributes        ~ '{'
    78
    79    :lexeme                 ~ start_node        pause => before     event => start_node
    80    start_node              ~ '['
    81
    82    string_char_set         ~ escaped_char
    83                                | [^;:}\]] # Neither a separator [;:] nor a terminator [}\]].
    84
    85    :lexeme                 ~ undirected_edge   pause => before     event => undirected_edge    priority => 2
    86    undirected_edge         ~ '--'
    87
    88    # Lexemes in alphabetical order.
    89
    90    # Boilerplate.
    91
    92    :discard                ~ whitespace
    93    whitespace              ~ [\s]+

A Warning

Now, you might be tempted to change the edge stuff from this:

    56    :lexeme                 ~ directed_edge     pause => before     event => directed_edge      priority => 2
    57    directed_edge           ~ '->'

    85    :lexeme                 ~ undirected_edge   pause => before     event => undirected_edge    priority => 2
    86    undirected_edge         ~ '--'

To this:

    41    edge_name              ::= directed_edge
    42                               | undirected_edge
    43
    44    directed_edge          ::= '->'   action => edge
    45
    46    undirected_edge        ::= '--'   action => edge

It certainly looks simpler, doesn't it?

Then, your Perl would not need to contain code to handle those 2 pauses, since Marpa would just trigger calls to the action sub edge(), which admittedly you must write.

Don't do that!

What happens is that edge() is now called at a different time in the flow of processing than events are with the original grammar.

In the sample I tested, all the calls to edge() were made after all the calls to pause-related code.

Why this happens is explained above in the quotation from the author of Marpa.

And now it's time to pause...

Handling the Pause on the Perl Side of Things

The Process

If Marpa is going to yield processing, we need to explicitly code for this.

BTW, a grammar without any pauses can process a string in one hit:

        my($string) = 'Input stream...';
        my($result) = $recognizer -> read(\$string);
        ...

To handle pauses however, we switch to a loop, since we must call resume() after each pause, until Marpa hits the end of the input stream.

So, here is the _process() method of MarpaX::Languages::Dash:

    01    sub _process
    02    {
    03        my($self)       = @_;
    04        my($string)     = $self -> clean_before($self -> graph_text);
    05        my($length)     = length $string;
    06        my($last_event) = '';
    07        my($format)     = '%-20s    %5s    %5s    %-s';
    08
    09        # We use read()/lexeme_read()/resume() because we pause at each lexeme.
    10
    11        my($event_name, $edge);
    12        my(@fields);
    13        my($lexeme, $literal);
    14        my($span, $start);
    15
    16        $self -> log(debug => sprintf($format, 'Event', 'Start', 'Span', 'Lexeme') );
    17
    18        for
    19        (
    20            my $pos = $self -> recce -> read(\$string);
    21            $pos < $length;
    22            $pos = $self -> recce -> resume($pos)
    23        )
    24        {
    25            $event_name     = $self -> _validate_event;
    26            ($start, $span) = $self -> recce -> pause_span;
    27            $pos            = $self -> recce -> lexeme_read($event_name);
    28            $literal        = substr($string, $start, $pos - $start);
    29            $lexeme         = $self -> recce -> literal($start, $span);
    30
    31            $self -> log(debug => sprintf($format, $event_name, $start, $span, $lexeme) );
    32
    33            if ($event_name eq 'attribute_name')
    34            {
    35                $fields[0] = $self -> clean_after($literal);
    36            }
    37            elsif ($event_name eq 'attribute_value')
    38            {
    39                $literal = $self -> clean_after($literal);
    40
    41                $self -> _add_daughter($fields[0], {value => $literal});
    42
    43                @fields = ();
    44
    45                # Skip the separator.
    46
    47                while ( ($pos < (length($string) - 1) ) && (substr($string, $pos, 1) =~ /[\s;]/) ) { $pos++ };
    48            }
    49            elsif ($event_name eq 'directed_edge')
    50            {
    51                $self -> _add_daughter('edge_id', {value => $self -> clean_after($literal)});
    52            }
    53            elsif ($event_name eq 'end_attributes')
    54            {
    55                $self -> _process_brace($literal);
    56            }
    57            elsif ($event_name eq 'end_node')
    58            {
    59                # Is this the anonymous node?
    60
    61                if ($last_event eq 'start_node')
    62                {
    63                    $self -> _add_daughter('node_id', {value => ''});
    64                }
    65            }
    66            elsif ($event_name eq 'literal_label')
    67            {
    68                push @fields, $literal;
    69
    70                $pos    = $self -> _process_label($self -> recce, \@fields, $string, $length, $pos);
    71                @fields = ();
    72            }
    73            elsif ($event_name eq 'node_name')
    74            {
    75                $literal = $self -> clean_after($literal);
    76
    77                $self -> _add_daughter('node_id', {value => $literal});
    78            }
    79            elsif ($event_name eq 'start_attributes')
    80            {
    81                $self -> _process_brace($literal);
    82            }
    83            elsif ($event_name eq 'start_node')
    84            {
    85                # Do nothing.
    86            }
    87            elsif ($event_name eq 'undirected_edge')
    88            {
    89                $self -> _add_daughter('edge_id', {value => $self -> clean_after($literal)});
    90            }
    91
    92            $last_event = $event_name;
    93        }
    94
    95        if ($self -> recce -> ambiguity_metric > 1)
    96        {
    97            $self -> log(notice => 'Ambiguous parse');
    98        }
    99
    100        if (my $ambiguous_status = $self -> recce -> ambiguous)
    101        {
    102            $self -> log(notice => "Parse is ambiguous: $ambiguous_status.");
    103        }
    104
    105        # Return a defined value for success and undef for failure.
    106
    107        return $self -> recce -> value;
    108
    109    } # End of _process.

Note: In the giant 'if' statement, essentially a big switch, event names are tested in alphabetical order, for ease of reading.

Lines 18 .. 23

These lines are saying that since we put at least one pause in the grammar, we must expect Marpa to exit from the read() before the end of the input stream, and that we must consequently:

o Validate the event, since we handle only one at a time

That's line 25.

o Handle the fact that Marpa paused before the lexeme declaring the pause

That's lines 26 .. 28, which tell Marpa to move past the lexeme with lexeme_read().

o Resume the parse

That's line 93, where we hit the end of the loop, and line 22, where we call resume().

So, we just have to explain lines 29 .. 92.

Lines 29 .. 32

These just grab the lexeme for the debug print. Thereafter, $lexeme is not used.

Lines 33 .. 36 - 'attribute_name'

'attribute_name' is not the first event expected in a syntactially valid file. The first would either be 'directed_edge', 'undirected_edge' or 'start_node'.

Also, we only get 'attribute_name' after Marpa parses a '{', which will have thus already triggered a 'start_attributes' event, which calls _process_brace($literal).

Now, 'What is a string?' is a more fundamental question than that ancient saying, 'How long is a piece of string?'.

Recall these lines of the grammar:

    43    string_token_pair       ::= literal_label
    44                                | attribute_name (':') attribute_value

With 'attribute_name' and 'attribute_value' (as with 'node_name'), BNF line 44 says we are going to let Marpa do all the lexing and parsing for these cases. Here's how it works.

In this grammar, I've defined these as being strings:

o An attribute name - BNF line 50
o An attribute value - BNF line 54
o A node name - BNF line 74

And 'string_char_set' is defined in BNF lines 82 .. 83, as either (a set of) escaped characters or of cetain unescaped characters. The latter characters must belong to a character set which excludes precisely those characters we treat as special, since they indicate the ends or separators of various tokens.

This means one of those special characters, let's say ':', can appear in a string if it's escaped, but if it's not escaped, it is going to be taken as having its special meaning.

Back to escaped characters: BNF line 82 directs us to BNF line 65, which shows how to declare a set of escaped characters, and also shows that Marpa accepts any named character set which Perl itself accepts.

Line 35 says preserve the attribute's name ($literal) in $fields[0], while we wait for Marpa to give us the attribute's value.

Lines 37 .. 48 - 'attribute_value'

These lines add the (attribute name, attribute_value) pair to the tree of tokens, and reset some things.

If you're wondering how Graphviz deals with, say '\:' (which is not one of Graphviz's acceptable escape sequences), or with characters not in the set [[:print:]], remember that my module is not claiming to perform, duplicate or replace Graphviz's own validation. Rather, I'm claiming that a significant block of code in Graphviz, i.e. the lexer and (much of) the parser, can - at least in theory - be replaced by a combination of Marpa and a grammar very much like this one.

In Graphviz, all attribute names and values are already being syntax-checked, and that would still have to be the case after adopting Marpa.

Lines 49 ... 52 - 'directed_edge'

Here we just add '->' to the tree.

Lines 53 .. 56 - 'end_attributes'

Here we call _process_brace() because there's a bit of work to do:

o Save the '}' in the tree
o Pop the stack of parent tree nodes

Note: Pushing the stack is discussed below. See 'start_attributes'.

Between '{' and '}', parsed tokens (attribute names and values) belong to the last parsed edge or node. But after '}', the next token can't be a child of that edge or node. It must actually be an edge or a node. And this new token will become the parent of any following attributes.

This does mean there is no such thing as an edge or node belonging to another edge or node.

In DOT, sub-graphs can group nodes, but DASH does not support sub-graphs.

Lines 57 .. 65 - 'end_node'

These lines handle 2 strings in the input stream, '[]' and '[node_name]'. The corresponding event sequences are:

o 'start_node', 'end_node'

Since, at 'end_node', the very last event was not 'node_name', there cannot be a node name, so we know we are dealing with the anonymous node.

o 'start_node', 'node_name', 'end_node'

This time we have a node name to add to the tree.

Lines 66 .. 72 - 'literal_label'

This is the start of very complex code.

It is our major transition from Marpa-based lexing and parsing, to doing things manually. In many real-world cases, you could write grammar to handle similar issues, but since the syntax of "DASH" in MarpaX::Languages::Dash allows all of unquoted, single-quoted, double-quoted and HTML-quoted ('<', '>') strings, I could not see how to do that with BNF alone. So, manual parsing it has to be. And that is the basis for this article. Actually, the real complexity is allowing any of those strings to contain escaped characters which are exactly the end-of-string characters.

Firstly, we save 'label' as the name of the current attribute, as per BNF lines 69 .. 70.

Then at line 70 we parse whatever type of string is given as the value of the label.

Lines 73 .. 78 - 'node_name'

This is easy. We're using Marpa to handle node names.

Lines 79 .. 82 - 'start_attributes'

This is where we push the current edge or node into a stack, since this (tree) node must be the parent of any following attributes.

See the discussion above, "Lines 53 .. 56 - 'end_attributes'", for details.

Lines 83 .. 87 - 'start_node'

We do nothing because we're letting Marpa find both the node's name and the terminating ']'.

Lines 87 .. 90 - 'undirected_edge'

As with 'directed_edge', we just add '--' to the tree.

Line 92

We save the name of the last event, in case we're deaing with an input of '[]', when we wish to know whether the last event was 'node_name' or 'start_node', as discussed above.

A Summary

All this must take place in a choreographed manner, so that the events triggered are in the correct chronological order.

This order (at least for my app, and I'm sure for any typical app) must correspond exactly to the order in which Marpa detects tokens of interest.

And this order is of course the order in which those tokens appear in the input stream.

Some General Notes about the Code

It's not obvious from the code shown, but sub _process() is called (from sub run()) within a try ... catch structure, handled by the fine module Try::Tiny.

When _process() exits successfully, the tree of tokens can be analyzed in various ways.

In the distro for MarpaX::Languages::Dash, you'll see scripts/render.pl passes this tree to the default renderer, for conversion into a real, live, Graphviz DOT file.

Basically though, it's just printed in a formatted manner if that's what the user asked for.

Wrapping Up and Winding Down

I hope I've made it easier to approach the dread topic of lexing and parsing. The simple fact is that with Marpa's new SLIF we can pass over to Marpa almost all the work, and just write code to handle precisely targeted places of interest within the input stream.

Overall, this should make clear the astonishing power of Marpa::R2 to relieve the programmer of what in the past was a huge component of such work.

This is combined with an almost-always clean interface, allowing our code to interact with Marpa neatly and efficiently.

And I have not even mentioned its other features, such as the precise pin-pointing of errors, etc. There are spelled out in Chapter 2 of The Marpa Papers.

Author

Ron Savage .

Marpa's homepage: http://savage.net.au/Marpa.html

Homepage: http://savage.net.au/index.html

Licence

Australian Copyright © 2013 Ron Savage. All rights reserved.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html