Conditional Preservation of Whitespace when Parsing with Marpa::R2

Table of contents

Conditional Preservation of Whitespace when Parsing with Marpa::R2
Some Background
Constructing a Mental Picture of Lexing and Parsing
MarpaX::Demo::StringParser V 1.04 and Graph::Easy::Marpa V 2.00
The Language of the Grammar
The Whitespace Problem
A Sub-grammar for Whitespace
An Intermediate Summary
A Word on Strings and Invisibility
Some Sample Input
The Full Grammar of MarpaX::Demo::StringParser
Handling the Pause on the Perl Side of Things
The Process
Lines 16 .. 21 - The Loop
Lines 25 .. 76 - The Context and The Code
A Summary
Some General Notes about the Code
Wrapping Up and Winding Down
Author
Licence

Conditional Preservation of Whitespace when Parsing with Marpa::R2

Some Background

I've previously written a 2-part series on lexing and parsing with Marpa::R2:

http://www.perl.com/pub/2012/10/an-overview-of-lexing-and-parsing.html and http://www.perl.com/pub/2013/01/lexing-and-parsing-continued.html.

There are a few typos in those, so here are my copies:

http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html and http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Continued.html.

However, I need to say they were written when Marpa had only one interface, now known by the cutesy acronym NAIF.

So, what those articles have to say on lexing and parsing is fine for previous versions of Marpa, but the code should be ignored, because the NAIF interface is deprecated. In short, I need to re-write them.

Marpa::R2 now makes available an interface called Scanless (SLIF), which all new code should use. I'm using V 2.064000 as of this writing (2013-07-25).

Constructing a Mental Picture of Lexing and Parsing

With the advent of Marpa's SLIF, we need to feel comfortable with both Marpa's approach and the fact that, classically, lexing and parsing were often such visibly distinct phases, perhaps even requiring 2 programs.

The following is from quotation from Jeffrey Kegler to the Google Group on Marpa, reformatted slightly by me to suit this article:

"Marpa's SLIF follows the classical separation of lexing and parsing, but it introduces a notation which is friendly to those programmers more accustomed to regular expressions and recursive descent.

So instead of two completely separate modules, lexer and parser, each with its own interface, you have a single DSL source file, and the lexing and parsing statements look similar, with only the different operators ("::=" versus "~") to tell them apart.

Underneath however, in the implementation, the distinction is still there, as sharp as ever."

See Wikipedia for more on DSLs (Domain-specific languages).

And here is another quotation from him:

"Marpa has 3 phases: Grammar pre-processing; Recognition; and Evaluation. That is, Marpa:

o Precomputes a grammar
o Recognizes an input, creating a table
o Turns that table into a tree, and evaluates it.

Events occur during recognition. Actions are executed during the evaluation. In other words, by the time the first action is executed, the last event is ancient history."

The section below, called 'The Full Grammar of MarpaX::Demo::StringParser', warns you about a trap which can ensue after a mis-understanding of the last point, i.e. that all events are triggered before any action sub is called.

These special tokens within the grammar's syntax, event and action, are discussed at length below.

And all this takes place in a single pass of the input stream, i.e. very efficiently indeed.

Consequently, Marpa's new design completely eliminates the necessity for you to write code to perform your own lexing.

MarpaX::Demo::StringParser V 1.04 and Graph::Easy::Marpa V 2.00

A new module, MarpaX::Demo::StringParser, was written to showcase the material discussed here, and to give you something to play with.

Also, the above articles refer to the module Graph::Easy::Marpa V 1.*. I am rewriting it, as V 2.00, to use the SLIF.

The Language of the Grammar

Marpa now reads your grammar using the syntax for Marpa's own extended BNF. Marpa's docs for this are here.

I'm going to be referring to this many times, and rather than say 'Marpa-style BNF', or 'Extended BNF', I'll call it SLIF-DSL.

In practice, this means you express your grammar in a string, and Marpa treats that as a set of rules as to how you want Marpa to treat your input stream.

The Whitespace Problem

We're all familiar with code which declares a single- or double-quoted string, and expects any whitespace within that string to be preserved, while at the same time expecting arbitrary whitespace in the source code to be discarded. This means we can format the source in different ways, and still have whitespace preserved because it's protected by those quotes. This much is simple.

Well, this is precisely what I mean by 'conditional preservation of whitespace': Some is preserved and some is discarded. So, how do we do that using Marpa?

One way would be to have Marpa preserve all whitespace, and commit ourselves to processing it somehow. Here, I'm going to ignore that option, since there's now a rather more sophisticated way to do it.

A Sub-grammar for Whitespace

Firstly, here's a tiny fragment of a grammar, in SLIF-DSL:

        source       => \(<<'END_OF_SOURCE'),
                ...
        :discard     ~ whitespace
        whitespace   ~ [\s]+

        END_OF_SOURCE

The entire grammar is a reference to a string of rules, whose oddish delimiters are \(<<'END_OF_SOURCE') and END_OF_SOURCE.

But, see those 2 rules which mention the reserved pseudo-symbol ':discard' and the word 'whitespace'? This is Marpa's way of allowing us to say: Discard all whitespace by default.

For more on :discard, see the docs: Marpa::R2::Scanless::DSL#Discard-rules. And note, of course, that these 2 rules are optional. Or, you may wish to discard different characters.

In other words, we're ordering Marpa to discard whitespace. Since this conflicts with our stated aim of preserving some whitespace, we need a mechanism which allows us to tell Marpa to override this default under circumstances chosen by us. Enter the pause and event adverbs.

Here's a bit more of the grammar, given the syntax that nodes are declared as /^\[.*\]$/:

        node_name   ::= start_node end_node

        :lexeme     ~ start_node   pause => before   event => start_node
        start_node  ~ '['

        :lexeme     ~ end_node
        end_node    ~ ']'

I'll explain below why those rules contain no reference to the content between the '[' and ']'. See 'A Word on Strings and Invisibility'.

Pause allows us to specify that at certain locations within the parse, i.e. at certain locations within the input stream, we want Marpa to stop parsing for a while, and pass control of the parsing process over to us.

It's the for a while part that makes the pause adverb appropriate. After a while, i.e. whenever we feel like it, we simply tell Marpa to resume parsing.

And, during that pause, we handle all parsing, and preserve whitespace if we wish. Hence the word conditional in the title of this article.

Event allows us to name pauses, and to query Marpa as to which event triggered any particular pause.

An Intermediate Summary

We've actually touched on a number of topics already, ofter interrelated ones, so here's a list (just in case your head is spinning):

o The Scanless Interface (SLIF)

This is Marpa's newest and (only) recommended interface.

o SLIF-DSL

This is classic BNF enhanced to give us programmers a way to specify various mechanism whereby we can interact with Marpa during its parse.

o Ways to handle whitespace

In short, these are:

o Preserve all whitespace by default

I.e. Design a grammar where whitespace is never discarded.

o Preserve some whitespace

That is, preserve whitespace by default, while overriding that sometimes, by using pause, to handle the precise way whitespace is conditionally discarded.

o Discard all whitespace by default

I.e. Design a grammar where whitespace is always discarded.

o Discard some whitespace

That is, discard whitespace by default, while overriding that sometimes, by using pause, to handle the precise way whitespace is conditionally preserved.

This mechanism is the topic of this article.

o Pseudo-symbols such as :discard

These allow us to specify static aspects of how we want Marpa to act.

o Adverbs such as event

These allow us to name pauses, and to query Marpa as to which event triggered any particular pause.

o Adverbs such as pause

These allow us to specify dynamic aspects of how our code interacts with the parsing process.

Other languages implement the command (verb) yield, which transfers the flow of control back and forth in a vaguely similar fashion.

o Getting Marpa to resume parsing after the callback

Here we control the manner in which Marpa resumes. We do this by moving the offset within the input stream at which Marpa resumes.

Clearly this could involve moving that pointer backwards or forwards.

Since this determines the next token Marpa sees, it thus directly controls the next rule triggered. Naturally that could be the same rule as the last one triggered.

In this article, I'll always be moving it forwards, I assure you!

The direct implication of this is that after the pointer is moved forward, I am saying these 4 things:

o I've told Marpa I accept responsibility for processing a certain subset of the input stream

This is part of the grammar.

o I respond when Marpa tells me it's reached such a point

This is part of my code.

o I've finished that processing

So I return control back to Marpa.

o I'm telling Marpa to treat that subset as invisible, and to resume parsing after it

This happens because I told Marpa where to resume within the input stream.

Neat, huh?

A Word on Strings and Invisibility

Don't get confused by these 2:

o The syntax for a node

For my application, the regexp a node must match is /^\[.*\]$/.

The name of the node is within those '[' and ']' chars.

The name of the node matches /.*/, meaning it can be zero-length as it happens, but you (the reader) should not attach any significance to the length.

o The syntax for the node's grammar

The grammar a node must match is 'node_name ::= start_node end_node ...', as above, where start_node is '[' and end_node is ']'.

The grammar for the node says Marpa is only allowed to see the '[' and the ']'. The node's name is hidden from Marpa by the design of the grammar, which forces processing to switch into the code I've written to handle processing between those 2 tokens.

This makes the node's name invisible to Marpa, and explains why the grammar ('...start_node end_node...') says that - as far as Marpa is concerned - there is literally nothing between the '[' and the ']'.

Some Sample Input

These graphs are copied from the docs for MarpaX::Demo::StringParser. Since I've called them graphs, that means each line below is a complete input stream. Further, all sample lines below can be combined into a single file, now shipped (with V 1.06) as data/graph.04.ge.

o Edges
        ->
        --
o Graphs
        [node]
        [node] ->
        -> {label: Start} -> {color: red} [node.1] {color: green} -> [node.2]
        [node.1] [node.2] [node.3]
o Nodes
        []
        [node.1]
        [node 1]
        [[node\]]
        ["[node]"]
        [     From here     ] -> [     To there     ]

The Full Grammar of MarpaX::Demo::StringParser

Before discussing the Perl code for handling pauses, I want to mention a trap in grammar design.

Here's the grammar, with an SVG version from a new but unfinished module:

    01    source                 => \(<<'END_OF_SOURCE'),
    02
    03    :default               ::= action => [values]
    04
    05    # Overall stuff.
    06
    07    :start                 ::= graph_grammar
    08
    09    graph_grammar          ::= graph_definition    action => graph
    10
    11    # Graph stuff.
    12
    13    graph_definition       ::= node_definition
    14                               | edge_definition
    15    # Node stuff
    16
    17    node_definition        ::= node_statement
    18                               | node_statement graph_definition
    19
    20    node_statement         ::= node_name
    21                               | node_name attribute_definition
    22                               | node_statement (',') node_statement
    23
    24    node_name              ::= start_node end_node
    25
    26    :lexeme                ~ start_node        pause => before        event => start_node
    27    start_node             ~ '['
    28
    29    :lexeme                ~ end_node
    30    end_node               ~ ']'
    31
    32    # Edge stuff
    33
    34    edge_definition        ::= edge_statement
    35                               | edge_statement graph_definition
    36
    37    edge_statement         ::= edge_name
    38                               | edge_name attribute_definition
    39                               | edge_statement (',') edge_statement
    40
    41    edge_name              ::= directed_edge
    42                               | undirected_edge
    43
    44    :lexeme                ~ directed_edge      pause => before        event => directed_edge
    45    directed_edge          ~ '->'
    46
    47    :lexeme                ~ undirected_edge    pause => before        event => undirected_edge
    48    undirected_edge        ~ '--'
    49
    50    # Attribute stuff.
    51
    52    attribute_definition   ::= attribute_statement*
    53
    54    attribute_statement    ::= start_attributes end_attributes
    55
    56    :lexeme                ~ start_attributes   pause => before        event => start_attributes
    57    start_attributes       ~ '{'
    58
    59    :lexeme                ~ end_attributes
    60    end_attributes         ~ '}'
    61
    62    # Boilerplate.
    63
    64    :discard               ~ whitespace
    65    whitespace             ~ [\s]+
    66
    67    END_OF_SOURCE

You might be tempted to change the edge stuff from this:

    41    edge_name              ::= directed_edge
    42                               | undirected_edge
    43
    44    :lexeme                ~ directed_edge      pause => before        event => directed_edge
    45    directed_edge          ~ '->'
    46
    47    :lexeme                ~ undirected_edge    pause => before        event => undirected_edge
    48    undirected_edge        ~ '--'

To this:

    41    edge_name              ::= directed_edge
    42                               | undirected_edge
    43
    44    directed_edge          ::= '->'   action => edge
    45
    46    undirected_edge        ::= '--'   action => edge

It certainly looks simpler, doesn't it?

Then, your Perl would not need to contain code to handle those 2 pauses, since Marpa would just trigger calls to the action sub edge().

Don't do that!

What happens is that edge() is now called at a different time in the flow of processing than it is with the original grammar.

In the sample I tested, all the calls to edge() were then made after all the calls to pause-related code.

Why this happens is explained above in the quotation from the author of Marpa.

And now it's time to pause...

Handling the Pause on the Perl Side of Things

The Process

If Marpa is going to yield processing, we need to explicitly code for this.

BTW, a grammar without any pauses can process a string in one hit:

        my($string) = 'Input stream...';
        my($result) = $recognizer -> read(\$string);
        ...

To handle pauses however, we switch to a loop, since we must call resume() after each pause, until Marpa hits the end of the input stream.

So, here is the process() method of MarpaX::Demo::StringParser:

    01    sub process
    02    {
    03        my($self)   = @_;
    04        my($string) = $self -> graph_text;
    05        my($length) = length $string;
    06
    07        # We use read()/lexeme_read()/resume() because we pause at each lexeme.
    08
    09        my($attribute_list);
    10        my($do_lexeme_read);
    11        my(@event, $event_name);
    12        my($lexeme_name, $lexeme);
    13        my($node_name);
    14        my($span, $start);
    15
    16        for
    17        (
    18            my $pos = $self -> recce -> read(\$string);
    19            $pos < $length;
    20            $pos = $self -> recce -> resume($pos)
    21        )
    22        {
    23            print "read() => pos: $pos\n" if ($self -> verbose > 1);
    24
    25            $do_lexeme_read = 1;
    26            @event          = @{$self -> recce -> events};
    27            $event_name     = ${$event[0]}[0];
    28            ($start, $span) = $self -> recce -> pause_span;
    29            $lexeme_name    = $self -> recce -> pause_lexeme;
    30            $lexeme         = $self -> recce -> literal($start, $span);
    31
    32            print "pause_span($lexeme_name) => start: $start. span: $span. " .
    33                "lexeme: $lexeme. event: $event_name\n" if ($self -> verbose > 1);
    34
    35            if ($event_name eq 'start_attributes')
    36            {
    37                # Read the attribute_start lexeme, but don't do lexeme_read()
    38                # at the bottom of the for loop, because we're just about
    39                # to fiddle $pos to skip the attributes.
    40
    41                $pos            = $self -> recce -> lexeme_read($lexeme_name);
    42                $pos            = $self -> find_terminator(\$string, qr/}/, $start);
    43                $attribute_list = substr($string, $start + 1, $pos - $start - 1);
    44                $do_lexeme_read = 0;
    45
    46                print "index() => attribute list: $attribute_list\n" if ($self -> verbose > 1);
    47
    48                $self -> attribute_list($attribute_list);
    49            }
    50            elsif ($event_name eq 'start_node')
    51            {
    52                # Read the node_start lexeme, but don't do lexeme_read()
    53                # at the bottom of the for loop, because we're just about
    54                # to fiddle $pos to skip the node's name.
    55
    56                $pos            = $self -> recce -> lexeme_read($lexeme_name);
    57                $pos            = $self -> find_terminator(\$string, qr/]/, $start);
    58                $node_name      = substr($string, $start + 1, $pos - $start - 1);
    59                $do_lexeme_read = 0;
    60
    61                print "index() => node name: $node_name\n" if ($self -> verbose > 1);
    62
    63                $self -> node($node_name);
    64            }
    65            elsif ($event_name eq 'directed_edge')
    66            {
    67                $self -> edge($lexeme);
    68            }
    69            elsif ($event_name eq 'undirected_edge')
    70            {
    71                $self -> edge($lexeme);
    72            }
    73            else
    74            {
    75                die "Unexpected lexeme '$lexeme_name' with a pause\n";
    76            }
    77
    78            $pos = $self -> recce -> lexeme_read($lexeme_name) if ($do_lexeme_read);
    79
    80            print "lexeme_read($lexeme_name) => $pos\n" if ($self -> verbose > 1);
    81        }
    82
    83        # Return a defined value for success and undef for failure.
    84
    85        return $self -> recce -> value;
    86
    87    } # End of process.

Lines 16 .. 21 - The Loop

Lines 16 .. 21 are saying that since we put at least one pause in the grammar, we must expect Marpa to exit from the read() before the end of the input stream, and that we must immediately:

o Handle the pause

That's lines 25 .. 76.

o Handle the fact that Marpa paused before the lexeme declaring the pause

That's lines 41, 56 and 78, which tell Marpa to move past the lexeme with lexeme_read() (discarding it in the process). Then, at the top of the loop, we call resume().

So, we just have to explain lines 25 .. 76.

And 78 actually, since the call to lexeme_read() is dependent on something via the $do_lexeme_read variable.

Lines 25 .. 76 - The Context and The Code

Here we stockpile values for use below:

o Line 25: $do_lexeme_read = 1

We set a flag so that by default we read and hence skip the lexeme where Marpa paused. See line 78.

We reset this flag in some cases, as explained next. See lines 44 and 59.

The structure of this particular grammar has a feature you might not be using: Each pause is at the start of a pair of delimiters. That is, the expected input includes '[ ... ]' and '{ ... }'.

So, the code now has to do 2 things:

o Skip the lexeme Marpa paused at, i.e. the opening delimiter

And that's the first token of each pair, '[' or '{'. This is easy, since Marpa tells us where it paused.

See the calls to lexeme_read() at lines 41 and 56.

But, and it's a huge but, in these 2 cases we can't use the lexeme_read() at line 78, because we need its return value, $pos, at lines 43, 58 and, later, 20.

Hence we call lexeme_read() at lines 41 and 56, and we use $do_lexeme_read to disable the call at line 78.

o Skip the lexeme which is the closing delimiter of each pair

That's just the closing ']' or '}'. But this just happens automatically when we return to the top of the loop.

Well, not quite automatically. The calls to find_terminator(), at lines 42 and 57, do what the method name says, and also return $pos, which we use in lines 43 and 58 (discussed below), and which we pass back to Marpa at line 20 when we call resume().

This ties in with the grammar, lines 29 .. 30 and 59 .. 60, where we do not specify any pause for the closing delimiters, so Marpa just continues right on past them.

o Line 26: @event = @{$self -> recce -> events}

... and ...

o Line 27: $event_name = ${$event[0]}[0]

We ask Marpa which event triggered the pause.

That allows us to handle each case in the multi-branched 'if' starting at line 35.

Since the 'if's do simple string comparisons, there is no gain from using a smart-match module such as match::simple.

o Line 28: ($start, $span) = $self -> recce -> pause_span()

This tells us the offset within the input stream at which Marpa paused. So, we use this as the starting point for our scan of the input, looking for the closing delimiters matching the opening ones found by Marpa.

See lines 42 and 57.

o Line 29: $lexeme_name = $self -> recce -> pause_lexeme()

This tells us the name of lexeme at which Marpa paused. We use it to tell Marpa to step over it before resuming.

See lines 41, 56, and, for the other cases, 78.

o Line 30: $lexeme = $self -> recce -> literal($start, $span)

This is the string we need in those cases where we don't need to search between delimiters for our text.

We just pass this to the sub edge(), to be pushed on to a stack of interesting things.

See lines 67 and 71, and the matching grammar at lines 44 .. 48.

o Line 42: $pos = $self -> find_terminator(\$string, qr/}/, $start);

... and ...

o Line 57: $pos = $self -> find_terminator(\$string, qr/]/, $start);

These set $pos, i.e. the offset within the input stream, to the closing delimiter matching the opening one found by Marpa (as stated just above).

We do this because back at the top of the loop, $pos is passed to resume(), telling Marpa where exactly we want it to resume parsing.

o Line 43: $attribute_list = substr($string, $start + 1, $pos - $start - 1)

... and ...

o Line 48: $self -> attribute_list($attribute_list)

These, together with the matching pair, lines 58 and 63, preserve all our work by pushing it on to the stack mentioned above.

A Summary

The stack built is due to Marpa pausing, and then we call subs directly, rather than have Marpa call action subs.

All this must take place in a choreographed manner, so that the calls to those subs are in the correct chronological order.

This order (at least for my app, and I'm sure for any typical app) must correspond exactly to the order in which Marpa detects tokens of interest.

And this order is of course the order in which those tokens appear in the input stream.

Some General Notes about the Code

It's not obvious from the code shown, but sub process() is called (from sub run()) within a try ... catch structure, handled by the fine module Try::Tiny.

When process() exits successfully, the stack is analyzed in various ways.

Firstly, it's printed in a formatted manner if that's what the user asked for.

Secondly, also optionally, it's written to a CSV file for further processing at the user's leisure.

Wrapping Up and Winding Down

I hope I've made it easier to approach the dread topic of lexing and parsing. The simple fact is that with Marpa's new SLIF we can pass over to Marpa almost all the work, and just write code to handle precisely targeted places of interest within the input stream.

Overall, this should make clear the astonishing power of Marpa::R2 to relieve the programmer of what in the past was a huge component of such work.

This is combined with an almost-always clean interface, allowing our code to interact with Marpa neatly and efficiently.

And I have not even mentioned its other features, such as the precise pin-pointing of errors, etc.

Author

Ron Savage .

Home page: http://savage.net.au/index.html

Licence

Australian Copyright © 2013 Ron Savage. All rights reserved.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html