The module, MarpaX::Languages::Dash, was written to showcase the material discussed here, and to give you something to play with.
And the grammar from this module will be incorporated into the re-write of GraphViz2::Marpa.
Also, I'm currently (2016-12-25) using Marpa::R2 V 3.000000.
With the advent of Marpa's SLIF, we need to feel comfortable with both Marpa's approach and the fact that, classically, lexing and parsing were often such visibly distinct phases, perhaps even requiring 2 programs.
The following is from quotation from Jeffrey Kegler to the Google Group on Marpa, reformatted slightly by me to suit this article:
"Marpa's SLIF follows the classical separation of lexing and parsing, but it introduces a notation which is friendly to those programmers more accustomed to regular expressions and recursive descent.
So instead of two completely separate modules, lexer and parser, each with its own interface, you have a single DSL source file, and the lexing and parsing statements look similar, with only the different operators ("::=" versus "~") to tell them apart.
Underneath however, in the implementation, the distinction is still there, as sharp as ever."
See Wikipedia for more on DSLs (Domain-specific languages).
And here is another quotation from the author of Marpa:
"Marpa has 3 phases: Grammar pre-processing; Recognition; and Evaluation. That is, Marpa:
Events occur during recognition. Actions are executed during the evaluation. In other words, by the time the first action is executed, the last event is ancient history."
The section below, called 'The Full Grammar of MarpaX::Languages::Dash', warns you about a trap which can ensue after a mis-understanding of the last point, i.e. that all events are triggered before any action sub is called.
These special tokens within the grammar's syntax, event and action, are discussed at length below.
And all this takes place very efficiently indeed.
Consequently, Marpa's new design completely eliminates the necessity for you to write code to perform your own lexing.
But keep in mind that Marpa does not forbid you to do your own lexing, and thus to pass specific tokens into Marpa, either instead of letting Marpa do any lexing, or by using a combination of both manual and Marpa-based lexing.
Marpa now reads your grammar using the syntax for Marpa's own extended BNF. Marpa's docs for this are here.
I'm going to be referring to this many times,
and rather than say 'Marpa-style BNF',
or 'Extended BNF',
I'll call it an SLIF-DSL
,
or usually just a BNF
.
In practice, this means you express your grammar in a string, and Marpa treats that as a set of rules as to how you want Marpa to treat your input stream.
We're all familiar with code which declares a single- or double-quoted string, and expects any whitespace within that string to be preserved, while at the same time expecting arbitrary whitespace in the source code to be discarded. This means we can format the source in different ways, and still have whitespace preserved when it's protected by quotes. This much is simple.
Well, this is precisely what I mean by 'conditional preservation of whitespace': Some is preserved and some is discarded. So, how do we do that using Marpa?
One way would be to have Marpa preserve all whitespace, and commit ourselves to processing it somehow. Here, I'm going to ignore that option, since there's now a rather more sophisticated way to do it.
Firstly, here's a tiny fragment of a grammar, in SLIF-DSL:
source => \(<<'END_OF_SOURCE'), ... :discard ~ whitespace whitespace ~ [\s]+ END_OF_SOURCE
The entire grammar is a reference to a string of rules, whose oddish delimiters are \(<<'END_OF_SOURCE') and END_OF_SOURCE.
But, see those 2 rules which mention the reserved pseudo-symbol :discard and the token whitespace? This is Marpa's way of allowing us to say: Discard all whitespace by default. We could have called whitespace anything. It's not a reserved word.
For more on :discard, see the docs: Marpa::R2::Scanless::DSL#Discard-rules. And note, of course, that these 2 rules are optional. Or, you may wish to discard different characters.
In other words, we're ordering Marpa to discard whitespace. Since this conflicts with our stated aim of preserving some whitespace, we need a mechanism which allows us to tell Marpa to override this default under circumstances chosen by us. Enter the pause and event adverbs.
Note that this usage of Marpa's adverbs event and pause should be regarded as an intermediate/advanced technique. For people just beginning to use Marpa, use of the action adverb is the recommended technique.
Here's a bit more of the grammar, given the regexp that node names must match is /^\[.*\]$/:
node_name ::= start_node end_node # Allow for the anonymous node, []. | start_node node_name end_node # Allow for [wxyz]. :lexeme ~ start_node pause => before event => start_node start_node ~ '[' :lexeme ~ end_node pause => before event => end_node end_node ~ ']' :lexeme ~ node_name pause => before event => node_name node_name ~ string_char_set+ string_char_set ~ escaped_char | [^;:}\]] # Neither a separator [;:] nor a terminator [}\]]. escaped_char ~ '\' [[:print:]]
Pause allows us to specify that at certain locations within the parse, i.e. at certain locations within the input stream, we want Marpa to stop parsing for a while, and pass control of the parsing process over to us.
It's the for a while part that makes the pause adverb so powerful. After a while, i.e. whenever we feel like it, we simply tell Marpa to resume parsing.
And, during that pause, we handle all parsing, and preserve whitespace if we wish. Hence the word conditional in the title of this article. And I say if we wish because there are times when we encounter leading and trailing spaces, for example, and choose to discard it. Likewise, we might choose to compress internal whitespace.
Event allows us to name pauses, and to query Marpa as to which event triggered any particular pause.
Don't get confused by these 2:
For my application, the regexp a node must match is /^\[.*\]$/.
The name of the node is within those '[' and ']' chars.
The name of the node matches /.*/, meaning it can be zero-length as it happens, but you (the reader) should not attach any significance to the length.
The grammar a node must match is either 'node_name ::= start_node end_node', or 'node_name ::= start_node node_name end_node', where start_node is '[' and end_node is ']'.
So what is the node's grammar saying?
There is no name between the start and end lexemes, so when Marpa finds such a case, i.e. no text in the input stream between '[' and ']', it triggers the event start_node, where we ignore the '[', and it triggers the event end_node, where we handle the case of the anonymous node by inserting a Tree::DAG_Node node into the output tree. As it happens, we also ignore the ']'. This all takes place in the method _process()
.
Here, there is a name between '[' and ']', and as above Marpa triggers the events (in this order), start_node, node_name, end_name. And, also as above, we happen to ignore the '[' and ']', but for the node_name we trigger our own (simple) processing, via the methods clean_after()
and _add_daughter()
.
Luckly, these 2 cases are fairly simple. Below, where attributes of a node or an edge are discussed, much more complexity need be handled.
I should warn you about Marpa's lookahead. There isn't any. That is, when Marpa detects a lexeme such as '[' at, presumably, the start of a node, Marpa is not saying that it has used lookahead to find the corresponding ']', and hence the '[' is provably the start of a node. It's just saying that so far (i.e. at the '['), the text does not contain anything which contradicts the grammar used so far. That is, the '[' triggers the 'start_node' event without implying anything about what follows the '['. So even saying it's the start of a node is an assumption, not yet supported by the facts.
Of course, once an event is triggered you know the offset within the input stream ($pos in my code), so you can perform your own lookahead.
We've actually touched on a number of topics already, ofter interrelated ones, so here's a list (just in case your head is spinning):
This is Marpa's newest and (only) recommended interface.
This is classic BNF enhanced to give us a way to specify various new features:
In short, these are:
I.e. Design a grammar where whitespace is never discarded.
That is, preserve whitespace by default, while overriding that sometimes, by using pause, to handle the precise way whitespace is conditionally discarded.
I.e. Design a grammar where whitespace is always discarded.
That is, discard whitespace by default, while overriding that sometimes, by using pause, to handle the precise way whitespace is conditionally preserved.
The last choice is the topic of this article.
These allow us to specify static aspects of how we want Marpa to act.
These allow us to name pauses, and to query Marpa as to which event triggered any particular pause.
These allow us to specify dynamic aspects of how our code interacts with the parsing process.
Other languages implement the command (verb) yield, which transfers the flow of control back and forth in a vaguely similar fashion.
Also, when we pause on a lexeme, that lexeme does not have to be a fixed string. In the BNF below, the string 'label' is a lexeme, but so are the variable-length strings I've called 'attribute_name' and 'attribute_name'. Likewise for 'node_name'.
Here we control the manner in which Marpa resumes. We do this by moving the offset within the input stream at which Marpa resumes.
In fact, this could involve moving that pointer backwards or forwards.
Since this determines the next token Marpa sees, it thus directly controls the next rule triggered. Naturally that could be the same rule as the last one triggered.
In this article, I'll always be moving it forwards, I assure you!
The direct implication of this is that after the pointer is moved forward, I am saying these 4 things:
This is part of the grammar.
This is part of my code.
So I return control back to Marpa.
This happens because I told Marpa where to resume within the input stream.
Neat, huh?
These graphs are copied from the sample data for MarpaX::Languages::Dash. Since I've called them graphs, that means each line below is a complete input stream. Further, all sample lines below can be combined into a single file, another graph, now shipped in the distro as data/graph.04.dash.
-> ->
[node] [node] -> -> {label: Start} -> {color: red} [node.1] {color: green} -> [node.2] [node.1] [node.2] [node.3]
[] [node.1] [node 1] [\[node\]] ["node"] [ From here ] -> [ To there ]
BTW, here is an SVG of this grammar, produced by MarpaX::Grammar::GraphViz2:
01 :default ::= action => [values] 02 03 lexeme default = latm => 1 # Longest Acceptable Token Match. 04 05 graph_grammar ::= graph_definition 06 07 # Graph stuff. 08 09 graph_definition ::= node_definition 10 | edge_definition 11 # Node stuff 12 13 node_definition ::= node_statement 14 | node_statement graph_definition 15 16 node_statement ::= node_name_token 17 | node_name_token attribute_definition 18 | node_statement (',') node_statement 19 20 node_name_token ::= start_node end_node # Allow for the anonymous node. 21 | start_node node_name end_node 22 23 # Edge stuff 24 25 edge_definition ::= edge_statement 26 | edge_statement graph_definition 27 28 edge_statement ::= edge_name 29 | edge_name attribute_definition 30 | edge_statement (',') edge_statement 31 32 edge_name ::= directed_edge 33 | undirected_edge 34 35 # Attribute stuff. 36 37 attribute_definition ::= attribute_statement+ 38 39 attribute_statement ::= start_attributes string_token_set end_attributes 40 41 string_token_set ::= string_token_pair+ 42 43 string_token_pair ::= literal_label 44 | attribute_name (':') attribute_value 45 46 # Lexemes in alphabetical order. 47 48 :lexeme ~ attribute_name pause => before event => attribute_name 49 50 attribute_name ~ string_char_set+ 51 52 :lexeme ~ attribute_value pause => before event => attribute_value 53 54 attribute_value ~ string_char_set+ 55 56 :lexeme ~ directed_edge pause => before event => directed_edge priority => 2 57 directed_edge ~ '->' 58 59 :lexeme ~ end_attributes pause => before event => end_attributes priority => 1 60 end_attributes ~ '}' 61 62 :lexeme ~ end_node pause => before event => end_node priority => 1 63 end_node ~ ']' 64 65 escaped_char ~ '\' [[:print:]] 66 67 # Use ' here just for the UltraEdit syntax hiliter. 68 69 :lexeme ~ literal_label pause => before event => literal_label priority => 1 70 literal_label ~ 'label' 71 72 :lexeme ~ node_name pause => before event => node_name 73 74 node_name ~ string_char_set+ 75 76 :lexeme ~ start_attributes pause => before event => start_attributes 77 start_attributes ~ '{' 78 79 :lexeme ~ start_node pause => before event => start_node 80 start_node ~ '[' 81 82 string_char_set ~ escaped_char 83 | [^;:}\]] # Neither a separator [;:] nor a terminator [}\]]. 84 85 :lexeme ~ undirected_edge pause => before event => undirected_edge priority => 2 86 undirected_edge ~ '--' 87 88 # Lexemes in alphabetical order. 89 90 # Boilerplate. 91 92 :discard ~ whitespace 93 whitespace ~ [\s]+
Now, you might be tempted to change the edge stuff from this:
56 :lexeme ~ directed_edge pause => before event => directed_edge priority => 2 57 directed_edge ~ '->' 85 :lexeme ~ undirected_edge pause => before event => undirected_edge priority => 2 86 undirected_edge ~ '--'
To this:
41 edge_name ::= directed_edge 42 | undirected_edge 43 44 directed_edge ::= '->' action => edge 45 46 undirected_edge ::= '--' action => edge
It certainly looks simpler, doesn't it?
Then, your Perl would not need to contain code to handle those 2 pauses, since Marpa would just trigger calls to the action sub edge()
, which admittedly you must write.
Don't do that!
What happens is that edge()
is now called at a different time in the flow of processing than events are with the original grammar.
In the sample I tested, all the calls to edge()
were made after all the calls to pause-related code.
Why this happens is explained above in the quotation from the author of Marpa.
And now it's time to pause...
If Marpa is going to yield processing, we need to explicitly code for this.
BTW, a grammar without any pauses can process a string in one hit:
my($string) = 'Input stream...'; my($result) = $recognizer -> read(\$string); ...
To handle pauses however, we switch to a loop, since we must call resume()
after each pause, until Marpa hits the end of the input stream.
So, here is the _process()
method of MarpaX::Languages::Dash:
01 sub _process 02 { 03 my($self) = @_; 04 my($string) = $self -> clean_before($self -> graph_text); 05 my($length) = length $string; 06 my($last_event) = ''; 07 my($format) = '%-20s %5s %5s %-s'; 08 09 # We use read()/lexeme_read()/resume() because we pause at each lexeme. 10 11 my($event_name, $edge); 12 my(@fields); 13 my($lexeme, $literal); 14 my($span, $start); 15 16 $self -> log(debug => sprintf($format, 'Event', 'Start', 'Span', 'Lexeme') ); 17 18 for 19 ( 20 my $pos = $self -> recce -> read(\$string); 21 $pos < $length; 22 $pos = $self -> recce -> resume($pos) 23 ) 24 { 25 $event_name = $self -> _validate_event; 26 ($start, $span) = $self -> recce -> pause_span; 27 $pos = $self -> recce -> lexeme_read($event_name); 28 $literal = substr($string, $start, $pos - $start); 29 $lexeme = $self -> recce -> literal($start, $span); 30 31 $self -> log(debug => sprintf($format, $event_name, $start, $span, $lexeme) ); 32 33 if ($event_name eq 'attribute_name') 34 { 35 $fields[0] = $self -> clean_after($literal); 36 } 37 elsif ($event_name eq 'attribute_value') 38 { 39 $literal = $self -> clean_after($literal); 40 41 $self -> _add_daughter($fields[0], {value => $literal}); 42 43 @fields = (); 44 45 # Skip the separator. 46 47 while ( ($pos < (length($string) - 1) ) && (substr($string, $pos, 1) =~ /[\s;]/) ) { $pos++ }; 48 } 49 elsif ($event_name eq 'directed_edge') 50 { 51 $self -> _add_daughter('edge_id', {value => $self -> clean_after($literal)}); 52 } 53 elsif ($event_name eq 'end_attributes') 54 { 55 $self -> _process_brace($literal); 56 } 57 elsif ($event_name eq 'end_node') 58 { 59 # Is this the anonymous node? 60 61 if ($last_event eq 'start_node') 62 { 63 $self -> _add_daughter('node_id', {value => ''}); 64 } 65 } 66 elsif ($event_name eq 'literal_label') 67 { 68 push @fields, $literal; 69 70 $pos = $self -> _process_label($self -> recce, \@fields, $string, $length, $pos); 71 @fields = (); 72 } 73 elsif ($event_name eq 'node_name') 74 { 75 $literal = $self -> clean_after($literal); 76 77 $self -> _add_daughter('node_id', {value => $literal}); 78 } 79 elsif ($event_name eq 'start_attributes') 80 { 81 $self -> _process_brace($literal); 82 } 83 elsif ($event_name eq 'start_node') 84 { 85 # Do nothing. 86 } 87 elsif ($event_name eq 'undirected_edge') 88 { 89 $self -> _add_daughter('edge_id', {value => $self -> clean_after($literal)}); 90 } 91 92 $last_event = $event_name; 93 } 94 95 if ($self -> recce -> ambiguity_metric > 1) 96 { 97 $self -> log(notice => 'Ambiguous parse'); 98 } 99 100 if (my $ambiguous_status = $self -> recce -> ambiguous) 101 { 102 $self -> log(notice => "Parse is ambiguous: $ambiguous_status."); 103 } 104 105 # Return a defined value for success and undef for failure. 106 107 return $self -> recce -> value; 108 109 } # End of _process.
Note: In the giant 'if' statement, essentially a big switch, event names are tested in alphabetical order, for ease of reading.
These lines are saying that since we put at least one pause in the grammar, we must expect Marpa to exit from the read()
before the end of the input stream, and that we must consequently:
That's line 25.
That's lines 26 .. 28, which tell Marpa to move past the lexeme with lexeme_read()
.
That's line 93, where we hit the end of the loop, and line 22, where we call resume()
.
So, we just have to explain lines 29 .. 92.
These just grab the lexeme for the debug print. Thereafter, $lexeme is not used.
'attribute_name' is not the first event expected in a syntactially valid file. The first would either be 'directed_edge', 'undirected_edge' or 'start_node'.
Also, we only get 'attribute_name' after Marpa parses a '{', which will have thus already triggered a 'start_attributes' event, which calls _process_brace($literal)
.
Now, 'What is a string?' is a more fundamental question than that ancient saying, 'How long is a piece of string?'.
Recall these lines of the grammar:
43 string_token_pair ::= literal_label 44 | attribute_name (':') attribute_value
With 'attribute_name' and 'attribute_value' (as with 'node_name'), BNF line 44 says we are going to let Marpa do all the lexing and parsing for these cases. Here's how it works.
In this grammar, I've defined these as being strings:
And 'string_char_set' is defined in BNF lines 82 .. 83, as either (a set of) escaped characters or of cetain unescaped characters. The latter characters must belong to a character set which excludes precisely those characters we treat as special, since they indicate the ends or separators of various tokens.
This means one of those special characters, let's say ':', can appear in a string if it's escaped, but if it's not escaped, it is going to be taken as having its special meaning.
Back to escaped characters: BNF line 82 directs us to BNF line 65, which shows how to declare a set of escaped characters, and also shows that Marpa accepts any named character set which Perl itself accepts.
Line 35 says preserve the attribute's name ($literal) in $fields[0], while we wait for Marpa to give us the attribute's value.
These lines add the (attribute name, attribute_value) pair to the tree of tokens, and reset some things.
If you're wondering how Graphviz deals with, say '\:' (which is not one of Graphviz's acceptable escape sequences), or with characters not in the set [[:print:]], remember that my module is not claiming to perform, duplicate or replace Graphviz's own validation. Rather, I'm claiming that a significant block of code in Graphviz, i.e. the lexer and (much of) the parser, can - at least in theory - be replaced by a combination of Marpa and a grammar very much like this one.
In Graphviz, all attribute names and values are already being syntax-checked, and that would still have to be the case after adopting Marpa.
Here we just add '->' to the tree.
Here we call _process_brace()
because there's a bit of work to do:
Note: Pushing the stack is discussed below. See 'start_attributes'.
Between '{' and '}', parsed tokens (attribute names and values) belong to the last parsed edge or node. But after '}', the next token can't be a child of that edge or node. It must actually be an edge or a node. And this new token will become the parent of any following attributes.
This does mean there is no such thing as an edge or node belonging to another edge or node.
In DOT, sub-graphs can group nodes, but DASH does not support sub-graphs.
These lines handle 2 strings in the input stream, '[]' and '[node_name]'. The corresponding event sequences are:
Since, at 'end_node', the very last event was not 'node_name', there cannot be a node name, so we know we are dealing with the anonymous node.
This time we have a node name to add to the tree.
This is the start of very complex code.
It is our major transition from Marpa-based lexing and parsing, to doing things manually. In many real-world cases, you could write grammar to handle similar issues, but since the syntax of "DASH" in MarpaX::Languages::Dash allows all of unquoted, single-quoted, double-quoted and HTML-quoted ('<', '>') strings, I could not see how to do that with BNF alone. So, manual parsing it has to be. And that is the basis for this article. Actually, the real complexity is allowing any of those strings to contain escaped characters which are exactly the end-of-string characters.
Firstly, we save 'label' as the name of the current attribute, as per BNF lines 69 .. 70.
Then at line 70 we parse whatever type of string is given as the value of the label.
This is easy. We're using Marpa to handle node names.
This is where we push the current edge or node into a stack, since this (tree) node must be the parent of any following attributes.
See the discussion above, "Lines 53 .. 56 - 'end_attributes'", for details.
We do nothing because we're letting Marpa find both the node's name and the terminating ']'.
As with 'directed_edge', we just add '--' to the tree.
We save the name of the last event, in case we're deaing with an input of '[]', when we wish to know whether the last event was 'node_name' or 'start_node', as discussed above.
All this must take place in a choreographed manner, so that the events triggered are in the correct chronological order.
This order (at least for my app, and I'm sure for any typical app) must correspond exactly to the order in which Marpa detects tokens of interest.
And this order is of course the order in which those tokens appear in the input stream.
It's not obvious from the code shown, but sub _process()
is called (from sub run()
) within a try ... catch structure, handled by the fine module Try::Tiny.
When _process()
exits successfully, the tree of tokens can be analyzed in various ways.
In the distro for MarpaX::Languages::Dash, you'll see scripts/render.pl passes this tree to the default renderer, for conversion into a real, live, Graphviz DOT file.
Basically though, it's just printed in a formatted manner if that's what the user asked for.
I hope I've made it easier to approach the dread topic of lexing and parsing. The simple fact is that with Marpa's new SLIF we can pass over to Marpa almost all the work, and just write code to handle precisely targeted places of interest within the input stream.
Overall, this should make clear the astonishing power of Marpa::R2 to relieve the programmer of what in the past was a huge component of such work.
This is combined with an almost-always clean interface, allowing our code to interact with Marpa neatly and efficiently.
And I have not even mentioned its other features, such as the precise pin-pointing of errors, etc. There are spelled out in Chapter 2 of The Marpa Papers.
Ron Savage .
Marpa's homepage: http://savage.net.au/Marpa.html
Homepage: http://savage.net.au/index.html
Australian Copyright © 2013 Ron Savage. All rights reserved.
All Programs of mine are 'OSI Certified Open Source Software'; you can redistribute them and/or modify them under the terms of The Artistic License, a copy of which is available at: http://www.opensource.org/licenses/index.html