The Marpa FAQ

On-line docs

The Marpa FAQ

The Marpa Papers

Github repos

The Marpa FAQ repo

The Marpa Papers repo

Table of Contents

1 Error: Parse failed. value() called when recognizer is not in tree mode

The culprit is Marpa::R2::Scanless::R's ambiguous() method.

Forest mode is something most users don't care about. It is used for Marpa's Abstract Syntax Forests. See Marpa::R2::ASF if you want to know more about those. The usual mode is 'tree mode', and most of Marpa's calls, such as value(), require you to be in 'tree mode'.

ambiguous() uses forest mode to print a detailed description of the ambiguity. (By the way, try getting that out of another parser.) And ambiguous() leaves you in forest mode, which means you cannot do much of the usual processing, such as calling value(). So how do you get around this problem?

There are two ways, depending on what you are trying to do. Most often, you regard an ambiguity in your parse as an error, so you want to print helpful message and abend if the parse is ambiguous, and go on to do normal stuff if it is not. What you should do in this case is call ambiguity_metric() first. ambiguity_metric() returns 1 for an unambiguous parse and 2+ for an ambiguous parse. Errors return something <= 0. And ambiguity_metric() does not put you into forest mode. See also Q4 Error message: Marpa exited....

Less often, you want to both print out a message describing the parse's ambiguity, and then go on to deal with it in 'tree mode'. In this case, you can call the series_restart() method. series_restart() puts you back into 'tree mode'.

And here is some code copied directly from V 2 of Genealogy::Gedcom::Date. The method decode_result() is listed below, and Dumper() is from Data::Dumper::Concise. Of course the calling code checks that the return from ambiguity_metric() is > 1:

    my($count) = 0;

    while (my $value = $self -> recce -> value)
    {
        $value = $self -> decode_result($$value); # Returns an arrayref, hence the 'for' below.

        $self -> log(debug => "Solution @{[++$count]}: \n" . Dumper($_) ) for @$value;
    }

For more on ambiguity_metric(), see Q4.

But remember, to handle ambiguity, you don't need to use this code. You can take a completely different route with Marpa::R2::ASF.

And there are still more choices: Leave the ambiguity as an error, re-work the grammar, or perhaps switch to events attached to lexemes.

In the latter case, if 2 or more events are fired by the same lexeme you can write code which chooses 1 of those events, and you then tell Marpa to continue with that choice. For code using this technique, see GraphViz2::Marpa, Text::Balanced::Marpa or Text::Delimited::Marpa.

2 Error: A lexeme in G1 is not a lexeme in any of the lexers?

Marpa automatically determines which symbols are 'lexemes'. To do this, it checks for all the symbols that the G1 grammar considers to be lexemes, and all the symbols that the L0 grammar considers to be lexemes. It requires these two lists to match.

A symbol is a G1 lexeme if it appears on the RHS of a G1 rule, but never appears on the LHS of any G1 rule. A symbol is an L0 lexeme if it appears on the LHS of an L0 rule, but never appears on the RHS of any L0 rule.

3 Error: A lexeme in lexer L0 is not a lexeme in G1

See the previous FAQ question.

4 Error message: Marpa exited at (line, column) = ($line, $column) within the input string

In the module Genealogy::Gedcom::Date, this text is part of the error message when a call to Marpa ambiguity_metric() returned a value <= 0.

Consider the possibility that the parse ends without a successful parse, but the input is the prefix of some input that can lead to a successful parse.

Marpa is not reporting a problem during the read(), because you can add more to the input string, and Marpa does not know that you do not plan to do this.

101 Where do I go for help?

The IRC channel may get the quickest response: irc.freenode.net#marpa.

The Google Group has message threading, and so is good for complex or extended discussions.

Marpa's homepage.

102 What is Libmarpa?

Libmarpa is the core Marpa library, written in C, which implements the Marpa parse engine - the Marpa algorithm itself. It is low-level, but it is documented and can be installed and used separately.

Homepage.

Formal docs.

103 Is Marpa deterministic?

No. Marpa is non-deterministic, unlike traditional parsers. Marpa tries everything, all at once.

This may take some getting used to, in the sense that probably what you're used to is a deterministic parse.

If two alternatives (in a prioritized rule say, separated by '|' or '||') happen to match some source, Marpa produces 2 parse trees.

The parser, then, catches all ambiguities, but the SLIF lexer does not, whether it is LTM or LATM. LTM is Longest Token Match, and LATM is Longest Acceptable Token Match.

The SLIF lexer does a longest tokens (plural) match. So it does recognize ambiguous tokens if they are longest tokens.

Ambiguous tokens which are shorter than the match one (whether the discipline is LATM or LTM) are ignored by the lexer.

Consequently, this means the ambiguous matched tokens must be all the same length.

See also the next 2 questions.

104 What is the difference between ambiguity and non-determinism?

Ambiguous means there is more than one parse.

Non-deterministic means that, in the course of discovering the parse or parses, multiple paths may be pursued at the same time.

A deterministic parser must always produce an unambiguous parse, if only by ignoring the other alternatives, as in PEG.

A non-deterministic parser, like Marpa, can produce both ambiguous and unambiguous parses.

Example code.

107 Why doesn't my ambiguous grammar return more than 1 parse tree?

Probably because your grammar includes

lexeme default = latm => 1 # Active the Longest Acceptable Token Match option.

Try commenting that line out after, perhaps, re-reading the previous question's answer.

See also the next 2 questions.

108 How do I implement disambiguation logic?

See under Bailing-out-of-parse-evaluation in Marpa::R2::Semantics, for how to short-circuit semantic analysis.

See also this gist.

See also the previous and next questions.

109 What is the meaning of L0 and G1?

L0 defines exactly what tokens are acceptable in the input stream. So, these are the rules defining the syntax of the tokens, but not the syntax of how these tokens can be combined. G1 defines the structure of the language - the ways in which the tokens can be combined. In other words, G1 is structural and therefore a 'parser' in the strictest sense, and L0 is its lexer. ('L' for lexer.)

One distinction between G1 and L0 is very important in designing a grammar/lexer combination: G1 rules have a semantics. L0 rules do not.

L0 and G1 discussed in Marpa::R2::Scanless::DSL, and, very briefly, in Marpa::R2::Semantics. In both cases, search for 'L0'.

110 How do I parse comments?

GraphViz2::Marpa implements comment skipping for C and C++ comments, and for #-prefix style comments. Search the source of Marpa.pm for 'C style comment', 'Cplusplus style comment' and 'hash style comment'.

111 How do I parse numbers?

Chapter 2 of Peter Stuifzand's Marpa Guide discusses this issue at length And if you fork that project on github, you'll find the directory examples/ contains about a dozen sample programs.

Alternately, download MarpaX::Languages::SVG::Parser, which contains similar logic (in data/*.bnf). Further, this latter distro includes sample code (float.pl and number.pl) in the scripts/ directory. The last-named script includes grammars for binary, octal and hex numbers which I did not need in the SVG parser.

112 How do I represent choice in the grammar?

Briefly, use '|'. Sample code from the above named module (data/d.bnf):

coordinate  ::= float      action => float
                | integer  action => integer

Also note that '||' is available to control the relative priorities of the alternatives.

This issue is expanded upon in the next 2 items in this FAQ.

113 Is there any other way of representing choice?

Sure. Consider these L0 rules (also from data/d.bnf):

sign_maybe  ~ [+-]
sign_maybe  ~

digit       ~ [0-9]
digit_any   ~ digit*
digit_many  ~ digit+

E           ~ [Ee] sign_maybe digit_many
E_maybe     ~ E
E_maybe     ~

:lexeme     ~ float
float       ~ sign_maybe digit_many E
              | sign_maybe digit_any '.' digit_many E_maybe
              | sign_maybe digit_many '.' E_maybe
              | sign_maybe non_zero digit_any

This is saying:

  • The sign is optional after E or e (sign_maybe)

  • The E is optional (E_maybe)

  • The sign is optional before a digit (the first alternative for 'float')

  • And so on

114 What's the difference between '|' and '||' in grammar definitions?

'|' expresses a simple alternative, while '||' separates alternatives at different levels of precedence. For example:

E ::=    E '*' E
      |  E '/' E
      || E '+' E
      |  E '-' E

This describes syntax for a simple 4-operation calculator, where multiplication and division take precedence over addition and subtraction.

This construct forces alternatives before the '||' to have higher precedence than the alternatives after that token.

Each of the alternatives separated by '|' are at the same precedence.

In Mark Dominus's 'Higher-Order Perl' book, he describes how he handles precedence in his recursive descent parsers for a similar calculator, beginning on p. 394. This is a good description of how the same situation would be handled with pre-Marpa techniques.

If precedence is not an issue for you, just use single bar alternation ('|').

Note also that Marpa supports a 'priority' adverb, discussed in the Marpa::R2::Scanless::DSL docs mentioned above.

115 What does it mean to hide (mask) tokens?

It's possible to design a grammar such that we can say certain tokens are hidden. This is discussed under RHS-alternatives in Marpa::R2::Scanless::DSL.

Here is an example (from MarpaX::Languages::Dash):

node_definition  ::= node_statement
                 | node_statement graph_definition

node_statement   ::= node_name
                 | node_name attribute_definition
                 | node_statement (',') node_statement

node_name        ::= start_node end_node

:lexeme          ~ start_node  pause => before  event => start_node
start_node       ~ '['

:lexeme          ~ end_node
end_node         ~ ']'

The comma is hidden. But what does this mean, exactly?

It means the comma may appear in the input stream (in this context), but I am not interested in it. So, Marpa will not return this token to me when an action is called. It has become invisible to the semantics.

Semantics are documented in Marpa::R2::Semantics.

Close study of the above grammar will show that another path is possible thru the grammar, in the case of 2 nodes side-by-side (i.e. without a comma between them). Thus the grammar accepts both these input streams as valid:

Stream 1: [node.a], [node.b]
Stream 2: [node.a]  [node.b]

Hence, from the end-user's viewpoint, the comma is defined - at least in this particular grammar - to be optional.

116 How do I implement recursion in a grammar?

See the next item.

117 Why does using '+' or '*' in a rule only work sometimes?

I assume you're referring to cases like this:

Case 1 (fails):

node_definition  ::= node_statement+
                 | node_statement graph_definition

Case 2 (succeeds):

coordinate_list  ::= coordinate+

Case 3 (succeeds):

whitespace       ~ [\s]+

Briefly, quantified rules (here, using the '+'), are only allowed when the right-hand side contains a single alternative. Thus case 1 fails due to the combination of '+' and '|'.

This is documented under 'Quantified rule' in Marpa::R2::Scanless::DSL.

So we must re-write rule 1 (with some context):

graph_definition  ::= node_definition
                  | edge_definition
                  | subgraph_definition

node_definition  ::= node_statement
                 | node_statement graph_definition

node_statement   ::= node_name
                 | node_name attribute_definition
                 | node_statement (',') node_statement

Now the rule 'node_definition ::= node_statement' allows a node definition to consist of a single node statement.

And the alternative (via the '|') 'node_definition ::= node_statement graph_definition' allows a node definition to consist of a node statement followed by a graph definition, which just happens to include the line 'graph_definition ::= node_definition'!

So we've managed to write the grammar using indirect recursion, thus allowing an indefinite list of node definitions to follow one another. And the last rule allows the input stream to separate them with commas as well as the usual spaces.

In other words: Why does the SLIF require quantifiers to be separate statements?

It's different from regular expressions, which everyone is accustomed to, and seems awkward.

The reason is that, with a parser, you expect to attach a semantics to all the rules - with regular expressions where is no such expectation.

So if Marpa allowed ' A ::= B+ (C | D) exp*', how would you specify how the arguments are passed to the semantics?

If just as a list, they have to be re-parsed all over again.

This problem can be solved, but the easiest way around it is to write it out as separate statements, with the quantifiers by themselves, and the semantics can then be attached to the statements.

118 Can a lexeme have length == 0?

No.

Marpa allows L0 rules that might be zero-length (for example, 'baZ ~ [Z]*') on the idea that it might be convenient to write them that way. But no zero-length lexeme will ever be accepted as a result of such rules.

If you think out the conceptual issues, you might see nulled lexemes are 'fraught' with them. At any given point, which null lexemes should you recognize? And how many of them? A parser could guess what you really mean here, but to my (Jeffrey's) mind it is just as well to force you to be explicit on this issue.

Note: I (Ron) think Jeffrey uses pseudo in the name 'nulled pseudo-lexeme' because, being effectively of 0 length, they can't actually be lexemes.

For sample code which shows a way of handling such an issue see null.lexeme.demo.pl and null.lexeme.demo.1.pl.

119 How do I find out where I am within the input stream?

See the line_column() method in Marpa::R2::Scanless::R.

120 How do I find out which rule applies at some point in the input stream (e.g. during a pause)?

See under Action-context in Marpa::R2::Semantics, for access to internal variables in Marpa which do what you want.

121 What is the meaning of the line number in a Marpa error message?

It is the line number within the BNF of the rule being processed.

Note: If your grammar uses \r, that will affect the line count.

Further, see the line_column() method in Marpa::R2::Scanless::R for various special characters some of which may affect this line count.

122 Can I switch grammars during a parse?

No - You can't just change Marpa's grammar object during a parse (by calling Marpa::R2::Scanless::G -> new() with a new BNF), and expect everything else to keep working.

Nevertheless, there are techniques which can probably do what you want.

  • 1: Combine the grammars

  • 2: Use 'parse before' and switch grammars twice

  • 3: Use an action, change the grammar, and re-parse

In a bit more detail:

1: 'Combine the grammars': Perhaps you are convinced this can't be done, but you are urged to try it. Focus on the problem of defining the lexemes.

You may find your solution involves switching lexers rather than switching grammars.

2: 'Use parse before and switch grammars twice': This technique is used in GraphViz2::Marpa.

The idea is to pause at the start of a substring you want to parse using a different grammar. Then, parse just that substring using a different grammar object and a different recognizer object.

At the end of the substring, tell the first grammar to resume at the first character after the end of the substring.

This works in my (Ron's) code because the second grammar ends with a known, matching, delimiter. When Marpa encounters the next character, it generates the infamous error: 'Parsing complete but lexemes remain'. Luckily for me, this just means there is (and there must be) tokens in the input stream after the end of the substring which needed to be parsed using the second grammar. So it's just a matter of asking Marpa if that particular error arose, processing the output from using the second grammar, and returning to the parse which is using first grammar.

In the code referenced, search for '$self -> grammar' and '$self -> grammar4html'.

Your problem will obviously be: What happens at the end of the substring? I.e. How do you know it's time to switch back to the previous grammar?

Also, this approach creates the problem of having the semantics defined in 2 different places. So combining them may add complexity to your code.

3: 'Use an action, change the grammar, and re-parse': This technique is used in MarpaX::Languages::SQL2003::AST.

See this IRC backlog entry for a discussion by the author, Jean-Damien Durand, for why he does things this way.

Briefly, the parse of a substring (call it 'a') requires knowledge of what follows that substring (call the second substring 'b'). So, parse it all ('a + b'), and when the action is triggered for 'b', use the information found to change the grammar used to parse 'a', and re-parse 'a'.

The effect (in Jean-Damien's case at least) is to use the result of a sub-grammar as a token value.

You may safely assume this latter method is a very advanced technique!

123 Where is Ruby Slippers parsing documented?

Start with Marpa::R2::Scanless::R's terminals_expected(). This document has several references to Ruby Slippers.

See also Jeffrey's articles: Ruby Slippers parsing, and the series beginning with parsing HTML part 1.

124 Where can I find code using Ruby Slippers?

See the script scripts/match.parentheses.02.pl, which ships with the module MarpaX::Demo::SampleScripts.

125 How do I associate an action class with my grammar?

This discussion assumes you're using Marpa::R2 V 3.0 or later. Prior to that, there was complications using the per-parse parameter to value(), and elsewhere.

See also Q126 Declaring an Action Class and Q136 How does Marpa pass parameters to action subs?.

References:

1: The per-parse argument to value()

2: The per-parse constructor

3: Calling value(...)

Here's how I declare a recognizer in one package:

use MarpaX::Languages::SVG::Parser::Actions;

# And later...

$self -> recce
(
    Marpa::R2::Scanless::R -> new
    ({
        grammar           => $self -> grammar,
        semantics_package => 'MarpaX::Languages::SVG::Parser::Actions',
    })
);

Now actions referenced in the grammar, as in:

coordinate  ::= float      action => float
                | integer  action => integer

must be declared as functions in the named semantics package, because that's where Marpa looks for them.

126 Declaring an Action Class

Firstly, See also Q136 on how Marpa passes parameters to action subs.

Now, the previous item in this FAQ discussed at length various issues regarding how to cleanly generate data within action subs, and how to pass that data back to the main-line code which called Marpa.

Here I explain the standard solution as used in a module, MarpaX::Languages::SVG::Parser, on CPAN. Briefly:

  • Each action sub returns a hashref to Marpa

This is because Marpa is the one calling the subs and gathering the results. And the hashrefs of course suit my code, whereas you may prefer to use a different data structure.

  • After a successful parse, the result of the parse is processed to recover those hashrefs

This result is returned by the value() method of the recognizer class. See value() in Marpa::R2::Scanless::R.

There is one basic problem to solve along the way: The data structure Marpa makes available as the result of the parse can be a deeply nested set of arrayrefs, depending on how deeply within the grammar the action sub is named.

So, we have to unravel the arrayrefs to recover the hashrefs.

Conside this fragment of the SVG grammar from MarpaX::Languages::SVG::Parser:

curve_to            ::= Cc curve_to_arguments  action => command

curve_to_arguments  ::= coordinate_triple
                        | coordinate_triple curve_to_arguments
...
coordinate_triple   ::= coordinate coordinate coordinate
...
coordinate          ::= float  action => float
...
Cc                  ~ [Cc]

Thus, given a curve - say the string 'C 10,11 20,21 30,31' - Marpa will call both float() and command().

(Elsewhere the grammar says commas are to be discarded).

What happens in practice? Firstly, the code:

sub command
{
    my($hashref, $t1, @t2) = @_;
    $param_count = 0;

    return
    {
        count => ++$item_count,
        name  => $t1,
        type  => 'command',
        value => [@t2],
    };

} # End of command.

# -----------------

sub float
{
    my($hashref, $t1) = @_;

    return
    {
        count => ++$item_count,
        name  => ++$param_count,
        type  => 'float',
        value => $t1,
    };

} # End of float.

Things to note:

  • We don't count the floats, to ensure there are 6, or 8, ...

This is because the input stream is deemed to have been validatated as being SVG.

  • A global variable counts output hashrefs

  • A global variable counts parameters per command

  • float() is called for every number

This includes the fact that the grammar says there are at least 3 coordinates (float pairs) per curve command.

This is not a problem, just a fact. We handle it with a Perl array, as explained next.

  • command() is called at the end of the string

Again, this is not a problem. On the contrary - it is an inescapable part of the solution!

But it does mean that we need to be aware of what Marpa does with the results of calling float(), i.e. where are they, when Marpa calls command().

It is simple. The results of the 6+ (in this grammar) calls to float(), 6+ hashrefs, are passed as the trailing parameters in the call to command(). That explains the @t2 parameter in the first line of command().

Likewise, we can say that since, in the grammar, nothing is subservient to float, then no action subs can possibly be called in the processing of a float. So, when float() is called, it cannot have any such trailing parameters.

Where does this leave us? It means that the @t2 parameter to command() contains a set of 6+ floats packaged as hahsrefs inside an unknown number of arrayrefs.

The same logic applies to the output of command() within the context of the output of parsing the whole input stream.

Our final task then will be to recover the result of the parse and unravel and discard the arrayrefs. This will leave us with a set of hashrefs, which was the point of the exercise. I should repeat: This structure suits my purpose. Adapt as required.

The remainder of the code:

my($result) = $self -> recce -> value;

die "Marpa's parse failed\n" if (! defined $result);

# Unravel whatever Marpa gives us.

for my $item (@{$self -> decode_result($$result)})
{
    # If it's a command, unravel the 'value => [@t2]' component.

    if ($$item{type} eq 'command')
    {
        $self -> new_item($$item{type}, $$item{name}, '-');

        for my $param (@{$self -> decode_result($$item{value})})
        {
            $self -> new_item($$param{type}, $$param{name}, $$param{value});
        }
    }
    else
    {
        $self -> new_item($$item{type}, $$item{name}, $$item{value});
    }
}

Lastly, the workhorse sub:

sub decode_result
{
    my($self, $result) = @_;
    my(@worklist) = $result;

    my($obj);
    my($ref_type);
    my(@stack);

    do
    {
        $obj      = shift @worklist;
        $ref_type = ref $obj;

        if ($ref_type eq 'ARRAY')
        {
            unshift @worklist, @$obj; # Unravel Marpa's handiwork.
        }
        elsif ($ref_type eq 'HASH')
        {
            push @stack, {%$obj}; # Build array of results.
        }
        else
        {
            die "Unsupported object type $ref_type\n" if ($ref_type);
        }
    } while (@worklist);

    return [@stack]; # Return an arrayref of hashrefs.

} # End of decode_result.

I must acknowledge the work done by the authors of Data::Find, Data::Recursive::Encode and MarpaX::Languages::C::AST. They all use similar mechanisms for unravelling data structures.

127 Where can I find code using events?

The are many Perl modules which use events.

Try MarpaX::Languages::Perl::PackUnpack or GraphViz2::Marpa.

128 Where can I find code which builds its grammar on-the-fly?

A simple case: Text::Balanced::Marpa.

A complex case: MarpaX::Languages::SQL2003::AST.

129 Other Resources

130 Papers on Parsing Theory

Aria's list.

131 Are there C/Lua/Python/etc bindings for libmarpa?

Various language bindings for libmarpa

132 What are the Marpa Papers?

They are a small book of Marpa-related matters, and are highly recommended reading for (a) Getting Started, (b) Marpa's Advantages, and (c) Advanced Usage.

133 What grammar do I use when eol ("\n", new-line) is significant?

There is no definitive answer to this, because different contexts will have different interpretations of how they need to handle eol.

It would be nice, however, to have some specific samples of grammar available.

134 When I use the case-insensitive suffix ':i' my program dies

This is probably because you're using actions on rules, and you changed a lexeme from, say, "and ~ 'and'" to "and ~ 'and':i". After you add the ':i', the parameter passed in to your action function will have an extra arrayref around it, compared to what is was without the ':i'. Just dereference it with $t1 = $$t1[0].

135 When do I use actions and when do I use events?

Actions are attached to rules, so when an action is fired, you know exaclty which rule is being triggered by the 'current' text. And the data structure Marpa passes to your action sub contains the results of that rule's subordinate rules, down to the lexeme level.

With events attached to lexemes, you know the lexeme but not the rule. Of course you can interrogate Marpa about which rule is currently firing. For that, see Q18.

This choice is yet one more way in which Marpa differs so hugely from traditional parsers.

Also, see this article for an explanation of how Marpa passes parameters into action subs.

136 How does Marpa pass parameters to action subs?

See this stand-alone article: How Marpa passes arguments to your action subs.

See also Q125 How do I associate an action class with my grammar? and Q126 Declaring an Action Class, and Q134 When I use the case-insensitive suffix ':i' my program dies.

137 SPPF parsing

Marpa uses SPPF parsing, but not under that name. Jeffrey Kegler calls his version a 'parse bocage', which he invented independently of Elizabeth Scott.

For details of her work, see here.

138 Parse exhaustion

See, and this setting.

139 What is the difference between prediction events and pre-lexeme events?

A prediction event will be triggered when its lexeme is predicted (i.e. expected) at some point in the input stream. A pre-lexeme event can only be triggered at the start of a lexeme which has been explicitly detected in the input stream.

140 Precedence, associativity, priority and rank

This answer was added after a discussion on IRC, starting here.

Briefly, try tackling this issue in this order:

  • Re-write the grammer

  • Consider using Marpa's 'priority' adverb, discussed here, and its 'rank' adverb, discussed here.

  • Observe that there is another way of prioritizing rules.

  • Lastly, parse order can be influenced by rule ranking.

141 I want to write my own version of the Marpa/Earley/Leo algorithm in language X. What's the best way to proceed?

This is Jeffrey speaking: You'll probably want to work your way through many of my blog posts, especially this one: The Marpa Algorithm.

Experience suggests writing the parser in this order:

  • A basic Earley's parser.

  • The Aycock & Horspool grammar rewrite, to fix issues with nulls.

  • Joop Leo's improvement.

  • Elizabeth Scott's SPPF, or some other binarization of the output.

Marpa was not actually developed in this order -- it would have been much easier if it had been.

The Aycock & Horspool fix for null terminals: Null terminals and rules were a nasty and persistent problem with Earley's until A&H found the fix.

Joop Leo's improvement: This is hard but necessary if you want to avoid having your parser being regarded as second class. Leo improved Earley's algorithm to be linear for all deterministic context-free grammars, which is a big, big deal. Without Leo's improvement, for example, your parser won't really be able to handle true second-order languages -- languages which are the output of other programs, including languages produced as the output of other languages.

The Marpa ordering: The Marpa algorithm takes pains to have the parse engine complete all work in one Earley set before doing any work in the next one. This allows the parser to do better debugging; to use advanced methods like the Ruby Slippers; to be used on an event-driven basis; and to be paired with custom procedural logic.

Earley hacks you may want to skip: Many of the other improvements IMHO you will want to skip, including the LR(0) states in the Aycock & Horspool paper. In general, I suggest you be skeptical of those improvements which do not produce better results in big-O terms, but instead merely reduce the number of Earley items by a constant amount. These do not seem to produced measurable speed-ups at parsing time, but they must be undone at evaluation time. They will make it very hard to add an event mechanism to your parser, and will make debugging and tracing more complex.

142 I used a precedenced statement for my expression, so why did my parse come out ambiguous anyway?

There is a section in the Marpa::R2 documents that deals specifically with this.

But that section contains no examples. So here is one. Let your input be:

1 2 + 3 4

Consider this precedenced statement:

E ::= Number || E E || E '+' E

The parse is unambiguous. Here is the only parse:

((1,2),+,(3,4))

But suppose we change the grammar to add unary plus, so that the precedenced statement is now:

E ::= Number || '+' E || E E || E '+' E

Now the parse is ambiguous -- these are the two parses:

(((1,2),(+,3)),4) and ((1,2),+,(3,4))

Notice that both of these parses obey precedence. Precedence can be seen as removing ambiguity, but it does not remove all ambiguity. It only eliminates those parses which do not obey the precedence.

143 How do I contribute to this FAQ?

The FAQ is on github. Just raise an issue. If you're not familiar with github, just click on '(!) Issues 0'. The 0 is the # of open issues, so it may be > 0.

144 How do I use external parsing?

External parsing is where you write your own code to do some parsing, while getting Marpa to handle various parts of the work.

One advantage of exernal parsing is that you can employ the full power of Perl's regexp capabilities, allowing you to thereby do more than is built into Marpa's SLIF (Scanless Interface aka BNF).

Note that you can switch back and forth between letting Marpa do the parsing and doing some of it yourself.

So why not do all of it one way or t'other? The obstacle is that Perl regexes are deterministic, while Marpa is non-deterministic.

See 'External scanning' for a discussion in the Marpa docs.

Jeffrey has blogged about it here and here.

rns has a gist demonstrating such code. Note specifically the 'for' loop starting at line 59. Your code jumps in and out of Marpa based on the event declared on line 36.

In brief, when Marpa detects input matching whatever you have attached an event to, it (Marpa) exits and your code - within the loop - takes over. Normally the first step is to ask Marpa which events triggered the exit. And yes, that was 'events' plural since several events can be triggered simultaneously at the same place in the input stream.

Ron Savage uses this technique in a number of modules:

See also the next question.

145 How do I distinguish between identifiers and keywords?

A good way is to use lexeme priorities, perhaps by combining them with external parsing.

See the previous question for sample code.

Jean-Damien Durand has written several full language parsers, including one for ECMAScript.

More sample code is in this article on Stackoverflow.

146 How do I write the BNF to specify exceptions to a pattern, or the complement of a pattern?

Currently there is no way to specify that directly in Marpa::R2, because context free grammars (CFGs) are not closed under complement. That is: If L1 and L2 are both BNF grammars, it is not necessarily possible to write L1 - L2 in BNF.

Nevertheless, Jean-Damien Durand has done something similar with this C implementation of a wrapper around Marpa, which uses his version of BNF. See:

This approach consists of building an ASF - Abstract Syntax Forest of parse results and then processing the trees in that forest.

147 How do I write the BNF when the first token is optional?

Note: GraphViz2::Marpa requires this technique, since it's working from this BNF.

Make sure you use enough tokens in your BNF. Do not try to compress the BNF when dealing with this issue.

Sample code.

148 Is there a tabular view of the DSL?

Yes: A Tabular View of Marpa::R2's DSL.

149 The Libmarpa API and missing valuator steps

When working with libmarpa, if you notice things like missing instructions of types MARPA_STEP_RULE and MARPA_STEP_NULLING_SYMBOL, make sure you are calling marpa_g_force_valued() immediately upon grammar creation.

For details see this Marpa Google Group post.