The Marpa FAQ

Version 2.04. 2024-01-21T17:41:34.

On-line docs

The Marpa FAQ Intro

The Marpa Papers

Github repos

The Marpa FAQ repo

The Marpa Papers repo

Table of Contents grouped by Topic

A: About Marpa

B: Error Messages

C: Problems

D: Tricks and techniques

E: Details on how Marpa works

F: Links and resources

Answers grouped by Topic

A: About Marpa

1 What is Libmarpa?

Libmarpa is the core Marpa library, written in C, that implements the Marpa parse engine - the Marpa algorithm itself. It is low-level, but it is documented and can be installed and used separately.

Homepage.

Index of Terms.

Formal docs.

2 Marpa parses ambiguous grammars. So what?

If you want to avoid ambiguity in your grammar, Marpa is the best tool for it. Marpa can tell you exactly where the ambiguity occurs in an ambiguous parse. (Marpa does this by taking two of the ambiguous parses and comparing them. This is a capability that parsers do not have unless they can handle ambiguous grammars.)

Marpa also allows you to prioritze dealing with the ambiguity. You can take advantage of Marpa's ability to handle ambiguous grammars by working on other aspects of your project, and deal with the ambiguity later.

Almost all of the parsers in practical use that refuse to allow ambiguous grammars, also refuse to parse a lot of very useful unambiguous grammars. They kill all your enemies, but they also kill a lot of friends.

On the other hand, you may not want to reject ambiguity so quickly. Humans, given the choice, prefer to communicate in ambiguous languages. For example, English, the language you are reading now, is highly ambiguous. Ambiguity can be used as a feature, instead of rejected as a bug. For more on this, see Jeffrey Kegler's blog post on the topic.

B: Error Messages

3 Error: Parse failed. value() called when recognizer is not in tree mode

The culprit is Marpa::R2::Scanless::R's ambiguous() method.

Forest mode is something most users don't care about. It is used for Marpa's Abstract Syntax Forests. See Marpa::R2::ASF if you want to know more about those. The usual mode is 'tree mode', and most of Marpa's calls, such as value(), require you to be in 'tree mode'.

ambiguous() uses forest mode to print a detailed description of the ambiguity. (By the way, try getting that out of another parser.) And ambiguous() leaves you in forest mode, which means you cannot do much of the usual processing, such as calling value(). So how do you get around this problem?

There are two ways, depending on what you are trying to do. Most often, you regard an ambiguity in your parse as an error, so you want to print helpful message and abend if the parse is ambiguous, and go on to do normal stuff if it is not. What you should do in this case is call ambiguity_metric() first. ambiguity_metric() returns 1 for an unambiguous parse and 2+ for an ambiguous parse. Errors return something <= 0. And ambiguity_metric() does not put you into forest mode. See also Q 6.

Less often, you want to both print out a message describing the parse's ambiguity, and then go on to deal with it in 'tree mode'. In this case, you can call the series_restart() method. series_restart() puts you back into 'tree mode'.

And here is some code copied directly from V 2 of Genealogy::Gedcom::Date. The method decode_result() is listed below, and Dumper() is from Data::Dumper::Concise. Of course the calling code checks that the return from ambiguity_metric() is > 1:

    my($count) = 0;

    while (my $value = $self -> recce -> value)
    {
        $value = $self -> decode_result($$value); # Returns an arrayref, hence the 'for' below.

        $self -> log(debug => "Solution @{[++$count]}: \n" . Dumper($_) ) for @$value;
    }

For more on ambiguity_metric(), see Q 6.

But remember, to handle ambiguity, you don't need to use this code. You can take a completely different route with Marpa::R2::ASF.

And there are still more choices: Leave the ambiguity as an error, re-work the grammar, or perhaps switch to events attached to lexemes.

In the latter case, if 2 or more events are fired by the same lexeme you can write code which chooses 1 of those events, and you then tell Marpa to continue with that choice. For code using this technique, see GraphViz2::Marpa, Text::Balanced::Marpa or Text::Delimited::Marpa.

4 Error: A lexeme in G1 is not a lexeme in any of the lexers

Marpa automatically determines which symbols are 'lexemes'. To do this, it checks for all the symbols that the G1 grammar considers to be lexemes, and all the symbols that the L0 grammar considers to be lexemes. It requires these two lists to match.

Lexemes in G1 are defined in rules using := and lexemes in L0 are defined by rules using ~.

A symbol is a G1 lexeme if it appears on the RHS of a G1 rule, but never appears on the LHS of any G1 rule. A symbol is an L0 lexeme if it appears on the LHS of an L0 rule, but never appears on the RHS of any L0 rule.

See also Q 8.

5 Error: A lexeme in lexer L0 is not a lexeme in G1

See Q 4.

6 Error message: Marpa exited at (line, column) = ($line, $column) within the input string

In the module Genealogy::Gedcom::Date, this text is part of the error message when a call to Marpa ambiguity_metric() returned a value <= 0.

Consider the possibility that the parse ends without a successful parse, but the input is the prefix of some input that can lead to a successful parse.

Marpa is not reporting a problem during the read(), because you can add more to the input string, and Marpa does not know that you do not plan to do this.

7 Error in SLIF parse: No lexeme found at line 1, column 5

The full error text will look like:

Error in SLIF parse: No lexeme found at line 1, column 5
* String before error: (?^:
* The error was at line 1, column 5, and at character 0x0029 ')', ...
* here: )

The part which says 'No lexeme found' means no lexeme according to your grammar. The actual character found is listed here as ')', but your BNF does not define that character as being a lexeme in that context.

8 Error: An L0 lexeme cannot appear on the RHS of an L0 rule

Consider these rules, which define lexemes in L0 (they are in L0 because ~ is used):

X ~ Y
Z ~ X
Y ~ 'y'

The usage of X in Z ~ X will trigger the error.

This means a lexeme in L0 (defined as is X by the tilde in the rule X ~ Y) cannot be used to define another lexeme (Z as in Z ~ X). The first occurrance of X is on the LHS of the first rule and the second occurrance of X is on the RHS of the second rule. You can't do that in Marpa::R2.

So the error message can be read as saying (and should say):

'An L0 lexeme (X in the rule X ~ Y) cannot appear on the RHS of another L0 rule (Z ~ X)'.

See also Q 4.

C: Problems

9 Why doesn't my ambiguous grammar return more than 1 parse tree?

Probably because your grammar includes

lexeme default = latm => 1 # Active the Longest Acceptable Token Match option.

Try commenting that line out after, perhaps, re-reading the previous question's answer.

See also Q 14 and Q 47.

10 Why does using '+' or '*' in a rule only work sometimes?

I assume you're referring to cases like this:

Case 1 (fails):

node_definition  ::= node_statement+
                 | node_statement graph_definition

Case 2 (succeeds):

coordinate_list  ::= coordinate+

Case 3 (succeeds):

whitespace       ~ [\s]+

Briefly, quantified rules (here, using the '+'), are only allowed when the right-hand side contains a single alternative. Thus case 1 fails due to the combination of '+' and '|'.

This is documented under 'Quantified rule' in Marpa::R2::Scanless::DSL.

So we must re-write rule 1 (with some context):

graph_definition  ::= node_definition
                  | edge_definition
                  | subgraph_definition

node_definition  ::= node_statement
                 | node_statement graph_definition

node_statement   ::= node_name
                 | node_name attribute_definition
                 | node_statement (',') node_statement

Now the rule 'node_definition ::= node_statement' allows a node definition to consist of a single node statement.

And the alternative (via the '|') 'node_definition ::= node_statement graph_definition' allows a node definition to consist of a node statement followed by a graph definition, which just happens to include the line 'graph_definition ::= node_definition'!

So we've managed to write the grammar using indirect recursion, thus allowing an indefinite list of node definitions to follow one another. And the last rule allows the input stream to separate them with commas as well as the usual spaces.

In other words: Why does the SLIF require quantifiers to be separate statements?

It's different from regular expressions, which everyone is accustomed to, and seems awkward.

The reason is that, with a parser, you expect to attach a semantics to all the rules - with regular expressions where is no such expectation.

So if Marpa allowed ' A ::= B+ (C | D) exp*', how would you specify how the arguments are passed to the semantics?

If just as a list, they have to be re-parsed all over again.

This problem can be solved, but the easiest way around it is to write it out as separate statements, with the quantifiers by themselves, and the semantics can then be attached to the statements.

11 Can a lexeme have length == 0?

No.

Marpa allows L0 rules that might be zero-length (for example, 'baZ ~ [Z]*') on the idea that it might be convenient to write them that way. But no zero-length lexeme will ever be accepted as a result of such rules.

If you think out the conceptual issues, you might see nulled lexemes are 'fraught' with them. At any given point, which null lexemes should you recognize? And how many of them? A parser could guess what you really mean here, but to my (Jeffrey's) mind it is just as well to force you to be explicit on this issue.

Note: I (Ron) think Jeffrey uses pseudo in the name 'nulled pseudo-lexeme' because, being effectively of 0 length, they can't actually be lexemes.

For sample code which shows a way of handling such an issue see null.lexeme.demo.pl and null.lexeme.demo.1.pl.

12 I used a precedenced statement for my expression, so why did my parse come out ambiguous anyway?

There is a section in the Marpa::R2 documents that deals specifically with this.

But that section contains no examples. So here is one. Let your input be:

1 2 + 3 4

Consider this precedenced statement:

E ::= Number || E E || E '+' E

The parse is unambiguous. Here is the only parse:

((1,2),+,(3,4))

But suppose we change the grammar to add unary plus, so that the precedenced statement is now:

E ::= Number || '+' E || E E || E '+' E

Now the parse is ambiguous -- these are the two parses:

(((1,2),(+,3)),4) and ((1,2),+,(3,4))

Notice that both of these parses obey precedence. Precedence can be seen as removing ambiguity, but it does not remove all ambiguity. It only eliminates those parses which do not obey the precedence.

13 The Libmarpa API and missing valuator steps

When working with libmarpa, if you notice things like missing instructions of types MARPA_STEP_RULE and MARPA_STEP_NULLING_SYMBOL, make sure you are calling marpa_g_force_valued() immediately upon grammar creation.

For details see this Marpa Google Group post. That group is now closed, so instead see:

Marpa's homepage.

That page contains a link to an archive of the group. The archive is called Marpa.Google.Group.2023.12.06.zip.

See also Q 50 for another reference to this archive.

D: Tricks and techniques

14 How do I implement disambiguation logic?

See under Bailing-out-of-parse-evaluation in Marpa::R2::Semantics, for how to short-circuit semantic analysis.

See also this gist.

See also Q 9 and Q 47.

15 How do I parse comments?

GraphViz2::Marpa implements comment skipping for C and C++ comments, and for #-prefix style comments. Search the source of Marpa.pm for 'C style comment', 'Cplusplus style comment' and 'hash style comment'.

16 How do I parse numbers?

Chapter 2 of Peter Stuifzand's Marpa Guide discusses this issue at length And if you fork that project on github, you'll find the directory examples/ contains about a dozen sample programs.

Alternately, download MarpaX::Languages::SVG::Parser, which contains similar logic (in data/*.bnf). Further, this latter distro includes sample code (float.pl and number.pl) in the scripts/ directory. The last-named script includes grammars for binary, octal and hex numbers which I did not need in the SVG parser.

17 How do I represent choice in the grammar?

Briefly, use '|'. Sample code from the above named module (data/d.bnf):

coordinate  ::= float      action => float
                | integer  action => integer

Also note that '||' is available to control the relative priorities of the alternatives.

See also Q 18 and Q 19.

18 Is there any other way of representing choice?

Sure. Consider these L0 rules (also from data/d.bnf):

sign_maybe  ~ [+-]
sign_maybe  ~

digit       ~ [0-9]
digit_any   ~ digit*
digit_many  ~ digit+

E           ~ [Ee] sign_maybe digit_many
E_maybe     ~ E
E_maybe     ~

:lexeme     ~ float
float       ~ sign_maybe digit_many E
              | sign_maybe digit_any '.' digit_many E_maybe
              | sign_maybe digit_many '.' E_maybe
              | sign_maybe non_zero digit_any

This is saying:

  • The sign is optional after E or e (sign_maybe)

  • The E is optional (E_maybe)

  • The sign is optional before a digit (the first alternative for 'float')

  • And so on

See also Q 17 and Q 19.

19 What's the difference between '|' and '||' in grammar definitions?

'|' expresses a simple alternative, while '||' separates alternatives at different levels of precedence. For example:

E ::=    E '*' E
      |  E '/' E
      || E '+' E
      |  E '-' E

This describes syntax for a simple 4-operation calculator, where multiplication and division take precedence over addition and subtraction.

This construct forces alternatives before the '||' to have higher precedence than the alternatives after that token.

Each of the alternatives separated by '|' are at the same precedence.

In Mark Dominus's 'Higher-Order Perl' book, he describes how he handles precedence in his recursive descent parsers for a similar calculator, beginning on p. 394. This is a good description of how the same situation would be handled with pre-Marpa techniques.

If precedence is not an issue for you, just use single bar alternation ('|').

Note also that Marpa supports a 'priority' adverb, discussed in the Marpa::R2::Scanless::DSL docs mentioned above.

See also Q 17 and Q 18.

20 What does it mean to hide (mask) tokens?

It's possible to design a grammar such that we can say certain tokens are hidden. This is discussed under RHS-alternatives in Marpa::R2::Scanless::DSL.

Here is an example (from MarpaX::Languages::Dash):

node_definition  ::= node_statement
                 | node_statement graph_definition

node_statement   ::= node_name
                 | node_name attribute_definition
                 | node_statement (',') node_statement

node_name        ::= start_node end_node

:lexeme          ~ start_node  pause => before  event => start_node
start_node       ~ '['

:lexeme          ~ end_node
end_node         ~ ']'

The comma is hidden. But what does this mean, exactly?

It means the comma may appear in the input stream (in this context), but I am not interested in it. So, Marpa will not return this token to me when an action is called. It has become invisible to the semantics.

Semantics are documented in Marpa::R2::Semantics.

Close study of the above grammar will show that another path is possible thru the grammar, in the case of 2 nodes side-by-side (i.e. without a comma between them). Thus the grammar accepts both these input streams as valid:

Stream 1: [node.a], [node.b]
Stream 2: [node.a]  [node.b]

Hence, from the end-user's viewpoint, the comma is defined - at least in this particular grammar - to be optional.

21 How do I implement recursion in a grammar?

See Q 10.

22 How do I find out where I am within the input stream?

See the line_column() method in Marpa::R2::Scanless::R.

23 How do I find out which rule applies at some point in the input stream (e.g. during a pause)?

See under Action-context in Marpa::R2::Semantics, for access to internal variables in Marpa which do what you want.

24 What is the meaning of the line number in a Marpa error message?

It is the line number within the BNF of the rule being processed.

Note: If your grammar uses \r, that will affect the line count.

Further, see the line_column() method in Marpa::R2::Scanless::R for various special characters some of which may affect this line count.

25 Can I switch grammars during a parse?

No - You can't just change Marpa's grammar object during a parse (by calling Marpa::R2::Scanless::G -> new() with a new BNF), and expect everything else to keep working.

Nevertheless, there are techniques which can probably do what you want.

  • 1: Combine the grammars

  • 2: Use 'parse before' and switch grammars twice

  • 3: Use an action, change the grammar, and re-parse

In a bit more detail:

1: 'Combine the grammars': Perhaps you are convinced this can't be done, but you are urged to try it. Focus on the problem of defining the lexemes.

You may find your solution involves switching lexers rather than switching grammars.

2: 'Use parse before and switch grammars twice': This technique is used in GraphViz2::Marpa.

The idea is to pause at the start of a substring you want to parse using a different grammar. Then, parse just that substring using a different grammar object and a different recognizer object.

At the end of the substring, tell the first grammar to resume at the first character after the end of the substring.

This works in my (Ron's) code because the second grammar ends with a known, matching, delimiter. When Marpa encounters the next character, it generates the infamous error: 'Parsing complete but lexemes remain'. Luckily for me, this just means there is (and there must be) tokens in the input stream after the end of the substring which needed to be parsed using the second grammar. So it's just a matter of asking Marpa if that particular error arose, processing the output from using the second grammar, and returning to the parse which is using first grammar.

In the code referenced, search for '$self -> grammar' and '$self -> grammar4html'.

Your problem will obviously be: What happens at the end of the substring? I.e. How do you know it's time to switch back to the previous grammar?

Also, this approach creates the problem of having the semantics defined in 2 different places. So combining them may add complexity to your code.

3: 'Use an action, change the grammar, and re-parse': This technique is used in MarpaX::Languages::SQL2003::AST.

See this IRC backlog entry for a discussion by the author, Jean-Damien Durand, for why he does things this way.

Briefly, the parse of a substring (call it 'a') requires knowledge of what follows that substring (call the second substring 'b'). So, parse it all ('a + b'), and when the action is triggered for 'b', use the information found to change the grammar used to parse 'a', and re-parse 'a'.

The effect (in Jean-Damien's case at least) is to use the result of a sub-grammar as a token value.

You may safely assume this latter method is a very advanced technique!

26 Where is Ruby Slippers parsing documented?

Start with Marpa::R2::Scanless::R's terminals_expected(). This document has several references to Ruby Slippers.

See also Jeffrey's articles: Ruby Slippers parsing, and the series beginning with parsing HTML part 1.

27 Where can I find code using Ruby Slippers?

See the script scripts/match.parentheses.02.pl, which ships with the module MarpaX::Demo::SampleScripts.

28 How do I associate an action class with my grammar?

This discussion assumes you're using Marpa::R2 V 3.0 or later. Prior to that, there was complications using the per-parse parameter to value(), and elsewhere.

See also Q 29 and Q 35.

References:

1: The per-parse argument to value()

2: The per-parse constructor

3: Calling value(...)

Here's how I declare a recognizer in one package:

use MarpaX::Languages::SVG::Parser::Actions;

# And later...

$self -> recce
(
    Marpa::R2::Scanless::R -> new
    ({
        grammar           => $self -> grammar,
        semantics_package => 'MarpaX::Languages::SVG::Parser::Actions',
    })
);

Now actions referenced in the grammar, as in:

coordinate  ::= float      action => float
                | integer  action => integer

must be declared as functions in the named semantics package, because that's where Marpa looks for them.

29 Declaring an Action Class

Firstly, see also Q 35.

Now, the previous item in this FAQ discussed at length various issues regarding how to cleanly generate data within action subs, and how to pass that data back to the main-line code which called Marpa.

Here I explain the standard solution as used in a module, MarpaX::Languages::SVG::Parser, on CPAN. Briefly:

  • Each action sub returns a hashref to Marpa

This is because Marpa is the one calling the subs and gathering the results. And the hashrefs of course suit my code, whereas you may prefer to use a different data structure.

  • After a successful parse, the result of the parse is processed to recover those hashrefs

This result is returned by the value() method of the recognizer class. See value() in Marpa::R2::Scanless::R.

There is one basic problem to solve along the way: The data structure Marpa makes available as the result of the parse can be a deeply nested set of arrayrefs, depending on how deeply within the grammar the action sub is named.

So, we have to unravel the arrayrefs to recover the hashrefs.

Conside this fragment of the SVG grammar from MarpaX::Languages::SVG::Parser:

curve_to            ::= Cc curve_to_arguments  action => command

curve_to_arguments  ::= coordinate_triple
                        | coordinate_triple curve_to_arguments
...
coordinate_triple   ::= coordinate coordinate coordinate
...
coordinate          ::= float  action => float
...
Cc                  ~ [Cc]

Thus, given a curve - say the string 'C 10,11 20,21 30,31' - Marpa will call both float() and command().

(Elsewhere the grammar says commas are to be discarded).

What happens in practice? Firstly, the code:

sub command
{
    my($hashref, $t1, @t2) = @_;
    $param_count = 0;

    return
    {
        count => ++$item_count,
        name  => $t1,
        type  => 'command',
        value => [@t2],
    };

} # End of command.

# -----------------

sub float
{
    my($hashref, $t1) = @_;

    return
    {
        count => ++$item_count,
        name  => ++$param_count,
        type  => 'float',
        value => $t1,
    };

} # End of float.

Things to note:

  • We don't count the floats, to ensure there are 6, or 8, ...

This is because the input stream is deemed to have been validatated as being SVG.

  • A global variable counts output hashrefs

  • A global variable counts parameters per command

  • float() is called for every number

This includes the fact that the grammar says there are at least 3 coordinates (float pairs) per curve command.

This is not a problem, just a fact. We handle it with a Perl array, as explained next.

  • command() is called at the end of the string

Again, this is not a problem. On the contrary - it is an inescapable part of the solution!

But it does mean that we need to be aware of what Marpa does with the results of calling float(), i.e. where are they, when Marpa calls command().

It is simple. The results of the 6+ (in this grammar) calls to float(), 6+ hashrefs, are passed as the trailing parameters in the call to command(). That explains the @t2 parameter in the first line of command().

Likewise, we can say that since, in the grammar, nothing is subservient to float, then no action subs can possibly be called in the processing of a float. So, when float() is called, it cannot have any such trailing parameters.

Where does this leave us? It means that the @t2 parameter to command() contains a set of 6+ floats packaged as hahsrefs inside an unknown number of arrayrefs.

The same logic applies to the output of command() within the context of the output of parsing the whole input stream.

Our final task then will be to recover the result of the parse and unravel and discard the arrayrefs. This will leave us with a set of hashrefs, which was the point of the exercise. I should repeat: This structure suits my purpose. Adapt as required.

The remainder of the code:

my($result) = $self -> recce -> value;

die "Marpa's parse failed\n" if (! defined $result);

# Unravel whatever Marpa gives us.

for my $item (@{$self -> decode_result($$result)})
{
    # If it's a command, unravel the 'value => [@t2]' component.

    if ($$item{type} eq 'command')
    {
        $self -> new_item($$item{type}, $$item{name}, '-');

        for my $param (@{$self -> decode_result($$item{value})})
        {
            $self -> new_item($$param{type}, $$param{name}, $$param{value});
        }
    }
    else
    {
        $self -> new_item($$item{type}, $$item{name}, $$item{value});
    }
}

Lastly, the workhorse sub:

sub decode_result
{
    my($self, $result) = @_;
    my(@worklist) = $result;

    my($obj);
    my($ref_type);
    my(@stack);

    do
    {
        $obj      = shift @worklist;
        $ref_type = ref $obj;

        if ($ref_type eq 'ARRAY')
        {
            unshift @worklist, @$obj; # Unravel Marpa's handiwork.
        }
        elsif ($ref_type eq 'HASH')
        {
            push @stack, {%$obj}; # Build array of results.
        }
        else
        {
            die "Unsupported object type $ref_type\n" if ($ref_type);
        }
    } while (@worklist);

    return [@stack]; # Return an arrayref of hashrefs.

} # End of decode_result.

I must acknowledge the work done by the authors of Data::Find, Data::Recursive::Encode and MarpaX::Languages::C::AST. They all use similar mechanisms for unravelling data structures.

30 Where can I find code using events?

The are many Perl modules which use events.

Try MarpaX::Languages::Perl::PackUnpack or GraphViz2::Marpa.

31 Where can I find code which builds its grammar on-the-fly?

A simple case: Text::Balanced::Marpa.

A complex case: MarpaX::Languages::SQL2003::AST.

32 What grammar do I use when eol ("\n", new-line) is significant?

There is no definitive answer to this, because different contexts will have different interpretations of how they need to handle eol.

It would be nice, however, to have some specific samples of grammar available.

33 When I use the case-insensitive suffix ':i' my program dies

This is probably because you're using actions on rules, and you changed a lexeme from, say, "and ~ 'and'" to "and ~ 'and':i". After you add the ':i', the parameter passed in to your action function will have an extra arrayref around it, compared to what is was without the ':i'. Just dereference it with $t1 = $$t1[0].

34 When do I use actions and when do I use events?

Actions are attached to rules, so when an action is fired, you know exaclty which rule is being triggered by the 'current' text. And the data structure Marpa passes to your action sub contains the results of that rule's subordinate rules, down to the lexeme level.

With events attached to lexemes, you know the lexeme but not the rule. Of course you can interrogate Marpa about which rule is currently firing. For that, see Q 23.

This choice is yet one more way in which Marpa differs so hugely from traditional parsers.

Also, see this article for an explanation of how Marpa passes parameters into action subs.

35 How does Marpa pass parameters to action subs?

See this stand-alone article: How Marpa passes arguments to your action subs.

See also Q 125 How do I associate an action class with my grammar? and Q 126 Declaring an Action Class, and Q 134 When I use the case-insensitive suffix ':i' my program dies.

36 How do I use external parsing?

External parsing is where you write your own code to do some parsing, while getting Marpa to handle various parts of the work.

One advantage of exernal parsing is that you can employ the full power of Perl's regexp capabilities, allowing you to thereby do more than is built into Marpa's SLIF (Scanless Interface aka BNF).

Note that you can switch back and forth between letting Marpa do the parsing and doing some of it yourself.

So why not do all of it one way or t'other? The obstacle is that Perl regexes are deterministic, while Marpa is non-deterministic.

See 'External scanning' for a discussion in the Marpa docs.

Jeffrey has blogged about it here and here.

rns has a gist demonstrating such code. Note specifically the 'for' loop starting at line 59. Your code jumps in and out of Marpa based on the event declared on line 36.

In brief, when Marpa detects input matching whatever you have attached an event to, it (Marpa) exits and your code - within the loop - takes over. Normally the first step is to ask Marpa which events triggered the exit. And yes, that was 'events' plural since several events can be triggered simultaneously at the same place in the input stream.

Ron Savage uses this technique in a number of modules:

See also Q 37.

37 How do I distinguish between identifiers and keywords?

A good way is to use lexeme priorities, perhaps by combining them with external parsing.

See the previous question for sample code.

Jean-Damien Durand has written several full language parsers, including one for ECMAScript.

More sample code is in this article on Stackoverflow.

See also Q 36.

38 How do I write the BNF to specify exceptions to a pattern, or the complement of a pattern?

Currently there is no way to specify that directly in Marpa::R2, because context free grammars (CFGs) are not closed under complement. That is: If L1 and L2 are both BNF grammars, it is not necessarily possible to write L1 - L2 in BNF.

Nevertheless, Jean-Damien Durand has done something similar with this C implementation of a wrapper around Marpa, which uses his version of BNF. See:

This approach consists of building an ASF - Abstract Syntax Forest of parse results and then processing the trees in that forest.

39 How do I write the BNF when the first token is optional?

Note: GraphViz2::Marpa requires this technique, since it's working from this BNF.

Make sure you use enough tokens in your BNF. Do not try to compress the BNF when dealing with this issue.

Sample code.

40 Is there a tabular view of the DSL?

Yes: A Tabular View of Marpa::R2's DSL.

41 How can I trace what the L0 parser is doing?

Use the trace_terminals parameter to Marpa::R2::Scanless::R's constructor.

See GraphViz2::Marpa's usage for sample code.

trace_terminals takes values from 0 to 99, with higher numbers requesting more verbose output. Start with 1 and then jump to 99.

42 How do I define lexemes for identifiers where the 1st char is from a subset of the other chars?

Assume the identifier is something like $head$tail, where $head must belong to

[_A-Za-z]

but $tail is allowed to belong to

[_A-Za-z0-9]*

meaning $tail may be empty or many chars long. Try this:

:lexeme            ~ capture_name  pause => before  event => capture_name
capture_name       ~ capture_name_head capture_name_tail
capture_name_head  ~ [_A-Za-z]

capture_name_tail  ~ [_A-Za-z0-9]*

And no, you do not need to use events. (I do, in Regexp::Parsertron :-).

43 What is the difference between prediction events and pre-lexeme events?

A prediction event will be triggered when its lexeme is predicted (i.e. expected) at some point in the input stream. A pre-lexeme event can only be triggered at the start of a lexeme which has been explicitly detected in the input stream.

44 Precedence, associativity, priority and rank

This answer was added after a discussion on IRC, starting here.

Try tackling these 4 issues in this order:

  • Re-write the grammer

Yes, seriously.

Keep this in mind: Lexeme priority only works if the lexemes are exactly the same length, and start at the same location. This is a limitation in R2 which will be lifted in R3.

E: Details on how Marpa works

45 Is Marpa deterministic?

No. Marpa is non-deterministic, unlike traditional parsers. Marpa tries everything, all at once.

This may take some getting used to, in the sense that probably what you're used to is a deterministic parse.

If two alternatives (in a prioritized rule say, separated by '|' or '||') happen to match some source, Marpa produces 2 parse trees.

The parser, then, catches all ambiguities, but the SLIF lexer does not, whether it is LTM or LATM. LTM is Longest Token Match, and LATM is Longest Acceptable Token Match.

The SLIF lexer does a longest tokens (plural) match. So it does recognize ambiguous tokens if they are longest tokens.

Ambiguous tokens which are shorter than the match one (whether the discipline is LATM or LTM) are ignored by the lexer.

Consequently, this means the ambiguous matched tokens must be all the same length.

See also Q 46.

46 What is the difference between ambiguity and non-determinism?

Ambiguous means there is more than one parse.

Non-deterministic means that, in the course of discovering the parse or parses, multiple paths may be pursued at the same time.

A deterministic parser must always produce an unambiguous parse, if only by ignoring the other alternatives, as in PEG.

A non-deterministic parser, like Marpa, can produce both ambiguous and unambiguous parses.

Example code.

47 What is the meaning of L0 and G1?

L0 defines exactly what tokens are acceptable in the input stream. So, these are the rules defining the syntax of the tokens, but not the syntax of how these tokens can be combined. G1 defines the structure of the language - the ways in which the tokens can be combined. In other words, G1 is structural and therefore a 'parser' in the strictest sense, and L0 is its lexer. ('L' for lexer.)

One distinction between G1 and L0 is very important in designing a grammar/lexer combination: G1 rules have a semantics. L0 rules do not.

L0 and G1 discussed in Marpa::R2::Scanless::DSL, and, very briefly, in Marpa::R2::Semantics. In both cases, search for 'L0'.

See also Q 9 and Q 14.

48 SPPF parsing

Marpa uses SPPF parsing, but not under that name. Jeffrey Kegler calls his version a 'parse bocage', which he invented independently of Elizabeth Scott.

For details of her work, see here.

49 Parse exhaustion

See, and this setting.

F: Links and resources

50 Where do I go for help?

The IRC channel may get the quickest response: irc.freenode.net#marpa.

The website manpages.org has good docs on the theoretical details. Just type marpa into the search box and hit Enter.

The Google Group has message threading, and so is good for complex or extended discussions. That group is now closed, so instead see:

Marpa's homepage.

That page contains a link to an archive of the group. The archive is called Marpa.Google.Group.2023.12.06.zip.

See also Q 13 for another use for this archive.

51 Other Resources

52 Papers on Parsing Theory

Aria's list.

53 Are there C/Perl/Rust/etc bindings for libmarpa?

The original bindings are Libmarpa, a C library, and Marpa::R2, a Perl module. Marpa::R2 has been in stable release for many years, and is based on Libmarpa.

A Rust wrapper exists which has proved useful in development. It is also available as a crate.

Other bindings are being prototyped, or are in early development.

54 What are the Marpa Papers?

They are a small book of Marpa-related matters, and are highly recommended reading for (a) Getting Started, (b) Marpa's Advantages, and (c) Advanced Usage.

55 I want to write my own version of the Marpa/Earley/Leo algorithm in language X. What's the best way to proceed?

This is Jeffrey speaking: You'll probably want to work your way through many of my blog posts, especially this one: The Marpa Algorithm.

Experience suggests writing the parser in this order:

  • A basic Earley's parser.

  • The Aycock & Horspool grammar rewrite, to fix issues with nulls.

  • Joop Leo's improvement.

  • Q 48 re Elizabeth Scott's SPPF, or some other binarization of the output.

Marpa was not actually developed in this order -- it would have been much easier if it had been.

The Aycock & Horspool fix for null terminals: Null terminals and rules were a nasty and persistent problem with Earley's until A&H found the fix.

Joop Leo's improvement: This is hard but necessary if you want to avoid having your parser being regarded as second class. Leo improved Earley's algorithm to be linear for all deterministic context-free grammars, which is a big, big deal. Without Leo's improvement, for example, your parser won't really be able to handle true second-order languages -- languages which are the output of other programs, including languages produced as the output of other languages.

The Marpa ordering: The Marpa algorithm takes pains to have the parse engine complete all work in one Earley set before doing any work in the next one. This allows the parser to do better debugging; to use advanced methods like the Ruby Slippers; to be used on an event-driven basis; and to be paired with custom procedural logic.

Earley hacks you may want to skip: Many of the other improvements IMHO you will want to skip, including the LR(0) states in the Aycock & Horspool paper. In general, I suggest you be skeptical of those improvements which do not produce better results in big-O terms, but instead merely reduce the number of Earley items by a constant amount. These do not seem to produced measurable speed-ups at parsing time, but they must be undone at evaluation time. They will make it very hard to add an event mechanism to your parser, and will make debugging and tracing more complex.

56 How do I contribute to this FAQ?

The FAQ is on github. Just raise an issue. If you're not familiar with github, just click on '(!) Issues 0'. The 0 is the # of open issues, so it may be > 0.

57 Are there any comparisons of Marpa against other parsers for speed and practicality?

MarpaX::ESLIF is an optimized version of Marpa::R2. Its 2022 announcement on the Marpa IRC channel discusses their relative speeds.

Jeffrey Kegler and Terence Parr (author of ANTLR) had a brief exchange on Reddit in 2014.

Lars Dieckow (daxim) performed a "shootout" among various parsers, including Perl 6, Marpa and ANTLR, and reported his results in a 2017 talk.

This paper describes both ANTLR and Marpa as "industrial grade" and pits them against each other in a performance benchmark. The paper is in the refereed literature and its authors (Henriksen, Bilardi and Pingali) are independent of the Marpa and ANTLR communities.

58 Are there any on-line tools for testing grammars?

Yes. See Grammophone. Also, specifically for testing LL, see this LL(k) tester.

59 What do I do when the IRC logbot stops?

Firstly, here is the logbot's HTML report.

To contact the admin, log on to irc.freenode.net's channel #irclogger and sent a message to 'feb'.

60 Where can I find a timeline (history) of parsing?

See Jeffrey Kegler's timeline.