Chapter 1
Marpa is a library written in C, with a Perl wrapper. It is dedicated to parsing. That is, people who wish to write parsers incorporate Marpa into their programs.
You start using Marpa by providing it with 2 inputs, a grammar, and an input stream which may or may not conform to the given grammar.
This means Marpa starts by parsing the grammar and storing it in RAM in some convenient form. Such a grammar specifies both what tokens are acceptable, and the order in which they may appear. Together, these determine the syntax of the language defined by the grammar.
In practice, this means you store your grammar in a string, and Marpa treats that as a set of rules as to how you want Marpa to treat your input stream.
Next, Marpa parses your input stream and tests each token it finds there against the grammar you provided. This includes determining both if the token matches any of the definitions of acceptable tokens as specified by the grammar, and whether or not the token appears at an acceptable position within the grammar, and hence at an acceptable position within the input stream.
It is in fact doing exactly the same as you are doing reading this text. For us humans, the grammar (e.g. English) is pre-loaded in our minds, and the input stream is any text we are reading or hearing (for instance: text, music or noise).
We test the latter against the grammar, and declare that it either matches what we expect, or fails to match.
Upon failure, we could switch the internal grammar we're using, from say English to Spanish or Chinese. If that fails we may well give up. That is, we conclude the input stream is grammatically incorrect, i.e. it contains one of more syntax errors, or, at the worst, it is visual (textual) or audible noise.
But if it matches, we conclude the input stream was grammatically correct, or roughly so, since humans don't always speak in grammatically perfect speech of course. Indeed, in songs and advertisments, correct grammar is often noticably distorted or even absent.
Likewise, Marpa attempts to verify whether or not the input stream does in fact match the given grammar.
This is described as: The input was 'recognized' (or not) by Marpa. Hence the (Perl) variable $recce in the docs. There is more on recognition in the next section.
We can express this as: The input stream conforms, or doesn't conform, to the given grammar.
If it doesn't, and we're dealing with a fixed text which we want to match, then we modify the grammar step-by-step, until it Just Works. This tells us that the point of test data is to test the grammar.
Deciding that work on the grammar is finished is equivalent to saying:
Every grammatically correct input stream will result in Marpa returning some sort of valid result, and
Every grammatically incorrect input stream will result in Marpa returning some sort of error result.
Further, a grammar may be deliberately constructed such that some inputs match the grammar in more than one way. These grammars are called ambiguous. Marpa handles these cases easily.
Marpa's grammars are written in what we call an SLIF-DSL. Here, SLIF stands for Scanless Interface, and DSL is Domain-specific Language.
The SLIF's parsing should be thought of as two stages. The lexer looks at everything first, and G1 (your grammar defining the structure of the language) only sees what the lexer tells it about.
Many programmers will have heard of BNF. Well, Marpa's SLIF-DSL is an extended BNF. That is, it includes special tokens which only make sense within the context of a Marpa grammar. Hence the 'Domain Specific' part of the name.
An example (later) includes such extensions - the pseudo-rule ':start'.
Marpa's docs for this SLIF-DSL are here.
BNF can be thought of as a generator for DSLs.
With the advent of Marpa's SLIF, we need to feel comfortable with both Marpa's approach and the fact that, classically, lexing and parsing were often such visibly distinct phases, perhaps even requiring 2 programs.
Marpa's SLIF follows the classical separation of lexing and parsing, but it introduces a notation which is friendly to those programmers more accustomed to regular expressions and recursive descent.
So instead of two completely separate modules, lexer and parser, each with its own interface, you have a single DSL source file, and the lexing and parsing statements look similar, with only the different operators ('::=' versus '~') to tell them apart.
Underneath however, in the implementation, the distinction is still there, as sharp as ever.
Marpa has 3 phases: Grammar pre-processing; Recognition; and Evaluation. That is, Marpa:
Precomputes a grammar.
Recognizes an input, creating a table.
Turns that table into a tree, and evaluates it.
Events occur during the recognition phase. Actions are executed during the evaluation phase. In other words, by the time the first action is executed, the last event is ancient history. Events and Actions are part of Marpa's SLIF-DSL.
And all this takes place in a single pass of the input stream, i.e. very efficiently indeed.
Consequently, Marpa's new design completely eliminates the necessity for you to write code to perform your own lexing.
The previous section is more of a descriptive explanation of Marpa's internal process. It's reasonable to ask: Is there another way to view this?
Sure. Take the first of the 3 points above: 'Precomputes a grammar'.
Marpa creates various data structures from such a grammar during the phase called 'Precomputes a grammar'. A simplistic view of this is that a grammar expressed as a multi-line string is converted into the same grammar expressed as a tree.
And these structures, combined with the input stream, become the basis for the recognition and evaluation phases.
For more on this, see Wikipeida's article on Abstract Syntax Trees.
Briefly, the parsing process converted a grammar as a tree, plus input, into another structure, representable as a tree, which is evaluated (analyzed) as per instructions embedded in the grammar. A example of such an instruction is given below as 'action => graph', whose exact effect will be discussed later.
But why not use regular expressions?
Regular expressions are a brilliant invention, and a perfectly suited for many applications.
But the requirement to process structured grammars, and solutions such as Marpa, are what distinguises parsing from simpler techniques such as lexing and regular expressions.
Lexers and regular expressions recognize strings, but do not assign any structure to them.
When captures are added to regular expressions, you begin to see bits and pieces of a structure, but regular expressions and captures have a hard time with a structure of any complexity. Lastly, when the structure is recursive they break down completely.
You can see such a case in the grammar fragment which follows - specifially in the last rule - where 'node_statement' is defined to be a series of 'node_statement's separated by commas. Marpa handles such cases effortlessly.
:start ::= graph_grammar
graph_grammar ::= graph_definition action => graph
graph_definition ::= node_definition
| edge_definition
node_definition ::= node_statement
| node_statement graph_definition
node_statement ::= node_name
| node_name attribute_definition
| node_statement (',') node_statement
...
The complete set of rules can be represented graphically as a tree, rooted at the start rule. See this SVG version, as produced by MarpaX::Grammar::GraphViz2, which graphs grammars. The grammar itself actually comes from the module MarpaX::Demo::StringParser.
Alternately, the grammar fragment can be represented as a textural tree:
statements
|---:start
| |---::=
| |---graph_grammar
|---graph_grammar
| |---::=
| |---graph_definition
| |---action
| |---=>
| |---graph
|---graph_definition
| |---::=
| |---node_definition
| |---|
| |---edge_definition
|---node_definition
| |---::=
| |---node_statement
| |---|
| |---node_statement
| |---graph_definition
|---node_statement
| |---::=
| |---node_name
| |---|
| |---node_name
| |---attribute_definition
| |---|
| |---node_statement
| |---(',')
| |---node_statement
...
This tree (for the whole grammar) is shipped as share/stringparser.cooked.tree in MarpaX::Demo::StringParser.
The simplest way is to copy a suitable grammar, but if that's not possible, we must write it ourselves.
And this leads to developing a set of test files, whose purpose is to test the grammar, rule by rule. This includes tests that are deliberately faulty. That is, some tests must cause Marpa to report a syntax error in the input stream. Whether or not the grammar or the input stream is the real culprit is of course dependent on the user's expertise.
It is written in C for speed.
It can precisely identify the location (within the input stream) of syntax errors.
It has extensive documentation. Start here if you wish to delve into the docs.
It extends the BNF concept in various ways.
It has an exceptionally straight-forward interface via Perl.
It has a low-level ('thin') interface as well.
It has convenient tracing and reporting features.
It ships as a stand-alone library module, for inclusion into end-user code.
It ships with a broad test suite.
It has a homepage, maintained by library's author.
It has a Google Group and email list (marpa-parser@googlegroups.com).
Intriguingly, Marpa's own grammar (for its DSL) ships in a file (metag.bnf) which has the same format as end-user grammars.
See Chapter 2 for a detailed list of Marpa's advantages over old-style lexers and parsers.
Marpa is a Perl library. We will use Marpa::R2 for the examples in this guide.
If you have Perl already installed you can install Marpa with cpan
or cpanm
.
$ cpan Marpa::R2
or
$ cpanm Marpa::R2
This will install Marpa::R2.
In the rest of this guide we'll assume you have installed Marpa::R2. The examples
that end in '.pl' can be executed using perl
.