A New Marpa-based Parser for Graphviz files

Author: Ron Savage
Table of Contents
Introduction
Interfaces to Marpa
The Grammar in BNF
Event-driven Processing of Lexemes
Sample Input
Sample Output
Download
TODO
Last updated: 2014-09-18
Introduction
Top

I've been developing another Marpa-based parser for DOT files, in a re-write of my Perl module This time I'm using the SLIF interface (see below)

The new code will be in the next public release of that module. The current version of GraphViz2::Marpa (V 1.13) uses the obsolete NAIF.

Marpa's homepage

Marpa's Perl docs

An Overview/FAQ for Marpa

My other graph-oriented modules.

Marpa is written in C, although I do all my development using its Perl SLIF interface. Marpa compiles into libmarpa.

Even though using Perl as a wrapper around these interfaces might make you think it's slow, Marpa is so fast that the slowdown is simpy not detectable when I run programs from the command line.

And ultimately, everything I write in Perl could be transcribed into C in order to use the THIF directly, but discussion of that process has only just begun on Marpa's irc channel irc.freenode.net#marpa. See the backlog re a SLIF-to-C compiler starting here.

Note: All software discussed here is Open Source.

See also the related document Conditional Preservation of Whitespace.

Interfaces to Marpa
Top
Language you use Name of Interface Abbreviation of Interface Status
C - - Supported
Perl - Marpa::R2 Thin interface THIF Supported. Throws Perl exceptions
Perl - Marpa::R2 Native/Naive interface NAIF Obsolete but supported, and discouraged. Throws Perl exceptions
Perl - Marpa::R2 Scanless interface SLIF Supported. Used in GraphViz2::Marpa. Throws Perl exceptions

The Grammar in BNF
Top

And yes, Marpa can parse the BNF-like syntax below. It is in fact BNF as extended by Jeffrey Keger, the author of Marpa.

Marpa's docs for its version of BNF - a Domain-Specific Language

The eye-popping similarity (revealed by very close inspection) between this grammar and the docs for DOT is surprising but by no means accidental.

Marpa's adoption of the Scanless Interface is specifically designed to bridge the gap between developing a complete grammar in BNF and implementing a parser which can literally use the very same BNF for lexing and parsing.

Note: I have not written a validating parser, meaning that nonsensical constructs are not checked for.

That means attribute names are not cross-checked against their values, which would ensure statements like 'node_1 [color = -1 penwidth = red]' are rejected.

That's a way of saying I assume the input file has been validated by 'dot'. Of course such validation could be added, and the C code for it must already exist in Graphviz.

Nevertheless, the input absolutely must conform to the given grammar in order to stop Marpa issuing a syntax error.

Note: $self is an object of type GraphViz2::Marpa, so $self -> grammar() means I'm initializing the object's attribute called 'grammar', which I then pass into the constructor of Marpa's recognizer object 'Marpa::R2::Scanless::R' in the code fragment which follows.



    $self -> grammar
    (
        Marpa::R2::Scanless::G -> new
        ({
source                  => \(<<'END_OF_GRAMMAR'),

:default                ::= action => [values]

lexeme default          = latm => 1     # Longest Acceptable Token Match.

:start                  ::= graph_definition

# The prolog to the graph.

graph_definition        ::= graph_body
                            | comments graph_body
                            | graph_body comments
                            | comments graph_body comments

comments                ::= comment+

comment                 ::= <C style comment>
                            | <Cplusplus style comment>
                            | <hash style comment>

graph_body              ::= prolog_tokens graph_statement_tokens

prolog_tokens           ::= strict_token graph_type global_id_type

strict_token            ::=
strict_token            ::= strict_literal

graph_type              ::= digraph_literal
                            | graph_literal

global_id_type          ::=
global_id_type          ::= generic_id_token

# The graph proper.

graph_statement_tokens  ::= open_brace statement_list close_brace

statement_list          ::= statement_token*

statement_token         ::= statement statement_terminator
                            | comments statement_token

statement_terminator    ::= semicolon_literal
statement_terminator    ::=

statement               ::= node_statement
                            | edge_statement
                            | attribute_statement
                            | assignment_statement
                            | subgraph_statement
                            | comments
# Node stuff.

node_statement          ::= generic_id_token attribute_tokens

generic_id_token        ::= generic_id
                            | ('<') generic_id ('>')
                            | number
                            | '"' string '"'
                            | empty_string

number                  ::= float
                            | integer

# Edge stuff.

edge_statement          ::= edge_segment attribute_tokens

edge_segment            ::= edge_tail edge_literal edge_head

edge_tail               ::= edge_node_token
                            | subgraph_statement

edge_head               ::= edge_tail
                            |  edge_tail edge_literal edge_head

edge_node_token         ::= generic_id_token
                            | node_port_id
                            | node_port_compass_id

node_port_id            ::= generic_id_token<colon>generic_id_token

node_port_compass_id    ::= node_port_id<colon>generic_id_token

# Attribute stuff.
# These have no body between the '[]' because they is parsed manually in order to
# preserve whitespace in double-quoted strings, which is otherwise discarded by this grammar.
# See _process_attributes(). See also the same method for the handling of attribute terminators: [;,].

attribute_tokens        ::=
attribute_tokens        ::= open_bracket close_bracket statement_terminator

attribute_statement     ::= node_statement # Search for 'class'! It's in _process().

# Assignment stuff.
# There is nothing after 'equals_literal' because it is parsed manually in order to
# preserve whitespace in double-quoted strings, which is otherwise discarded by this grammar.
# See _attribute_field().

assignment_statement    ::= generic_id equals_literal generic_id_token

# Subgraph stuff.
# Subgraphs are handled by the statement type 'graph_statement_tokens'.
# Hence subgraph sub_1 {...} and subgraph {...} and sub_1 {...} and {...} are all output as:
# o The optional literal 'subgraph', which is classified as a literal.
# o The optional subgraph id, which is classified as a node_id.
# o The literal '{'.
# o The body of the subgraph.
# o The literal '}'.

subgraph_statement      ::= subgraph_prefix subgraph_id_token graph_statement_tokens

subgraph_prefix         ::=
subgraph_prefix         ::= subgraph_literal

subgraph_id_token       ::=
subgraph_id_token       ::= generic_id_token

# Lexeme-level stuff, in alphabetical order.

:lexeme                 ~ close_brace       pause => before     event => close_brace

close_brace             ~ '}'

:lexeme                 ~ close_bracket     pause => before     event => close_bracket

close_bracket           ~ ']'

:lexeme                 ~ colon             pause => before     event => colon

colon                   ~ ':'

digit                   ~ [0-9]
digit_any               ~ digit*
digit_many              ~ digit+

:lexeme                 ~ digraph_literal   pause => before     event => digraph_literal

digraph_literal         ~ 'digraph':i

:lexeme                 ~ edge_literal      pause => before     event => edge_literal

edge_literal            ~ '->'
edge_literal            ~ '--'

:lexeme                 ~ empty_string      pause => before     event => string

empty_string            ~ '""'

:lexeme                 ~ equals_literal    pause => before     event => equals_literal

equals_literal          ~ '='

escaped_char            ~ '\' string_char
escaped_quote           ~ '\"'

:lexeme                 ~ float             pause => before     event => float

float                   ~ sign_maybe digit_any '.' digit_many
                            | sign_maybe digit_many '.' digit_any

:lexeme                 ~ generic_id        pause => before     event => generic_id

generic_id_prefix       ~ [a-zA-Z\200-\377_]
generic_id_suffix       ~ [a-zA-Z\200-\377_0-9]*
generic_id              ~ <generic_id_prefix>generic_id_suffix

:lexeme                 ~ graph_literal     pause => before     event => graph_literal

graph_literal           ~ 'graph':i

:lexeme                 ~ integer           pause => before     event => integer

integer                 ~ sign_maybe non_zero digit_any
                            | zero

non_zero                ~ [1-9]

:lexeme                 ~ open_brace        pause => before     event => open_brace

open_brace              ~ '{'

:lexeme                 ~ open_bracket      pause => before     event => open_bracket

open_bracket            ~ '['

semicolon_literal       ~ ';'

sign_maybe              ~ [+-]
sign_maybe              ~

:lexeme                 ~ strict_literal    pause => before     event => strict_literal

strict_literal          ~ 'strict':i

:lexeme                 ~ string            pause => before     event => string

string                  ~ string_char+

string_char             ~ escaped_char | escaped_quote | [^"]

# Comment with " matching 2nd last \", for the UltraEdit syntax highlighter.

:lexeme                 ~ subgraph_literal  pause => before     event => subgraph_literal

subgraph_literal        ~ 'subgraph':i

zero                    ~ '0'

# C and C++ comment handling copied from MarpaX::Languages::C::AST.

<C style comment>                   ~ '/*' <comment interior> '*/'

<comment interior>                  ~ <optional non stars> <optional star prefixed segments> <optional pre final stars>

<optional non stars>                ~ [^*]*
<optional star prefixed segments>   ~ <star prefixed segment>*
<star prefixed segment>             ~ <stars> [^/*] <optional star free text>
<stars>                             ~ [*]+
<optional star free text>           ~ [^*]*
<optional pre final stars>          ~ [*]*

<Cplusplus style comment>           ~ '//' <Cplusplus comment interior>
<Cplusplus comment interior>        ~ [^\n]*

# Hash comment handling copied from Marpa::R2's metag.bnf.

<hash style comment>                ~ <terminated hash comment>
                                        | <unterminated final hash comment>

<terminated hash comment>           ~ '#' <hash comment body> <vertical space char>

<unterminated final hash comment>   ~ '#' <hash comment body>

<hash comment body>                 ~ <hash comment char>*

<vertical space char>               ~ [\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]

<hash comment char>                 ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]

# White space.

:discard                ~ whitespace

whitespace              ~ [\s]*

END_OF_GRAMMAR
        })
    );

    $self -> recce
    (
        Marpa::R2::Scanless::R -> new
        ({
            grammar          => $self -> grammar,
            #trace_terminals => 99,
        })
    );

				

Event-driven Processing of Lexemes
Top

One point in the grammar I'd like to mention explicitly, is the construct you see in, e.g., 'before' could (in a different grammar) have been 'after', while 'event' declares the event's name, as used in the giant 'if' statement in the code below.

Since, astonishingly, Marpa's lexer/parser can be exited and re-entered at any time, a grammar can request a return to the calling code at the start (as here) or end of any given lexeme, giving the caller the opportunity to use various Marpa methods to get and stash the lexeme, and its length, etc. Then, the parser is re-entered with a call to resume($pos).

This also means the $pos variable can be fiddled. And that's how I parse attribute names and values manually, in order to preserve the whitespace within strings, as mentioned in the gammar's comments. Having done that, I stash them, and advance $pos past the end of the last such value, thereby letting Marpa continue without ever actually seeing inside a [...] style attribute declararion.

Further, each attribute value could then have its own standalone grammar, if you wanted to use Marpa to validate each type of attributes' values separately.

I should add that the multiple entries into, and exits from, the parser as used here are probably not how the majority of Marpa-using code is written. That is, the input string is generally processed in one go. Of course it's nice to have a choice, and demonstrates one of the many ways in which Marpa distinguishes itself from traditional parsers.

Output (for my Perl code) is in the form of a tree. There is an example below. This tree is managed by the Perl module Tree::DAG_Node, I didn't write Tree::DAG_Node, but I do now maintain it.



sub _process
{
    my($self)          = @_;
    my($string)        = $self -> graph_text;
    my($length)        = length $string;
    my($literal_token) = qr/(?:colon|edge_literal|equals_literal|strict_literal|subgraph_literal)/;
    my($prolog_token)  = qr/(?:digraph_literal|graph_literal)/;
    my($simple_token)  = qr/(?:float|integer)/;

    $self -> log(debug => "Input:\n$string");

    # We use read()/lexeme_read()/resume() because we pause at each lexeme.

    my($attribute_list, $attribute_value);
    my(@event, $event_name);
    my($generic_id);
    my($lexeme_name, $lexeme, $literal);
    my($node_name);
    my($span, $start, $s);
    my($type);

    for
    (
        my $pos = $self -> recce -> read(\$string);
        $pos < $length;
        $pos = $self -> recce -> resume($pos)
    )
    {
        $self -> log(debug => "read() => pos: $pos. ");

        @event          = @{$self -> recce -> events};
        $event_name     = ${$event[0]}[0];
        ($start, $span) = $self -> recce -> pause_span;
        $lexeme_name    = $self -> recce -> pause_lexeme;
        $lexeme         = $self -> recce -> literal($start, $span);

        $self -> log(debug => "pause_span($lexeme_name) => start: $start. span: $span. " .
            "lexeme: $lexeme. event: $event_name. ");

        if ($event_name =~ $simple_token)
        {
            $pos = $self -> _process_lexeme(\$string, $start, $event_name, $event_name, $lexeme_name, $event_name);
        }
        elsif ($event_name =~ $literal_token)
        {
            $node_name = ($event_name eq 'edge_literal')
                            ? 'edge'
                            : ($event_name eq 'equals_literal')
                            ? 'equals'
                            : 'literal';
            $pos       = $self -> _process_lexeme(\$string, $start, $event_name, $node_name, $lexeme_name, undef);
        }
        elsif ($event_name eq 'close_brace')
        {
            $pos     = $self -> recce -> lexeme_read($lexeme_name);
            $literal = substr($string, $start, $pos - $start);

            $self -> log(debug => "close_brace => '$literal'");
            $self -> _process_brace($literal);
        }
        elsif ($event_name eq 'close_bracket')
        {
            $pos     = $self -> recce -> lexeme_read($lexeme_name);
            $literal = substr($string, $start, $pos - $start);

            $self -> log(debug => "close_bracket => '$literal'");
            $self -> _process_bracket($literal);
        }
        elsif ($event_name eq 'generic_id')
        {
            $pos        = $self -> recce -> lexeme_read($lexeme_name);
            $generic_id = substr($string, $start, $pos - $start);
            $type       = 'node_id';

            if ($generic_id =~ /^(?:edge|graph|node)$/i)
            {
                $type = 'class';
            }
            elsif ($generic_id eq 'subgraph')
            {
                $type = 'literal';
            }

            $self -> log(debug => "generic_id => '$generic_id'. type => $type");
            $self -> _process_token($type, $generic_id);
        }
        elsif ($event_name eq 'open_brace')
        {
            $pos     = $self -> recce -> lexeme_read($lexeme_name);
            $literal = substr($string, $start, $pos - $start);

            $self -> log(debug => "open_brace => '$literal'");
            $self -> _process_brace($literal);
        }
        elsif ($event_name eq 'open_bracket')
        {
            $pos     = $self -> recce -> lexeme_read($lexeme_name);
            $literal = substr($string, $start, $pos - $start);

            $self -> log(debug => "open_bracket => '$literal'");
            $self -> _process_bracket($literal);

            $pos = $self -> _find_terminator(\$string, $start);

            $attribute_list = substr($string, $start + 1, $pos - $start - 1);

            $self -> log(debug => "index() => attribute list: $attribute_list");
            $self -> _process_attributes($attribute_list);
        }
        elsif ($event_name eq 'string')
        {
            $pos     = $self -> recce -> lexeme_read($lexeme_name);
            $literal = substr($string, $start, $pos - $start);

            $self -> log(debug => "string => '$literal'");
            $self -> _process_token('node_id', $literal);
        }
        # From here on are the low-frequency events.
        elsif ($event_name =~ $prolog_token)
        {
            $pos     = $self -> recce -> lexeme_read($lexeme_name);
            $literal = lc substr($string, $start, $pos - $start);

            $self -> log(debug => "$event_name => '$literal'");
            $self -> _process_digraph_graph($event_name, $literal);
        }
        else
        {
            die "Unexpected lexeme '$lexeme_name' with a pause\n";
        }
    }

    # Return a defined value for success and undef for failure.

    return $self -> recce -> value;

} # End of _process.

				

Sample Input
Top


/* C comment. */

// C++ comment.

# Hash comment.

STRICT digraph graph_16
{
    fontsize = 16.0
    label    = "\"Standard\"\rSyntax\lTest"
    size     = "5,6"

    node
    [
        shape = "record",
    ];

    edge
    [
        color = "red"
        penwidth = 3,
    ];

    node_16_1
    [
        label    = "<p11> left|<p12> middle|<p13> right"
        pencolor = blue
    ]

    node_16_2
    [
        pencolor = green
        label    = "<p21> one|<p22> two"
    ]

    node_16_1:p11 -> node_16_2:p22:s
    [
        arrowhead = "odiamond";
        arrowtail = "odot",
        color     = red
        dir       = both;
    ];

    subgraph subgraph_16_1
    {
        node [shape = square]

        label = ""

        node_16_3 -> { node [shape = circle] node_16_4 }
        [
            arrowhead = "empty",
            arrowtail = "halfopen"
            color     = green
            dir       = "both",
        ]

        node_16_5 -> node_16_6
        [
            arrowhead = "halfopen",
            arrowtail = "empty"
            color     = blue
            dir       = "both",
        ]
    }
}
				

Sample Output
Top


root. Attributes: {uid => "0"}
   |---prolog. Attributes: {uid => "1"}
   |   |---literal. Attributes: {uid => "3", value => "strict"}
   |   |---literal. Attributes: {uid => "4", value => "digraph"}
   |---graph. Attributes: {uid => "2"}
       |---node_id. Attributes: {uid => "5", value => "graph_16"}
       |---literal. Attributes: {uid => "6", value => "{"}
       |   |---fontsize. Attributes: {type => "float", uid => "7", value => "16.0"}
       |   |---label. Attributes: {type => "string", uid => "10", value => "\"Standard\"\rSyntax\lTest"}
       |   |---size. Attributes: {type => "string", uid => "13", value => "5,6"}
       |   |---class. Attributes: {uid => "16", value => "node"}
       |   |   |---literal. Attributes: {uid => "17", value => "["}
       |   |   |---shape. Attributes: {uid => "18", value => "record"}
       |   |   |---literal. Attributes: {uid => "19", value => "]"}
       |   |---class. Attributes: {uid => "20", value => "edge"}
       |   |   |---literal. Attributes: {uid => "21", value => "["}
       |   |   |---color. Attributes: {uid => "22", value => "red"}
       |   |   |---penwidth. Attributes: {uid => "23", value => "3"}
       |   |   |---literal. Attributes: {uid => "24", value => "]"}
       |   |---node_id. Attributes: {uid => "25", value => "node_16_1"}
       |   |   |---literal. Attributes: {uid => "26", value => "["}
       |   |   |---label. Attributes: {uid => "27", value => "<p11> left|<p12> middle|<p13> right"}
       |   |   |---pencolor. Attributes: {uid => "28", value => "blue"}
       |   |   |---literal. Attributes: {uid => "29", value => "]"}
       |   |---node_id. Attributes: {uid => "30", value => "node_16_2"}
       |   |   |---literal. Attributes: {uid => "31", value => "["}
       |   |   |---pencolor. Attributes: {uid => "32", value => "green"}
       |   |   |---label. Attributes: {uid => "33", value => "<p21> one|<p22> two"}
       |   |   |---literal. Attributes: {uid => "34", value => "]"}
       |   |---node_id. Attributes: {uid => "35", value => "node_16_1:p11"}
       |   |---edge. Attributes: {uid => "38", value => "->"}
       |   |   |---literal. Attributes: {uid => "44", value => "["}
       |   |   |---arrowhead. Attributes: {uid => "45", value => "odiamond"}
       |   |   |---arrowtail. Attributes: {uid => "46", value => "odot"}
       |   |   |---color. Attributes: {uid => "47", value => "red"}
       |   |   |---dir. Attributes: {uid => "48", value => "both"}
       |   |   |---literal. Attributes: {uid => "49", value => "]"}
       |   |---node_id. Attributes: {uid => "39", value => "node_16_2:p22:s"}
       |   |---literal. Attributes: {uid => "50", value => "subgraph"}
       |   |---node_id. Attributes: {uid => "51", value => "subgraph_16_1"}
       |   |---literal. Attributes: {uid => "52", value => "{"}
       |   |   |---class. Attributes: {uid => "53", value => "node"}
       |   |   |   |---literal. Attributes: {uid => "54", value => "["}
       |   |   |   |---shape. Attributes: {uid => "55", value => "square"}
       |   |   |   |---literal. Attributes: {uid => "56", value => "]"}
       |   |   |---label. Attributes: {type => "string", uid => "57", value => """"}
       |   |   |---node_id. Attributes: {uid => "60", value => "node_16_3"}
       |   |   |---edge. Attributes: {uid => "61", value => "->"}
       |   |   |   |---literal. Attributes: {uid => "69", value => "["}
       |   |   |   |---arrowhead. Attributes: {uid => "70", value => "empty"}
       |   |   |   |---arrowtail. Attributes: {uid => "71", value => "halfopen"}
       |   |   |   |---color. Attributes: {uid => "72", value => "green"}
       |   |   |   |---dir. Attributes: {uid => "73", value => "both"}
       |   |   |   |---literal. Attributes: {uid => "74", value => "]"}
       |   |   |---literal. Attributes: {uid => "62", value => "{"}
       |   |   |   |---class. Attributes: {uid => "63", value => "node"}
       |   |   |   |   |---literal. Attributes: {uid => "64", value => "["}
       |   |   |   |   |---shape. Attributes: {uid => "65", value => "circle"}
       |   |   |   |   |---literal. Attributes: {uid => "66", value => "]"}
       |   |   |   |---node_id. Attributes: {uid => "67", value => "node_16_4"}
       |   |   |---literal. Attributes: {uid => "68", value => "}"}
       |   |   |---node_id. Attributes: {uid => "75", value => "node_16_5"}
       |   |   |---edge. Attributes: {uid => "76", value => "->"}
       |   |   |   |---literal. Attributes: {uid => "78", value => "["}
       |   |   |   |---arrowhead. Attributes: {uid => "79", value => "halfopen"}
       |   |   |   |---arrowtail. Attributes: {uid => "80", value => "empty"}
       |   |   |   |---color. Attributes: {uid => "81", value => "blue"}
       |   |   |   |---dir. Attributes: {uid => "82", value => "both"}
       |   |   |   |---literal. Attributes: {uid => "83", value => "]"}
       |   |   |---node_id. Attributes: {uid => "77", value => "node_16_6"}
       |   |---literal. Attributes: {uid => "84", value => "}"}
       |---literal. Attributes: {uid => "85", value => "}"}
Parse result: 0 (0 is success)

				

Download
Top

The current code can be downloaded from

To run:

shell> tar xvzf GraphViz2-Marpa-2.00.rc3.tgz
shell> cd GraphViz2-Marpa
shell> perl Makefile.PL; make (To install the pre-reqs)
shell> scripts/test.sh
shell> scripts/g2m.sh data/16.gv -max info (As per the example above)
And likewise for any file data/*.gv.

TODO
Top
  1. Of course, the long (and nice-to-have) explanations on the Graphviz web site regarding various aspects of the precise interpretation of DOT constucts are not addressed by this code.
  2. If the input contains 'node [color = red] node [shape = hexagon]', these 2 lines are output as 2 nodes in the tree, and could be folded into one. This would not be difficult, and indeed some post-processing is already done (moving edge attributes).
  3. These things are not handled:
  4. Copy all sample graphs shipped with Graphviz, i.e. graphs/directed/*.gv and graphs/undirected/*.gv, into the module's test suite, and test them.
  5. And, as stated in the introduction: Attribute names are not cross-checked against their values, which would ensure statements like 'node_1 [color = -1 penwidth = red]' are rejected. Unfortunately, this is a massive task.


My homepage
Debian V 7.6
Perl V 5.18.2
Marpa V 2.091001
GraphViz2::Marpa V 2.00 (unreleased)