Fancy Matching of Delimited Text

Table of contents

Fancy Matching of Delimited Text
Extreme Simplicity
Major Features
Controlling the Parse
Processing the Output
Letting the Code Print Your Tree
Writing Your Own Tree-walking Code

Fancy Matching of Delimited Text

If the problem is matching strings inside arbitrary delimiters, then - up until January 2015 - Perl programmers have available to them modules like Regexp::Common and Text::Balanced.

These modules, and others like them, may well do what you want. And, as I scrolled back and forth thru their docs, what struck me was the amount of work the programmer has to do, to help those modules do their work. And that even included worrying about void/scalar/list context.

But if you're comfortable with Marpa::R2, you'll be very tempted to say: Is there another way?

Well now, yes there is: Text::Balanced::Marpa. Of course, I can't guarantee my new module will solve every problem addressed by other modules (hint: it can't), but I will claim it is very easy to use.

One problem dealt with by Text::Balanced which I do not even consider, is HERE documents.

Nevertheless, in this article I hope to convince you that Text::Balanced::Marpa will serve many of your requirements.

Extreme Simplicity

To use Text::Balanced::Marpa all you need do is specify open and close delimiters, and of course the input string.

Yes, that it! Three measely little strings! No regexps, no options, nothing superfluous to the task.

But see "Controlling the Parse" if you have a desperate need for options anyway.

Major Features

In what follows, all programs mentioned - scripts/*.pl and t/*.t - are shipped in the distro.

o UTF8 support is built-in

This code from t/utf8.t shows this:

        my($parser) = Text::Balanced::Marpa -> new
        (
                open    => ['Δ'],
                close   => ['δ'],
                options => overlap_is_fatal,
        );
o You may specify one or more opening delimiters

This code from t/multiple.quotes.t shows this:

        my($parser) = Text::Balanced::Marpa -> new
        (
                open    => ['<', '{', '[', '(', '"', "'"],
                close   => ['>', '}', ']', ')', '"', "'"],
                options => overlap_is_fatal,
        );

All delimiters are treated equally. That is, there is no hierarchy of delimiters.

Overlapping usage of an (open, close) pair, or nesting of same, is discussed below.

o Each opening delimiter must have a corresponding closing delimiter

Thus the # of opening delimiters must be the same as the # of closing delimiters.

And as you have seen just above, the closing delimiter may or may not be the same as the corresponding opening delimiter.

o Delimiters may be any number of characters long

Here, from t/html.t and scripts/traverse.pl, is a silly example:

        my($parser) = Text::Balanced::Marpa -> new
        (
                open =>
                [
                        '<html>',
                        '<head>',
                        '<title>',
                        '<body>',
                        '<h1>',
                        '<table>',
                        '<tr>',
                        '<td>',
                ],
                close =>
                [
                        '</html>',
                        '</head>',
                        '</title>',
                        '</body>',
                        '</h1>',
                        '</table>',
                        '</tr>',
                        '</td>',
                ],
        );

The docs discuss why using this module to parse HTML is not a good idea, but yes, it's sort-of possible.

And t/angle.brackets.t shows yet again why Marpa::R2::HTML is really what you want for HTML.

o Delimiters do not have to be commonly-recognized pairs

So, your pairs of (open, close) could be (']', '!') and ('=>', 'q}') in the same run.

No, I can't see the point of that either, but it will work.

o The escape character defaults to backslash

That is not surprising, but consider ...

o You may define your own escape character when calling new()

Now that's flexiblity plus.

o You only need escape the 1st character of any multi-character delimiter

I guess that could be called 'syntatic sugar'.

o Delimiters can actually be part of another delimiter

This code in t/perl.delimiters.t shows clearly what I mean:

        my($parser) = Text::Balanced::Marpa -> new
        (
                open  => ['qw/', 'qr/', 'q|', 'qq|'],
                close => [  '/',   '/',  '|',   '|'],
        );

So, both 'q|' and 'qq|' can be opening delimiters, and '|' can appear in various places within the set of delimiters used in a single run.

o You can specify any offset into the input text at which to start parsing
o The last error number and error message are accessible via methods

Both warning and error messages are returned by the methods error_number() and error_message().

Controlling the Parse

Various options are available to control the behaviour of the parser:

o nothing_is_fatal

This is the default.

o debug

Print lots of stuff.

o print_warnings

Print any warnings triggered.

o overlap_is_fatal

Turn the overlap warning into a fatal error.

Overlap is when - in the input text - a delimiter is opened but, before it is closed, a different delimiter is closed.

o nesting_is_fatal

Turn the nested warning into a fatal error.

Nesting is when a delimiter is opened but, before it is closed, the same delimiter is opened again.

o ambiguity_is_fatal

When, presumably due to the precise delimiters you specify, Marpa finds that the resultant grammar is ambiguous, the result can be either a warning or fatal error.

o exhaustion_is_fatal

When the input text matches the grammar, but at the end of the parse there are lexemes left over in the input, you can choose to regard that as fatal or non-fatal.

Processing the Output

By now I hope you're seriously wondering how the parsed text is made available to you after a parse.

Well, it's held in a tree, managed by the little module Tree.

Letting the Code Print Your Tree

Here is part of the output from scripts/tiny.pl (the delimiters are ('{', '}'):

        Parsing |I've already parsed up to here ->a {b {c} d} e|. pos: 33. length: 13
        root. Attributes: {}
           |--- text. Attributes: {text => "a "}
           |--- open. Attributes: {text => "{"}
           |   |--- text. Attributes: {text => "b "}
           |   |--- open. Attributes: {text => "{"}
           |   |   |--- text. Attributes: {text => "c"}
           |   |--- close. Attributes: {text => "}"}
           |   |--- text. Attributes: {text => " d"}
           |--- close. Attributes: {text => "}"}
           |--- text. Attributes: {text => " e"}
        Parse result: 0 (0 is success)

You can see the nesting: The outer ('{', '}') pair have a set of daughters for all contained text, and the same structure is evident for the inner pair too, although in this case there is just the tree node holding 'c'.

Another example, this from scripts/synopsis.pl (the delimiters are ('<:', ':>') and ('[%', '%]'):

        Parsing |a [% b <: c :> d %] e|
        root. Attributes: {}
           |--- text. Attributes: {text => "a "}
           |--- open. Attributes: {text => "[%"}
           |   |--- text. Attributes: {text => " b "}
           |   |--- open. Attributes: {text => "<:"}
           |   |   |--- text. Attributes: {text => " c "}
           |   |--- close. Attributes: {text => ":>"}
           |   |--- text. Attributes: {text => " d "}
           |--- close. Attributes: {text => "%]"}
           |--- text. Attributes: {text => " e"}
        Parse result: 0 (0 is success)

Writing Your Own Tree-walking Code

Here, from scripts/travers.pl, is how it's done:

        if ($parser -> parse(\$text) == 0)
        {
                my($attributes);
                my($indent);
                my($text);

                for my $node ($parser -> tree -> traverse)
                {
                        next if ($node -> is_root);

                        $attributes = $node -> meta;
                        $text       = $$attributes{text};
                        $text       =~ s/^\s+//;
                        $text       =~ s/\s+$//;
                        $indent     = $node -> depth - 1;

                        print "\t" x $indent, "$text\n" if (length($text) );
                }
        }

This produces:

        <html>
            <head>
                <title>
                    A Title
                </title>
            </head>
            <body>
                <h1>
                    A H1 Heading
                </h1>
                <table>
                    <tr>
                        <td>
                            A table cell
                        </td>
                     </tr>
            </table>
            </body>
        </html>