Fancy Matching of Delimited Text |
Extreme Simplicity |
Major Features |
Controlling the Parse |
Processing the Output |
Letting the Code Print Your Tree |
Writing Your Own Tree-walking Code |
If the problem is matching strings inside arbitrary delimiters, then - up until January 2015 - Perl programmers have available to them modules like Regexp::Common and Text::Balanced.
These modules, and others like them, may well do what you want. And, as I scrolled back and forth thru their docs, what struck me was the amount of work the programmer has to do, to help those modules do their work. And that even included worrying about void/scalar/list context.
But if you're comfortable with Marpa::R2, you'll be very tempted to say: Is there another way?
Well now, yes there is: Text::Balanced::Marpa. Of course, I can't guarantee my new module will solve every problem addressed by other modules (hint: it can't), but I will claim it is very easy to use.
One problem dealt with by Text::Balanced which I do not even consider, is HERE documents.
Nevertheless, in this article I hope to convince you that Text::Balanced::Marpa will serve many of your requirements.
To use Text::Balanced::Marpa all you need do is specify open and close delimiters, and of course the input string.
Yes, that it! Three measely little strings! No regexps, no options, nothing superfluous to the task.
But see "Controlling the Parse" if you have a desperate need for options anyway.
In what follows, all programs mentioned - scripts/*.pl and t/*.t - are shipped in the distro.
This code from t/utf8.t shows this:
my($parser) = Text::Balanced::Marpa -> new ( open => ['Δ'], close => ['δ'], options => overlap_is_fatal, );
This code from t/multiple.quotes.t shows this:
my($parser) = Text::Balanced::Marpa -> new ( open => ['<', '{', '[', '(', '"', "'"], close => ['>', '}', ']', ')', '"', "'"], options => overlap_is_fatal, );
All delimiters are treated equally. That is, there is no hierarchy of delimiters.
Overlapping usage of an (open, close) pair, or nesting of same, is discussed below.
Thus the # of opening delimiters must be the same as the # of closing delimiters.
And as you have seen just above, the closing delimiter may or may not be the same as the corresponding opening delimiter.
Here, from t/html.t and scripts/traverse.pl, is a silly example:
my($parser) = Text::Balanced::Marpa -> new ( open => [ '<html>', '<head>', '<title>', '<body>', '<h1>', '<table>', '<tr>', '<td>', ], close => [ '</html>', '</head>', '</title>', '</body>', '</h1>', '</table>', '</tr>', '</td>', ], );
The docs discuss why using this module to parse HTML is not a good idea, but yes, it's sort-of possible.
And t/angle.brackets.t shows yet again why Marpa::R2::HTML is really what you want for HTML.
So, your pairs of (open, close) could be (']', '!') and ('=>', 'q}') in the same run.
No, I can't see the point of that either, but it will work.
That is not surprising, but consider ...
Now that's flexiblity plus.
I guess that could be called 'syntatic sugar'.
This code in t/perl.delimiters.t shows clearly what I mean:
my($parser) = Text::Balanced::Marpa -> new ( open => ['qw/', 'qr/', 'q|', 'qq|'], close => [ '/', '/', '|', '|'], );
So, both 'q|' and 'qq|' can be opening delimiters, and '|' can appear in various places within the set of delimiters used in a single run.
Both warning and error messages are returned by the methods error_number() and error_message().
Various options are available to control the behaviour of the parser:
This is the default.
Print lots of stuff.
Print any warnings triggered.
Turn the overlap warning into a fatal error.
Overlap is when - in the input text - a delimiter is opened but, before it is closed, a different delimiter is closed.
Turn the nested warning into a fatal error.
Nesting is when a delimiter is opened but, before it is closed, the same delimiter is opened again.
When, presumably due to the precise delimiters you specify, Marpa finds that the resultant grammar is ambiguous, the result can be either a warning or fatal error.
When the input text matches the grammar, but at the end of the parse there are lexemes left over in the input, you can choose to regard that as fatal or non-fatal.
By now I hope you're seriously wondering how the parsed text is made available to you after a parse.
Well, it's held in a tree, managed by the little module Tree.
Here is part of the output from scripts/tiny.pl (the delimiters are ('{', '}'):
Parsing |I've already parsed up to here ->a {b {c} d} e|. pos: 33. length: 13 root. Attributes: {} |--- text. Attributes: {text => "a "} |--- open. Attributes: {text => "{"} | |--- text. Attributes: {text => "b "} | |--- open. Attributes: {text => "{"} | | |--- text. Attributes: {text => "c"} | |--- close. Attributes: {text => "}"} | |--- text. Attributes: {text => " d"} |--- close. Attributes: {text => "}"} |--- text. Attributes: {text => " e"} Parse result: 0 (0 is success)
You can see the nesting: The outer ('{', '}') pair have a set of daughters for all contained text, and the same structure is evident for the inner pair too, although in this case there is just the tree node holding 'c'.
Another example, this from scripts/synopsis.pl (the delimiters are ('<:', ':>') and ('[%', '%]'):
Parsing |a [% b <: c :> d %] e| root. Attributes: {} |--- text. Attributes: {text => "a "} |--- open. Attributes: {text => "[%"} | |--- text. Attributes: {text => " b "} | |--- open. Attributes: {text => "<:"} | | |--- text. Attributes: {text => " c "} | |--- close. Attributes: {text => ":>"} | |--- text. Attributes: {text => " d "} |--- close. Attributes: {text => "%]"} |--- text. Attributes: {text => " e"} Parse result: 0 (0 is success)
Here, from scripts/travers.pl, is how it's done:
if ($parser -> parse(\$text) == 0) { my($attributes); my($indent); my($text); for my $node ($parser -> tree -> traverse) { next if ($node -> is_root); $attributes = $node -> meta; $text = $$attributes{text}; $text =~ s/^\s+//; $text =~ s/\s+$//; $indent = $node -> depth - 1; print "\t" x $indent, "$text\n" if (length($text) ); } }
This produces:
<html> <head> <title> A Title </title> </head> <body> <h1> A H1 Heading </h1> <table> <tr> <td> A table cell </td> </tr> </table> </body> </html>