Understanding Marpa-style action subs parameters

At Home with Marpa

The BNF aka the DSL

The Grammar

My Rules Rule

The Question

The Problem

An Aside about Events

Action Subs

Parameters

Sample Grammar + Its Action Sub

First Step in Understanding

Second Step in Understanding

The rules which constitute a date

The Third Step in Understanding

Wrapping Up and Winding Down

References

Understanding Marpa-style action subs parameters

You will need to have at least a basic idea about Marpa before making much sense of what follows.

At Home with Marpa

Marpa's homepage.

The FAQ.

Introductory Material.

The BNF aka the DSL

The BNF or Domain-specific Language used by Marpa is what we use to write the grammar rules to be input to Marpa::R2.

The Grammar

For this article, I'm going to base everything on the grammar in Genealogy::Gedcom::Date V 2, which I've just released.

That module contains the grammar for dates as defined in the GEDCOM (GEnealogical Data Communication) spec, p 45.

My Rules Rule

You can see the rules here, but the rendition does not respect the fact that I use tabs:

https://metacpan.org/source/RSAVAGE/Genealogy-Gedcom-Date-2.00/lib/Genealogy/Gedcom/Date.pm#L141.

The Question

Given this grammar, the question is "How can we understand the parameters passed to the action subs declared in https://metacpan.org/source/RSAVAGE/Genealogy-Gedcom-Date-2.01/lib/Genealogy/Gedcom/Date/Actions.pm"?

The Problem

For those of you unfamiliar with Marpa::R2, that question may well not make sense.

Put briefly, when we declare a Marpa grammar, we can either attach action subs to rules or attach events to the L0 lexemes. And yes, it's possible to do both, but I've never see a use case for that.

L0? They are the lexemes in the grammar declared with '~' rather than with '::='.

With events, you ask Marpa to (roughly speaking) process a lexeme at a time, and Marpa triggers an event if there is a rule specifying that the lexeme just encountered in the input stream has that event attached to it.

You may wonder how you test for events. Marpa has the astonishing ability to effortlessly and endlessly exit from whatever it's doing to execute caller code and to then return - yes, into Marpa - at precisely the point where it briefly left to run the caller's code.

So, during that interval in which caller code is running you check for any and all events triggered, make decisions, and then re-enter Marpa.

Amazingly, those decisions can include your own parsing, called external parsing (i.e. external to Marpa), which can even change the point at which Marpa's internal parsing continues.

That means you can now process the lexemes your external parser has just found, and tell Marpa to skip them when it resumes. Further, you can tell Marpa to step forward or backward within the input stream before resuming. Try that with any other parser.

This is how I handled whitespace differently depending on whether it was inside or outside quotes when I wrote the code which uses Marpa to parse Graphviz DOT files.

And here's the article I wrote on that.

But, enough of the fun stuff. Back to the action subs.

Action Subs

Here then we are discussing action subs, which means they are called by Marpa when rules match text. And specifically, we're talking about the parameters passed to those subs.

Parameters

Golden rule: All parameters passed to action subs are arrayrefs. And why is that?

Because:

o An arrayref means it's the address of a data structure: It is not the data structure itself, so the parameter is much smaller than most data structures.
o An array can hold an arbitrary number of elements: So, it allows Marpa to pass a tiny thing, an arrayref, which - potentially - represents a huge data structure.

Sample Grammar + Its Action Sub

To be concrete, let's consider the date 'Int 21 Jun 1950 (Comment)' as the input stream.

In the case of these rules:

        date_value        ::= date_period
                            | date_range                                           # Action sub
                            | approximated_date                                    # parameters
                            | interpreted_date         action => interpreted_date  # ($t1)
                            | ('(') date_phrase (')')  action => date_phrase       # ($t1)

        interpreted_date  ::= interpreted date ('(') date_phrase (')')

        date_phrase       ::= date_text

The action sub I wish to focus on is:

        sub interpreted_date
        {
                my($cache, $t1) = @_;
                my($t2)      = $$t1[1][1][0];
                $$t2{flag}   = 'INT';
                $$t2{phrase} = "($$t1[2][0])";

                return [$$t1[1][0], $t2];

        } # End of interpreted_date.

And when the module is given that date, this sub is called with one (1) parameter (as displayed by Data::Dumper::Concise):

        [
          "Int",
          [
            [],
            [
              {
                day => 21,
                kind => "Date",
                month => "Jun",
                type => "Gregorian",
                year => 1950
              }
            ]
          ],
          [
            "Comment"
          ]
        ]

Let's analyze this structure. It's an arrayref obviously. And what's in it? The components:

o A string, "Int"

This is clearly just the first lexeme in the input stream.

o A complex arrayref

Details below.

o A simple arrayref

Here, the only element within the array is what the GEDCOM spec calls a Date Phrase, the 'Comment'. And yes, the '(' and ')' have been discarded, which is in accordance with the grammar, which says:

        interpreted_date  ::= interpreted date ('(') date_phrase (')')

That is, '(' and ')' are discarded using Marpa's convention that anything in parentheses is flagged by the author (of the grammar) as not worth preserving.

First Step in Understanding

To repeat the structure of $t1:

        [
          "Int",
          [
            [],
            [
              {
                day => 21,
                kind => "Date",
                month => "Jun",
                type => "Gregorian",
                year => 1950
              }
            ]
          ],
          [
            "Comment"
          ]
        ]

And to repeat the rule:

        interpreted_date  ::= interpreted date ('(') date_phrase (')')

It consists of 3 parts (worth preserving):

o interpreted

Which is defined to be:

        interpreted  ~ 'int':i
                         | 'interpreted':i

So the first element of the parameter to the action sub, $t1, will always be one of these 2 (case-insensitive) words.

o date

The second element is a date. But it's another arrayref! No matter, we'll get to that. See "Second Step in Understanding" below.

o date_phrase

And the 3rd and last element is not just the phrase, but another arrayref holding the phrase. But why is it an arrayref, and not just a string like 'Int'?

Have another look at the grammar:

        interpreted_date  ::= interpreted date ('(') date_phrase (')')

        date_phrase       ::= date_text

Neither of these rules declares an action sub, but that's OK. At the very top of the original grammar I used:

        :default ::= action => [values]

which you can think of as meaning "Let a rule without an action sub return the value of its sub-rules".

The Date Phrase is not declared using a '~' rule, like interpreted, but depends on another rule, where date_phrase is defined.

So the nesting of:

          [
            "Comment"
          ]

reflects the nesting of the 2 rules just above.

Second Step in Understanding

We need to examine this thing called date. So, we turn to a study of which rules are triggered when a date is detected in the input stream.

Before we start, here again is that array element:

        [
          [],
          [
            {
              day => 21,
              kind => "Date",
              month => "Jun",
              type => "Gregorian",
              year => 1950
            }
          ]
        ],

It is an arrayref in its own right, and has 2 elements:

o An empty arrayref

Which needs explaining.

o Another arrayref

Which contains a single element, a hashref.

Since Marpa only ever auto-generates arrayrefs, the mere fact that it's a hashref tells us it was returned by an action sub. And whoever wrote those action subs must have decided that a hashref was the best structure to represent whatever they wanted to return from that sub. And that's exactly what I did.

The rules which constitute a `date`

Here they are:

        date  ::= calendar_escape calendar_date

Yep. That's it. One rule.

And since the input stream did not contain a calendar escape, we now know this:

o The rule contains 2 symbols on the right-hand side

o There are 2 entries in the array element

o The input stream did not contain a calendar sub

o The first element of the arrayref is empty

This begs the question: What happens if we add a calendar escape to the input stream?

Try 'Int Gregorian 21 Jun 1950 (Comment)' and we get:

  [
    {
      kind => "Calendar",
      type => "Gregorian"
    },
    [
      {
        day => 21,
        kind => "Date",
        month => "Jun",
        type => "Gregorian",
        year => 1950
      }
    ]
  ],

Ah, ha! The first element is no longer an empty arrayref. Putting a calendar escape in the input stream has triggered a call to the calendar action sub, which is:

        sub calendar_name
        {
                my($cache, $t1) = @_;
                $t1 =~ s/\@\#d(.+)\@/$1/; # Zap gobbledegook if present.
                $t1 = ucfirst lc $t1;

                return
                {
                        kind => 'Calendar',
                        type => $t1,
                };

        } # End of calendar_name.

And look! It returns a hashref, exactly corresponding to the hashref we got.

So that's why we initially got an arrayref, and why it was empty.

The Third Step in Understanding

We're left with the task of understanding the origin of the 2nd element in the arrayref:

        [
          {
            day => 21,
            kind => "Date",
            month => "Jun",
            type => "Gregorian",
            year => 1950
          }
        ]

Well, the grammar said:

        date  ::= calendar_escape calendar_date

So obviously the 2nd element represents the result of calendar_date.

Now, calendar_date has no action sub of its own, but it participates in this rule:

        calendar_date  ::= gregorian_date  action => gregorian_date  # ($t1)
                       | julian_date       action => julian_date     # ($t1)
                       | french_date       action => french_date     # ($t1)
                       | german_date       action => german_date     # ($t1)
                       | hebrew_date       action => hebrew_date     # ($t1)

And each of these has just one symbol on the right-hand side, suggesting that each of these subs will take one parameter. And what they return - of course - depends, as always, on the author of those subs. Let's examine gregorian_date (the case we're dealing with):

        sub gregorian_date
        {
                my($cache, $t1) = @_;

                # Is it a BCE date? If so, it's already a hashref.

                if (ref($$t1[0]) eq 'HASH')
                {
                        return $$t1[0];
                }

                my($day);
                my($month);
                my($year);

                # Check for year, month, day.

                if ($#$t1 == 0)
                {
                        $year = $$t1[0];
                }
                elsif ($#$t1 == 1)
                {
                        $month = $$t1[0];
                        $year  = $$t1[1];
                }
                else
                {
                        $day   = $$t1[0];
                        $month = $$t1[1];
                        $year  = $$t1[2];
                }

                my($result) =
                {
                        kind  => 'Date',
                        type  => 'Gregorian',
                        year  => $year,
                };

                # Check for /00.

                if ($year =~ m|/|)
                {
                        ($$result{year}, $$result{suffix}) = split(m|/|, $year);
                }

                $$result{month} = $month if (defined $month);
                $$result{day}   = $day   if (defined $day);
                $result         = [$result];

                return $result;

        } # End of gregorian_date.

And, as expected, it returns a hashref. There are actually 2 places which return, but both return hashrefs, so that's OK. The first is $$t1[0], which is a hashref because Perl says so, and the 2nd is:

        my($result) =
        {
                kind  => 'Date',
                type  => 'Gregorian',
                year  => $year,
        };

And that's augmented with various bits depending on what was found in the input stream.

So what was found? A day - 21 - and a month - Jun -. And that's what appears in the hashref returned:

        [
          {
            day => 21,
            kind => "Date",
            month => "Jun",
            type => "Gregorian",
            year => 1950
          }
        ]

Wrapping Up and Winding Down

I hope this article has helped clarify the possibly mystifying way in which parameters are passed into action subs.

Further, I hope it explains the way that each such parameter will be structured as an arrayref, and the way these arrayrefs can be nested to any depth.

And, that the nesting is because each rule's right-hand side symbols return the results of all rules on which that symbol depends.

Note: If the list of parameters to your action sub seems to be missing elements, check the start of your BNF! You may have used ':default ::= action => ::first', which causes only the first symbol returned by a rule to be passed to your action sub. Instead, consider using ':default ::= action => [values]', or ':default ::= action => [name, values]'. See here for details. And don't forget the default for ':default'. Yep, you guessed it, it's '::first', so you should always specify ':default' explicitly, to avoid this particular problem, especially if you are writing for an audience containing people new to Marpa.

Also, each of the rules and sub-rules is free to return any structure, but that I happen to like returning hashrefs. This in turn reinforces that fact that the author of the action subs decides what to return per action sub.

Lastly, a completely different way of returning data is via the per-parse variable, which is always the first parameter passed to an action sub. For example code, see https://github.com/ronsavage/Image-Magick-CommandParser. This module may not yet be on CPAN, hence the link to github.

Briefly, we do this:

        # Start the BNF:

        :default ::= action => [values]
        ...

        # Set up a stack, $self -> items, and a logger.

        $self -> items(Set::Array -> new);
        $self -> logger(Log::Handler -> new);
        $self -> logger -> add(...);

And then, later:

        # Pass the above to value(), which calls the action subs:

        my($cache) =
        {
                items  => $self -> items,
                logger => $self -> logger,
        };

        $self -> recce -> value($cache);

        # A action sub from Image::Magick::CommandParser,
        # where I choose to push hashrefs onto the stack:

        sub action_set
        {
                my($cache, @param) = @_;
                my($name)          = 'action_set';

                $$cache{logger} -> log(debug => $name);
                $$cache{items} -> push
                ({
                        param => decode_result(\@param),
                        rule  => $name,
                });

                return $param[0]; # We don't care what's returned!

        } # End of action_set.

        # Back in the calling code just after calling value(), loop over the stack.

        for my $item ($self -> items -> print)
        {
                # Process $$item{param} and $$item{rule}.
        }

References

Marpa's homepage

Ron's homepage

Understanding Marpa-style action subs parameters

Table of contents