The Genealogy::Gedcom Namespace

Table of contents

The Genealogy::Gedcom Namespace
FAQ
Which version of GEDCOM is the default for these modules?
Where does all this leave Paul Johnson's Gedcom.pm?
What will the new modules do?
Yes, all well and good, but what's the point of re-writing Gedcom.pm?
Why do states in that STT appear to be doubled?
How do these new modules handle non-standard tags?
What about extensions to GEDCOM 5.5.1?
References
The Gedcom Mailing List

The Genealogy::Gedcom Namespace

This document is about the new set of modules I (Ron Savage) am writing (2011-08-12):

A) Genealogy::Gedcom
B) Genealogy::Gedcom::Reader
C) Genealogy::Gedcom::Reader::Lexer
D) Genealogy::Gedcom::Reader::Lexer::DFA
E) Genealogy::Gedcom::Reader::Parser
F) Genealogy::Gedcom::Writer

The basic code, and hence the (dir) structure, of this set is copied directly from Graph::Easy::Marpa, so yes, the parser will use Marpa.

FAQ

Which version of GEDCOM is the default for these modules?

DRAFT Release 5.5.1, in Ged551-5.pdf. See "References" for downloading details.

Where does all this leave Paul Johnson's Gedcom.pm?

People using that module should continue to use it indefinitely.

I do not see Genealogy::Gedcom as being a drop-in replacement for Gedcom.pm.

Nevertheless, if and when G::G reaches V 1.00 (what I call production quality), it might be a consideration for new code, since, dealing with identical input, they will obviously have a number of methods in common.

What will the new modules do?

A) Genealogy::Gedcom is a dummy module, which will one day have methods to directly manipulate data at the individual and family level.

The following modules all operate at a lower level.

B) Genealogy::Gedcom::Reader is a wrapper which calls both the lexer and the parser.
C) Genealogy::Gedcom::Reader::Lexer is called by Genealogy::Gedcom::Reader, and can be called from any module.

It reads the GEDCOM file, and lexes it, meaning it identifies tokens in the input stream, but does not assign meaning to those tokens. The paser assigns meaning to them.

Experts in the field are free to disagree with my casual definitions of lexing and parsing.

I call the GEDCOM file 'raw' data, so this CSV file is called 'cooked'. This terminology helps name some of the many command line options available.

Outputs supported by the lexer:

o A RAM-based array (of tokens), to be passed to the parser, or to any other module.

Note: In this array, of type Set::Array, each element is a hashref with these key => value pairs:

        count      => $myself -> _count,
        data       => defined($field[2]) ? $field[2] : '', # To allow for $field[2] eq '0'.
        level      => $field[0],
        line_count => $myself -> line_count,
        tag        => $field[1],
        type       => $type,

where @field = split(/\s+/, $input_record, 3).

The code reading the file removes leading and trailing spaces, and this usage of split handles indented GEDCOM files.

The removal of trailing spaces may cause inconvenience when the file deliberately contains a trailing space. See p 10 of the GEDCOM document, where it discusses the CONC and CONT tags. This means the removal of trailing spaces may be dropped from the code, or made optional.

        count is the token count, 1 .. N. Hence it's just the array index + 1.
        data is '20 Dec 1775', in the GEDCOM record '2     DATE 20 Dec 1775'.
        level is '2', in that record.
        line_count is that record's line number within the input file. This helps identification of errors in the file.
        tag is 'DATE', within that record.
        type is the context of that record. If the parent of that record is 'BIRT', and BIRT's parent is '0 @I1@ INDI', then type is 'individual'.

In fact, the record '0 @I1@ INDI', and all its child records, have type 'individual'.

I have to use $myself, since the DFA calls functions, and $self is not available within these functions.

In fact, Set::FA::Element calls these functions with only 1 parameter, the object of type Set::FA::Element itself.

So, $myself is a global variable within the lexer, and is a copy of $self, and in this way I can circumvent the fact the DFA only calls functions.

Lastly, another note on the 'count' key. It is not the line number in the input stream because the code ignores (but still counts) blank lines.

o A CSV file (of tokens).

This file too can be passed to the parser, or to any other module.

By default, the CSV file is not produced.

o A pretty-printed report (of tokens), written to the log.

A logger option can suppress this report.

The default logger is Log::Handler, whose default output goes to the screen.

To do its work, Genealogy::Gedcom::Reader::Lexer calls the next module.

D) Genealogy::Gedcom::Reader::Lexer::DFA, where DFA stands for Discrete Finite Automaton (nick: State Machine). The latter module calls Set::FA::Element, a module which I did not write but which I now maintain.

Genealogy::Gedcom::Reader::Lexer will perform some validation on the input tokens.

E) Genealogy::Gedcom::Reader::Parser is also called by Genealogy::Gedcom::Reader, and can be called from any module.

It too will perform some validation on the input tokens, which can come from RAM or a 'cooked' file.

F) Genealogy::Gedcom::Writer will simply output the array of tokens, enabling round-tripping of the input stream.

Later, there will probably be other modules in the series.

Yes, all well and good, but what's the point of re-writing Gedcom.pm?

Ahhh - I thought you'd never ask.

The point is that this design minimizes the effort of future support and maintenance.

The DFA takes a State Transition Table as a parameter to new(), and this STT can some from various sources:

o A copy of the default STT is stored within the source code of Genealogy::Gedcom::Reader::Lexer, after the __DATA__ token.

This makes it very fast to access (using Data::Section::Simple), and hence this is the default source.

o The STT can be read in from any CSV file.
o The STT can be read in from any LibreOffice (nee Open Office) *.ods file.

Obviously all forms of the STT have to be in the expected format, which is validated before being used.

You can see a copy of this STT here. The last 2 columns have not been used yet.

The distro ships with scripts/stt2html.pl, which will convert the CSV STT into HTML for easy viewing.

In fact, I work on the STT in LibreOffice, and export it to a CSV file for testing. I can save it as *.ods too obviously. Then the CSV file can be incorporated in the lexer's source code.

This means anyone can easily experiment with patches to the STT, supporting any extension to GEDCOM they dream up, simply by editing text files.

I should say one reason I love this approach is that after examining the source code for Gedcom.pm, I couldn't really understand how it performs its magic, and didn't want to spend too much time studying that technique. I assure you that I in no way mean to disparage Paul's work: Gedcom.pm is a very cleverly written module, which works very well indeed.

The advantage of my code is the text-based STT combined with the output of an array of lexed and parsed tokens, which anyone can siphon off for their own dark purposes.

Clearly this also means files exported from other genealogy programs can be imported into G::G by judicious editing of the STT.

Why do states in that STT appear to be doubled?

Because the design of Set::FA::Element demands it.

When text in the input stream is consumed (by matching a regexp in the STT), what happens when the 'current' state and the 'next' state are the same?

Set::FA::Element has adopted the convention that such an event is a noop (after the matching input is consumed). That means the DFA does not execute the state's exit and entry functions, even though that would sometimes convenient. So, in such a situation the STT is designed to rock back and forth between 2 identical states.

How do these new modules handle non-standard tags?

Currently, the lexer accepts valid tags which have suffixes. Hence both INDI and INDIVIDUAL are accepted. This will change when validation is implemented.

What about extensions to GEDCOM 5.5.1?

Any extension to the GEDCOM format has to be discussed (if only among Perl programmers), and documented in some manner compatible with the original document.

I have no such suggestions, but I definitely encourage those who do to use the Gedcom mailing list to elicit responses to their ideas.

My code's design simplifies adoption of such extensions. Other code may be just as easy to extend.

References

o The original Perl Gedcom
o GEDCOM
o GEDCOM Specification
o GEDCOM Validation
o GEDCOM Tags
o Usage of non-standard tags
o http://www.tamurajones.net/FTWTEXT.xhtml

This is apparently the worst offender she's seen. Search that page for 'tags'.

o http://www.tamurajones.net/GenoPro2011.xhtml
o http://www.tamurajones.net/GenoPro2007.xhtml
o http://www.tamurajones.net/TheFTWTEXTProblem.xhtml
o Other articles on Tamura's site
o http://www.tamurajones.net/FiveFreakyFeaturesYourGenealogySoftwareShouldNotHave.xhtml
o http://www.tamurajones.net/TwelveOrdinaryMustHaveGenealogySoftwareFeatures.xhtml
o Other projects

Many of these are discussed on Tamura's site.

o http://bettergedcom.wikispaces.com/
o http://www.ngsgenealogy.org/cs/GenTech_Projects
o http://gdmxml.fugal.net/
o http://www.cosoft.org/genxml/
o http://www.sunflower.com/~billk/GEDC/
o http://ancestorsnow.blogspot.com/2011/07/vged.html
o http://www.tamurajones.net/GEDCOMValidation.xhtml
o http://webtrees.net/
o http://swoodbridge.com/Genealogy/lifelines/
o http://deadendssoftware.blogspot.com/
o http://www.legacyfamilytree.com/
o https://devnet.familysearch.org/docs/api-overview

The Gedcom Mailing List

Contact perl-gedcom-help@perl.org.