|The Genealogy::Gedcom Namespace|
|Which version of GEDCOM is the default for these modules?|
|Where does all this leave Paul Johnson's Gedcom.pm?|
|What will the new modules do?|
|Yes, all well and good, but what's the point of re-writing Gedcom.pm?|
|Why do states in that STT appear to be doubled?|
|How do these new modules handle non-standard tags?|
|What about extensions to GEDCOM 5.5.1?|
|The Gedcom Mailing List|
This document is about the new set of modules I (Ron Savage) am writing (2011-08-12):
The basic code, and hence the (dir) structure, of this set is copied directly from Graph::Easy::Marpa, so yes, the parser will use Marpa.
DRAFT Release 5.5.1, in Ged551-5.pdf. See "References" for downloading details.
People using that module should continue to use it indefinitely.
I do not see Genealogy::Gedcom as being a drop-in replacement for Gedcom.pm.
Nevertheless, if and when G::G reaches V 1.00 (what I call production quality), it might be a consideration for new code, since, dealing with identical input, they will obviously have a number of methods in common.
The following modules all operate at a lower level.
It reads the GEDCOM file, and lexes it, meaning it identifies tokens in the input stream, but does not assign meaning to those tokens. The paser assigns meaning to them.
Experts in the field are free to disagree with my casual definitions of lexing and parsing.
I call the GEDCOM file 'raw' data, so this CSV file is called 'cooked'. This terminology helps name some of the many command line options available.
Outputs supported by the lexer:
Note: In this array, of type Set::Array, each element is a hashref with these key => value pairs:
count => $myself -> _count, data => defined($field) ? $field : '', # To allow for $field eq '0'. level => $field, line_count => $myself -> line_count, tag => $field, type => $type,
where @field = split(/\s+/, $input_record, 3).
The code reading the file removes leading and trailing spaces, and this usage of split handles indented GEDCOM files.
The removal of trailing spaces may cause inconvenience when the file deliberately contains a trailing space. See p 10 of the GEDCOM document, where it discusses the CONC and CONT tags. This means the removal of trailing spaces may be dropped from the code, or made optional.
count is the token count, 1 .. N. Hence it's just the array index + 1. data is '20 Dec 1775', in the GEDCOM record '2 DATE 20 Dec 1775'. level is '2', in that record. line_count is that record's line number within the input file. This helps identification of errors in the file. tag is 'DATE', within that record. type is the context of that record. If the parent of that record is 'BIRT', and BIRT's parent is '0 @I1@ INDI', then type is 'individual'.
In fact, the record '0 @I1@ INDI', and all its child records, have type 'individual'.
I have to use $myself, since the DFA calls functions, and $self is not available within these functions.
In fact, Set::FA::Element calls these functions with only 1 parameter, the object of type Set::FA::Element itself.
So, $myself is a global variable within the lexer, and is a copy of $self, and in this way I can circumvent the fact the DFA only calls functions.
Lastly, another note on the 'count' key. It is not the line number in the input stream because the code ignores (but still counts) blank lines.
This file too can be passed to the parser, or to any other module.
By default, the CSV file is not produced.
A logger option can suppress this report.
The default logger is Log::Handler, whose default output goes to the screen.
To do its work, Genealogy::Gedcom::Reader::Lexer calls the next module.
Genealogy::Gedcom::Reader::Lexer will perform some validation on the input tokens.
It too will perform some validation on the input tokens, which can come from RAM or a 'cooked' file.
Later, there will probably be other modules in the series.
Ahhh - I thought you'd never ask.
The point is that this design minimizes the effort of future support and maintenance.
The DFA takes a State Transition Table as a parameter to new(), and this STT can some from various sources:
This makes it very fast to access (using Data::Section::Simple), and hence this is the default source.
Obviously all forms of the STT have to be in the expected format, which is validated before being used.
You can see a copy of this STT here. The last 2 columns have not been used yet.
The distro ships with scripts/stt2html.pl, which will convert the CSV STT into HTML for easy viewing.
In fact, I work on the STT in LibreOffice, and export it to a CSV file for testing. I can save it as *.ods too obviously. Then the CSV file can be incorporated in the lexer's source code.
This means anyone can easily experiment with patches to the STT, supporting any extension to GEDCOM they dream up, simply by editing text files.
I should say one reason I love this approach is that after examining the source code for Gedcom.pm, I couldn't really understand how it performs its magic, and didn't want to spend too much time studying that technique. I assure you that I in no way mean to disparage Paul's work: Gedcom.pm is a very cleverly written module, which works very well indeed.
The advantage of my code is the text-based STT combined with the output of an array of lexed and parsed tokens, which anyone can siphon off for their own dark purposes.
Clearly this also means files exported from other genealogy programs can be imported into G::G by judicious editing of the STT.
Because the design of Set::FA::Element demands it.
When text in the input stream is consumed (by matching a regexp in the STT), what happens when the 'current' state and the 'next' state are the same?
Set::FA::Element has adopted the convention that such an event is a noop (after the matching input is consumed). That means the DFA does not execute the state's exit and entry functions, even though that would sometimes convenient. So, in such a situation the STT is designed to rock back and forth between 2 identical states.
Currently, the lexer accepts valid tags which have suffixes. Hence both INDI and INDIVIDUAL are accepted. This will change when validation is implemented.
Any extension to the GEDCOM format has to be discussed (if only among Perl programmers), and documented in some manner compatible with the original document.
I have no such suggestions, but I definitely encourage those who do to use the Gedcom mailing list to elicit responses to their ideas.
My code's design simplifies adoption of such extensions. Other code may be just as easy to extend.
This is apparently the worst offender she's seen. Search that page for 'tags'.
Many of these are discussed on Tamura's site.