Introduction

Table of contents

Introduction
Have any files in a directory changed?
Which directories do I back up?
POD = Plain Old Documentation
POM = POD Object Model
Converting text into a web site - 1
Converting text into a web site - 2
Converting text between various formats (SDF)
Converting text between various formats (Programs)
Converting text into images
Converting images between formats (ImageMagick)
The Magick scripting language (MSL)
GEnealogical Data COMmunication (GEDCOM)
GEDCOM on the Web - 1
GEDCOM on the Web - 2
Mailbox (MBX) files
Browser Favourites
Executable Text Files
Writing and Reading eval-able strings
Creating and Editing images with text
Nested Parsers for HTML - 1 of 4
Nested Parsers for HTML - 2 of 4
Nested Parsers for HTML - 3 of 4
Nested Parsers for HTML - 4 of 4
Resources
Author
Licence

Introduction

This lecture is called SDF - Simple Document Format.

I wish to talk about using simple text files for various purposes.

Why not PDF? The advantages of PDF:

OS agnostic
Viewable with many browsers
Viewer is free
Can create with freeware

Disadvantages:

Proprietary format
Adobe Acrobat costs $
It is only 1 format!

Have any files in a directory changed?

We can determine this by calculating a cryptographic digest (checksum) of all the files in the directory. The algorithm I use is called MD5, so I store the results in MD5.txt.

Then, later, we re-calculate the digests, and compare them with the values read in from MD5.txt. If at least 1 file has changed, the directory needs to be backed up.

My program, md5.pl, automatically handles sub-directories, and skip files with names like MD5.txt and *.bak. Of course it skips the directories '.' and '..'.

For format of MD5.txt looks like:

        D:/Dir/CPAN.pl: d3599a54751df65a693e34b9c1e1eec6
        D:/Dir/FlockDir.pm: f13e3843df50b8bfc868a56bf1f61306

Which directories do I back up?

md5.pl is driven by another program, backup.pl. backup.pl reads a strategy file that tells it which directories to run md5.pl on.

Here is the first few lines of the strategy file:

        # D:\My Documents\Backup\strategy.txt.
        backupDirectory=D:\Backup
        scriptDirectory=D:\Scripts\General
        # Tar name = Directory name
        Apache=D:\My Documents\Apache
        Apache-cgi-bin=D:\Apache\cgi-bin

The backupDirectory option tells backup.pl where to write output to.

The scriptDirectory options tells backup.pl where to find auxiliary scripts, ie md5.pl.

Lines like Apache=D:\My Documents\Apache mean:

Run md5.pl on D:\My Documents\Apache
If anything has changed, run tar and gzip on that dir
Call the output file Apache.tgz
Save Apache.tgz into backupDirectory = D:\Backup

Now I need only back up 1 directory: D:\Backup.

POD = Plain Old Documentation

Perl has its own documentation language called POD.

The name is modelled on POTS = Plain Old Telephone System.

POD files are text files.

Many program exist which read POD files and reformat them.

Perl ships with:

pod2html
pod2latex
pod2man
pod2text
pod2usage
podchecker
podlint

POM = POD Object Model

There are also various Perl modules dedicated to processing POD files, eg: Pod::POM.

The name is modelled on DOM = Document Object Model.

I'm writing this lecture directly into a POD file called sdf.pod.

I can get away with using only 4 POD commands to format this file.

Converting text into a web site - 1

Here is the XML defining one page on a web site (it's part of a larger file):

        <page name = 'Services' logo = './assets/images/butterflies.png'
                toc = 'Services' order = '04'>
        <header>
                <heading>Services Available</heading>
                <heading_pix>./assets/images/Services.png</heading_pix>
                <page_background>./assets/images/pa000131-pale.png</page_background>
        </header>
        <publication toc_id = 'Training'>
                <text>Small classes only.</text>
                <text>Please - only one dog per person.</text>
                <text>Next class: 1st June, 10:30am.</text>
        </publication>
        <publication toc_id = 'Dietary Advice'>
                <text>We can advise if a special diet is required.</text>
        </publication>
        </page>

I've written a program, web-site.pl, which converts XML like this into a set of HTML pages.

Converting text into a web site - 2

http://savage.net.au/Ron/html/sdf.1.png

web-site.pl is used to generate 3 web sites at the moment: mine, one for my neighbour who's a gardener, and another one for a dog training school.

Converting text between various formats (SDF)

SDF is a freely available documentation system designed and developed by Ian Clatworthy, with help from many others. SDF is written in Perl.

SDF uses text files as input. It has its own markup language, or it can use POD.

Output formats directly supported:

HTML
Plain text
POD
MIF (Maker Interchange Format)
SGML
MIMS F6 help
MIMS HTX

Output formats indirectly supported:

Remember pod2latex etc above?
PostScript, FrameViewer, RTF etc via FrameMaker
LaTeX, RTF, GNU Info and LyX via SGML-Tools
PostScript and DVI via Latex

Converting text between various formats (Programs)

Here are a mixed-bag of conversion utilities:

wp2html: Word Perfect and MS Word 7 and 97 to HTML
catwpd: Word Perfect to text
catdoc: MS Word to text
rtf2html: RTF to HTML
pdftotext: Adobe PDF to text
ps2ascii: Adobe PostScript to text
xlHtml: MS Excel and MS Powerpoint to HTML
pptHtml: MS Powerpoint to HTML
xls2csv: MS Excel to CSV (Comma-Separated Variables, ie text)
swfparse: Extracts links from Shockwave files
ht://Dig (yes, that's its name!): HTML to text

ht:/Dig is actually designed to help speed up web site search engines by indexing HTML pages.

Clearly, any package which outputs HTML can be used to index a file with ht://Dig. Eg: Your MS Excel spreadsheets can be converted to HTML with xlHtml, and then the HTML can be indexed with ht://Dig.

By combining these utilites, and the others mentioned herein, you are well on the way to file conversion utopia.

Converting text into images

Firstly, here I wish to discuss converting text (PS files) into images (SVG/JPG/BMP etc).

pstoedit converts PS files into PDF/SVG/CGM/EMF/MIF/RTF/etc. Of course, some of these are text output files, and some are image formats.

In fact, the range of output from pstoedit is too large to include here.

Secondly, some utilities read HTML and output images.

html2jpg, Personal edition, can output to either JPG or BMP, and the Enterprise edition can even handle pages which require horizontal scrolling.

This package captures web pages, ie not just the HTML code, but also some images, the output from CGI scripts, etc. Batch mode processing is possible.

html2ps is a Perl program, which supports multiple documents and following hyperlinks. If the output is converted to PDF, the hyperlinks are maintained.

Converting images between formats (ImageMagick)

ImageMagick is a collection of utilities for image format conversion. About 70 formats are supported, although not all can be both read and written.

Note: Some ImageMagick features require external programs, which ImageMagick calls 'delegates'.

ImageMagick can use the FreeType libraries to annotate with TrueType fonts. It requires GhostScript to read (Encapsulated) PostScript and PDF files. These can then be converted into GIF, PNG etc images. html2ps can be used to read HTML files.

ImageMagick can handle files, eg EPS and PS, which contain multiple pages, and can convert each page into a separate image.

ImageMagick can create blank images with special backgrounds, and then annotate them with any text you choose.

ImageMagick consists of command line programs, and has an interface via Perl too. You can use the C/C++ interface if you wish to program in those languages.

For CGI scripts, I have written a Perl CGI ImageMagick FAQ.

So, why discuss ImageMagick?

Two reasons:

It can read files such as EPS, PS and PDF
It can be driven from a text (XML) file

The Magick scripting language (MSL)

An MSL file is simply an XML file recognized by the ImageMagick utility conjure.

An MSL looks like (call it convert.msl):

        <?xml version="1.0" encoding="UTF-8"?>

        <image size="400x400" >
                <read filename="image.gif" />
                <get width="base-width" height="base-height" />
                <resize geometry="%[dimensions]" />
                <get width="width" height="height" />
                <print output=
                        "Image sized from %[base-width]x%[base-height]
                        to %[width]x%[height].\n" />
                <write filename="image.png" />
        </image>

and to run this, with image.gif in the current directory, and output going to image.png, I tested it with:

        D:\TEMP>conjure -dimensions 400x400 convert.msl

where the string '%[dimensions]' in the MSL file is set at run-time with the '-dimensions 400x400' option.

GEnealogical Data COMmunication (GEDCOM)

GEDCOM is a text file format used for documenting family trees.

There are a range of family tree programs around, and they can use GEDCOM to communicated with each other.

The best of these, I believe, is PAF = Personal Ancestral File.

A GEDCOM file entry for 1 person looks like:

        0 @I1@ INDI
        1 NAME Ronald /Savage/ Mr
        2 SURN Savage
        2 GIVN Ronald
        2 NSFX Mr
        1 SEX M
        1 BIRT
        2 DATE 21 Jun 1950
        2 PLAC Victoria H, Launceston, Tas, Aust
        1 REFN 1
        1 _UID 502BB18A5386D5119E640000000000007772
        1 FAMS @F1@
        1 FAMC @F2@
        1 NOTE BC: 764. Born 2:30 pm

Various programs are available which can read a GEDCOM file output by a family tree program, and can convert it into HTML, for example.

GEDCOM on the Web - 1

How do we display data from a GEDCOM file in a web page?

The answer is a Perl CGI script called ged-view.cgi, which I wrote. It uses a Cascading Style Sheet (CSS) to format the data in a nice way. A part of the CSS looks like:

        #person_center {
                padding: 10px;
                margin-top: 20px;
                margin-bottom: 20px;
                margin-right: auto;
                margin-left: auto;      /* opera does not like 'margin:20px auto' */
                background-color: #80c0ff;
                border: 5px solid #ccc;
                text-align:left; /* part 2 of 2 centering hack */
                width: 600px; /* ie5win fudge begins */
                voice-family: "\"}\"";
                voice-family:inherit;
                width: 570px;
                }

        html>body #person_center {
                width: 570px; /* ie5win fudge ends */
                }

This is yet another way a simple (:-) text file can exercise remote control over something else. Here it controls the appearance of web pages.

GEDCOM on the Web - 2

But how to we design a web page which can be controlled by a CSS?

The answer is HTML templates. In web-ged.cgi I have used a Perl module called the Template Toolkit. A part of a template, as used by the Template Toolkit, looks like:

        [% SET spouse_count = 0 %]
        [% FOREACH spouse = person.spouse %]
                [% SET spouse_count = spouse_count + 1 %]
                <br>
                <div id = 'person_center'>
                        <p>Spouse [% spouse_count %]:
                                <a href = '[% href(spouse.xref) %]'>
                                [% spouse.name %]</a> ([% spouse.sex %])</p>
                        <p>Marriage: [% spouse.marriage %]</p>
                        [% SET child_count = 0 %]
                        [% FOREACH child = spouse.children %]
                                [% SET child_count = child_count + 1 %]
                                <hr>
                                <p>Child [% child_count %]:
                                        <a href = '[% href(child.xref) %]'>
                                        [% child.name %]</a> ([% child.sex %])</p>
                                <p>Birth: [% child.birth %]</p>
                                <p>Death: [% child.death %]</p>
                        [% END %]
                </div>
        [% END %]

Look closely. See <div id = 'person_center'>? That connect the HTML to the CSS.

Also notice: There are absolutely no HTML tables in this set of templates.

Mailbox (MBX) files

Some email clients, such as MS Outlook Express (now called Virus Express) store email in a proprietary format.

However, high quality email clients use an industry standard format - MBX.

MBX files are text files, with individual emails simply concatenated one after the other, into one long file. Each mailbox you define in your email client with have all its emails stored in this one file.

Of course, many people have written programs, eg in Perl, which can read MBX files and process them. I use a Perl module called Mail::Header to process my MBX files.

Why would you want to do this?

To find stuff when you can't quite remember it
To create indexes of topics
To delete attachments!

Yes, it is this last one I do. My email client is called PocoMail. Poco is the Spanish word for 'small', and PocoMail - which costs $25 - is a small program, dedicated to managing email.

However, when you delete emails, it does not delete the corresponding attachements. So, I've written a little program to find the attachments I wish to keep, and to delete the rest.

Browser Favourites

Because Internet Explorer became slower and slower as I added more and more Favouries, I now keep my favourites in a text file.

The file is structured. This format allows me to add annotations to links if I choose, and to keep related links together, thus emulating folders.

I can sort the file anytime I wish.

I use my editor to search for anything interesting. Since my editor runs all day, I can keep this text file permanently loaded.

I does mean I have to copy-and-paste links, rather than just clicking on them, but I feel the convenience outways this extra effort.

Simple!

Executable Text Files

There are basically 2 types of text files which we can execute:

Scripts, eg Perl scripts
Strings which are eval-able (executable) in some language

'eval' is the command which executes a string, and is available in various languages. Indeed, it is often a test of a language's power as to whether or not the language supports 'eval'.

Executable text follows, and the executor is Perl.

        #!/usr/bin/perl
        print "Hello, World\n";

A string, eval-able by Perl (in a file) will look like:

        %hash = (
                'key_1' => 'Value one',
                'key_2' => 'Value two'
        };

Writing and Reading eval-able strings

We can write and read such a string with:

        #!/usr/bin/perl
        use strict;
        use warnings;
        use Data::Dumper;
        my(%hash) =
        (
                key_1 => 'Value one',
                key_2 => 'Value two',
        );
        open(OUT, '> string.txt') || die("Can't open(> string.txt): $!");
        print OUT Data::Dumper -> Dump([\%hash], ['*hash']);
        close(OUT);

        #!/usr/bin/perl
        use strict;
        use warnings;
        my(%hash);
        open(INX, 'string.txt') || die("Can't open(string.txt): $!");
        local $/;
        eval <INX>;
        print map{"$_ => $hash{$_}. \n"} sort keys %hash;

Creating and Editing images with text

Here I'll just mention a few image/text manipulation packages.

ImageMagick: See above
PhotoLine 32: Image Editing for Windows and Mac
DDTitle: Sophisticated text-on-image software

Nested Parsers for HTML - 1 of 4

By this I mean parsing a HTML file to extract a string of text, and then parsing that extracted text using another parser, all in 1 go.

Here is a screen-capture of the original HTML:

http://savage.net.au/Ron/html/sdf.2.png

Nested Parsers for HTML - 2 of 4

Here is the Perl code which parses the HTML file and extracts the table:

        my($table_name)         = 'method';
        my($table_type)         = 'HTMLtable';
        my($table_file)         = 'perl.html';
        my($row_names)          = 'Method,Parameters,Description';
        my($dbh)                        = DBI -> connect('dbi:AnyData(RaiseError=>1):');
        $dbh -> func($table_name, $table_type, $table_file,
                {col_names => $row_names, count => 8}, 'ad_catalog');
        my($sql) = "select * from $table_name";
        my($sth) = $dbh -> prepare($sql) || die("Can't prepare: $sql: $DBI::errstr");
        $sth -> execute() || die("Can't execute: $sql: $DBI::errstr");
        $|                                      = 1;
        my($method_parser)      = Parse::RecDescent -> new($method_grammar) || die("Invalid method grammar");
        my($data, $text, %method);
        while ($data = $sth -> fetchrow_hashref() )
        {
                $text   = $$data{'Parameters'};
                $text   =~ tr/\cM\cJ/ /;
                $text   =~ s/([a-z])\s+([a-z])/${1}_$2/g; # Make things like 'stroke => color name' easy.
                $text   =~ tr/ \t/ /s;
                if ( (! $text) || ($text eq ' ') )
                {
                        $method{$$data{'Method'} } = '';
                        next;
                }
                $text = $method_parser -> Option($text);
                die("Cannot parse text") if (! defined($text) );
                $method{$$data{'Method'} } = $text;
        }
        die("Can't fetch: $sql: $DBI::errstr") if ($DBI::errstr);
        $dbh -> disconnect();

Nested Parsers for HTML - 3 of 4

Here is the Perl grammar which parses the HTML:

        my($method_grammar) =
        q{
        Token: /^@?\w[-\w]*/
        NextToken: Token '=>'
        Comma: /,/
        Option: <leftop: Pair Comma Pair>
                                {
                                        $return = join($;, sort grep{! /,/} @{$item[1]});
                                }
        Pair:   Token '=>' '{' <commit> <leftop: Token Comma Token> '}'
                                {
                                        $return = "$item[1] => " . join(' | ', sort grep{! /,/} @{$item[5]});
                                }
                        | Token '=>' Token Comma ...!NextToken Token
                                {
                                        $return = "$item[1] => $item[3] & $item[6]";
                                }
                        | Token '=>' Token
                                {
                                        $return = "$item[1] => $item[3]";
                                }
                        | 'string'
                                {
                                        $return = $item[1];
                                }
                        | <error: Invalid option specification>
        };

Nested Parsers for HTML - 4 of 4

Here is the output of the program:

        D:\scripts\ImageMagick-Conjure>perl conjure-text-2-xml.pl -help
        Method: AddNoise. Parameters:
        noise = Gaussian | Impulse | Laplacian | Multiplicative | Poisson | Uniform

        Method: AffineTransform. Parameters:
        affine = array_of_float_values
        rotate = float
        scale = float & float
        skewX = float
        skewY = float
        translate = float & float

        Method: Annotate. Parameters:
        affine = array_of_float_values
        antialias = false | true
        density = geometry
        family = string
        fill = color_name
        font = string
        geometry = geometry
        gravity = Center | East | North | NorthEast | NorthWest | South | SouthEast | So
        uthWest | West
        pointsize = integer
        rotate = float
        scale = float & float
        stretch = Condensed | Expanded | ExtraCondensed | ExtraExpanded | Normal | SemiC
        ondensed | SemiExpanded | UltraCondensed | UltraExpanded

Resources

The lectures

CGI scripting (lecture 2001)

Simple Document Format (lecture 2002)

Web Servers (lecture 2002)

The slides for Simple Document Format (*.tgz)

The slides for Simple Document Format (*.zip)

CSS: http://www.thenoodleincident.com/tutorials/box_lesson/index.html
DDTitle: http://www.softlab-nsk.com/ddtitle/
Gedcom: http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html
ged-view.cgi: http://savage.net.au/Perl.html#ged_view_cgi
ht://Dig: http://www.htdig.org/
html2jpg: http://www.html2jpg.com/
html2ps: http://www.tdb.uu.se/~jan/html2ps.html
ImageMagick Home Page: http://www.imagemagick.org/
ImageMagick for Windows: http://www.dylanbeattie.net/magick/
PAF: http://www.familysearch.org
PhotoLine 32: http://www.ciebv.com
PocoMail: http://www.pocomail.com
pod2ps: http://www.cpan.org/modules/by-authors/Jim_Pravetz/
ps2edit: http://www.pstoedit.com/
SDF: http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html

Author

Ron Savage .

Home page: http://savage.net.au/index.html

Version: 1.01 01-Jun-2006

This version disguises my email address.

Version: 1.00 27-May-2002

Original version.

Licence

Australian Copyright © 2002 Ron Savage. All rights reserved.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html