IntroductionThis lecture is called SDF - Simple Document Format. I wish to talk about using simple text files for various purposes. Why not PDF? The advantages of PDF: Disadvantages: Have any files in a directory changed?We can determine this by calculating a cryptographic digest (checksum) of all the files in the directory. The algorithm I use is called MD5, so I store the results in MD5.txt. Then, later, we re-calculate the digests, and compare them with the values read in from MD5.txt. If at least 1 file has changed, the directory needs to be backed up. My program, md5.pl, automatically handles sub-directories, and skip files with names like MD5.txt and *.bak. Of course it skips the directories '.' and '..'. For format of MD5.txt looks like: D:/Dir/CPAN.pl: d3599a54751df65a693e34b9c1e1eec6 D:/Dir/FlockDir.pm: f13e3843df50b8bfc868a56bf1f61306 Which directories do I back up?md5.pl is driven by another program, backup.pl. backup.pl reads a strategy file that tells it which directories to run md5.pl on. Here is the first few lines of the strategy file: # D:\My Documents\Backup\strategy.txt. backupDirectory=D:\Backup scriptDirectory=D:\Scripts\General # Tar name = Directory name Apache=D:\My Documents\Apache Apache-cgi-bin=D:\Apache\cgi-bin The backupDirectory option tells backup.pl where to write output to. The scriptDirectory options tells backup.pl where to find auxiliary scripts, ie md5.pl. Lines like Apache=D:\My Documents\Apache mean:
Now I need only back up 1 directory: D:\Backup. POD = Plain Old DocumentationPerl has its own documentation language called POD. The name is modelled on POTS = Plain Old Telephone System. POD files are text files. Many program exist which read POD files and reformat them. Perl ships with: POM = POD Object ModelThere are also various Perl modules dedicated to processing POD files, eg: Pod::POM. The name is modelled on DOM = Document Object Model. Converting text into a web site - 1Here is the XML defining one page on a web site (it's part of a larger file): <page name = 'Services' logo = './assets/images/butterflies.png' toc = 'Services' order = '04'> <header> <heading>Services Available</heading> <heading_pix>./assets/images/Services.png</heading_pix> <page_background>./assets/images/pa000131-pale.png</page_background> </header> <publication toc_id = 'Training'> <text>Small classes only.</text> <text>Please - only one dog per person.</text> <text>Next class: 1st June, 10:30am.</text> </publication> <publication toc_id = 'Dietary Advice'> <text>We can advise if a special diet is required.</text> </publication> </page> I've written a program, web-site.pl, which converts XML like this into a set of HTML pages. Converting text into a web site - 2<img src = 'sdf-slide-8.png'> web-site.pl is used to generate 3 web sites at the moment: mine, one for my neighbour who's a gardener, and another one for a dog training school. Converting text between various formats (SDF)SDF is a freely available documentation system designed and developed by Ian Clatworthy, with help from many others. SDF is written in Perl. SDF uses text files as input. It has its own markup language, or it can use POD. Output formats directly supported: Output formats indirectly supported:
Converting text between various formats (Programs)Here are a mixed-bag of conversion utilities:
By combining these utilites, and the others mentioned herein, you are well on the way to file conversion utopia. Converting text into imagesFirstly, here I wish to discuss converting text (PS files) into images (SVG/JPG/BMP etc). pstoedit converts PS files into PDF/SVG/CGM/EMF/MIF/RTF/etc. Of course, some of these are text output files, and some are image formats. In fact, the range of output from pstoedit is too large to include here. Secondly, some utilities read HTML and output images. html2jpg, Personal edition, can output to either JPG or BMP, and the Enterprise edition can even handle pages which require horizontal scrolling. This package captures web pages, ie not just the HTML code, but also some images, the output from CGI scripts, etc. Batch mode processing is possible. html2ps is a Perl program, which supports multiple documents and following hyperlinks. If the output is converted to PDF, the hyperlinks are maintained. Converting images between formats (ImageMagick)ImageMagick is a collection of utilities for image format conversion. About 70 formats are supported, although not all can be both read and written. Note: Some ImageMagick features require external programs, which ImageMagick calls 'delegates'. ImageMagick can use the FreeType libraries to annotate with TrueType fonts. It requires GhostScript to read (Encapsulated) PostScript and PDF files. These can then be converted into GIF, PNG etc images. html2ps can be used to read HTML files. ImageMagick can handle files, eg EPS and PS, which contain multiple pages, and can convert each page into a separate image. ImageMagick can create blank images with special backgrounds, and then annotate them with any text you choose. ImageMagick consists of command line programs, and has an interface via Perl too. You can use the C/C++ interface if you wish to program in those languages. For CGI scripts, I have written a Perl CGI ImageMagick FAQ. So, why discuss ImageMagick? Two reasons: The Magick scripting language (MSL)An MSL file is simply an XML file recognized by the ImageMagick utility conjure. An MSL looks like (call it convert.msl): <?xml version="1.0" encoding="UTF-8"?> <image size="400x400" > <read filename="image.gif" /> <get width="base-width" height="base-height" /> <resize geometry="%[dimensions]" /> <get width="width" height="height" /> <print output= "Image sized from %[base-width]x%[base-height] to %[width]x%[height].\n" /> <write filename="image.png" /> </image> and to run this, with image.gif in the current directory, and output going to image.png, I tested it with: D:\TEMP>conjure -dimensions 400x400 convert.msl where the string '%[dimensions]' in the MSL file is set at run-time with the '-dimensions 400x400' option. GEnealogical Data COMmunication (GEDCOM)GEDCOM is a text file format used for documenting family trees. There are a range of family tree programs around, and they can use GEDCOM to communicated with each other. The best of these, I believe, is PAF = Personal Ancestral File. A GEDCOM file entry for 1 person looks like: 0 @I1@ INDI 1 NAME Ronald /Savage/ Mr 2 SURN Savage 2 GIVN Ronald 2 NSFX Mr 1 SEX M 1 BIRT 2 DATE 21 Jun 1950 2 PLAC Victoria H, Launceston, Tas, Aust 1 REFN 1 1 _UID 502BB18A5386D5119E640000000000007772 1 FAMS @F1@ 1 FAMC @F2@ 1 NOTE BC: 764. Born 2:30 pm Various programs are available which can read a GEDCOM file output by a family tree program, and can convert it into HTML, for example. GEDCOM on the Web - 1How do we display data from a GEDCOM file in a web page? The answer is a Perl CGI script called ged-view.cgi, which I wrote. It uses a Cascading Style Sheet (CSS) to format the data in a nice way. A part of the CSS looks like: #person_center {
padding: 10px;
margin-top: 20px;
margin-bottom: 20px;
margin-right: auto;
margin-left: auto; /* opera does not like 'margin:20px auto' */
background-color: #80c0ff;
border: 5px solid #ccc;
text-align:left; /* part 2 of 2 centering hack */
width: 600px; /* ie5win fudge begins */
voice-family: "\"}\"";
voice-family:inherit;
width: 570px;
}
html>body #person_center {
width: 570px; /* ie5win fudge ends */
}
This is yet another way a simple (:-) text file can exercise remote control over something else. Here it controls the appearance of web pages. GEDCOM on the Web - 2But how to we design a web page which can be controlled by a CSS? The answer is HTML templates. In web-ged.cgi I have used a Perl module called the Template Toolkit. A part of a template, as used by the Template Toolkit, looks like: [% SET spouse_count = 0 %] [% FOREACH spouse = person.spouse %] [% SET spouse_count = spouse_count + 1 %] <br> <div id = 'person_center'> <p>Spouse [% spouse_count %]: <a href = '[% href(spouse.xref) %]'> [% spouse.name %]</a> ([% spouse.sex %])</p> <p>Marriage: [% spouse.marriage %]</p> [% SET child_count = 0 %] [% FOREACH child = spouse.children %] [% SET child_count = child_count + 1 %] <hr> <p>Child [% child_count %]: <a href = '[% href(child.xref) %]'> [% child.name %]</a> ([% child.sex %])</p> <p>Birth: [% child.birth %]</p> <p>Death: [% child.death %]</p> [% END %] </div> [% END %] Look closely. See <div id = 'person_center'>? That connect the HTML to the CSS. Also notice: There are absolutely no HTML tables in this set of templates. Mailbox (MBX) filesSome email clients, such as MS Outlook Express (now called Virus Express) store email in a proprietary format. However, high quality email clients use an industry standard format - MBX. MBX files are text files, with individual emails simply concatenated one after the other, into one long file. Each mailbox you define in your email client with have all its emails stored in this one file. Of course, many people have written programs, eg in Perl, which can read MBX files and process them. I use a Perl module called Mail::Header to process my MBX files. Why would you want to do this? Yes, it is this last one I do. My email client is called PocoMail. Poco is the Spanish word for 'small', and PocoMail - which costs $25 - is a small program, dedicated to managing email. However, when you delete emails, it does not delete the corresponding attachements. So, I've written a little program to find the attachments I wish to keep, and to delete the rest. Browser FavouritesBecause Internet Explorer became slower and slower as I added more and more Favouries, I now keep my favourites in a text file. The file is structured. This format allows me to add annotations to links if I choose, and to keep related links together, thus emulating folders. I can sort the file anytime I wish. I use my editor to search for anything interesting. Since my editor runs all day, I can keep this text file permanently loaded. I does mean I have to copy-and-paste links, rather than just clicking on them, but I feel the convenience outways this extra effort. Simple! Executable Text FilesThere are basically 2 types of text files which we can execute:
Executable text follows, and the executor is Perl. #!/usr/bin/perl print "Hello, World\n"; A string, eval-able by Perl (in a file) will look like: %hash = ( 'key_1' => 'Value one', 'key_2' => 'Value two' }; Writing and Reading eval-able stringsWe can write and read such a string with: #!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my(%hash) =
(
key_1 => 'Value one',
key_2 => 'Value two',
);
open(OUT, '> string.txt') || die("Can't open(> string.txt): $!");
print OUT Data::Dumper -> Dump([\%hash], ['*hash']);
close(OUT);
#!/usr/bin/perl
use strict;
use warnings;
my(%hash);
open(INX, 'string.txt') || die("Can't open(string.txt): $!");
local $/;
eval <INX>;
print map{"$_ => $hash{$_}. \n"} sort keys %hash;
Creating and Editing images with textHere I'll just mention a few image/text manipulation packages.
Nested Parsers for HTML - 1 of 4By this I mean parsing a HTML file to extract a string of text, and then parsing that extracted text using another parser, all in 1 go. Here is a screen-capture of the original HTML: <img src = 'sdf-slide-21.png'> Nested Parsers for HTML - 2 of 4Here is the Perl code which parses the HTML file and extracts the table: my($table_name) = 'method';
my($table_type) = 'HTMLtable';
my($table_file) = 'perl.html';
my($row_names) = 'Method,Parameters,Description';
my($dbh) = DBI -> connect('dbi:AnyData(RaiseError=>1):');
$dbh -> func($table_name, $table_type, $table_file,
{col_names => $row_names, count => 8}, 'ad_catalog');
my($sql) = "select * from $table_name";
my($sth) = $dbh -> prepare($sql) || die("Can't prepare: $sql: $DBI::errstr");
$sth -> execute() || die("Can't execute: $sql: $DBI::errstr");
$| = 1;
my($method_parser) = Parse::RecDescent -> new($method_grammar) || die("Invalid method grammar");
my($data, $text, %method);
while ($data = $sth -> fetchrow_hashref() )
{
$text = $$data{'Parameters'};
$text =~ tr/\cM\cJ/ /;
$text =~ s/([a-z])\s+([a-z])/${1}_$2/g; # Make things like 'stroke => color name' easy.
$text =~ tr/ \t/ /s;
if ( (! $text) || ($text eq ' ') )
{
$method{$$data{'Method'} } = '';
next;
}
$text = $method_parser -> Option($text);
die("Cannot parse text") if (! defined($text) );
$method{$$data{'Method'} } = $text;
}
die("Can't fetch: $sql: $DBI::errstr") if ($DBI::errstr);
$dbh -> disconnect();
Nested Parsers for HTML - 3 of 4Here is the Perl grammar which parses the HTML: my($method_grammar) =
q{
Token: /^@?\w[-\w]*/
NextToken: Token '=>'
Comma: /,/
Option: <leftop: Pair Comma Pair>
{
$return = join($;, sort grep{! /,/} @{$item[1]});
}
Pair: Token '=>' '{' <commit> <leftop: Token Comma Token> '}'
{
$return = "$item[1] => " . join(' | ', sort grep{! /,/} @{$item[5]});
}
| Token '=>' Token Comma ...!NextToken Token
{
$return = "$item[1] => $item[3] & $item[6]";
}
| Token '=>' Token
{
$return = "$item[1] => $item[3]";
}
| 'string'
{
$return = $item[1];
}
| <error: Invalid option specification>
};
Nested Parsers for HTML - 4 of 4Here is the output of the program: D:\scripts\ImageMagick-Conjure>perl conjure-text-2-xml.pl -help Method: AddNoise. Parameters: noise = Gaussian | Impulse | Laplacian | Multiplicative | Poisson | Uniform Method: AffineTransform. Parameters: affine = array_of_float_values rotate = float scale = float & float skewX = float skewY = float translate = float & float Method: Annotate. Parameters: affine = array_of_float_values antialias = false | true density = geometry family = string fill = color_name font = string geometry = geometry gravity = Center | East | North | NorthEast | NorthWest | South | SouthEast | So uthWest | West pointsize = integer rotate = float scale = float & float stretch = Condensed | Expanded | ExtraCondensed | ExtraExpanded | Normal | SemiC ondensed | SemiExpanded | UltraCondensed | UltraExpanded Resources
AuthorRon Savage Home page: http://savage.net.au/index.html
LicenceAustralian Copyright © 2002 Ron Savage. All rights reserved. All Programs of mine are 'OSI Certified Open Source Software'; you can redistribute them and/or modify them under the terms of The Artistic License, a copy of which is available at: http://www.opensource.org/licenses/index.html |
| Top of page |