HTML::Parser::Simple - Parse nice HTML files without needing a compiler
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::Parser::Simple;
# -------------------------
# Method 1:
my($p) = HTML::Parser::Simple -> new
(
{
input_dir => '/source/dir',
output_dir => '/dest/dir',
}
);
$p -> parse_file('in.html', 'out.html');
# Method 2:
my($p) = HTML::Parser::Simple -> new();
$p -> parse('<html>...</html>');
$p -> traverse($p -> get_root() );
print $p -> result();
HTML::Parser::Simple is a pure Perl module.
It parses HTML V 4 files, and generates a tree of nodes per HTML tag.
The data associated with each node is documented in the FAQ.
This module is available as a Unix-style distro (*.tgz).
See http://savage.net.au/Perl-modules.html for details.
See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing.
new(...) returns an object of type HTML::Parser::Simple.
This is the class's contructor.
Usage: HTML::Parser::Simple -> new().
This method takes a hashref of options.
Call new() as new({option_1 => value_1, option_2 => value_2, ...}).
Available options:
This takes the path where the input file is to read from.
The default value is '' (the empty string).
This takes the path where the output file is to be written.
The default value is '' (the empty string).
This takes either a 0 or a 1.
Write more or less progress messages to STDERR.
The default value is 0.
Note: Currently, setting verbose does nothing.
This takes either a 0 or a 1.
0 means do not accept an XML declaration, such as <?xml version="1.0" encoding="UTF-8"?> at the start of the input file, and some other XHTML features.
1 means accept it.
The default value is 0.
Warning: The only XHTML changes to this code, so far, are:
E.g.: <?xml version="1.0" standalone='yes'?>.
E.g.: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">.
Returns the Tree::Simple object which the parser calls the current node.
Returns the nesting depth of the current tag.
It's just there in case you need it.
Returns the input_dir parameter, as passed in to new().
Returns the output_dir parameter, as passed in to new().
Returns the type of the most recently created node, 'global', 'head', or 'body'.
See the first question in the FAQ for details.
Returns the result so far of the parse.
Returns the node which the parser calls the root of the tree of nodes.
Returns the verbose parameter, as passed in to new().
Returns the xhtml parameter, as passed in to new().
Print $msg to STDERR if new() was called as new({verbose => 1}), or if $p -> set_verbose(1) was called.
Otherwise, print nothing.
Parses the string of HTML in $html, and builds a tree of nodes.
After calling $p -> parse(), you must call $p -> traverse($p -> get_root() ) before calling $p -> result().
Alternately, call $p -> parse_file(), which calls all these methods for you.
Note: parse() may be called directly or via parse_file().
Parses the HTML in the input file, and writes the result to the output file.
Returns the result so far of the parse.
Sets the node which the parser calls the current node.
Returns undef.
Sets the nesting depth of the current node.
Returns undef.
It's just there in case you need it.
Sets the input_dir parameter, as though it was passed in to new().
Returns undef.
Sets the output_dir parameter, as though it was passed in to new().
Returns undef.
Sets the type of the next node to be created, 'global', 'head', or 'body'.
See the first question in the FAQ for details.
Returns undef.
Returns the node which the parser calls the root of the tree of nodes.
Returns undef.
Sets the verbose parameter, as though it was passed in to new().
Returns undef.
Sets the xhtml parameter, as though it was passed in to new().
Returns undef.
The data of each node is a hash ref. The keys/values of this hash ref are:
This is the string of HTML attributes associated with the HTML tag.
So, <table align = 'center' summary = 'Body'> will have an attributes string of " align = 'center' summary = 'Body'".
Note the leading space.
This is an array ref of bits and pieces of content.
Consider this fragment of HTML:
<p>I did <i>not</i> say I <i>liked</i> debugging.</p>
When parsing 'I did ', the number of child nodes (of <p>) is 0, since <i> has not yet been detected.
So, 'I did ' is stored in the 0th element of the array ref.
Likewise, 'not' is stored in the 0th element of the array ref belonging to the node 'i'.
Next, ' say I ' is stored in the 1st element of the array ref, because it follows the 1st child node (<i>).
Likewise, ' debugging' is stored in the 2nd element.
This way, the input string can be reproduced by successively outputting the elements of the array ref of content interspersed with the contents of the child nodes (processed recusively).
Note: If you are processing this tree, never forget that there can be content after the last child node has been closed, but before the current node is closed.
Note: The DOCTYPE declaration is stored as the 0th element of the content of the root node.
The nesting depth of the tag within the document.
The root is at depth 0, '<html>' is at depth 1, '<head>' and '<body>' are a depth 2, and so on.
It's just there in case you need it.
So, the tag '<html>' will mean the name is 'html'.
The root of the tree is called 'root', and holds the DOCTYPE, if any, as content.
The root has the node 'html' as the only child, of course.
This holds 'global' before '<head>' and between '</head>' and '<body>', and after '</body>'.
It holds 'head' for all nodes from '<head>' to '</head>', and holds 'body' from '<body>' to '</body>'.
It's just there in case you need it.
They are treated as content. This includes the prefix '<!--' and the suffix '-->'.
It is treated as content belonging to the root of the tree.
It is treated as content belonging to the root of the tree.
No, never.
Up to V 4.
Make yourself a nice cup of tea, and then fix your page.
No.
For example, if you feed in a HTML page without the title tag, this module does not care.
By installing HTML::Revelation, of course!
Sample output:
http://savage.net.au/Perl-modules/html/CreateTable.html
Suggested steps:
Note: There are quite a few files involved. Proceed with caution.
Call this input.html.
Reveal.pl ships with HTML::Revelation.
Call the output file output.1.html.
Parse.html.pl ships with HTML::Parser::Simple.
Call the output file parsed.html.
Call the output file output.2.html.
If they match, or even if they don't match, you're finished.
No, never.
Help with quirks:
http://www.quirksmode.org/sitemap.html
Yes. If your HTML file is not nice, the interpretation of tag nesting will not match your preconceptions.
In such cases, do not seek to fix the code. Instead, fix your (faulty) preconceptions, and fix your HTML file.
The 'a' tag, for example, is defined to be an inline tag, but the 'div' tag is a block-level tag.
I don't define 'a' to be inline, others do, e.g. http://www.w3.org/TR/html401/ and hence HTML::Tagset.
Inline means:
<a href = "#NAME"><div class = 'global_toc_text'>NAME</div></a>
will not be parsed as an 'a' containing a 'div'.
The 'a' tag will be closed before the 'div' is opened. So, the result will look like:
<a href = "#NAME"></a><div class = 'global_toc_text'>NAME</div>
To achieve what was presumably intended, use 'span':
<a href = "#NAME"><span class = 'global_toc_text'>NAME</span></a>
Some people (*cough* *cough*) have had to redo their entire websites due to this very problem.
Of course, this is just one of a vast set of possible problems.
You have been warned.
During testing, Tree::Fast crashed, so I replaced it with Tree and everything worked. Spooky.
Late news: Tree does not cope with an array ref stored in the metadata, so I've switched to Tree::DAG_Node.
Stop press: As an experiment I switched to Tree::Simple. Since it also works I'll just keep using it.
That name sounds like a pure Perl version of the same API as used by HTML::Parser.
But the API's are not, and are not meant to be, compatible.
Some people might falsely assume HTML::Parser can automatically fall back to HTML::Parser::PurePerl in the absence of a compiler.
As always with OO code, sub-class! In this case, you write a new version of the traverse() method.
Alternately, implement another method in your sub-class, e.g. process(), which recurses like traverse(). Then call parse() and process().
Yes. See: git://github.com/ronsavage/html--parser--simple.git
I edit with Emacs, using the default formatting for Perl.
That means, in general, leading 4-space tabs. Hashrefs use a leading tab and then a space.
All vertical alignment within lines is done manually with spaces.
Perl::Critic is off the agenda.
This Perl HTML parser has been converted from a JavaScript one written by John Resig.
http://ejohn.org/files/htmlparser.js
Well done John!
Note also the comments published here:
http://groups.google.com/group/envjs/browse_thread/thread/edd9033b9273fa58
HTML::Parser::Simple was written by Ron Savage
in 2009.
Home page: http://savage.net.au/index.html
Australian copyright © 2009 Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License, a copy of which is available at:
http://www.opensource.org/licenses/index.html