|CGI Scripting - An Introduction|
|URI = URL + URN|
|Web Server Start Up|
|Web Server Request Loop|
|Web Server Directory Structure|
|Web Server Configuration|
|A Perl Web Client|
|Web Client Requests|
|Action = Script|
|Web Page Content|
|Digression: HTML 'v' XML|
|Re Action: A Tale of 2 Scripts|
|Combining Perl and HTML|
|A Detour - SDF|
|Inside a Script: Who's Calling?|
|Data per URI - Page Design|
|Data per URI - Page Content|
|Data per URI - Page Content Revisited|
This article is about CGI scripting in Perl.
Of course, scripting can take place anywhere, not just in the context of the web. And many languages can be used to write CGI scripts.
Perl has been ported to about 70 operating systems. The most recent I know of (February 2001) is Windows CE. This makes Perl widely available.
In this article I will make a lot of simplifications and generalizations. Here's the first...
There are 2 types of CGI scripts:
Invariably, this second type, having processed the data, will output an HTML page or form to let the user continue, or at least know what happened, or they will do a CGI redirect to another script which outputs something.
I am splitting scripts in to 2 types just to emphasize that there are differences between processing forms and processing in the absence of forms.
There are Web Servers, and Web Clients. Some web clients are browsers.
There are programs and scripts. Once upon a time, programs were compiled and scripts were interpreted. Hence the 2 names.
But today, this is 'a distinction without a difference'. My attitude is that the 2 words, program and script, are interchangable.
Program and process, however, are different.
Program means a program on disk.
Process means a program which has been loaded by the operating system into memory, and it being executed. This means a single program on disk can be loaded and run several times simultaneously, in which case it is 1 program and several processes.
Web servers have names like Apache, Zeus, MS IIS and TinyHTTPd. Apache and TinyHTTPd (Tiny HyperText Transfer Protocol Daemon) are Open Source. Zeus and MS IIS (Internet Information Server) are commercial products.
The feeble security of IIS makes it unusable in a commercial environment.
My examples will use Apache as the web server.
Web clients which are browsers have names like Opera, Netscape, IE.
Of course, you can roll your own non-browser web client. We'll do this below.
You'll notice the 3 letters I, L and N are in alphabetical order. That's the way to remember this formula.
When a web server starts running, these are the basic steps taken:
It doesnï¿½t matter which web server you are using, and it doesn't matter if the web server is running under Unix or Windows or any other OS. These principles will apply.
The web server request loop, simplified (as always), has several steps:
Pictorially, we have an infinity symbol, ie a figure-of-8 on its side:
+------+ 1 -Request---> +------+ 2 -Action--> +------+ | Web | (URI or Submit) | Web | (script.pl) | Perl | |Client| |Server| |Script| +------+ <--Response- 4 +------+ <---HTML-- 3 +------+ (Header and HTML) (Plain page or CGI form)
Things to note:
In reality, of course, messages 1 and 4 will be wrapped in TCP/IP envelopes, and messages 2 and 3 will be mediated (handled) by the OS.
Many issued arise here. A brief summary:
But how does the web server know which page to return or which script to run? To answer this we next look at the directory structure on the web server's machine.
Below, Monash and Rusden are the names of university campuses.
monash.edu and rusden.edu will be listed under the 'Virtual Hosts' part of httpd.conf, or, if you are running MS Windows NT/2k, they can be named in the file C:/WinNT/System32/Drivers/Etc/Hosts. Under other versions of MS Windows, the hosts file will be C:/Windows/Hosts.
And a warning about the NT version of this file. Windows Explorer will lie to you about the attributes of this file. You will have to log off as any user and log on as the administrator to be able to save edits into this file.
See http://savage.net.au/Perl/html/configure-apache.html for details.
Assume this directory structure:
- D:/ - www/ - cgi-bin/ - x.pl - conf/ - httpd.conf - public/ - index.html - monash/ - index.html - monash/staff - mug-shots.html - rusden/ - index.html - rusden/staff - courses.html
Contents can be executed by web server but not viewed by web clients
Contents invisible to web clients
Contents can be viewed by web clients
Now, the web server can be told, via its configuration file httpd.conf, that:
Hence, a request for http://monash.edu/staff/mug-shots.html returns the disk file D:/www/public/monash/staff/mug-shots.html
Hence a request for http://rusden.edu/staff/courses.html returns the disk file D:/www/public/rusden/staff/courses.html
Did you notice that both virtual hosts use D:/www/cgi-bin?
============================================================== These 2 hosts have their own document trees, but share scripts ==============================================================
We can service any number of virtual hosts with only one copy of each script. This is a huge maintenance saving.
This is the information available to the web server when a request comes in from a web client. So, now let's look at the client side of things.
Here is a real, live, complete, Perl Web Client, which is obviously not a browser:
#!/usr/bin/perl use LWP::Simple; print get('http://savage.net.au/index.html');
Yes folks, that's it. The work is managed by the Perl module 'LWP::Simple', and is available thru the command 'get', which that module exports, ie makes public so it can be used in scripts like this one. LWP stands for Library for Web programming in Perl.
This code runs identically, from the command line, under Windows and Linux.
The output is 'print'ed to the screen, but not formatted according to the HTML.
It's time now to step thru the web server-web client interaction:
When you type something like 'rusden.edu' into the browser's address field, or pass that string to a web client, and hit Go, here's sort of what happens:
In reality, processing the request and manufacturing the response can be quite complex procedures.
There are 2 types of web pages sent out to web clients:
In such cases, the form must contain a submit button of some sort. You can use a clickable image as a submit button. Or, you may use a standard submit button, whose appearance has perhaps been transformed by a cascading style sheet, as the thing to click.
If you view the source of such a form, you will always find text like: <form method= 'POST' action='http://some.domain.net.au/script.pl' enctype='application/x-www-form-urlencoded'>.
The 'action' part tells the web server, when the form is submitted, which script to run to process the form's data.
The web server asks the operating system to load and run the script, and then it (the web server) passes the data (from the form) to the script. The script process the data and outputs a response (which would normally be another form).
I've used './script.pl' to indicate the script is in the 'current' directory, but be warned, the CGI protocol does not specify what the current directory is at any time.
In fact, it does not even specify that any current directory exists. Your scripts must, at all times, know exactly where they are and what they are doing.
Remember, this 'action' is taking place inside (ie from the point of view of) the web server.
Web pages usually contain data in a combination of languages:
Of course, complex validation often requires access to databases and so on, so sometimes there is no escape from the overhead just listed.
As an aside, here's how HTML compares to XML.
HTML is a rendering language. It indicates how the data is to be displayed.
XML is a meta-language. It indicates the meaning of the data.
HTML: '<h1>25</h1>' tells you how 25 should look, but not what it is. In other words, '<h1>' is a command, telling a web client how to display what follows. HTML: '<th>Temperature</th><td>25</td>' tells you how to align the 25, but not what it is. XML: '<temperature>25</temperature>' tells you what 25 is. '<temperature>' is not a command. XML: '<street number>25</street number>' tells you what 25 is.
Hmmm. This would make a marvellous exam question.
So, what happens when a web client requests that a web server run a script?
To answer this, let's look at a web client request for a script-generated form, and how that request is processed.
In fact, the web client is saying to the web server: 'Pretty please, run _your_ script on _my_ data'.
Let's step thru the procedure:
You can see the problem. How does script # 2 know what 'state' script # 1 got up to?
The problem of maintaining state is a big problem.
Chapter 5 in 'Writing Apache Modules in Perl and C' is called 'Maintaining State', and is dedicated to this problem. See 'Resources', below.
A few alternatives, and a very simply discussion of possible drawbacks:
Drawback: A person can simply use the browser's 'View Source' command to see the values. Hidden simply means that these fields are not rendered on the screen. There is absolutely no security in hidden fields.
Drawback: The web client may have disabled cookies. Some banks do this under the false assumption that cookies can contain viruses.
Drawback: If the cookie is written to disk by the web client, the text in the cookie must be encrypted if you want to stop people looking at it or changing it.
Drawback: The data is in the memory of one process, and when the web client logs back in (ie submits the form data) it may be connected to a different process, ie a _copy_ of the process which send the first response, and this copy will not have access to the memory of the first process.
Here's how: Generate a random number. Write the data into a database, using the random number as the key. Send the random number to the web client, to be returned with the form data.
Drawback: You can't just use the operating system's random number generator, since anyone with the same OS and compiler could predict the numbers, since they aren't truely random.
Drawback: Relative URIs no longer work correctly. However, help is at hand with Perl module Apache::StripSession.
Drawback: Under some circumstances it is possible for the session ID to 'leak' to other sites.
Drawback: How does script # 2 know the name of this file, created by script # 1? It's simple if they are the same script, but they don't have to be.
Drawback: What happens if 2 copies of script # 1 run at the same time?
In each case, you either abandon that alternative, or add complexity to overcome the drawbacks.
There is no one, perfect, solution which fits all cases. You must study the alternatives, study your situation, and choose a course of action.
There are 3 basic ways to do this:
Many Perl packages take this approach. Eg: Apache::ASP, Apache::EmbPerl (EmbeddedPerl), Apache::EP (another embedded Perl), Apache::ePerl (yet another embedded Perl), Template::Toolkit (embed a mini non-Perl language in the HTML).
In each case, you need an interpreter to read the combined HTML/Perl/Other and to output pure HTML. In such cases, the interpreter will act as a filter.
This is just the reverse of (1). Thus (tested code):
#!/usr/bin/perl use strict; use warnings; use CGI; my($q) = CGI -> new(); print $q -> header, $q -> start_html(), 'Weather Report', $q -> br(), $q -> table ( $q -> Tr ([$q -> th('Temperature') . $q -> td(25)]) ), $q -> end_html();
In this case your script will act as a filter. Your script reads this file, and looks for special strings in the file, which it replaces with HTML generated by, say, reading a database and formatting the output. In other words, the external file contains a combination of:
If you head on over to SDF - Simple Document Format, at http://www.mincom.com/mtr/sdf/ you'll see an example of the 3rd way. SDF is, of course, a Perl-based Open Source answer to PDF.
SDF is also available from CPAN: http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html.
SDF converts text files into various specific formats. SDF can output, directly or via other software, into these formats: HTML, PostScript, PDF, man pages, POD, LaTeX, SGML, MIMS HTX and F6 help, MIF, RTF, Windows help and plain text.
A script can ask the web server the URI used to fire off the script.
The web server puts this information into the environment of the script, under the name HTTP_REFERER (yes, mis-spelling included for free).
So, as a script, I can say I was called by one of:
Now, either 'monash.edu' or 'rusden.edu' is just the value of a string in the script, and so the script can use this string as a key into a database.
In fact, this part of the URI is also in the environment, separately, under the name HTTP_HOST.
From a database table, or any number of tables, the script can retrieve data specific to the host.
This in turn means the script can change its behaviour depending on the URI used to run it.
The Open Source database MySQL has a reserved table called 'hosts', so I'll start using the word 'domain'.
Given a domain, I can turn that into a number which can be used as an index into a database table.
Here is a sample 'domain' table:
+=============+=======+ | | URI | | domain_name | index | +=============+=======+ | monash.edu | 4 | +=============+=======+ | rusden.edu | 6 | +=============+=======+
And here is a sample web page 'design' table:
+=======+ + URI +===============+===========+===================+ | index | template_name | bkg_color | location_of_links |... +=======+===============+===========+===================+ | 4 | dark blue | cream | down the left |... +=======+===============+===========+===================+ | 6 | pale green | an image | across the bottom |... +=======+===============+===========+===================+
Here is a sample web page 'content' table:
+=======+ + URI +================+================+ | index | News headlines | Weather |... +=======+================+================+ | 4 | - | www.bom.gov.au |... +=======+================+================+ | 6 | www.f2.com.au | www.bom.gov.au |... +=======+================+================+
f2 = Fairfax, the publisher of 'The Age' newspaper.
bom = Bureau of Meteorology.
Let me give a more commercial example. Here we chain tables:
ProductMap table: +=======+ + URI +==============+============+ | index | Products | product_id | +=======+==============+============+ | 4 | Motherboards | 1 | +=======+==============+============+ | 4 | Printers | 2 | +=======+==============+============+ | 4 | CD Writers | 3 | +=======+==============+============+ | 6 | CD Writers | 4 | +=======+==============+============+ | 6 | Zip Drives | 5 | +=======+==============+============+ Product table: +============+=============+ | product_id | Brands | +============+=============+ | 1 | Gigabyte X1 | +============+=============+ | 1 | Gigabyte X2 | +============+=============+ | 1 | Intel A | +============+=============+ | 1 | Intel B | +============+=============+ | : | : | +============+=============+ | 5 | Sony | +============+=============+
Hence a list of Products for a given URI, ie a given shop, can be turned into an HTML table and inserted into the outgoing web page.
Perl has a number of languages built in to it.
I am writing this document in POD, and running a Perl script to turn it into HTML.
Used for writing Perl eXtenSions in C/C++.
cpan is a shell, shipped with Perl, which is used to download Perl modules.
dbish is a shell, available when you install DBI, which gives you a command line interface to databases.
Yes, Java is a Perl library. In other words, there is an interface between Perl and Java.
Lastly, you can insert Python, C and C++ source code in a Perl program, and have Perl call the appropriate compiler at run time, to build your program on the fly.
For example, if you write a subroutine in C++, you can compile it and then call it from Perl, all in one go!
Most Perl module (library) authors write their documentation in POD, for various reasons:
Perl itself treats all POD as comments when executing the script.
CGI scripting (lecture 2001)
SDF (lecture 2002)
Web Servers (lecture 2002)
The slides (*.tgz)
The slides (*.zip)
The Perl CGI Image::Magick FAQ
Perl is an Open Source package.
Pre-compiled versions are available for various platforms.
Apache is an Open Source package.
Pre-compiled versions are available for various platforms.
MySQL is an Open Source package.
Pre-compiled versions are available for various platforms.
CPAN contains around 745Mb of Open Source library code.
One home among many, of scripts good :-), bad :-| and terrible :-(.
Larry Wall, Tom Christiansen and Jon Orwant, O'Reilly, Third Edition, 2000, 0-596-00027-8
Damian Conway, Manning, 2000, 1-884777-79-1
Lincoln Stein, Wiley, 1998, 0-471-24744-8
Kevin Meltzer, Brent Michalski, Addison-Wesley, 2001, 0-201-71014-5
Alligator Descartes and Tim Bunce, O'Reilly, 2000, 1-56592-699-4
Lincoln Stein and Doug MacEachern, O'Reilly, 1999, 1-56592-567-X
Ron Savage .
Home page: http://savage.net.au/index.html
This version disguises my email address.
Australian Copyright © 2002 Ron Savage. All rights reserved.
All Programs of mine are 'OSI Certified Open Source Software'; you can redistribute them and/or modify them under the terms of The Artistic License, a copy of which is available at: http://www.opensource.org/licenses/index.html