CGI Scripting - An Introduction

Table of contents

CGI Scripting - An Introduction
Terminology
URI = URL + URN
Web Server Start Up
Web Server Request Loop
Web Server Directory Structure
Web Server Configuration
A Perl Web Client
Web Client Requests
Web Pages
Action = Script
Warning
Web Page Content
JavaScript
Digression: HTML 'v' XML
Re Action: A Tale of 2 Scripts
Maintaining State
Combining Perl and HTML
A Detour - SDF
Inside a Script: Who's Calling?
Data per URI - Page Design
Data per URI - Page Content
Data per URI - Page Content Revisited
Perl's Sub-languages
Writing Documentation
Resources
Author
Licence

CGI Scripting - An Introduction

This article is about CGI scripting in Perl.

Of course, scripting can take place anywhere, not just in the context of the web. And many languages can be used to write CGI scripts.

Perl has been ported to about 70 operating systems. The most recent I know of (February 2001) is Windows CE. This makes Perl widely available.

In this article I will make a lot of simplifications and generalizations. Here's the first...

There are 2 types of CGI scripts:

Those which output HTML pages
Those which process the input from CGI forms

Invariably, this second type, having processed the data, will output an HTML page or form to let the user continue, or at least know what happened, or they will do a CGI redirect to another script which outputs something.

I am splitting scripts in to 2 types just to emphasize that there are differences between processing forms and processing in the absence of forms.

Terminology

There are Web Servers, and Web Clients. Some web clients are browsers.

There are programs and scripts. Once upon a time, programs were compiled and scripts were interpreted. Hence the 2 names.

But today, this is 'a distinction without a difference'. My attitude is that the 2 words, program and script, are interchangable.

Program and process, however, are different.

Program means a program on disk.

Process means a program which has been loaded by the operating system into memory, and it being executed. This means a single program on disk can be loaded and run several times simultaneously, in which case it is 1 program and several processes.

Web servers have names like Apache, Zeus, MS IIS and TinyHTTPd. Apache and TinyHTTPd (Tiny HyperText Transfer Protocol Daemon) are Open Source. Zeus and MS IIS (Internet Information Server) are commercial products.

The feeble security of IIS makes it unusable in a commercial environment.

My examples will use Apache as the web server.

Web clients which are browsers have names like Opera, Netscape, IE.

Of course, you can roll your own non-browser web client. We'll do this below.

URI = URL + URN

You'll notice the 3 letters I, L and N are in alphabetical order. That's the way to remember this formula.

U = Uniform
R = Resource
I = Indicator
L = Location
N = Name

Web Server Start Up

When a web server starts running, these are the basic steps taken:

Read configuration file. With Apache, you can use Perl to analyze certains parts of the config file. With Apache, you can even use Perl inside the config file
Start sub-processes (depending on platform)
Become quiescent, ie wait for requests from web clients

It doesn�t matter which web server you are using, and it doesn't matter if the web server is running under Unix or Windows or any other OS. These principles will apply.

Web Server Request Loop

The web server request loop, simplified (as always), has several steps:

Accept request from web client
Process request. This means one of:
Read a disk file containing an HTML page (the file = the page = the response)
Run a script (its ouput = the response)
Service the request using code within the web server. If you submit this to Apache with server-info enabled, http://127.0.0.1/server-info, Apache itself fabricates the response
The script fabricates an HTML page and writes it to STDOUT. The web server captures STDOUT. This output is the body of the response. The script exits
The web server sends the response body, wrapped in the appropriate headers, to the web client

Pictorially, we have an infinity symbol, ie a figure-of-8 on its side:

        +------+  1 -Request--->  +------+  2 -Action-->  +------+
        | Web  |  (URI or Submit) | Web  |  (script.pl)   | Perl |
        |Client|                  |Server|                |Script|
        +------+  <--Response- 4  +------+  <---HTML-- 3  +------+
                (Header and HTML)       (Plain page or CGI form)

Things to note:

The interaction starts from the web client
The interaction is a round trip
The web client uses the HyperText Transfer Protocol to format Request 1:
The URI will be sent as text in the message
The data from a submitted form will be sent using the CGI protocol
In both cases, the HTTP will be used to generate an envelope wrapped around the message content

In reality, of course, messages 1 and 4 will be wrapped in TCP/IP envelopes, and messages 2 and 3 will be mediated (handled) by the OS.

Action 2 is a request from the web server to the operating system to load and run a script

Many issued arise here. A brief summary:

Does the web server have permission to run this script? After all, the web server is a program, which means it was loaded and run by some user, often a special user called 'nobody'. So does this 'nobody' have permission to run script.pl?
Does this particular web client have this permission? The web server will check directory access permissions, and may have to ask the web client for a username and a password before proceeding
Does the script have permission to read/write whatever directories it needs to to do its work? For instance, to put a web front-end on CVS requires that 'nobody' have read access to the source code repository, or that the script opens a socket to another script which itself can access the repository
Action 3 is a stream of HTML output by the script and is captured by the web server
Response 4 is the output from the script wrapped in an envelope of headers according to the HTTP
The web client cannot see the source code of the script, only the output of the script. If the web client, eg a browser, offers to download the script and pops up a dialog box asking for the name of a file to save the script in, then clearly, the web server did not execute the script. This means the web server is mis-configured
If the first execute of the script outputs a CGI form, then when the web client submits that form, the script is re-run to process the form's data. That's right, the script would normally be run twice. In other words, the first time the script runs it sees it has no input data, so it outputs an empty form. The second time it runs it sees it has input data, so it reads and processes that data. Yes, they could be 2 separate scripts. When the form is output, the 'action' clause specifies the name of the script which the web server will run to process the form's data

Web Server Directory Structure

But how does the web server know which page to return or which script to run? To answer this we next look at the directory structure on the web server's machine.

Below, Monash and Rusden are the names of university campuses.

monash.edu and rusden.edu will be listed under the 'Virtual Hosts' part of httpd.conf, or, if you are running MS Windows NT/2k, they can be named in the file C:/WinNT/System32/Drivers/Etc/Hosts. Under other versions of MS Windows, the hosts file will be C:/Windows/Hosts.

And a warning about the NT version of this file. Windows Explorer will lie to you about the attributes of this file. You will have to log off as any user and log on as the administrator to be able to save edits into this file.

See http://savage.net.au/Perl/html/configure-apache.html for details.

Assume this directory structure:

        - D:/
        -    www/
        -        cgi-bin/
        -            x.pl
        -        conf/
        -            httpd.conf
        -        public/
        -            index.html
        -            monash/
        -                index.html
        -            monash/staff
        -                mug-shots.html
        -            rusden/
        -                index.html
        -            rusden/staff
        -                courses.html

Note:

D:/www/cgi-bin

Contents can be executed by web server but not viewed by web clients

D:/www/conf

Contents invisible to web clients

D:/www/public

Contents can be viewed by web clients

Web Server Configuration

Now, the web server can be told, via its configuration file httpd.conf, that:

Web client requests using http://monash.edu/ are directed to D:/www/public/monash/

Hence, a request for http://monash.edu/staff/mug-shots.html returns the disk file D:/www/public/monash/staff/mug-shots.html

Web client requests using http://rusden.edu/ are directed to D:/www/public/rusden/

Hence a request for http://rusden.edu/staff/courses.html returns the disk file D:/www/public/rusden/staff/courses.html

Web client requests using http://monash.edu/cgi-bin/ are directed to D:/www/cgi-bin
Web client requests using http://rusden.edu/cgi-bin/ are directed to D:/www/cgi-bin

Did you notice that both virtual hosts use D:/www/cgi-bin?

        ==============================================================
        These 2 hosts have their own document trees, but share scripts
        ==============================================================

We can service any number of virtual hosts with only one copy of each script. This is a huge maintenance saving.

This is the information available to the web server when a request comes in from a web client. So, now let's look at the client side of things.

A Perl Web Client

Here is a real, live, complete, Perl Web Client, which is obviously not a browser:

        #!/usr/bin/perl
        use LWP::Simple;
        print get('http://savage.net.au/index.html');

Yes folks, that's it. The work is managed by the Perl module 'LWP::Simple', and is available thru the command 'get', which that module exports, ie makes public so it can be used in scripts like this one. LWP stands for Library for Web programming in Perl.

This code runs identically, from the command line, under Windows and Linux.

The output is 'print'ed to the screen, but not formatted according to the HTML.

It's time now to step thru the web server-web client interaction:

Web Client Requests

When you type something like 'rusden.edu' into the browser's address field, or pass that string to a web client, and hit Go, here's sort of what happens:

The web client says 'You're lazy', and prepends the default protocol to the string, resulting in 'http://rusden.edu'
The web client says 'You're lazy', and appends the default directory to the string, resulting in 'http://rusden.edu/'
The web client sends this to the web server, together with some headers. This is the all-important 'Request' (see Web Server Request Loop)
The web server parses it and, using it's configuration data, determines which disk directory, if any, this maps to. I say 'if any' because it may refer to a virtual, ie non-existant, directory
If the client asks for a directory, this would normally be converted (by the web server) into a request for a directory listing, or a default file, such as /index.html
If the client asks for a script to be run, the request is processed as described above. Of course, the client may not even know that a script is being run to service the request
The web server determines whether or not you have enough permission to access files in this directory
If so, the web server reads this disk file into memory, or runs the script, and sends the result off to the web client, together with the appropriate headers of course. This is the all-important 'Response'

In reality, processing the request and manufacturing the response can be quite complex procedures.

Web Pages

There are 2 types of web pages sent out to web clients:

Those which contain passive text, which the web client (or human operating a browser) can do no more than look at
Those which contain active text, ie CGI forms, in that the web client (or human) can fill in data entry fields and then submit the form's data back to the web server for processing by a script

In such cases, the form must contain a submit button of some sort. You can use a clickable image as a submit button. Or, you may use a standard submit button, whose appearance has perhaps been transformed by a cascading style sheet, as the thing to click.

Action = Script

If you view the source of such a form, you will always find text like: <form method= 'POST' action='http://some.domain.net.au/script.pl' enctype='application/x-www-form-urlencoded'>.

The 'action' part tells the web server, when the form is submitted, which script to run to process the form's data.

The web server asks the operating system to load and run the script, and then it (the web server) passes the data (from the form) to the script. The script process the data and outputs a response (which would normally be another form).

Warning

I've used './script.pl' to indicate the script is in the 'current' directory, but be warned, the CGI protocol does not specify what the current directory is at any time.

In fact, it does not even specify that any current directory exists. Your scripts must, at all times, know exactly where they are and what they are doing.

Remember, this 'action' is taking place inside (ie from the point of view of) the web server.

Web Page Content

Web pages usually contain data in a combination of languages:

Text: Display this text
Image references: Display this image
HTML: Format the text and images. HTML is a 'rendering' language
XML: Echo and describe the text, eg to 'data mining' page crawlers
JavaScript, for these reasons:
Create special effects (trivial)
Validate form input (somewhat more important)

For security reasons, though, all form input must be validated on the server side, despite any validation done by JavaScript. The JavaScript should simply be seen as a convenience for the person inputting data into the form.

Yes, scripts can output scripts! Specifically, scripts can output web pages containing JavaScript, etc. There's even a Perl interface to Macromedia's Flash. Where I used to work, some salesmen were obsessed with Flash, because it's all they understand of the software we write :-(. In Flash's defense, you'd have to say it's too trivial to have pretensions.

JavaScript

As a Perl aficionado, you may be tempted to look down on JavaScript, but you shouldn't. It really does have its uses.

When a page contains JavaScript to validate form input, this means quite a bit of a saving for the web client.

Without the JavaScript here's what would happen (call this 'overhead'):

The form's data would have to be sent to the web server. This means one trip across the internet
The web server would have to run the script which is to validate the data
The web server would have to pass the data to the script
The script would have to read and parse the data
The script would have to validate the data
The script would have to send a response to the web server
The web server would have to send a response to the web client. This means a second trip across the internet

All of this takes time. When the JavaScript runs, it runs inside the web client, eg browser, so the web client gets a response much faster.

Of course, complex validation often requires access to databases and so on, so sometimes there is no escape from the overhead just listed.

For example, where I work we noticed some pages were appearing very slowly, and I tracked it down to 3.6Mb (yes!!!) of JavaScript in some pages, which was being used to stop inputting of duplicate data. Naturally this JavaScript was being created by a Perl program :-).

Digression: HTML 'v' XML

As an aside, here's how HTML compares to XML.

HTML is a rendering language. It indicates how the data is to be displayed.

XML is a meta-language. It indicates the meaning of the data.

Examples:

        HTML: '<h1>25</h1>' tells you how 25 should look, but not what it is.
                In other words, '<h1>' is a command, telling a web client how to
                display what follows.

        HTML: '<th>Temperature</th><td>25</td>' tells you how to align the 25,
                but not what it is.

        XML: '<temperature>25</temperature>' tells you what 25 is.
                '<temperature>' is not a command.

        XML: '<street number>25</street number>' tells you what 25 is.

Hmmm. This would make a marvellous exam question.

Re Action: A Tale of 2 Scripts

So, what happens when a web client requests that a web server run a script?

To answer this, let's look at a web client request for a script-generated form, and how that request is processed.

In fact, the web client is saying to the web server: 'Pretty please, run _your_ script on _my_ data'.

Let's step thru the procedure:

The web client sends the URI 'http://rusden.edu/cgi-bin/enrol.pl'. This is script # 1
The web server executes the script (# 1), captures its output and sends the output, the form, back to the web client. Script # 1 knows what to output because it sees that it has no input data from a CGI form
The script (# 1) terminates. It is finished, completed, done: gone forever. Trust me: I'm a programmer...
The web client renders the web page
The web client fills in the form and submits it. Being a form, it must contain an 'action' clause naming a script (# 2). Perhaps script # 1 is the same as script # 2
The web server executes the script (# 2), which processes the data. This invocation of script # 2 is independent of the prior invocation of script # 1, even if they are one and the same script. The web server executes 2 separate processes, scripts # 1 and # 2. Script # 2 knows what to do because it sees that it has input data from a CGI form
And so on... Script # 2 may issue another form, in order to continue the interaction

You can see the problem. How does script # 2 know what 'state' script # 1 got up to?

Maintaining State

The problem of maintaining state is a big problem.

Chapter 5 in 'Writing Apache Modules in Perl and C' is called 'Maintaining State', and is dedicated to this problem. See 'Resources', below.

A few alternatives, and a very simply discussion of possible drawbacks:

Send data to the web client as 'hidden fields', to be returned with the form data

Drawback: A person can simply use the browser's 'View Source' command to see the values. Hidden simply means that these fields are not rendered on the screen. There is absolutely no security in hidden fields.

Save state in cookies

Drawback: The web client may have disabled cookies. Some banks do this under the false assumption that cookies can contain viruses.

Drawback: If the cookie is written to disk by the web client, the text in the cookie must be encrypted if you want to stop people looking at it or changing it.

Save state in web server memory

Drawback: The data is in the memory of one process, and when the web client logs back in (ie submits the form data) it may be connected to a different process, ie a _copy_ of the process which send the first response, and this copy will not have access to the memory of the first process.

Save state in the URI itself, eg as a session ID

Here's how: Generate a random number. Write the data into a database, using the random number as the key. Send the random number to the web client, to be returned with the form data.

Drawback: You can't just use the operating system's random number generator, since anyone with the same OS and compiler could predict the numbers, since they aren't truely random.

Drawback: Relative URIs no longer work correctly. However, help is at hand with Perl module Apache::StripSession.

Drawback: Under some circumstances it is possible for the session ID to 'leak' to other sites.

Write the data to a temporary file

Drawback: How does script # 2 know the name of this file, created by script # 1? It's simple if they are the same script, but they don't have to be.

Drawback: What happens if 2 copies of script # 1 run at the same time?

In each case, you either abandon that alternative, or add complexity to overcome the drawbacks.

There is no one, perfect, solution which fits all cases. You must study the alternatives, study your situation, and choose a course of action.

Combining Perl and HTML

There are 3 basic ways to do this:

Put the code inside the HTML

Many Perl packages take this approach. Eg: Apache::ASP, Apache::EmbPerl (EmbeddedPerl), Apache::EP (another embedded Perl), Apache::ePerl (yet another embedded Perl), Template::Toolkit (embed a mini non-Perl language in the HTML).

In each case, you need an interpreter to read the combined HTML/Perl/Other and to output pure HTML. In such cases, the interpreter will act as a filter.

Put the HTML inside the code

This is just the reverse of (1). Thus (tested code):

        #!/usr/bin/perl
        use strict;
        use warnings;
        use CGI;
        my($q) = CGI -> new();
        print $q -> header,
              $q -> start_html(),
              'Weather Report',
              $q -> br(),
              $q -> table
              (
                 $q -> Tr
                 ([$q -> th('Temperature') . $q -> td(25)])
              ),
              $q -> end_html();
Put the HTML, or XML, or whatever, in a file external to the script

In this case your script will act as a filter. Your script reads this file, and looks for special strings in the file, which it replaces with HTML generated by, say, reading a database and formatting the output. In other words, the external file contains a combination of:

HTML, which your script simply copies to its output stream
HTML comments, like <!-- Some command -->, which your script cuts out and replaces with the results of processing that command

A Detour - SDF

If you head on over to SDF - Simple Document Format, at http://www.mincom.com/mtr/sdf/ you'll see an example of the 3rd way. SDF is, of course, a Perl-based Open Source answer to PDF.

SDF is also available from CPAN: http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html.

SDF converts text files into various specific formats. SDF can output, directly or via other software, into these formats: HTML, PostScript, PDF, man pages, POD, LaTeX, SGML, MIMS HTX and F6 help, MIF, RTF, Windows help and plain text.

Inside a Script: Who's Calling?

A script can ask the web server the URI used to fire off the script.

The web server puts this information into the environment of the script, under the name HTTP_REFERER (yes, mis-spelling included for free).

So, as a script, I can say I was called by one of:

http://monash.edu/cgi-bin/enrol.pl
http://rusden.edu/cgi-bin/enrol.pl

Now, either 'monash.edu' or 'rusden.edu' is just the value of a string in the script, and so the script can use this string as a key into a database.

In fact, this part of the URI is also in the environment, separately, under the name HTTP_HOST.

From a database table, or any number of tables, the script can retrieve data specific to the host.

This in turn means the script can change its behaviour depending on the URI used to run it.

Data per URI - Page Design

The Open Source database MySQL has a reserved table called 'hosts', so I'll start using the word 'domain'.

Given a domain, I can turn that into a number which can be used as an index into a database table.

Here is a sample 'domain' table:

        +=============+=======+
        |             |  URI  |
        | domain_name | index |
        +=============+=======+
        | monash.edu  |   4   |
        +=============+=======+
        | rusden.edu  |   6   |
        +=============+=======+

And here is a sample web page 'design' table:

        +=======+
        +  URI  +===============+===========+===================+
        | index | template_name | bkg_color | location_of_links |...
        +=======+===============+===========+===================+
        |   4   |   dark blue   |   cream   |   down the left   |...
        +=======+===============+===========+===================+
        |   6   |   pale green  |  an image | across the bottom |...
        +=======+===============+===========+===================+

Data per URI - Page Content

Here is a sample web page 'content' table:

        +=======+
        +  URI  +================+================+
        | index | News headlines |    Weather     |...
        +=======+================+================+
        |   4   |        -       | www.bom.gov.au |...
        +=======+================+================+
        |   6   | www.f2.com.au  | www.bom.gov.au |...
        +=======+================+================+

f2 = Fairfax, the publisher of 'The Age' newspaper.

bom = Bureau of Meteorology.

Data per URI - Page Content Revisited

Let me give a more commercial example. Here we chain tables:

        ProductMap table:
        +=======+
        +  URI  +==============+============+
        | index |   Products   | product_id |
        +=======+==============+============+
        |   4   | Motherboards |     1      |
        +=======+==============+============+
        |   4   | Printers     |     2      |
        +=======+==============+============+
        |   4   | CD Writers   |     3      |
        +=======+==============+============+
        |   6   | CD Writers   |     4      |
        +=======+==============+============+
        |   6   | Zip Drives   |     5      |
        +=======+==============+============+

        Product table:
        +============+=============+
        | product_id |    Brands   |
        +============+=============+
        |     1      | Gigabyte X1 |
        +============+=============+
        |     1      | Gigabyte X2 |
        +============+=============+
        |     1      |   Intel A   |
        +============+=============+
        |     1      |   Intel B   |
        +============+=============+
        |     :      |      :      |
        +============+=============+
        |     5      |    Sony     |
        +============+=============+

Hence a list of Products for a given URI, ie a given shop, can be turned into an HTML table and inserted into the outgoing web page.

Perl's Sub-languages

Perl has a number of languages built in to it.

POD - Plain Old Documentation

I am writing this document in POD, and running a Perl script to turn it into HTML.

'Here' documents
Regular Expressions
XS

Used for writing Perl eXtenSions in C/C++.

cpan

cpan is a shell, shipped with Perl, which is used to download Perl modules.

dbish

dbish is a shell, available when you install DBI, which gives you a command line interface to databases.

Perl has a debugger which can be started with the command line switch -d
JavaScript
Java

Yes, Java is a Perl library. In other words, there is an interface between Perl and Java.

Lastly, you can insert Python, C and C++ source code in a Perl program, and have Perl call the appropriate compiler at run time, to build your program on the fly.

For example, if you write a subroutine in C++, you can compile it and then call it from Perl, all in one go!

Writing Documentation

Most Perl module (library) authors write their documentation in POD, for various reasons:

POD is text, and therefore quick and simple to write
pod2html.* ships with Perl, so anyone can download a Perl module and convert the docs into HTML
When Perl, eg the cpan shell, downloads a Perl module, the docs can be extracted automatically, and inserted into Perl's doc directory tree
All good Perl modules are actually shipped with POD embedded in the source of the module

Perl itself treats all POD as comments when executing the script.

Resources

The lectures

CGI scripting (lecture 2001)

SDF (lecture 2002)

Web Servers (lecture 2002)

The slides (*.tgz)

The slides (*.zip)

http://savage.net.au/ImageMagick/html/im-faq.html

The Perl CGI Image::Magick FAQ

Perl: http://www.cpan.org/ports/index.html

Perl is an Open Source package.

Pre-compiled versions are available for various platforms.

Apache Web Server: http://httpd.apache.org/ or a mirror

Apache is an Open Source package.

Pre-compiled versions are available for various platforms.

MySQL: http://www.mysql.com/ or a mirror

MySQL is an Open Source package.

Pre-compiled versions are available for various platforms.

Perl Database Interface (DBI): http://dbi.symbolstone.org/
CPAN (Comprehensive Perl Archive Network): http://cpan.org/

CPAN contains around 745Mb of Open Source library code.

CGI Programming OpenFAQ: http://www.boutell.com/openfaq/cgi/
CGI Programming MetaFAQ: http://www.perl.org/CGI_MetaFAQ.html
The CGI Resource Index: http://cgi.resourceindex.com/

One home among many, of scripts good :-), bad :-| and terrible :-(.

Articles on Perl: http://wdvl.internet.com/Authoring/Languages/Perl/
Programming Perl

Larry Wall, Tom Christiansen and Jon Orwant, O'Reilly, Third Edition, 2000, 0-596-00027-8

Object Oriented Perl

Damian Conway, Manning, 2000, 1-884777-79-1

Official Guide to Programming with CGI.pm

Lincoln Stein, Wiley, 1998, 0-471-24744-8

Writing CGI Applications with Perl

Kevin Meltzer, Brent Michalski, Addison-Wesley, 2001, 0-201-71014-5

Programming the Perl DBI

Alligator Descartes and Tim Bunce, O'Reilly, 2000, 1-56592-699-4

Writing Apache Modules with Perl and C

Lincoln Stein and Doug MacEachern, O'Reilly, 1999, 1-56592-567-X

Author

Ron Savage .

Home page: http://savage.net.au/index.html

Version: 1.01 01-Jun-2006

This version disguises my email address.

Version: 1.00 18-Feb-2001

Original version.

Licence

Australian Copyright © 2002 Ron Savage. All rights reserved.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html