Tutorial # 7: jQuery-style UTF8 and Databases

Table of contents

Tutorial # 7: jQuery-style UTF8 and Databases
The POD for this tutorial ...
ASCII is not Unicode
Unskippable, mandatory and compulsory background reading
Perl
Module preamble
Creating a utf8-friendly database in Postgres
Creating a utf8-friendly database in SQLite
Telling the web client we're using utf8
Populating the database from disk files
Inputting from a web client
Outputting to a web client
Traps
History
Links

Tutorial # 7: jQuery-style UTF8 and Databases

The POD for this tutorial ...

... starts off with:

        =encoding utf8

Because sample Unicode data is included below.

Notes:

# 1: POD is Perl's built-in documentation language, and stands for Plain Ol' Documentation. This name is modelled on POTS, Plain Ol' Telephone System.

ASCII is not Unicode

Yes, I really do have to say this. For instance, if you're trying to insert a character such as 'é' into a file in your editor, you might make the mistake of looking up that very same glyph in the extended ASCII character set.

Don't do that! You need the Unicode 'é' and not the not-really-ASCII 'é'. And no, your editor won't magically convert it for you.

Unskippable, mandatory and compulsory background reading

The Wikibooks article

Joel Spolsky's 'The Absolute Minimum etc' article

Docs for JSON::XS

Tom Christiansen's prescriptions

Tom's Canonical Rule

Perl

I'm using Perl V 5.14.

Module preamble

Every module must start with this preamble:

        use strict;
        use utf8;
        use warnings;
        use warnings  qw(FATAL utf8);    # Fatalize encoding glitches.
        use open      qw(:std :utf8);    # Undeclared streams in UTF-8.
        use charnames qw(:full :short);  # Unneeded in v5.16.

Notes:

# 1: Yes, that's right - every module (that uses utf8, no matter how indirectly)!

# 2: You wouldn't believe the pain of omitting this preamble, especially when you think you don't need it.

# 3: Obviously, if you're using a later Perl, it can be simplified.

# 4: I'm using Moo in this project, hence the 'use strict', which Moose would apply automatically.

# 5: I'm writing a contacts manager, but while the Perl code manipulates Unicode strings, it does not contain any Unicode strings. Hence any other utf8-based pragmas you might have been expecting are just not needed here.

Creating a utf8-friendly database in Postgres

        psql=# create role ron login password 'seekrit';
        psql=# create database contacts owner ron encoding 'UTF8';
        psql=# \q

And what does this mean?

It means we want Postgres to reliably store the utf8-encoded data we give it.

In other words, we're telling Postgres to expect utf8-encoded data.

That in turn means we are committing to only send Postgres data which we have already utf8-encoded.

Creating a utf8-friendly database in SQLite

        my($config) = $self -> module_config;
        my($attr)   =
        {       # These 2 are of course not utf8 related.
                AutoCommit => defined($$config{AutoCommit}) ? $$config{AutoCommit} : 1,
                RaiseError => defined($$config{RaiseError}) ? $$config{RaiseError} : 1,
        };
        $$attr{sqlite_unicode} = 1 if ( ($$config{dsn} =~ /SQLite/i) && $$config{sqlite_unicode});

        $self -> simple(DBIx::Simple -> connect($$config{dsn}, $$config{username}, $$config{password}, $attr) );

And frankly, you probably want this too:

        $self -> simple -> query('PRAGMA foreign_keys = ON') if ($$config{dsn} =~ /SQLite/i);

Notes:

# 1: Did you notice this line:

        $$attr{sqlite_unicode} = 1 if ( ($$config{dsn} =~ /SQLite/i) && $$config{sqlite_unicode});

It's the bit you really need.

# 2: Ok, so you don't need DBIx::Simple if you've just using DBI, but the point I'm making is the same.

Telling the web client we're using utf8

        <html>
                <head>
                        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
                ...
                </head>
                <body>
                ...
                </body>
        </html>

This means we are telling the web server to utf8-encode the web page when serializing it for transmission.

Finally we can say, at least something to do with utf8 is easy...

Notes:

# 1: You can read more about 'charset' here:

http://www.w3.org/html/wg/drafts/html/master/document-metadata.html#the-meta-element.

# 2: Actually, somebody, somewhere has to utf8-encode the outgoing data.

More below, where Ajax is discussed. Search this document for 'autocomplete' if you're in a hurry.

Populating the database from disk files

Step 1: Overview:

There are 2 specific ways of populating the database that I wish to address in this tutorial. Here, we start with reading utf8-encoded data from text files. Later I'll discuss getting data from a web client.

It should be self-evident that the data file itself must be in utf8. How to arrange that? Step-by-step. Which means we create data in Unicode and encode that data in utf8.

I use UltraEdit (for Unix), so when I create a new, empty buffer, into which the data will be typed/copied before being written to disk, I can see that on the bottom (status) line the text 'ISO-8859-1'.

This is the default file type, and clicking on the menu (arrow-heads) beside this text opens up the menu, with the top item being Unicode. But this in turn points to another menu! Why is that?

It's telling you that you're not going to be creating a file in (just) Unicode, but that your Unicode characters will be auto-encoded into utf8. This is good, since now you can just save the buffer to disk, and it's in precisely the form you want, for when you later read it back in using Perl.

I hope that's clear: Unicode is a character set, and utf8 is a transformation of your Unicode characters into a binary string. See the docs (above) for e.g. JSON::XS, where this is clearly spelled out.

Step 2: Reading the text file:

This is the data file 'fake.people.txt' (fake because I'm populating the database with test data). It is shipped with my utf8-aware, but still under development, App::Office::Contacts V 2.00:

        visibility_id, communication_type_id, creator_id, deleted, facebook_tag, gender_id, role_id, title_id, twitter_tag, date_of_birth, given_names, homepage, name, preferred_name, surname
        2, 2, 0, 0, n/a, 3, 4, 2, n/a, 1950-06-21, Ron, http://savage.net.au/, Ron Savage, Ron, Savage
        1, 4, 0, 0, n/a, 2, 3, 4, n/a, 1989-07-06, Zoe, http://zephyr.net.au/, Zoe Savage, Zoe, Savage
        2, 2, 0, 0, n/a, 3, 4, 2, n/a, 1999-09-09, Léon, http://astray.com, Léon Brocard, Léon, Brocard

Note the acute accent on Léon's name. I do something similar with the names of fake organizations (e.g. MégaCorp Pty Ltd).

So how do we actually read in such a file? With Text::CSV::Encoded. Here's the code I use. It returns an arrayref of hashrefs:

        sub read_a_file
        {
                my($self, $input_file_name) = @_;
                $input_file_name = "$FindBin::Bin/../data/$input_file_name";
                my($csv)         = Text::CSV::Encoded -> new({allow_whitespace => 1, encoding_in => 'utf8'});

                open my $io, '<', $input_file_name;
                $csv -> column_names($csv -> getline($io) );
                my($data) = $csv -> getline_hr_all($io);
                close $io;

                return $data;

        } # End of read_a_file.

Step 3: Writing to the database:

But how do we write that to a database? Thanx - I thought you'd never ask:

This is the code to write the arrayref of hashrefs, returned from read_a_file(), to a database:

        sub populate_people_table
        {
                my($self)       = @_;
                my($table_name) = 'people';
                my($data)       = $self -> read_a_file("fake.$table_name.txt");

                # Each element of @$data is a hashref with these keys:
                # visibility_id, communication_type_id, creator_id, deleted,
                # facebook_tag, gender_id, role_id, title_id, twitter_tag,
                # date_of_birth, given_names, homepage, name, preferred_name, surname.
                # read_a_file() decoded the data, so we can use uc() but then have to call encode().

                for (@$data)
                {
                        # Setting upper_given_names is for the search code.
                        # See App::Office::Contacts::Controller::Exporter::Search.

                        $$_{upper_given_names} = encode('utf8', uc $$_{given_names});
                        $$_{upper_name}        = encode('utf8', uc $$_{name});
                        $$_{timestamp}         = localstamp;

                        $self -> db -> simple -> insert($table_name, $_);
                }

                $self -> logger -> log(debug => "Populated table $table_name");
                $self -> dump($table_name); # This checks 'verbose' re writing the data to the screen.

        }       # End of populate_people_table.

Notes:

# 1: Admittedly it's very confusing. If read_a_file() decodes, why don't we have to encode all fields, not just the 'name' field?

What happens if we try encoding another field, say 'given_names'. I tried that, and it did not make any difference.

# 2: What happens if we use binmode in read_a_file()?

        Old: open my $io, '<', $input_file_name;

        New: open my $io, '<', $input_file_name;
                 binmode $io;

No change.

Inputting from a web client

You can probably guess we follow the canonical rule and utf8-decode the incoming data as soon as it hits the run mode. And indeed, here's the Search module's display() run mode:

        sub display
        {
                my($self)   = @_;
                my($key)    = $self -> query -> param('search_name') || '';
                my($uc_key) = encode('utf8', uc decode('utf8', $key) );

                $self -> param('db') -> simple -> begin_work;
                $self -> add_header('-Status' => 200, '-Content-Type' => 'text/xml; charset=utf-8');

                my($response);

                try
                {
                        $self -> log(debug => "Controller::Exporter::Search.display($uc_key)");

                        # Here we get:
                        # o Organizations whose names match.
                        # o People whose names match.
                        # o Organizations whose email addresses match.
                        # o People whose email addresses match.
                        # o Organizations whose phone numbers match.
                        # o People whose phone numbers match.
                        # Then we winnow those sets, i.e we remove duplicates.

                        my($user_id)       = $self -> param('user_id');
                        my($organizations) = $self -> param('db') -> organization -> get_organizations($user_id, $uc_key);
                        my($people)        = $self -> param('db') -> person -> get_people($user_id, $uc_key);
                        my(@emails)        = $self -> param('db') -> email_address -> get_organizations_and_people($user_id, $uc_key);
                        my(@phones)        = $self -> param('db') -> phone_number -> get_organizations_and_people($user_id, $uc_key);
                        $organizations     = $self -> remove_duplicates($organizations, $emails[1], $phones[1]);
                        $people            = $self -> remove_duplicates($people, $emails[0], $phones[0]);
                        my($row)           =
                        [
                                # We put people before organizations. Do not use 'sort' here because
                                # of the way we've formatted multiple entries for each person/organization.

                                @{$self -> param('view') -> person -> format_search_result($uc_key, $people)},
                                @{$self -> param('view') -> organization -> format_search_result($uc_key, $organizations)},
                        ];

                        $response = $self -> param('db') -> library -> build_ok_xml
                                (
                                        $self -> param('view') -> search -> display($uc_key, $row)
                                );

                        $self -> param('db') -> simple -> commit;
                        $self -> log(debug => "Final search count: @{[scalar @$row]}");
                }
                catch
                {
                        my($error) = $_;

                        $self -> param('db') -> simple -> rollback;

                        # Try to log the error despite the error.

                        $self -> log(error => "System error: $error");

                        $response = $self -> param('system_error');
                };

                return encode('utf8', $response);

        } # End of display.

Notes:

# 1: $uc_key is encoded because the (Perl) database code passes this string to Postgres, as seen in the code for get_people() just below.

# 2: Formatting the output.

See the line: $response = $self -> param('view') -> search -> display($uc_key, $row);

This calls App::Office::Contacts::View::Search's display() method, which does not have to utf8-encode the outgoing data.

This confusing issue is discussed shortly, in the Notes to Step 2 in the next section.

# 3: Code in the JS receiving this looks for /^Error/.

Clearly, the outgoing error message is too simple at the moment. I'm just fixing the code so that email addresses and phone numbers (think 'X' as in '1122334455 X 789') are utf8-friendly.

Then more sopisticated error messages will be implemented.

# 4: A successful search.

http://savage.net.au/jQuery-tutorials/pod/search.result.png.

Outputting to a web client

Step 1: Reading from the database:

Here's how the person part of the search feature (in App::Office::Contacts) works:

        sub get_people
        {
                my($self, $user_id, $uc_key) = @_;

                $self -> db -> logger -> log(debug => "Database::Person.get_people($user_id, $uc_key)");

                my($result) = $self -> db -> simple -> query("select * from people where upper_name like ? or upper_given_names like ? order by name", "\%$uc_key\%", "\%$uc_key\%")
                                                || die $self -> db -> simple -> error;
                $result     = $self -> build_person_record($user_id, $self -> db -> library -> decode_hashref_list($result -> hashes) );

                $self -> db -> logger -> log(debug => "Final people count: @{[scalar @$result]}");

                return $result;

        } # End of get_people.

Notes:

# 1: The call to query() sets up $result, and on the next line I call $result -> hashes() because I want N1 columns from N2 rows, and hashes() returns a list (N2) of hashrefs.

# 2: That list is immediately passed to decode_hashref_list(), which looks like:

        sub decode_hashref_list
        {
                my($self, @list) = @_;
                @list            = () if ($#list < 0);

                my(@result);

                for my $item (@list)
                {
                        $$item{$_} = decode('utf8', $$item{$_} || '') for keys %$item;

                        push @result, $item;
                }

                return [@result];

        } # End of decode_hashref_list.

So, as always, we decode as soon as the data is read in from the database.

Step 2: Formatting the output

I'll skip the code, but you can view the result here.

Notes:

# 1: A big question arises: How exactly do I encode the outgoing data? The answer is: I don't.

Now I hear you scream: But above you said!

        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

And!

        This means we are telling the web server to utf8-encode the web page when serializing it for transmission.

Yes, true. But still, I don't encode the data being sent from the server.

As it happens (i.e. in this particular application's design) the search results are returned to the web client in the response to an Ajax call.

# 2: Do I have to encode data being sent via Ajax to the server?

No. The jQuery API docs say:

"Default is "application/x-www-form-urlencoded; charset=UTF-8", which is fine for most cases.

# 3: jQuery's Autocomplete needs JSON!

Here's some autocomplete-prepping Javascript:

        $(function()
        {
                $("#add_staff_person_name").autocomplete
                ({
                        source: "/AutoComplete?type=person_name&sid=" + $("#update_org_sid").val(),
                        minLength: 2,
                        select: function(event, ui)
                        {
                                $("#add_staff_person_name").val(ui.item ? ui.item.value : "-");
                        }
                });
        });

This code governs when (after 2 keystrokes) and what gets sent to which url, as part of 'autocomplete', and what id the returned menu of choices is attached to.

I do not specify anything about the expected response format. And - it just works, and perfectly at that.

But I did have to specify JSON-encoding for the response, which I do in the Perl with:

        # Warning: Do not use ... new -> utf8 -> ...

        $response = JSON::XS -> new -> encode($response);

And yes, here you must explicity avoid the utf8() call!

I hope that's clear.

# 4: What about outputting XML?

I have discussed Ajax calls returning XML in tutorial # 6.

Manual utf8-encoding of the output text is not needed. Specifically, that tutorial just says:

        Since I'm using a sub-class of CGI::Snapp, I can call its add_header(...) method.

        $self -> add_header('-Status' => 200, '-Content-Type' => 'text/xml; charset=utf-8');

So my XML-using run modes just return a string ($html) as:

        return
        qq|<response>
                <error></error>
                <org_id>$organization_id</org_id>
                <person_id>$person_id</person_id>
                <$type><![CDATA[$html]]></$type>
        </response>
        |;

Traps

None today. Please call back tomorrow, when I'm sure there'll be plenty in stock :-(.

History

V 1.01 2013-07-01. Update after releasing App::Office::Contacts V 2.01.

V 1.00 2013-05-07. Original version.

Links

My home page

All tutorials (Includes references)

POD source for this tutorial