splitbrain.org

electronic brain surgery since 2001

Surviving the Perl/UTF-8 Madness

This post is basically a survival guide for my self, should I ever again be up against the evil called “Unicode Support” in Perl 5.8+.

Perl's idea of UTF-8 support is that scalars now have an internal flag that determines if its contents is UTF-8 or not. When the flag is off, then the content is assumed to be ASCII7, but basically it is just treated as bytes.

Now what happens if you for example concatenate two differently flagged scalars, is that Perl will first convert the “off-flagged” scalar to UTF-8. And this is when it gets ugly. Because it might be that this off-flagged string does very well contain UTF-8 encoded stuff already, just the flag wasn't set correctly.

But it doesn't stop there. No Perl script is an island - there's always input and output. How does Perl know if it is UTF-8 or not? The sad thing is that it tries to guess. And as we all know, things get really ugly when software tries to guess1).

So here is the main trick to stay sane with Perl's UTF-8 support: never let it guess, always make sure everything is encoded and flagged as UTF-8 internally!

Internal Scalars

You sometimes need a simple string containing non-ASCII chars. The simplest way to achieve that is to write your code in UTF-8. Use an UTF-8 capable editor and terminal and just write your string in your natural language. But tell Perl about it!

To do so, just use the “use utf8” pragma:

use utf8; #this script is written in UTF-8
 
my $string = "Äußerst süße Töchter aus Ölüberschußländern in Übergröße";

Standard File Handles

Again, in theory Perl should guess if your Terminal provides UTF-8 or not and recode input and output accordingly. For me that never works reliable. So just tell Perl what encoding your streams use with binmode:

# treat all input and output as UTF-8 and set the flags correctly
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
binmode STDIN,  ":utf8";

The above should of course also work with other encodings. Perl than will recode them to UTF-8 internally.

Reading from Files

Even if your Terminal, editor and Script is in UTF-8 – your files to read might not. Telling Perl the correct encoding will again automatically recode them and set the UTF-8 flag:

# read from a latin1 encoded file
open FH, "<:encoding(iso-8859-1)", 'test.latin1.txt';
$latin1 = <FH>;
close FH;
 
# read from a UTF-8 encoded file
open FH, "<:encoding(utf8)", 'test.utf8.txt';
$utf8 = <FH>;
close FH;

Both scalars, $utf8 and $latin1, now contain valid UTF-8 encoded text with Perl's internal flag enabled.

MySQL Databases

You might think to know the answer here: SET NAMES utf8. Yes and no. Sending this will switch your MySQL connection to UTF-8 and when you pass Perl scalars with the UTF-8 flag enabled, their values will be inserted correctly (usually). However everything you read from the database will be UTF-8 encoded but missing the UTF-8 flag.

Luckily DBD::mysql has a cure for that – an option called mysql_enable_utf8. You need to pass it in the connect method.

$DBH = DBI->connect("DBI:mysql:database=foo;host=localhost",
                    'user',
                    'password',
                    {
                         mysql_enable_utf8 => 1
                    });

The flag will also take care of sending SET NAMES utf8 for you.

Manually setting the UTF-8 flag

If you have some data from other sources (eg. Non-MySQL DBs), you can switch the UTF-8 flag with Encode::decode_utf8. The decode is a bit confusing but it will “decode” into Perl's internal UTF-8 format.

use Encode;
 
# $line containes UTF-8 encoded text but the flag isn't set, yet
$line = Encode::decode_utf8($line); # set the flag

That's it. Once you figure it all out, it is somewhat bearable. Personally I prefer PHP's UTF-8 support: just treat everything as single bytes and provide a library for multibyte operations.

This post was originally published on cosmocode.de
Tags:
perl, programming, utf-8, unicode
Similar posts: