PHP, Unicode, and MySQL Character Sets

This post could be subtitled: “When importing your old data breaks your character encoding,” but even that doesn’t quite capture the frustration felt when unexpected UTFမ8 (or UTF-16) characters are scattered throughout your data.

Historically, PHP and MySQL have shared mutually beneficial positions within the web services ecosystem. It’s no surprise then that the two have more or less evolved together, benefiting from the other’s success in what those with a background in the natural sciences might consider a symbiotic relationship. To say that the two began life in a world where latin1 (ISO-8859-1 and its more location-specific derivatives or “code pages”) was the de facto standard encoding might be an understatement, but it is also conveniently ignores a little piece of history. Things change, people change, and I suppose it could be said that sometimes standards change. Or, more correctly, they’re superseded by better standards, cast away into the forgotten reaches of history by the marauding armies of deprecation. This realm is also periodically referred to in technical parlance as the IETF archives.

Sometimes, but not always, old and clunky standards linger on where they refuse to die, because doing things the right way is actually quite difficult. Not to mention that if the old way of doing something has always worked, it usually has enough managerial or influential inertia to carry it on into a future that’s very different from what its developers envisioned.

Many years ago, PHP6 was announced as a planned upgrade path from PHP5.x. Many promises were made: Namespaces, closures, and unicode support (in the form of UTF-16) to name a few. But the process stalled as developers were bogged down by the difficulty of converting everything to Unicode. Namespaces and closures were eventually migrated into the PHP5.3 branch, and it seemed that language-level unicode support would have to wait. It also didn’t help that many users complained loudly about potential breakage; alas, sacrifices must occasionally be made when moving forward, and in our industry, it’s often the users themselves who most fiercely resist change. Admittedly, the entrenchment of PHP in the web services sector probably hasn’t helped much to this end…

The current metrics are nevertheless quite promising in spite of the delays. As of present, the current statistics indicate that the core conversion of PHP to unicode is about 70% complete (accessed May 2nd, 2012). Of course, what progress has been made–if it’s anything short of complete–is of little use to those who have an immediate need for complete unicode support. For others, like myself, unicode is a nice-to-have, but for a majority of the work-related data I’ve seen, it’s a matter of dealing with systems that were written years ago by naive developers.

Continuing the story: MySQL eventually added mostly-working support for UTF-8. I say mostly, because it like everything that came before suffered from occasional breakage and the weirdness one might expect in a system that saw such a drastic change. However, even in the early days it worked well for systems that were careful not to mangle incoming unicode and for those that were properly configured. Indeed, a fresh install of WordPress will configure tables to use UTF-8 as the default character set (assuming your server supports it, of course!), and if you’re careful, you can muck with incoming data via PHP without disturbing anything important. UTF-8, PHP, nor MySQL are fundamentally the problem; character set conversion, however, is.

With one of our latest site conversions, we noticed a flurry of unusual artifacts scattered throughout most of the articles we imported into WordPress from a CMS that shall remain unnamed. Things like “Ä”, which was supposed to be the UTF-8 non-breaking space, to “’”, which was supposed to be a stylized right-apostrophe (’), were scattered throughout almost every article. Evidently, the original site owners had interesting copy/paste habits (they used Word), but the site worked well for them and I can’t bring myself to judge anyone for putting together a system that worked for their specific needs.

I researched character encodings for a couple days, usually between fixing other things I had broken (oops!), and couldn’t come up with a definitive solution. I’m pretty sure I tried just about everything I could think of, including some unorthodox suggestions like examining the bin2hex() output, running an str_replace() on it, and then re-assembling it back into a binary string. But whereas an unorthodox–and unnecessarily complex–process might work, it isn’t often the right way to do something. Heck, I even tried various incantations of iconv and mb_string‘s conversion functions before giving up, left with the impression that whatever dark magics I thought I had once possessed regarding UTF-8 mangling had long since wafted away not unlike the wisps of smoke from a witch doctor’s extinguished incense burner.

After puzzling over the matter for a few hours the following morning an epiphany struck. The answer needn’t be so complex! Instead, perhaps it was the encoding PHP was using when it connected to the MySQL server that was the source of the problem. When I’d pull that data and save it over to the database that was UTF-8 encoded, that’s when it would be mangled. There was a character translation going on somewhere, and I was certain this was it.

To understand the confusion that lead up to my frantic Google searches and the discovery of puffer-fish-extract-like medicinal remedies such as running str_replace() on bin2hex() output, a little bit of knowledge regarding the MySQL tool set is helpful. First, two of the most common tools I use when importing test data into a database are mysqldump and the standard mysql client (for restoration). Interestingly, one only needs to reach as far as the man page for either utility to discover the source of the problem:

--default-character-set=charset_name

Use charset_name as the default character set. See Section 9.5, “Character Set Configuration”. If no character set is specified, mysqldump uses utf8, and earlier versions use latin1.

In retrospect, I should have considered this first. The old database was configured to use latin1 on all of its tables, and it undoubtedly survived through many early MySQL versions. The new database was set up on a very recent version of MySQL, as was my workstation’s test database, and was therefore using UTF-8 as the default character set. The only difference between the two was that my workstation has everything configured to use UTF-8–the server, however, does not. Contributing to the confusion was the behavior of the import: When I would run a mysqldump on one of the data tables and import it into my workstation’s database, I noticed three extra characters in place of the apostrophes. Yet the server only displayed a single unexpected character.

Something was afoot.

When I discovered the encoding defaults in MySQL’s tool chain, it occurred to me that I should have exported the source data as latin1 from the old database, and imported it as latin1 into my test harness. It worked, or at least it looked like it worked, and then I ran my import utility.

…and it broke again.

Somewhat furious that I couldn’t quite figure out what the solution was, I paused for a moment to reflect on what happened. Then it occurred to me that PHP was probably using whatever default the MySQL driver was configured to–namely, UTF-8. I added a line to my importer (using PDO):

$exportDb->query('SET CHARACTER SET latin1');

Then I re-ran my importer, waited for some sample data to complete, and checked the article. It worked. Oh, how gloriously it worked! Perhaps, I mused, the simplicity of the solution was at least partially to blame for its evasiveness.

Over-analysis, particularly in IT, can be a problematic hurdle that can only be avoided or straddled with sufficient experience. Oftentimes, the avoidance of our tendency to over-think a problem can only come from a lifetime of experiences that work together to undo all of the nonsense models we’ve established in our heads, and that’s why I sometimes feel so astounded whenever someone’s kid makes a profoundly deep, yet exceptionally simple observation. You know the sort: You’re left, probably in shock, thinking about what a damn genius that little bastard is going to be. But then, years down the road, that little genius goes to school and is taught “critical thinking skills,” goes into IT, and then sits up late one night writing a blog post about what a damn genius someone else’s kid is for thinking outside the box.

Maybe one of the lessons I should have taken away from this is that the best solution is often to take a few steps back and let the obvious fixes (at the time) flutter away on the wind. There’s usually an easier, better way to do it. Unfortunately, seeing that takes practice.

The shortened version of this can be condensed into the following: First, most of the time, converting one database to another will work out of the box and you won’t need to do anything more. Second, sometimes even with custom tools, database conversions will go smoothly even when dealing with different character sets. But, finally, sometimes you’ll encounter a data set that’s just old enough to be a small thorn in your side because it happened to have persisted through a major architectural change in whatever system it calls home. If that’s the case, don’t listen to anyone else–and especially don’t try mangling your data, because you’ll only make it worse!–check your client and server encoding and alter them accordingly. If you’re dumping data from a set of tables that use latin1, make sure your export tools also dump that same data in latin1; if you’re using mysql or mysqldump, that means using the --default-character-set option, and if you’re using PHP directly, configure the database driver accordingly. If you’re importing UTF-8 characters that were originally (or accidentally) stored in latin1, don’t panic. As long as you make sure to pull that source data in latin1 (or whatever encoding it was originally stored in), you should be OK: The conversion–or pass-through–to the destination can typically occur to UTF-8 without anything being lost.

You don’t even have to whisper arcane incantations to yourself at night.

No comments.
***

Annoyances: vBulletin 4 Template Hooks

I’ve been doing commercial software development for vBulletin 3.x (and, by extension, 4.x) off and on now for a couple of years. While there are some things that irritate the crap out of me about both of these products, vBulletin (both versions) have features that just aren’t found in other bulletin board packages. Admittedly, many of these exclusive features are provided by an extensive library of 3rd party software, but the point still stands–as much as I hate to admit it. Few other message boards have a plugin system that’s easy to develop for, and fewer still have the vast library of plugins available. phpBB doesn’t even come close. vBulletin still has its shortcomings for developers, but I’ll save my complaints for a later installment.

What I’m going to write about tonight is something that bit me, and I know it’s going to bite someone else out there: Template hooks have been the bane of my existence in vB4 for the majority of this weekend, and once you start adding a few yourself, you’ll grow to appreciate the manic schizophrenia that is the vBulletin 4.x template system in all its unadulterated glory. I hope to save you from the onset of severe insanity, so keep reading for my story and my solution.

Side note: You might also want to make this a summer project, because you’ll be bald by the time you’re finished, and I understand that bald heads get cold quite quickly. If you’re already bald, accept my apologies and tear something else out–like the upholstery stuffing in your desk chair. Don’t have a chair? Reach for the carpet. Don’t have carpet? Well, you’re on your own.

Template Hooks: They Work–But not When You Want Them To

I’ve written a couple of plugins that rely on the various forumhome_wgo_pos* template hooks for both vBulletin 3.x and 4.x. These hooks work perfectly for most use cases, regardless of when your plugin fires, and are almost foolproof. Don’t be lulled into a false sense of security, though. The moment you do anything unusual with template hooks in vBulletin 4.x, you’ll be bitten by the what-the-heck-happened-to-my-output surprise.

To reproduce the ailment that has been afflicting my sanity for the better part of this last Sunday, I direct you to a simple test:

  1. Create a new product, complete with its very own plugin.
  2. Set the plugin to fire on the global_start plugin hook.
  3. Add the following code to the plugin:
    $template_hook['footer_test_hook'] = '<b>Hi!</b>';
  4. Add the following code to your footer template:
    {vb:raw template_hook.footer_test_hook}
  5. Run it!

You should notice that you now have a nice, shiny string containing Hi! at the bottom of your page in the footer code. Now, let’s break it:

  1. Add a template, such as break_my_footer to your product XML (optional; you could use any other template if you like)
  2. Call this template from your plugin using something like:
    $tpl = vB_Template::create('mytemplate');

    Or, if you decided to use an existing (small) template:

    $tpl = vB_Template::create('option');
  3. Then modify your template hook code appropriately:
    $template_hook['footer_test_hook'] = htmlentities($tpl->render());
  4. Watch in horror as nothing appears in your footer.

Try as I might, I spent a good hour or two trawling various vBulletin support sites for answers. Rather than make a post somewhere and risk having one of their ill-tempered devs explain “Well, this is how it’s supposed to work, didn’t you use the search?” when the built in search generally sucks and Google doesn’t always pick up their help threads, I decided that this issue became personal. That is to say, this code insulted my mother, my father, my nonexistent siblings, and each of my ancestors going back 1,500 years.

After performing various blind tests I concluded that somehow the call to the vB_Template::create() factory method was effectively wiping the contents of $template_hook–or ignoring it, or purging it, or performing an exorcism on it with tremendous glee while I steamed with fury in front of my monitor. I then decided that I’d had enough, and so I searched for the footer template to determine where it was being called, prepared, and possibly rendered in the code. My hunch was that the footer was being generated separately from the forumhome cruft that so happily seemed to work no matter where I used it or what I did with it (and indeed it is generated separately). Yet my own template hook refused to work.

Then I came across this code in includes/class_bootstrap.php:

 $templater = vB_Template::create('footer');
                        $templater->register('admincpdir', $admincpdir);
                        $templater->register('ad_location', $ad_location);
                        $templater->register('cronimage', $cronimage);
                        $templater->register('languagechooserbits', $languagechooserbits);
                        $templater->register('modcpdir', $modcpdir);
                        $templater->register('quickchooserbits', $quickchooserbits);
                        $templater->register('template_hook', $template_hook);
                        $templater->register('facebook_footer', $facebook_footer);
                $footer = $templater->render();

Pay careful attention to the line $template->register('template_hook', $template_hook);. Clearly, the footer is processing the template hook here–so I thought to myself, perhaps there’s a nearby hook that I could attach my plugin to so I can guarantee I know that the content of $template_hook won’t be interfered with.

I scrolled up and found a hook that probably should have been fairly obvious to me from the start. But hey, it’s the weekend. What more can you expect?

($hook = vBulletinHook::fetch_hook('parse_templates')) ? eval($hook) : false;

Sheepishly, I changed my plugin to use the parse_templates plugin hook instead of global_start, and it worked! So the upshot is: If you’re going to try using custom template hooks and you discover that they won’t work the moment you load a template, try changing the plugin hook to parse_templates. It might just fix the problem.

Now, this was admittedly all my fault for not realizing that parse_templates may be the correct solution; I really should have examined the vBulletin sources more closely. Shame on me. In my defense, though, the vBulletin documentation is pretty poor, much of it is outdated, and even less of it focuses on issues specific to 4.x. However, I have one particular bone to pick: It’s puzzling to me that whatever is in $template_hook will work fine up until the moment you decide to call vB_Template::create(). There’s a comment under the create() method that indicates something to do with $template_hook and treating it as a special case for the purpose of various globals or some such, along with a reference to a bug tracker ID. I think that’s more coincidental than anything else, and certainly if I wanted to find out what was happening, I could run a trace with XDebug, but I’m not that desperate–or bored (yet). My guess is that, somehow, subsequent calls to vB_Template::create() clobber the contents of $template_hook by the time vBulletin gets around to rendering the footer; I may be wrong–I probably am–but this is an example of bizarre code suffering from manic schizophrenia.

Frankly, the vBulletin sources are so stupidly convoluted it’s a miracle the software works as well as it does. I’ll save that for another rant much later this week or next. In short, remember: If you’re toying with custom template hooks, you might just break your code. If you do, try changing the plugin you’re writing for template rendering purposes to hook into parse_templates. You’re almost guaranteed to have little to no interference with the contents of $template_hook and the parse_templates hook is nearest to the templates that are most likely to be affected.

Toodles for now. Expect to see a whiny rant soon!

No comments.
***

Java and Multiple Desktops

If you’re using a single monitor, this article won’t be of much use to you. For the rest of you who have two (or more) monitors plugged into your box, you may be able to glean something of useful from this post. I’m sure the topic has been written to death elsewhere, and while there’s a few different ways to accomplish the same thing, this is my solution.

An example and the wrong solution, told as a story

First, the problem. If you’re written any GUI applications, I know you’ve done this at least once before: You create your GUI, you start attaching widgets to it, you launch it, you debug it, and about two or three hours into the project, you’re growing increasingly more annoyed with the window positioning. Maybe the window is attaching itself to the upper left corner, or maybe your window manager is genuinely trying hard to do the Right Thing thus leaving your application to appear randomly around the screen in a feeble attempt to cascade against something that doesn’t exist. Frustrated, you do something like this (SWT example):

// This assumes that display has already been created elsewhere.
// app is a class representing the main window for the application.
private void centerOnScreen ()
{
    Rectangle bounds = new Rectangle(0, 0, 0, 0);
    Rectangle desktopBounds = display.getBounds();
 
    bounds.x = (desktopBounds.width / 2) - (app.DEFAULT_WIDTH / 2);
    bounds.y = (desktopBounds.height / 2) - (app.DEFAULT_HEIGHT / 2);
    bounds.width = app.DEFAULT_WIDTH;
    bounds.height = app.DEFAULT_HEIGHT;
 
    shell.setBounds(bounds); // That'll teach 'em.
}

Your application is now centering itself on the screen. Great! No longer do you have to hunt around for the silly thing during debugging to drag it around, resize it, or otherwise mutter unsavory curses under your breath.

You then plug in a second monitor, and immediately those same unsavory words you uttered earlier have grown significantly worse. Now, with a second monitor, your nifty centerOnScreen() method centers the application between windows. “Curses,” you say, “I though I had the damn thing fixed!” Since your boss was kind enough to purchase two identical monitors, you figure you’ll draft up a quick fix:

// This assumes that display has already been created elsewhere.
// app is a class representing the main window for the application.
private void centerOnScreen ()
{
    Rectangle bounds = new Rectangle(0, 0, 0, 0);
    Rectangle desktopBounds = display.getBounds();
 
    bounds.x = (desktopBounds.width / 2) - (app.DEFAULT_WIDTH / 2);
    bounds.y = (desktopBounds.height / 2) - (app.DEFAULT_HEIGHT / 2);
    bounds.width = app.DEFAULT_WIDTH;
    bounds.height = app.DEFAULT_HEIGHT;
 
    // XXX: No one will ever have a monitor greater than 1920 pixels wide.
    if (desktopBounds.width > 1920)
        bounds.x = (desktopBounds.width / 2 / 2) - (app.DEFAULT_WIDTH / 2);
 
    shell.setBounds(bounds); // That'll teach 'em.
}

After such harrowing labor, your application is now back to normal and centering itself on the left-most monitor. You smile smugly, commit the changes, and go home for the day.

The next morning, you get an e-mail from one of the other developers in the office. He’s not particularly happy:

From: Bob Jones <bobj@megacorp.com>
To: UI Design Team <webreakthings@megacorp.com>
Subject: what’s wrong with the ui?

Hey guys, I just noticed that some changes made since yesterday have the application appearing kind of off my left screen. It spills over a bit onto the right monitor.

For what it’s worth, my monitors are of two different sizes so it’s kinda funky.

Oops. Dividing the desktop size in two (you have two monitors of identical dimensions) and then dividing that number in two doesn’t quite work when individual attached screens differ in width. Worse, what happens when someone buys a monitor with a horizontal resolution greater than 1920?

From: Milton Pencilpicker <miltonpen@megacorp.com>
To: UI Design Team <webreakthings@megacorp.com>
Subject: App keeps appearing waaaaaaay off to the side

Guys,

Did you say the app was supposed to start centering after the changes made yesterday? It’s still broken. It centers vertically just fine, but it’s about a quarter of the way over to the left. Just thought I’d let you know.

Yeah, this solution isn’t going to work. You need to account for resolutions on a per screen basis. So what do you do?

Probe some screens

Java exposes individuals screens through the GraphicsEnvironment singleton. The advantage of this method over getBounds() (or the Swing equivalent) is that you can easily determine the default device. Here’s one such example:

private GraphicsDevice defaultDevice;
private int defaultDeviceOffset = 0;
private ArrayList<Dimension> screens;
private int totalWidth = 0;
 
/**
 * Probe attached displays.
 * This method collects data related to all attached displays.
 * For illustrative purposes, we're recording the dimension of each
 * attached screen and recording it in the local screens arraylist.
 * We then add up the total screen width.
 */
private void probeDisplays ()
{
    GraphicsDevice[] devices = GraphicsEnvironment
        .getLocalGraphicsEnvironment()
        .getScreenDevices();
 
    defaultDevice = GraphicsEnvironment
        .getLocalGraphicsEnvironment()
        .getDefaultScreenDevice();
 
    for (int i = 0; i < devices.length; i++) {
        if (devices[i].equals(defaultDevice))
            // Do something when we encounter the default device.
            // One example would be to calculate the total screen
            // width thusfar. For our example purposes, we're going
            // to record the default device offset versus all other
            // attached screens.
            defaultDeviceOffset = i; // Mostly meaningless; sample purposes only.
 
        DisplayMode dm = devices[i].getDisplayMode();
        Dimension d = new Dimension(dm.getWidth(), dm.getHeight());
        screens.add(d);
        totalWidth += dm.getWidth();
    }
}

By recording individual screens, our centerOnScreen method can now be performed on a screen-by-screen basis:

/**
 * Revised centerOnScreen.
 */
private void centerOnScreen ()
{
    // Center the window based upon the default device dimensions.
 
    Rectangle bounds = new Rectangle(0, 0, 0, 0);
    int widthSoFar = 0;
 
    for (int i = 0; i <= defaultDeviceOffset; i++) {
        widthSoFar += screens.get(i).getWidth();
    }
 
    bounds.x = (screens.get(i).getWidth() / 2) - (app.DEFAULT_WIDTH / 2) + widthSoFar;
    bounds.y = (screens.get(i).getHeight() / 2) - (app.DEFAULT_HEIGHT / 2);
    bounds.width = app.DEFAULT_WIDTH;
    bounds.height = app.DEFAULT_HEIGHT;
 
    shell.setBounds(bounds);
}

By collecting metrics from the GraphicsEnvironment singleton, we’ve established what 1) the default device dimensions and and 2) have established code that will automatically correct for centering the application on the default device regardless of whether it is to the left or right of the other monitor or monitors.

Some Improvements

I want to keep this article short, so I’ll offer some suggestions for further improvements.

  • If you wanted to restore the window position after it has been closed and restarted, just record the current window boundaries. Assuming monitor positions don’t change between launches, the application will restart from the last position it was closed at.
  • It would be trivial to examine the window positions just prior to restoring the previous session to determine if it will be drawn within the boundaries of the desktop. You could either examine each screen individually (more accurate) or by gathering the desktop boundaries regardless of display (less code, less accurate). The best option is certainly to compare the window’s last position with the dimensions and structure of the probed screen devices. You can correct for changes in resolution, missing monitors, and a few other unexpected situations.
  • This example only corrects for the most common scenario where desktop monitors are side-by-side. This is the default on Windows without 3rd party software and is generally a safe assumption. For platforms using Xorg (or similar), individual monitors can be to the left, right, top, or bottom of the primary display. NVIDIA’s drives provide fairly fine-grained control over monitor positioning, so if you’re targeting a multi-monitor environment under *nix, you may wish to do further testing and examine window positions based on height.
  • This code can easily be adapted to Swing.
  • While the sample code won’t work out of the box with JFace, it shouldn’t be too difficult to modify the sample code so it’ll work. Be sure to use getShell() to obtain window locations and dimensions but be mindful that this will probably break if you try to access the shell object during a close action. I’ll be making another post to demonstrate one possible solution.

Should you find bugs or would like to offer corrections to the example code in this article, please feel free to post. Obviously, corrections or suggestions that extend beyond the spirit of the code won’t be included. If you would like to offer criticism, please post full code snippets (even if you’re copying part of what I’ve written in this article) for clarity; should someone new to Java come across my blog, you would be doing them a fantastic service by limiting the amount of vertical scrolling necessary to read the example code and your critiques. :)

I waive all rights to the code in this article and hereby place it into the public domain.

No comments.
***
Page 1 of 812345...Last »