PHP, Unicode, and MySQL Character Sets

This post could be subtitled: “When importing your old data breaks your character encoding,” but even that doesn’t quite capture the frustration felt when unexpected UTFမ8 (or UTF-16) characters are scattered throughout your data.

Historically, PHP and MySQL have shared mutually beneficial positions within the web services ecosystem. It’s no surprise then that the two have more or less evolved together, benefiting from the other’s success in what those with a background in the natural sciences might consider a symbiotic relationship. To say that the two began life in a world where latin1 (ISO-8859-1 and its more location-specific derivatives or “code pages”) was the de facto standard encoding might be an understatement, but it is also conveniently ignores a little piece of history. Things change, people change, and I suppose it could be said that sometimes standards change. Or, more correctly, they’re superseded by better standards, cast away into the forgotten reaches of history by the marauding armies of deprecation. This realm is also periodically referred to in technical parlance as the IETF archives.

Sometimes, but not always, old and clunky standards linger on where they refuse to die, because doing things the right way is actually quite difficult. Not to mention that if the old way of doing something has always worked, it usually has enough managerial or influential inertia to carry it on into a future that’s very different from what its developers envisioned.

Many years ago, PHP6 was announced as a planned upgrade path from PHP5.x. Many promises were made: Namespaces, closures, and unicode support (in the form of UTF-16) to name a few. But the process stalled as developers were bogged down by the difficulty of converting everything to Unicode. Namespaces and closures were eventually migrated into the PHP5.3 branch, and it seemed that language-level unicode support would have to wait. It also didn’t help that many users complained loudly about potential breakage; alas, sacrifices must occasionally be made when moving forward, and in our industry, it’s often the users themselves who most fiercely resist change. Admittedly, the entrenchment of PHP in the web services sector probably hasn’t helped much to this end…

The current metrics are nevertheless quite promising in spite of the delays. As of present, the current statistics indicate that the core conversion of PHP to unicode is about 70% complete (accessed May 2nd, 2012). Of course, what progress has been made–if it’s anything short of complete–is of little use to those who have an immediate need for complete unicode support. For others, like myself, unicode is a nice-to-have, but for a majority of the work-related data I’ve seen, it’s a matter of dealing with systems that were written years ago by naive developers.

Continuing the story: MySQL eventually added mostly-working support for UTF-8. I say mostly, because it like everything that came before suffered from occasional breakage and the weirdness one might expect in a system that saw such a drastic change. However, even in the early days it worked well for systems that were careful not to mangle incoming unicode and for those that were properly configured. Indeed, a fresh install of WordPress will configure tables to use UTF-8 as the default character set (assuming your server supports it, of course!), and if you’re careful, you can muck with incoming data via PHP without disturbing anything important. UTF-8, PHP, nor MySQL are fundamentally the problem; character set conversion, however, is.

With one of our latest site conversions, we noticed a flurry of unusual artifacts scattered throughout most of the articles we imported into WordPress from a CMS that shall remain unnamed. Things like “Ä”, which was supposed to be the UTF-8 non-breaking space, to “’”, which was supposed to be a stylized right-apostrophe (’), were scattered throughout almost every article. Evidently, the original site owners had interesting copy/paste habits (they used Word), but the site worked well for them and I can’t bring myself to judge anyone for putting together a system that worked for their specific needs.

I researched character encodings for a couple days, usually between fixing other things I had broken (oops!), and couldn’t come up with a definitive solution. I’m pretty sure I tried just about everything I could think of, including some unorthodox suggestions like examining the bin2hex() output, running an str_replace() on it, and then re-assembling it back into a binary string. But whereas an unorthodox–and unnecessarily complex–process might work, it isn’t often the right way to do something. Heck, I even tried various incantations of iconv and mb_string‘s conversion functions before giving up, left with the impression that whatever dark magics I thought I had once possessed regarding UTF-8 mangling had long since wafted away not unlike the wisps of smoke from a witch doctor’s extinguished incense burner.

After puzzling over the matter for a few hours the following morning an epiphany struck. The answer needn’t be so complex! Instead, perhaps it was the encoding PHP was using when it connected to the MySQL server that was the source of the problem. When I’d pull that data and save it over to the database that was UTF-8 encoded, that’s when it would be mangled. There was a character translation going on somewhere, and I was certain this was it.

To understand the confusion that lead up to my frantic Google searches and the discovery of puffer-fish-extract-like medicinal remedies such as running str_replace() on bin2hex() output, a little bit of knowledge regarding the MySQL tool set is helpful. First, two of the most common tools I use when importing test data into a database are mysqldump and the standard mysql client (for restoration). Interestingly, one only needs to reach as far as the man page for either utility to discover the source of the problem:

--default-character-set=charset_name

Use charset_name as the default character set. See Section 9.5, “Character Set Configuration”. If no character set is specified, mysqldump uses utf8, and earlier versions use latin1.

In retrospect, I should have considered this first. The old database was configured to use latin1 on all of its tables, and it undoubtedly survived through many early MySQL versions. The new database was set up on a very recent version of MySQL, as was my workstation’s test database, and was therefore using UTF-8 as the default character set. The only difference between the two was that my workstation has everything configured to use UTF-8–the server, however, does not. Contributing to the confusion was the behavior of the import: When I would run a mysqldump on one of the data tables and import it into my workstation’s database, I noticed three extra characters in place of the apostrophes. Yet the server only displayed a single unexpected character.

Something was afoot.

When I discovered the encoding defaults in MySQL’s tool chain, it occurred to me that I should have exported the source data as latin1 from the old database, and imported it as latin1 into my test harness. It worked, or at least it looked like it worked, and then I ran my import utility.

…and it broke again.

Somewhat furious that I couldn’t quite figure out what the solution was, I paused for a moment to reflect on what happened. Then it occurred to me that PHP was probably using whatever default the MySQL driver was configured to–namely, UTF-8. I added a line to my importer (using PDO):

$exportDb->query('SET CHARACTER SET latin1');

Then I re-ran my importer, waited for some sample data to complete, and checked the article. It worked. Oh, how gloriously it worked! Perhaps, I mused, the simplicity of the solution was at least partially to blame for its evasiveness.

Over-analysis, particularly in IT, can be a problematic hurdle that can only be avoided or straddled with sufficient experience. Oftentimes, the avoidance of our tendency to over-think a problem can only come from a lifetime of experiences that work together to undo all of the nonsense models we’ve established in our heads, and that’s why I sometimes feel so astounded whenever someone’s kid makes a profoundly deep, yet exceptionally simple observation. You know the sort: You’re left, probably in shock, thinking about what a damn genius that little bastard is going to be. But then, years down the road, that little genius goes to school and is taught “critical thinking skills,” goes into IT, and then sits up late one night writing a blog post about what a damn genius someone else’s kid is for thinking outside the box.

Maybe one of the lessons I should have taken away from this is that the best solution is often to take a few steps back and let the obvious fixes (at the time) flutter away on the wind. There’s usually an easier, better way to do it. Unfortunately, seeing that takes practice.

The shortened version of this can be condensed into the following: First, most of the time, converting one database to another will work out of the box and you won’t need to do anything more. Second, sometimes even with custom tools, database conversions will go smoothly even when dealing with different character sets. But, finally, sometimes you’ll encounter a data set that’s just old enough to be a small thorn in your side because it happened to have persisted through a major architectural change in whatever system it calls home. If that’s the case, don’t listen to anyone else–and especially don’t try mangling your data, because you’ll only make it worse!–check your client and server encoding and alter them accordingly. If you’re dumping data from a set of tables that use latin1, make sure your export tools also dump that same data in latin1; if you’re using mysql or mysqldump, that means using the --default-character-set option, and if you’re using PHP directly, configure the database driver accordingly. If you’re importing UTF-8 characters that were originally (or accidentally) stored in latin1, don’t panic. As long as you make sure to pull that source data in latin1 (or whatever encoding it was originally stored in), you should be OK: The conversion–or pass-through–to the destination can typically occur to UTF-8 without anything being lost.

You don’t even have to whisper arcane incantations to yourself at night.

No comments.
***

PEAR + PHPUnit on Windows

Windows? Yes, I know. No one does real development under Windows unless they’re either insane or one of those weird types who likes (and I mean really likes) Powershell. Sure, I’ll grant them that .NET is reasonably well done, but I’d counter that anyone who’s used a proper shell (I see Powershell as a hybrid REPL/shell for .NET) and done it the Unix Way would soon find that Windows feels quite anemic. I’m biased; I confess that I essentially use Windows only for games these days, and my Real Development occurs under Arch Linux (with apologies to Gentoo). That said, there are some useful tools and utilities that can coexist under both but often with some unnecessary frustration on the Windows front unless you install 3rd party package managers. There are a few good ones available for basic needs, but it’s something a stock Windows install sorely lacks–unless, of course, you have a strong affinity for Microsoft products. And no, before you ask, I don’t think the promise of an app store with Windows 8 will bridge the gap. Windows 8 is a whole ‘nother can of worms…

PHPUnit is probably the de facto standard unit testing suite for PHP5.2+. It’s also easiest to install using PEAR, but as of this writing, PEAR is not installed with the PHP for Windows distribution. Worse, a cursory Google search for running PEAR on Windows and, more importantly, installing PEAR apps like PHPUnit yields mostly unhelpful results. Sure, there’s a few moderately helpful links here and there, but most of them seem to link to other blogs, probably with the intent to siphon off some Google juju for themselves. There’s even one such result that appears to be a domain that was once hijacked and taken over by spammers. Also, the Stack Overflow results are oddly unimpressive.

Bummer.

So, here’s my answer to the doldrums, but mostly so that I can reference this instructional write up in the future should the need arise. Indeed it probably will given my terrible memory.

Install Pear

First, don’t listen to anyone else. Second, listen to your instinct. It is possible to install pear, and it is possible to get the PEAR script running perfectly fine such that you can use it from cmd.exe, just as you were suspecting all along. It’s also exceptionally easy; simply follow the standard installation instructions to install pear (but with some minor twists!). Also, if you’re afraid something could go wrong and bork your PHP installation, make a backup of your PHP installation folder prior to installing PEAR. I emphasize this point because it most certainly will happen. Remember: The necessity of a backup increases asymptotically as the difficulty of restoring such data also increases, up to and including the point where there is no such backup.

Here’s what you’ll need to do:

  • Download and install go-pear.phar from pear.php.net. You may also need to modify your php.ini in order to get PHARs working, but I’m fairly certain it ought to work from a stock PHP installation.
  • When the installer runs, correct the paths it prompts for as appropriate for your particular situation; be aware that if you don’t want pear.ini in your Windows directory, you will need to change this! For the most part, it should automatically locate everything it needs.
  • Update your environment variables as hinted by the installer.

Now, for that last bit, you can either merge the PEAR_ENV.reg registry entries as hinted by PEAR’s installer or, if you’d rather do it by hand, you can add these–modifying where appropriate:

  • PHP_PEAR_SYSCONF_DIR=C:\php – Should be the root of your PHP install.
  • PHP_PEAR_INSTALL_DIR=C:\php\pear – Should be the root of your PEAR install.
  • PHP_PEAR_DOC_DIR=C:\php\docs – Presumably, this is where PEAR apps will dump their documentation, if requested or required.
  • PHP_PEAR_BIN_DIR=C:\php – Path containing your pear.bat script.
  • PHP_PEAR_DATA_DIR=C:\php\data – PEAR’s data directory. Various odds and ends get lumped in here, including licenses.
  • PHP_PEAR_PHP_BIN=C:\php\php.exe – Obviously, your PHP binary.
  • PHP_PEAR_TEST_DIR=C:\php\test – PEAR applications will stick their unit tests into this directory.

Now, obviously, you could probably have a working PEAR install without setting all of these. I wouldn’t recommend it, though. You’re apt to encounter a package that really wants to distribute documentation (or license text), and if you’ve installed PHP in a location that PEAR doesn’t expect as a reasonable default, well, you can guess that this probably won’t work.

At this point, you’re probably curious to see whether or not PEAR is working. To check this, fire up a command prompt and run pear help or similar. If it runs, you should be OK. If it fails, you probably skipped a step. Don’t worry if you do. It should be fairly easy to fix, but you might have to remove every directory except dev and ext from your PHP install and start from the beginning. If you’re paranoid, make a backup of your PHP installation folder prior to installing PEAR. I said this already, didn’t I?

Installing PHPUnit

Here’s the fun part. If you’ve successfully managed to install PEAR and pear help hasn’t griped, this step will either be exceptionally informative and hint that you skipped something or it will work. If you encounter trouble installing PHPUnit via PEAR, you most likely forgot to configure the appropriate environment variables; check these first, ensure they’re correct, and close/reopen all command prompts.

Otherwise, simply running:

pear config-set auto_discover 1
pear install pear.phpunit.de/PHPUnit

Should work, and you’ll have a PHPUnit install that also works. Try it out with: phpunit –help!

At this point, you should be able to continue forward, run your unit tests under Windows, fix any breakage that results (trust me, you’ll probably have some if you weren’t developing with Windows in mind–and sometimes even if you were), and commit changes where appropriate. Believe me when I say it: Some things should just work out of the box, but there are others that will spectacularly fail even if you used DIRECTORY_SEPARATOR everywhere in your paths (though it shouldn’t matter) and were careful to avoid non-Windows-only extensions.

But hey, that’s what you wrote those unit tests for, right?

No comments.
***

Merry Christmas!

Merry Christmas all!

Once this holiday season is over, I’ll be getting back to the LotW. It’s been a month, I know. I haven’t bothered to take the time to assemble more links.

I also have some updates for the Atom-based media center (HTPC) project that will be worthwhile, too.

1 comment.
***
Page 2 of 41234