Musing about… Prepared Statements

A spurt of curiosity this evening–more specifically, one of those circumstances we each have from time to time wherein a handful of unrelated thoughts flutter about the conscious mind like a pair of butterflies flitting from flower to flower–consumed me sufficiently that I decided to do a brief Google search on prepared statements. I’m unsure where such a motive originated, but I’m fairly convinced that it was at least tangentially related to some misinformation I’ve heard of late related to web programming advice and also possibly due to my surprise that few commercial PHP bulletin board packages actually use prepared statements.

Before I begin, let’s consider for a moment that last and most disconcerting statement: Few commercial PHP forums use prepared statements. To the uninitiated, this might seem to be a matter of nick-picking unimportant to the real world. To the rest of you, it may come as a sad commentary on the state of modern programming and commercial software (perhaps, fittingly, as a commentary on the average run-of-the-mill PHP programmer). Prepared statements certainly aren’t new, and while they’ve been a part of PHP for a number of years now, it’s infuriating that they hardly see common use.

PHP first shipped PDO with PHP 5.1 (available as a PECL extension for PHP 5.0, circa 2004-2005). Intriguingly, for systems that don’t provide PDO support (or the appropriate drivers for PDO), the MySQLi and PostgreSQL functions and classes have provided prepared statements for quite some time, and the SQLite 3 drivers have provided a prepare() method since PHP 5.3. Commercial bulletin boards, like vBulletin and IPB, have seen many revisions in the years since, and several free/open source packages including phpBB have been part of similarly major overhauls. Yet the overwhelming majority of them still make no use of prepared statements. Humorously, as of this writing, vBulletin does provide a misleadingly-named sql_prepare method in its database class, but it doesn’t emulate prepared statements–it simply provides an escape wrapper with data type introspection and casting.

PDO has been available for nearly 8 years and many RDBMS drivers for PHP have offered prepared statements for at least that long (longer in the case of PostgreSQL if memory serves correctly). Yet every year or two, new major versions of popular PHP message boards are released, and every major release sees the same legacy database code under the hood. Perhaps it’s intentional. Perhaps the developers still want to support PHP 4.x in spite of the fact that it went EOL in 2008. Perhaps they just don’t know any better. Who knows!

Why Prepared Statements?

A prepared statement or parameterized statement, as it is occasionally known by in DBA parlance, is a specially-formatted SQL string that utilizes placeholders, either question marks (?), special named parameters (such as “:name”), or other database-specific strings, to indicate to the database or the driver where data is to be inserted. This has the benefit that, in theory at least, any data managed by a prepared statement is unlikely to serve as a vector for SQL injection attacks. The reason this works is because most drivers dispatch the prepared statement and its data separately on the wire and process them independently providing a certain degree of isolation. But wait, there’s more! Because of the implementation nature of prepared statements on most platforms, the query planner can often optimize and partially compile the statement such that, if it runs again, much of the legwork has already been completed and the query can run faster. Software like forums or blogs often execute the same query multiple times–with different data–so one might think it would be a natural fit. If it’s such a good thing, why do so many popular packages forgo such a benefit?

While I can’t answer for many developers, I think I know what at least part of the answer might be. First, for enormous code bases like vBulletin (and phpBB to a lesser extent), virtually no effort is made to separate the application logic from the underlying model. I’ll be fair in my distinction: The presentation layer is thankfully separated from the mess in the form of templates, but the remaining code is a bowl of spaghetti not unlike that of many of the very first PHP applications (and Perl!) that first graced the Internet over a decade ago. Because the model (and, by extension, the SQL) is so deeply entrenched in the functional logic of the application, reworking it to use prepared statements–and consider, also, that many of these queries are generated programmatically–would be a tremendous undertaking of many man-hours. Cleaning up the code properly such that it is more of a structurally sound framework (think MVC) is most certainly out of the picture. It isn’t impossible, of course, but when you consider that some functions in many of these software packages have persisted since the dawn of time, such refactoring becomes the thing of fairy tales.

To illustrate some of my displeasure, vBulletin version 4.2 still provides an iif function which is little more than a wrapper for the ternary operator (?:) in PHP. The ternary operator has been around since at least PHP4, yet there it is, in all its glory, a legacy function still available from the early days of PHP3 when such a beast didn’t exist!

One might think that it would simply be a matter of adding some logging code to old function calls, tracing the source that called them, and then reworking the culprit code to use built in language features. It might even take less than an afternoon.

The Drawbacks Programmer Mistakes

While prepared statements (parameterized queries for those of you who are embarrassingly excited by more elaborate verbiage) aren’t a panacea (I did it again) for all things SQL injection-like, they’re a good mitigation strategy, but it’s important to use them with caution. As Jason Lam states on the ISC Diary, “I still remember 4-5 years ago when SQL injection just started to become popular, the common mitigation suggested [was] to use prepared statement [sic] as if [they were] a magic bullet. As we [now] understand the SQL injection problem better, we realize that even prepared statement can be vulnerable to SQL injection as well.”

Well, yeah. This is where I smack my forehead. Maybe I’m being overly critical as I re-read a post from 5 years ago, because I’ve had the tremendously good fortune of witnessing some magnificently terrible code in my time as a web application developer.

Mr. Lam goes on to explain the insertion of unchecked user input, but I can’t shake the feeling that there is an implicit overtone in the article that it is somehow the fault of prepared statements. Perhaps more accurately, the article is faulting most of us for having championed prepared statements as a welcome solution to a very common and widespread problem. Realistically, though, it’s not an issue with prepared statements–they work just fine. It’s an issue with developers inappropriately using the tools at their disposal and doing so in a manner that simply transfers the vulnerability from query() to prepare() by forgetting to properly manage incoming data. Though, I should say that I’m inclined to suggest that programmatically assembling a prepared statement is somewhat counter-productive. More on this later.

Ironically, while doing some research for this article, I ran across a couple of posts on Stack Overflow that presented this problem of unchecked user input as one of the primary drawbacks of prepared statements. Really? Drawbacks? If you’re not using named parameters or placeholders for your query data, you’re probably not using prepared statements correctly! But drawbacks? Gee, maybe we were a little too vigilant in telling people to use prepared statements–so much so that they did a find/replace for query and swapped it with prepare. (I’m being facetious; so, to head off any comments to the contrary, it’s not at all possible to simply swap some text, because prepared statements do require a little more work.)

The problem I have with labeling unchecked user input as a drawback of prepared statements is that it is no longer a “real” prepared statement whenever such data is concatenated with the resulting query. Yes, it is still a prepared statement, insofar as calling prepare() on the driver’s end, but it’s no longer being used like a prepared statement. Here’s a hint to new developers, particularly PHP developers since a huge percentage of them are guilty of doing stupid things like this: Never concatenate unchecked input in any query–prepared or otherwise. If you’re using a prepared statement, use it like a damned prepared statement. The moment you start piping data into the query string itself, it’s no longer going to have the benefits of a prepared statement. (I’ll give you a special exception if you’re using LIMIT and concatenating integers with your queries since not all of you may be running MySQL 5.0.7 or later.)

Will the Real Prepared Statement Please Step Forward?

In my mind, and trust me, it’s a very strange place in here, a prepared statement is one that may contain parameters and is “prepared” ahead of time for reuse (that is, compiled) by the driver or the RDBMS (usually the RDBMS). Nothing more, nothing less. The instant some unfiltered data is slapped on to the end of the query, it’s no longer a pure prepared statement; instead, it becomes a mistake. Again: Prepared statements are parameterized queries that are usually compiled by the backend for a little extra speed. A query can contain anything else that the programmer adds into it, but fundamentally, a prepared statement is something that dictates a very specific structure. It certainly cannot overcome the mistakes of a naive developer who, believing that a prepared statement will magically fix all of their (singular they, sorry linguistic purists) security-related woes, use such a tool in addition to dangerous techniques like concatenating unchecked input. Another way to look at it is thus: If prepared statements are prepared (that is, compiled) by the database for reuse, and the developer is concatenating a dynamic value to the statement, the entire benefit of preparing (compiling) that statement is immediately lost, because the RDBMS has to re-compile the statement every single time it’s sent along the wire. Please, don’t do this.

Of course, there may be reasons not to use prepared statements all the time. For one, prepared statements in MySQL versions prior to 5.1 can no longer be managed by the query cache which may impact performance (High Performance MySQL, 2nd ed., p229). DBMSs that don’t support prepared statements can, in PDO at least, can have them emulated by the PDO driver at the cost of some pre-processing performance, and using older PHP functions like the popular-but-now-deprecated mysql_* ones just outright don’t support anything but basic queries (they also don’t use the binary interface, making them somewhat slower). If you’re using only a single query with absolutely no intention of reusing it, prepared statements may incur some overhead since the query must be compiled. Furthermore, for MySQL at least, if you’re not using stored procedures, the database has no way to share compiled prepared statements among multiple connections. Yet, while a prepared statement is no substitute for caution–particularly with programmatically-generated queries–it is a useful tool in the developer’s arsenal to protect against attacks like SQL injection. Plus, if you make it a habit to use PDO (you should), not only do you get emulated prepared statements for databases that don’t support them, you also get to use the modern MySQL APIs under the hood and some consistency, which says a lot in the PHP world.

Tangentially, this is also why it boggles my mind that many sites (banks, news agencies, airlines, and even some game companies) limit what characters the user can enter for their password, and how so many companies with an online presence often have draconian limits of less than 16 characters, inclusive. Seriously: If you’re correctly storing a secure hash of the password (HMAC, bcrypt, scrypt, or at least SHA256 or similar), you don’t need to store the password directly, nor does it matter if the password is 5 or 500 characters. It’s going to be comprised of a fixed length of some limited set of ASCII characters representing hexadecimal numbers which can be stored without much fuss. The 1990s were over two decades away. I think it’s time we stopped writing code like Y2K is still a viable problem.

Also, let’s start using prepared statements a little more often.

1 comment.
***

PHP, Unicode, and MySQL Character Sets

This post could be subtitled: “When importing your old data breaks your character encoding,” but even that doesn’t quite capture the frustration felt when unexpected UTFမ8 (or UTF-16) characters are scattered throughout your data.

Historically, PHP and MySQL have shared mutually beneficial positions within the web services ecosystem. It’s no surprise then that the two have more or less evolved together, benefiting from the other’s success in what those with a background in the natural sciences might consider a symbiotic relationship. To say that the two began life in a world where latin1 (ISO-8859-1 and its more location-specific derivatives or “code pages”) was the de facto standard encoding might be an understatement, but it is also conveniently ignores a little piece of history. Things change, people change, and I suppose it could be said that sometimes standards change. Or, more correctly, they’re superseded by better standards, cast away into the forgotten reaches of history by the marauding armies of deprecation. This realm is also periodically referred to in technical parlance as the IETF archives.

Sometimes, but not always, old and clunky standards linger on where they refuse to die, because doing things the right way is actually quite difficult. Not to mention that if the old way of doing something has always worked, it usually has enough managerial or influential inertia to carry it on into a future that’s very different from what its developers envisioned.

Many years ago, PHP6 was announced as a planned upgrade path from PHP5.x. Many promises were made: Namespaces, closures, and unicode support (in the form of UTF-16) to name a few. But the process stalled as developers were bogged down by the difficulty of converting everything to Unicode. Namespaces and closures were eventually migrated into the PHP5.3 branch, and it seemed that language-level unicode support would have to wait. It also didn’t help that many users complained loudly about potential breakage; alas, sacrifices must occasionally be made when moving forward, and in our industry, it’s often the users themselves who most fiercely resist change. Admittedly, the entrenchment of PHP in the web services sector probably hasn’t helped much to this end…

The current metrics are nevertheless quite promising in spite of the delays. As of present, the current statistics indicate that the core conversion of PHP to unicode is about 70% complete (accessed May 2nd, 2012). Of course, what progress has been made–if it’s anything short of complete–is of little use to those who have an immediate need for complete unicode support. For others, like myself, unicode is a nice-to-have, but for a majority of the work-related data I’ve seen, it’s a matter of dealing with systems that were written years ago by naive developers.

Continuing the story: MySQL eventually added mostly-working support for UTF-8. I say mostly, because it like everything that came before suffered from occasional breakage and the weirdness one might expect in a system that saw such a drastic change. However, even in the early days it worked well for systems that were careful not to mangle incoming unicode and for those that were properly configured. Indeed, a fresh install of WordPress will configure tables to use UTF-8 as the default character set (assuming your server supports it, of course!), and if you’re careful, you can muck with incoming data via PHP without disturbing anything important. UTF-8, PHP, nor MySQL are fundamentally the problem; character set conversion, however, is.

With one of our latest site conversions, we noticed a flurry of unusual artifacts scattered throughout most of the articles we imported into WordPress from a CMS that shall remain unnamed. Things like “Ä”, which was supposed to be the UTF-8 non-breaking space, to “’”, which was supposed to be a stylized right-apostrophe (’), were scattered throughout almost every article. Evidently, the original site owners had interesting copy/paste habits (they used Word), but the site worked well for them and I can’t bring myself to judge anyone for putting together a system that worked for their specific needs.

I researched character encodings for a couple days, usually between fixing other things I had broken (oops!), and couldn’t come up with a definitive solution. I’m pretty sure I tried just about everything I could think of, including some unorthodox suggestions like examining the bin2hex() output, running an str_replace() on it, and then re-assembling it back into a binary string. But whereas an unorthodox–and unnecessarily complex–process might work, it isn’t often the right way to do something. Heck, I even tried various incantations of iconv and mb_string‘s conversion functions before giving up, left with the impression that whatever dark magics I thought I had once possessed regarding UTF-8 mangling had long since wafted away not unlike the wisps of smoke from a witch doctor’s extinguished incense burner.

After puzzling over the matter for a few hours the following morning an epiphany struck. The answer needn’t be so complex! Instead, perhaps it was the encoding PHP was using when it connected to the MySQL server that was the source of the problem. When I’d pull that data and save it over to the database that was UTF-8 encoded, that’s when it would be mangled. There was a character translation going on somewhere, and I was certain this was it.

To understand the confusion that lead up to my frantic Google searches and the discovery of puffer-fish-extract-like medicinal remedies such as running str_replace() on bin2hex() output, a little bit of knowledge regarding the MySQL tool set is helpful. First, two of the most common tools I use when importing test data into a database are mysqldump and the standard mysql client (for restoration). Interestingly, one only needs to reach as far as the man page for either utility to discover the source of the problem:

--default-character-set=charset_name

Use charset_name as the default character set. See Section 9.5, “Character Set Configuration”. If no character set is specified, mysqldump uses utf8, and earlier versions use latin1.

In retrospect, I should have considered this first. The old database was configured to use latin1 on all of its tables, and it undoubtedly survived through many early MySQL versions. The new database was set up on a very recent version of MySQL, as was my workstation’s test database, and was therefore using UTF-8 as the default character set. The only difference between the two was that my workstation has everything configured to use UTF-8–the server, however, does not. Contributing to the confusion was the behavior of the import: When I would run a mysqldump on one of the data tables and import it into my workstation’s database, I noticed three extra characters in place of the apostrophes. Yet the server only displayed a single unexpected character.

Something was afoot.

When I discovered the encoding defaults in MySQL’s tool chain, it occurred to me that I should have exported the source data as latin1 from the old database, and imported it as latin1 into my test harness. It worked, or at least it looked like it worked, and then I ran my import utility.

…and it broke again.

Somewhat furious that I couldn’t quite figure out what the solution was, I paused for a moment to reflect on what happened. Then it occurred to me that PHP was probably using whatever default the MySQL driver was configured to–namely, UTF-8. I added a line to my importer (using PDO):

$exportDb->query('SET CHARACTER SET latin1');

Then I re-ran my importer, waited for some sample data to complete, and checked the article. It worked. Oh, how gloriously it worked! Perhaps, I mused, the simplicity of the solution was at least partially to blame for its evasiveness.

Over-analysis, particularly in IT, can be a problematic hurdle that can only be avoided or straddled with sufficient experience. Oftentimes, the avoidance of our tendency to over-think a problem can only come from a lifetime of experiences that work together to undo all of the nonsense models we’ve established in our heads, and that’s why I sometimes feel so astounded whenever someone’s kid makes a profoundly deep, yet exceptionally simple observation. You know the sort: You’re left, probably in shock, thinking about what a damn genius that little bastard is going to be. But then, years down the road, that little genius goes to school and is taught “critical thinking skills,” goes into IT, and then sits up late one night writing a blog post about what a damn genius someone else’s kid is for thinking outside the box.

Maybe one of the lessons I should have taken away from this is that the best solution is often to take a few steps back and let the obvious fixes (at the time) flutter away on the wind. There’s usually an easier, better way to do it. Unfortunately, seeing that takes practice.

The shortened version of this can be condensed into the following: First, most of the time, converting one database to another will work out of the box and you won’t need to do anything more. Second, sometimes even with custom tools, database conversions will go smoothly even when dealing with different character sets. But, finally, sometimes you’ll encounter a data set that’s just old enough to be a small thorn in your side because it happened to have persisted through a major architectural change in whatever system it calls home. If that’s the case, don’t listen to anyone else–and especially don’t try mangling your data, because you’ll only make it worse!–check your client and server encoding and alter them accordingly. If you’re dumping data from a set of tables that use latin1, make sure your export tools also dump that same data in latin1; if you’re using mysql or mysqldump, that means using the --default-character-set option, and if you’re using PHP directly, configure the database driver accordingly. If you’re importing UTF-8 characters that were originally (or accidentally) stored in latin1, don’t panic. As long as you make sure to pull that source data in latin1 (or whatever encoding it was originally stored in), you should be OK: The conversion–or pass-through–to the destination can typically occur to UTF-8 without anything being lost.

You don’t even have to whisper arcane incantations to yourself at night.

No comments.
***

Symfony 1.3/1.4 and Suhosin

I read about Symfony some time back when I was working on a project in CakePHP. Symfony struck my interests primarily because it uses Doctrine (by default) as its ORM, and I’ve used Doctrine in a couple of other projects. This weekend, I elected to give Symfony a try. After reading through the documentation–which I’ll usually do for a few hours before taking the final plunge–I was satisfied that Symfony would be a great framework to learn next.

Updated March 2nd, 2010. Click here to view my updated thoughts.

Following the initial guide was easy until I attempted to view the example application. Then I encountered this:

Fatal error: SUHOSIN – Use of preg_replace() with /e modifier is forbidden by configuration in /home/bshelton/phpapps/sf-test/lib/symfony/lib/response/sfWebResponse.class.php(409) : regexp code on line 409

Uh oh! This struck me as odd, considering Symfony prides itself on being a secure-by-default framework and it triggers a Suhosin warning from the stock install? Great. But don’t be too alarmed as the offending code isn’t a security hole (newlines were inserted by me to increase readability on the web):

return preg_replace('/\-(.)/e',
  "'-'.strtoupper('\\1')",
  strtr(ucfirst(strtolower($name)), '_', '-'));

The Fix

While the /e modifier does have the potential to execute PHP code, the offending line merely takes the first character immediately following a “-” and uppercases it (the strtoupper call is the only eval’d code). However, rather than disable parts of Suhosin–which could be bad if third party code is included that also does something similar and requires auditing–I decided to fix the issue. According to the Suhosin docs, the fix is fairly simple. Simply change line 409 in symfony/lib/response/sfWebResponse.class.php to:

PHP 5.2 and below:

return preg_replace_callback('/\-(.)/',
        create_function('$matches', 'return \'-\'.strtoupper($matches[1]);'),
        strtr(ucfirst(strtolower($name)),
        '_', '-'));

PHP 5.3 and up:

return preg_replace_callback('/\-(.)/',
        function ($matches) { return '-'.strtoupper($matches[1]); },
        strtr(ucfirst(strtolower($name)),
        '_', '-'));

In this case, rather than evaluating PHP code as part of the replacement, a callback function is created that performs the same approximate test. Obviously, the solution for PHP 5.2 and below could be equally as dangerous, but if we’re careful, we can create a function exactly as we intended.

Performance Concerns

After I discovered this solution by browsing the PHP documentation, it occurred to me that callbacks in preg_replace_callback are slightly slower than performing a call to preg_replace without callbacks. I couldn’t recall precisely how slow, so I elected to perform a benchmark using several different ideas at tackling this particular problem. This section highlights these benchmarks along with each potential solution. Ironically, using preg_replace_callback is about 30% slower than performing character-by-character replacement.

The Methodology

The methodology I used to test various solutions consisted of the following steps:

  1. Generate a large data set to aid in characterizing performance differences and elevate measurable performance discrepancies above the noise floor
  2. Write each solution such that it consumes the data generated in step 1
  3. Compare the performance of each method against the stock Symfony solution

Since the stock Symfony code is designed to “normalize” (their words) headers generated and sent to the client, the source data is created by a script that picks a random number of characters (all lowercase) and joins them together using a dash (-) or an underscore (_) in order to emulate strings like “content-type” or “content_type.” The same data set is used for all subsequent tests. No more than one dash or underscore is used per string; although this is unlikely cause for much concern as most headers are unlikely to have more than one or two separators.

You can download the script that generates this data, along with the source data itself, in the archive toward the bottom of this post.

Note: These solutions were tested on PHP 5.2 only. You’ll likely discover differences when testing on other PHP versions.

The Results

Before we examine each solution, let’s take a look at the results:

Benchmark Plot

As you can tell from this chart, the Suhosin-recommended solution of using preg_replace_callback is the slowest. The stock Symfony code performs rather well but it still slower than simplified string replacement calls. We’ll examine each solution in the following section.

The Code

The code for each solution is provided below along with a brief description of what it does. The actual benchmarks (and benchmark data) is also provided. Remember that each of these solutions is modified to use our source data.

Solution 1: Stock Symfony Solution
<?php
 
include_once 'source-data.php';
 
foreach ($source as $s) {
    preg_replace('/\-(.)/e', "'-'.strtoupper('\\1')", strtr(ucfirst(strtolower($s)), '_', '-'));
}

The stock Symfony solution to the normalization problem is to first convert the string to lowercase, capitalize the first character, and translate all underscores (_) into dashes (-). preg_replace is then run and evaluates the replacement, which converts the first character immediately following a dash to its uppercase equivalent.

Solution 2: preg_replace_callback
<?php
 
include_once 'source-data.php';
 
foreach ($source as $s) {
    preg_replace_callback('/\-(.)/',
        create_function('$matches', 'return \'-\'.strtoupper($matches[1]);'),
        strtr(ucfirst(strtolower($name)),
        '_', '-'));
}

As hinted by the Suhosin documentation, preg_replace_callback is recommended over evaluating PCRE replacements. Unfortunately, this solution is also the slowest. In structure, it is most similar to the stock Symfony code but replaces the evaluated code with an anonymous function. I have not tested true anonymous as presented in PHP 5.3, however. Given that create_function likely has to create an additional interpreter instance in PHP 5.2, this solution might be slightly faster in later versions of PHP when using this code:

return preg_replace_callback('/\-(.)/',
        function ($matches) { return '-'.strtoupper($matches[1]); },
        strtr(ucfirst(strtolower($name)),
        '_', '-'));
Solution 3: Array and String Manipulation Only
<?php
 
include_once 'source-data.php';
 
foreach ($source as $s) {
    $s = strtr(strtolower($s), '_', '-');
    $tmp = explode('-', $s);
    foreach ($tmp as &$t) {
        $t = ucfirst($t);
    }
    $buf = implode('-', $tmp);
}

This solution was among the fastest but its performance remains within the margin of error and may therefore be tied with Solution 4. As with solutions 1 and 2, this solution translates all characters to lowercase and replaces all underscores (_) with dashes (-). However, unlike the previous solution, this one splits each header along the dash boundary, loops through the remaining items, converts them in-place using ucfirst, and then joins them together again with a dash.

This solution is not optimal, and I expected it to be among the slowest. I was surprised to discover that this solution performed faster than the others, and I initially assumed the overhead of preg_replace was due in no small part to the requirement of loading the PCRE engine. It is also likely that increasing the number of split points in the header (by way of more dashes) will likewise increase the amount of time this solution requires to run.

My presumed explanations for the performance of this code segment were challenged with the next test.

Solution 4: Two Replacements
<?php
 
include_once 'source-data.php';
 
foreach ($source as $s) {
    $s = strtr(ucfirst(strtolower($s)), '_', '-');
    preg_match('/\-./', $s, $matches);
    str_replace($matches[0], strtoupper($matches[0]), $s);
}

I confess that the file name for this test is slightly misleading, and I’ll correct it in the posted archive. The initial intention was to utilize two preg_replace statements, but I elected (at the last minute, no less), to use only one in conjunction with an str_replace. In this solution, a dash (-) followed by any single character is captured, capitalized, and then replaced into the final product. As with Solution 3, this is among the fastest of the 5 tested. This solution may also scale slightly better than Solution 3 for headers containing more than a single dash.

Note that the performance of this solution hints that the slightly slower behavior of solutions 1 and 2 are likely due to code evaluation rather than the overhead of loading PCRE.

Solution 5: Character-by-Character Replacement
<?php
 
include_once 'source-data.php';
 
foreach ($source as $s) {
    $s = ucfirst(strtolower($s));
    $buf = '';
    for ($i = 0; $i < strlen($s); $i++) {
        if ($s[$i-1] == '-')
            $buf .= strtoupper($s[$i]);
        else
            $buf .= $s[$i];
    }
}

I initially assumed this would be the slowest performing benchmark of the 5. I was surprised to discover that it is the second slowest. Indeed, this solution performed approximately as well compared to preg_replace_callback as the stock Symfony code did in contrast with this solution.

Conclusion

The Symfony stock code is quite fast but still appears to create instances (at least in sfWebResponse.class.php) where partial evaluation of PHP code is necessary. eval‘d code isn’t necessarily evil, and while it most certainly is a vector for exploitation, it is the manner in which code is evaluated that makes it dangerous. Regardless, I think it is appropriate to evaluate (I made a pun!) circumstances in which code itself is eval‘d and question whether such calls are necessary. While these benchmarks are admittedly highly artificial, it is fairly obvious that in some situations, less code does not necessarily translate to better performance. More importantly, highly compressed code can be difficult for new maintainers to understand and therefore become more error prone or more likely to encounter breakage than simpler but more verbose statements.

The benchmarks listed in this article may be downloaded here, along with the spreadsheet used to chart the data points (apologies that it is in .xlsx format; I’ve been testing the Office 2010 beta–I’ll post a version with an .ods later!). Please recall that this benchmark isn’t scientific. The tests I conducted are highly artificial and are presented for only an exceedingly small subset of data variation. It is likely that other methods are faster, more efficient, and more appropriate for the given solution. Indeed, the entire purpose for this exercise was two-fold: 1) To avoid having to make any configurational changes to Suhosin and 2) surprise myself with how many unique solutions I could create for a single problem. I think I succeeded on both counts.

If you post commentary, please keep in mind that this was conducted for my own purposes and curiosity only. It is neither intended to be malicious to the Symfony project nor issued as a correction to their sources (though a patch that eliminates the use of /e in preg_replace would be nice!). I’ve been highly impressed by the Symfony sources. Unlike most PHP code, they’re clean, concise, easy to understand, and well-documented. It’s a breath of fresh air compared to a vast majority of PHP-based projects. My thanks to the Symfony project in general and Fabien Potencier in particular. Please feel free to post corrections to my methods but be polite about it!

Recommendations? I’d suggest using solution #4 if you’re encountering issues with Suhosin and Symfony.

Updates: March 2nd, 2010

I’m growing increasingly less impressed with Symfony and its internal design. It appears the developers have a strong affinity for preg_replace‘s /e modifier. Yes, I understand, you audit your code and it’s secure. But do you know what the most significant problem is with suggesting users disable Suhosin’s suhosin.executor.disable_emodifier flag is? Users are likely to do it globally–rather than in an .htaccess file somewhere–and as a consequence, they’ll likely open their systems up for (admittedly unlikely but possible) far worse things. I haven’t tried Symfony 2.0 yet, but I hope they’ve resolved this. It’s stupid. And it pisses me off.

Anyway, if you’re looking for some quick fixes and are going through the Symfony tutorial and do not want to disable any part of Suhosin, here’s what you’re going to need to do.

First, change line 409 in symfony/lib/response/sfWebResponse.class.php to:

    $name = strtr(ucfirst(strtolower($name)), '_', '-');
    if (preg_match('/\-./', $name, $matches))
        return str_replace($matches[0], strtoupper($matches[0]), $name);
    return $name;

Then change line 281 in symfony/lib/form/addon/sfFormObject.class.php to:

    if (preg_match_all('#(?:/|_|-)+.#', $text, $matches)) {
        foreach ($matches[0] as $match)
            $text = str_replace($match, strtoupper($match), $text);
        return ucfirst(strtr($text, array('/' => '::', '_' => '', '-' => '')));
    }
    return ucfirst($text);

Naturally, you’re probably better off listening to the advice of the developers directly, but this is my solution. Your mileage may vary.

Note to the developers: If you stumble upon this post in the near future (obviously this doesn’t apply to versions of Symfony beyond 1.4), feel free to add your commentary. Be mindful that some of us actually want to leave suhosin.executor.disable_emodifier set to on, including for your product. I will admit that the lines of code you’ve written look reasonably safe and do not appear to be accepting tainted input, but I’m not going to risk that. I’ve been running several versions of phpBB–and we all know the sorts of holes that software has–along with WordPress and a few other PHP-based web apps without ever having encountered this issue before. Seriously, it makes me kind of worried about the rest of your code! Let’s take a look at the comments in the Suhosin INI file:

; The /e modifier inside preg_replace() allows code execution. Often it is the
; cause for remote code execution exploits. It is wise to deactivate this
; feature and test where in the application it is used. The developer using the
; /e modifier should be made aware that he should use preg_replace_callback()
; instead.

‘Nuff said.


Here’s a good source on preg_replace, why you should always use single quotes, common mistakes, and why you should really just avoid using /e in the first place.

No comments.
***
Page 1 of 3123