2014 February - Benjamin Shelton's Musings

SQL and PHP

Friday February 21st, 2014

We have a few things to talk about regarding SQL and PHP. It’s short and sweet, so I’ll be brief.

I’ll start with saying that I’d like to put this tactfully, but unfortunately, I don’t know how else to say this:

It’s 2014 and we’re still not using parameterized statements.

Seriously.

I ran across a project that was linked from one of the various “discover Github” mailings I’m subscribed to that looks very promising. It’s called Lychee, it looks great, and the demo functions well. I can’t think of any other open source gallery systems at the moment that look as good as Lynchee (or as simple). There’s just one problem: It’s an absolute haven for SQL injection attacks. Gosh, I don’t even know where to begin. Nearly every single query that accepts some form of external input is a potential landmine of shame, sadness, and roasted pandas. Worse, the last ticket addressing this particular issue was posted about a month ago. If there’s no activity this weekend, I might just go through and convert it to use parameterized statements myself and offer a pull request.

The other problem I have with it is the code formatting. It’s pretty ugly. It certainly wouldn’t pass PSR-2, and it reminds me vaguely of the ad hoc formatting found in most other popular PHP products (commercial and otherwise, so it’s not like Lychee is unique). But, that’s okay. Code formatting is mostly subjective, as long as you’re following some standard (even your own) and remain consistent. While some of us would rather split hairs over various stupid things, it’s mostly incidental because we care about appearance. Appearance does matter, because care and consideration for how the code looks probably reflects how the code operates. Pragmatically, though, appearance doesn’t matter that much. It can be fixed with automated tools. No biggy.

However, the SQL injections are unforgivable.

There are hundreds of resources on the matter, many targeting PHP. So it’s not like there’s much of an excuse. Heck, I’ve written about this in the past, and I’m not even a smart dude. If you haven’t heard of parameterized statements, either you’ve been living under a rock or you’re new at this.

The good news is that if you’re new to the big bad world of web software, don’t fret. The absolute best place to get started if you’re writing PHP is to read PHP the Right Way. It includes dozens of tips and tricks, information on established and emerging standards, and delves into a few advanced topics for larger applications. I would highly encourage you to read the entire thing. In particularly, I would suggest reading the section on databases, because it covers PDO, SQL injections, and parameterized statements (often known as prepared statements). There’s nothing wrong with the MySQLi extensions per se but new PHP code should use PDO instead. PDO even has a lesser known feature to inject queries into objects, which can take care of some of the more mundane work (if you happen to swing that way).

Don’t get me wrong. It’s great to see new projects popping up all the time challenging older, more established ones. It breathes new life into under-served markets and creates a greater body of useful tools for all of us. Sometimes a fresh look into old concepts helps us think outside the box, questioning established design principles, and jarring us back into the reality that there might actually be a better way of doing things. Please don’t feel this post is an outright criticism: It’s not. We all had to start somewhere. If you’ve never heard of SQL injections, well, now you have. Extra information in your arsenal of tools won’t hurt, but it certainly will help combat one of the most common design flaws in software today.

What you should take away from this post is that there are unsavory individuals out there who are deeply interested in easy-to-exploit flaws in software. These folks gain a great deal of enjoyment from imposing misery on others, either by writ of outright damage or embarrassment. Because of them, we have to develop for the web differently than we might have, say, 10 years ago.

So let’s start by teaching the newcomers not to repeat our mistakes!

No comments.

***

nginx v1.4.5 and IPv6

Saturday February 15th, 2014

I recently updated the VPS this blog is sitting on. Coincidentally, this also updated nginx to the latest version and broke everything. I didn’t think much of it at the time, but when I linked a friend to this post over on my fun blog, he was delivered to the default nginx page. Puzzled, I poked around for a while, mostly examining DNS records and server configurations. I couldn’t find anything wrong.

Then I had a eureka moment.

I’m on IPv6 at home. I have this site (and others) configured to use IPv6. It hadn’t occurred to me until then that it might be protocol related. Using curl (curl -4 and curl -6), I confirmed my suspicions. Although the server was listening on TCP and TCP6, it was only serving up the vhosts on IPv6 and not IPv4. IPv4 was receiving the standard welcome page.

I knew that I had configured the server appropriately for both stacks. I’ve read through the docs. I combed through dozens of blog posts documenting the process. I was convinced the server was correctly configured. I must’ve fiddled with it for a good hour or so, reviewing documentation and the likes to no avail.

Infuriating.

Since nginx 1.2 or 1.3 (I can’t remember precisely), it’s been necessary to add ipv6only=off to the listen directives in order to support a dual stack environment. It’s my understanding this trick doesn’t work on some BSDs, but I know for a fact it worked fine under Linux. Or so I thought. I tried it successfully under Arch and Ubuntu with identical results with the exception that I neglected to recall one minor detail: My Arch install updated to nginx 1.4-something well after I had configured my desktop for developing on a dual IPv4/IPv6 stack. I suspect it’s probably broken in the same manner. But, I use it strictly for development, so I’m not particularly concerned whether or not it works on IPv4. I don’t use the protocol much within my network, so why worry, right?

To continue: I decided to take another stab at it and discovered something curious. Previously, all that was required to enable dual-stack support in nginx was to add the following to whatever was configured as the default host

    listen [::]:80 ipv6only=off default_server

And then all subsequent vhosts simply required

    listen[::]:80;

That’s all. It used to work–like magic. But, sadly, magic eventually runs out. This is why electronics stop working once you let all the “magic smoke” escape. Sorry, it’s an old electrical engineering joke my father has oft repeated. I guess it’s brushed off onto me.

Anyway, here’s the solution. You might find it contrary to some of the antiquated information out there lurking on various blogs dating back from 2011 through the middle of 2013. It works for nginx 1.4.5 (and possibly earlier versions), but the trick is to add this to the default vhost configuration

    listen[::]:80 ipv6only=on default_server;
    listen 80 default_server;

And for all subsequent vhosts

    listen[::]:80;
    listen 80;

I should note it works fine without adding the ipv6only=on directive, just like the generic vhost config (above). I believe I’ve read that this is because the default behavior enables ipv6only automatically. However, if you’re running a slightly older version, you might need to keep it. Hence why I’m not going to remove it from my examples. Better safe than sorry, right?

default_server is (hopefully) obvious, but only required if you want to provide a default site (or page) for users hitting your web server’s IP. Or for ancient browsers that haven’t been taught how to use the Host header. Are there any of those left?

So, the trick is that you need two listen directives. Period. Yes, even for TLS/SSL. If you skip these directives on any vhost, the missing protocol binding will be skipped for that vhost. I suspect this is probably documented somewhere. The problem though is that there are literally dozens of blogs pointing the old instructions that used to work. These are now deprecated. Following them will only lead to sadness.

Initial frustration aside, I find meshes well with my preferences. It’s more explicit and there’s no question which protocols nginx will use when binding to the configure port or ports. However, it will cause headaches for IPv6-enabled sites migrating from nginx 1.2. So, if you’re running Ubuntu and have decided to update in order to gain access to newer features (websocket support, SPDY, et al), expect breakage. More importantly, be absolutely certain you’ve independently tested all of your deployed sites using IPv4 and IPv6. Make liberal use of the -4 and -6 switches for curl. It’ll save you from unpleasant surprises.

1 comment.

***

Lies, Damn Lies, and PHP Benchmarks

Wednesday February 5th, 2014

I need to get something off my chest. First, I’d like you to examine the following code:

 
class Test
{
    public function output ()
    {
        return 'Hello, world!';
    }
}
 
function test ()
{
    return 'Hello, world!';
}
 
$start = microtime(true);
for ($i = 0; $i < 10000000; $i++) {
    test();
}
printf("%0.2f seconds for 10 million function calls.\n", microtime(true)-$start);
 
$start = microtime(true);
for ($i = 0; $i < 10000000; $i++) {
    $c = new Test();
    $c->output();
}
printf("%0.2f seconds for 10 million class calls.\n", microtime(true)-$start);
 
// Outputs about 4.9 seconds for the first and 7.9 for the second on my system.

How many times have you seen a benchmark like this passed off as proof of some facet of PHP’s behavior, particularly relating to classes versus functions? Next, ask yourself: What does this “benchmark” prove?

The answer: Absolutely nothing. Most developers are already aware (or should be) that instantiation of a class invokes a fair amount of overhead not present with functions in PHP. Yet time and again, whenever I encounter a discussion relating to PHP best practices or performance, I find comments that allude to any of the following:

PHP is not Java!
${framework} uses classes, therefore ${framework} is slow!
PHP is a templating language. Replicating OOP concepts in PHP is overkill/stupid/slow.

Usually, but not always, one (or more) of the above opinions is presented with an allusion to benchmarks not all that dissimilar from the horrible, horrible sample of code at the beginning of this post. Why?

I don’t have a good answer, but I suspect it’s because PHP is one of the most oft-benchmarked scripting languages of our time, probably because the barrier to entry is so low. The worst part? No one knows how to benchmark. Granted, I’m guilty of the same charge, and I’ve published awful benchmarks in the past somehow “proving” (for some value of proof) that a specific feature is slower than another. So, as penance for my own wrongdoings, I want to make a point, and I want to make it as clear as I possibly can.

Your PHP benchmarks are wrong.

Don’t feel bad, though. Nearly all of them are wrong, including most hardware benchmarks. The reason for this is benchmarks, by their nature, are synthetic, and the nature of a synthetic benchmark is such that it fails to capture real world behavior. If you want to create a realistic benchmark, you’re going to have to put an awful lot of work into emulating a full stack, and the only way to do this correctly is to effectively write a small application. You can’t simply toss a few function calls into a for loop and call it a day. But the problem with this approach is that once you’ve implemented a demonstration application, you’re no longer benchmarking the language–you’re benchmarking your library or application. The only way you can really benchmark a language is to do so in a manner that can be replicated across multiple platforms in a manner that each are roughly equivalent and the benchmark captures the relative performance of each.

If you don’t buy that argument, I’d suggest you look at the TechEmpower framework benchmarks. They’ve put a lot of work into creating a fair, realistic collection of benchmarks for dozens and dozens of frameworks for each of the popular languages. If you’re not willing to invest a similar magnitude of effort into your own benchmarks, you’re not going to prove anything. You won’t prove that a specific language feature is too slow to use, you won’t prove that functions are better than classes, and all you’re going to accomplish is wasting your time–and worse–your readers’ time.

I’ll explain further, but before I delve into greater detail about why looping over a function ten million times proves absolutely nothing, I want to demonstrate by example. Nothing beats a good illustration to broaden one’s horizons, so I’ve fabricated a slightly less terrible benchmark than the code snippet that began this post.

A Slightly Less Terrible Benchmark

This benchmark is intended to illustrate two objectives: 1) That PHP language features don’t differ by much in terms of performance relative to each other and 2) that simple benchmarks are effectively pointless and only prove what the author intends for them to prove. The benchmarks are broken down into six files:

out-class-inheritance.php – Uses class inheritance to output a short string.
out-class-interface-inheritance.php – Uses an interface to define class methods and inheritance to output a short string.
out-class-interface.php – Uses an interface to define class methods to output a short string.
out-class.php – Use a simple class to output a short string.
out-class-static.php – Uses a static method of a class to output a short string.
out-function.php – Uses a single function to output a short string.

Each benchmark was run three times using Apache Bench (ab) on Arch Linux with a concurrency of 10 across 10000 runs. In keeping with the tradition of poorly designed benchmarks, ab was run on the same system as the web server. It’s a terrible idea, I know, but we’re only measuring approximate performance of individual language features with the intent to prove that there isn’t a substantial difference.

Note that I won’t include the sources to this benchmark here. If you want to examine them, please review this Gist. They’re about as simple as the code fragment at the top of this post and only slightly less stupid.

Performance

Unsurprisingly, for all intents and purposes, each test performed approximately the same. I noticed that tabbing between the shell and my browser introduced sharp drops in performance, so a handful of these had to be re-run from the start. So, I suspect that the reductions in performance as seen for the inheritance and interface tests were likely introduced by system variability. Yes, inheritance should be slightly slower (more overhead) but it’s not substantially slower:

out-class-inheritance.php – 15407 req/s, 0.65s total
out-class-interface-inheritance.php – 14854 req/s, 0.674s total
out-class-interface.php – 14568 req/s, 0.688s total
out-class.php – 15367 req/s, 0.651s total
out-class-static.php – 15633 req/s, 0.64s total
out-function.php – 15146 req/s, 0.66s total

Are these results surprising? They shouldn’t be. This is a benchmark and benchmarks lie. My benchmark lies because it isn’t illustrative of real world use, and as such, these numbers mean absolutely nothing. Well, okay, my results are suggestive that if all things are equal, OOP versus functional design is a meaningless argument. But these results don’t matter because no one in their right mind is going to have an application that consists of a small single class or function. Likewise, no one’s going to run the same function or instantiate a class a few million times in their app and call it good. At least, I’d hope not.

What this does illustrate is that a benchmark can be manipulated to produce desired results, and they can be interpreted with greater variability than most religious texts. While I’d like to believe this benchmark demonstrates that there’s little difference between PHP language features, it isn’t exhaustive enough to measure the impact of specific design decisions. Nor should it, because PHP doesn’t exist in a vacuum and naive designs fail to account for realistic implementations.

The problem, then, is that benchmarks running a small sample of code a million times fail to account for the broader design of a full application, which might be comprised of a few dozen functions, classes, hundreds of queries and so forth. No one class is going to be run a million times for every hit (or it shouldn’t be), and neither will any one function. More importantly, as my benchmarks demonstrate, the real bottleneck is going to be network I/O, followed by disk, and with rare exceptions (and a distant third), the CPU. For dissenters, I would wager that you’re going to encounter network limitations well before any hypothetical “classes are slower than functions” condition is met. Moreover, because the PHP VM is rather slow, it won’t matter a great deal how your application is structured anyway. An application written in Go or C++ is going to perform several orders of magnitude faster than you PHP app. For that matter, an application written using Python, Gevent, and Gunicorn would also render your pet PHP constructs a rather embarrassingly distant last place. Save for frameworks written in C (Phalcon) or new clean-room PHP implementations (like HHVM), the entire debate over classes versus functions is a rather silly one, isn’t it?

Benchmarks are Relative

I can’t emphasize how fascinating and well designed the TechEmpower benchmarks are for illustrative relative performance among different languages and frameworks. That said, I must emphasize again that nothing exists in a vacuum. PHP classes or functions aren’t solitary constructs. They’re running in unison with a web server, a database engine of some sort, and possibly dozens of other platforms, each with a certain amount of overhead or latency, and even network topography can impact service behaviors in suboptimal ways.

What burns me the most is that there is this small population of noisy magpie-like developers pushing against efforts to improve PHP development (like the improvements in the PSR series) toward sensible standards. While this population is diminishing, I still see comments from time to time on sites like Hacker News that effectively blame OOP design for PHP’s comparatively poor performance on the web framework benchmark, particularly for frameworks like Symfony. Yet Facebook’s HHVM and, to a lesser extent, the Phalcon PHP framework have demonstrated that this need not be the case. Why is PHP plagued with this sort of nonsense? I have no idea.

Consider for a moment the benefits gleaned from the PSR standards like PSR-0 autoloading and the fairly recent explosion in the number of libraries that, thanks to Composer and Packagist, can be included in any project with little effort. Furthermore, clean design and implementation induces new momentum in the PHP community that will hopefully render spaghetti code like that in projects using archaic practices (vBulletin and IP.Board, I’m looking at you…) a forgotten memory. OOP isn’t a panacea, but when used correctly, it encourages code reuse and generally reduces the time spent on implementation. It’s disappointing that there are musings in the community about whether or not OOP practices have a place in PHP. If you don’t think OOP belongs in PHP, then I suspect you’ve missed that debate by about a decade and a half. Better late than never, right?

Fortunately, as Composer and Packagist have illustrated, such opposition to modern PHP design is constrained to a tiny and ever-dwindling population of the PHP community (I’d actually wager it’s mostly people have have merely dabbled in PHP and don’t currently write much code in it, if any). PHP has its warts, but modern PHP largely abolishes or mitigates some of the worst parts of the language. That isn’t to say PHP is a beautiful language (it’s not), but it doesn’t have to be beautiful to be useful. Perl taught us that lesson years ago.

I think I see an older gentleman in the back with a long gray beard laughing. See? He gets it.

No comments.

***