The Arch Linux Rant

This post is part one in a two part series (part two can be read here). In this post, I’ll explore some of my thoughts and feelings about Arch from a fairly high level vantage point–consider it a 50,000′ review. In the next post, I’ll explain why I feel comments like this one demonstrate a certain degree of naïvety, although it’s by far one of the more benign ones I’ve encountered in my short time partaking in the Arch community.

For those of you who know me or, at the very least, read my (mostly) pointless and fairly infrequent rants, you’ll probably recall my decision to leave Gentoo. Some eight or nine months later, I changed my mind and stuck with it for a while longer. That’s no longer the case, and as of January this year I permanently switched over to Arch Linux. I’ve been holding off on posting this rant primarily to take a “wait and see” approach to determine how well I’d do without Gentoo.

The short version is thus: I haven’t missed Gentoo one bit. I started about a year ago (maybe less), running Arch in a VM, tweaking settings, playing around with KDE 4.7-something, and spent quite some time familiarizing myself with the system. Coming from Gentoo and FreeBSD, the transition wasn’t that bad–I’m used to taking a greater hands-on approach with my systems than most people might be comfortable with–so ultimately, Arch felt like a natural fit.

Then I took the plunge.

For the interested, I should share a little bit of history. My initial foray into the world of Unix and Unix-like operating systems began around 2000ish (actually earlier, but it wasn’t until then when I started tinkering with them for my own purposes) with OpenBSD. I later transitioned to FreeBSD for a variety of reasons–mostly the improved performance and greater compatibility (at that time) with other software. It wasn’t until quite some time after that when I began using the ports collection, and prior to then I made a habit of configuring and compiling most of the software I used by hand. As my software library began to increase, so too did the amount of time needed to invest into updating it. I eventually began looking for other solutions, although I kept a FreeBSD system at home for a number of years thereafter.

Sometime later, around late 2003 or early 2004 (possibly as late as 2005), I began experimenting with Gentoo. While I tried other Linux distros, I found most of them too alien in contrast with my beloved *BSDs. Their package managers were strange and sometimes convoluted, and I had little idea how to properly install custom configured software as I was prone to doing. The BSD way worked, but I knew it wasn’t optimal since /usr/local seemed to be mostly unused or remained distro-specific as to its preferred usage. Gentoo was a reasonable fit: Its well-documented nature, a ports-like collection (portage), and the tendency to keep mostly out of the user’s way beckoned me to take a closer look. Better yet, though the system was compiled mostly from scratch, configuring individual options and customizing packages was simple and very BSD-like.

I’ll keep this as short as I can, but I’ll just say this: I ran Gentoo on my desktop and as a file server for a number of years, probably from 2005-2006 until just last year (2011). Indeed, in 2006, I made the switch to use Gentoo almost exclusively, even while I was going to university. Later that same year, my FreeBSD install on my file server was gone, victim to the spread of Gentoo. Don’t get me wrong: I still love FreeBSD–and Gentoo, too–but for home use FreeBSD didn’t seem cut out for what I wanted to do. Granted, I’d still run it in circumstances where I needed a stable, long-running platform as a web service or similar, but at that time, Ports often lagged behind Gentoo and sometimes the only way you could install software was to download and compile it yourself. Not that there’s anything wrong with that, of course. (N.B.: In 2010, the situation somewhat reversed: Ports was updated with newer software versions while Portage stagnated. Strange what a few years will do.)

So, the transition to Gentoo was made because everything I wanted was just an emerge away. Things remained like this for a long time, too, because Gentoo was at the bleeding edge, and the convenience of a rolling-release plus fairly simple package management was seductive. I think that’s why I had a hard time adopting other distros. Frequent, new releases of software allowed me to continue sampling what was coming down the turnpike and forced me to stay up to date on current software.

Gentoo was not without its warts, of course. I won’t go into detail here as I’ve linked to my thoughts on the matter earlier in this post–and they should be mostly up to date–but the problems that began as a slow trickle eventually turned into a torrent of disaster. Occasional dependency conflicts requiring manual intervention, lengthy X server compiles (and let’s not get into glibc or a desktop environment like KDE), the lack of package maintainers, and a slow, downward spiral into the abyss lead me to look for alternatives. I didn’t want to part with the rolling release model; indeed, once you’ve sampled such ambrosia, it’s difficult to whet your palate with meager release-based distros. Around this time (late 2010, early 2011), I was participating fairly regularly in various Slashdot discussions and made a similar offhanded remark about my concerns with Gentoo, and someone suggested trying Arch Linux. So I did.

The thing that struck me the most about Arch was its simplicity. There’s very little that it does for you. System configuration using the standard init scripts is exceptionally simple and straightforward (mostly in /etc/rc.conf), and with the exception of the package manager and build system, Arch makes Gentoo look like an impenetrable fortress of automation. Within a few days of using Arch, I fell in love with the brilliantly minimalistic design.

Truth be told, there’s a lot about Arch to love and enjoy. It does take some time to get the system configured to your liking, but the installation process is mostly painless and fairly simple (bugs in the installer notwithstanding). Using an AUR helper like yaourt isn’t necessary but strongly recommended. It also helps to have at least a passing understanding of how makepkg and pacman work and interact. Of course, it isn’t necessary to use extras like the Arch Build System (ABS) and the Arch User Repository (AUR), but to ignore them is to do yourself a great disservice: There’s so much software available on the AUR that it makes Ubuntu’s impressive repositories and widespread support (through .debs) seem almost anemic.

However, I don’t think it’s possible to use Arch for very long until you’re drawn in by the allure of creating your own PKGBUILDs. I think that’s at least part of what I like most about Arch. Unlike Gentoo’s ebuilds, PKGBUILDs are simple. Armed with just a basic understanding of sh-like syntax (e.g. bash), a moderately skilled user otherwise unfamiliar with PKGBUILDs could put together a custom one in under an hour. More advanced users could piece together PKGBUILDs customizing their favorite software in an evening or two (mostly dependent upon how many packages they want). But here’s the trick: If your PKGBUILD fails for whatever reason, it’s unlikely to break your system save for inappropriate dependencies or “conflicts” statements. Since pacman (Arch’s package manager) examines the contents of packages, including those generated by custom PKGBUILDs, and determines where each file in the archive is to be placed, it takes an exceptionally stupid habit (usually using -f for force) to circumvent pacman’s all-seeing eye.

To illustrate: When Arch released GIMP 2.8, I was disappointed by some of the new features. The solution? Create a GIMP 2.6 package! (You can download the PKGBUILD here, but be sure to grab both the gegl and babl PKGBUILDs, too.) Since the Arch project provides all the appropriate PKGBUILDs for the official repos, it’s easy to find the one you want, download it, and then modify it to your liking. You don’t even have to deal with the headaches caused by portage overlays.

The astute reader may have noticed that I haven’t yet addressed the issue of binary package updates with regards to the core system and official repositories. There’s a reason for that. While having the binary packages available is nice, and it’s certainly better than spending two days compiling KDE, it wasn’t my primary reason for switching. Binary releases did impact my decision–and don’t get me wrong, I love being able to download just the updates I need without compiling anything but the few AUR packages I have–but it’s one of those conveniences that’s nice to have. I won’t begrudge compiling the system (or kernel) from scratch, because I spent so many years with stagnant Gentoo systems compiling off and on over a week or two. It would seem a tad bit hypocritical of me to complain about compile times. Besides, if you swing that way, you can do that with Arch, too.

I don’t mean to wax optimistic about the glory of Arch. It’s not without the occasional sharp edge, and there’s a plethora of things that will snag the naive user who hasn’t yet developed the healthy habit of thoroughly reading documentation. I certainly wouldn’t recommend it to someone who’s new to Linux in general (they should use Ubuntu or a similar distribution), but for others who are at least familiar with the shell and want a do-it-yourself distro, Arch is certainly worth a look.

Interestingly, and unexpectedly, Arch has made using Linux fun again. I’m reminded of the days when I first started getting into Gentoo after understanding its quirks and loving every minute of it. The difference, though, is that I spend so much time doing things in Arch and with Arch that I almost don’t do much else (not even games!). My Windows install is probably lonely; it might receive a boot once every three weeks for the occasional update or game that I can’t otherwise get working under Wine, but even then, there’s so much more to explore with Arch.

No comments.
***

Shell Voodoo, Connected IPs, and Counting Total Connections

I’m posting this mostly as a note to myself, but if you, future visitor, stumble upon this post and have improvements or other things you’d like to share, be my guest. Posts that are overly critical of the methodologies provided by others, or those which otherwise add nothing to the discussion will be removed. This is especially true for those espousing beliefs that PowerShell is superior.

I won’t go into the exact details of why we needed to do this, but the general break down is thus:

  • Get a list of connected IP addresses
  • Sort them
  • Count how many connections were made from a single address

Fortunately, the solution turns out to be quite easy. For FreeBSD:

netstat -anfinet | grep -v 127.0.0.1 | awk '{ print $5 }' | \
grep -E '.*([0-9]{1,4}\.)+.*' | sed 's/\(.*\)\..*/\1/' | \
sort -g -k 1 | uniq -c | sort -n -k 1

And for most derivatives of Linux:

netstat -anW --tcp --udp | grep -v 127.0.0.1 | awk '{ print $5 }' | \
grep --color=never -E '.*[0-9]{1,4}(\.|\:).*' | sed 's/\(.*\)\:.*/\1/' | \
sort -g -k 1 | uniq -c | sort -n -k 1

You may need to modprobe sctp to get the --tcp and --udp netstat flags working. Also, both of these should work with IPv6 addresses, too, which is why I’ve tried to keep the sed regex as simple as possible.

What the Eff is This?!

Okay, I agree. I’ve probably made some kind of mistake somewhere; I don’t know awk or sed quite as well as I should (easily fixed, if I ever wanted to spend a weekend learning). That said, here’s my understanding of how this should work. First, we’ll deal with the FreeBSD derivative, line by line:

FreeBSD

Here is a breakdown for the FreeBSD-specific stuff:

netstat -anfinet | grep -v 127.0.0.1 | awk '{ print $5 }' | \

As with all platforms I’m aware, -an shows all connections by their numerical addresses. netstat prefers to perform a reverse lookup on every address, and this can take some time. However, the FreeBSD-specific option -f inet specifies to only show INET (IPv4/IPv6) addresses and eliminates much of the cruft associated with local Unix domain sockets. Likewise, we trim localhost from the list with grep -v, and we fetch the 5th output column using awk

grep -E '.*([0-9]{1,4}\.)+.*' | sed 's/\(.*\)\..*/\1/' | \

Moving on to the next line, we fetch only those lines that contain something that vaguely resembles an IP address with grep -E (I prefer to use -E here since it gives us the extended regex syntax), and we pass the results into sed to strip off the trailing remote host’s port number. Alternatively, you could use something like 's/^\([0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\).*/\1/' instead to filter out IPv4 addresses, but since we already know roughly what to expect from the input we can simplify our regex. Furthermore, we also know that the IP address of the remote host in FreeBSD will always have a dot followed by the port number appended, and we can naively remove this.

sort -g -k 1 | uniq -c | sort -n -k 1

Lastly, we sort (generically, with -gunique addresses in our list including their totals, and we sort numerically by the first column (now containing the count).

Linux

Here is a breakdown for the Linux-specific stuff:

netstat -anW --tcp --udp | grep -v 127.0.0.1 | awk '{ print $5 }' | \

Following in the footsteps of FreeBSD, we use -an to display all connected numeric addresses so we don’t waste time running reverse lookups. However, in most Linux distributions, lengthy columns–and especially IPv6 addresses–will be truncated by netstat’s output. To counter this, we use -W to show the wide listing, and we use --tcp and --udp to filter out only those protocols. You may need to modprobe sctp in order to get this to work; if you can’t, this string of commands might still work. Lastly, we filter connections to localhost with grep -v, and we fetch the 5th column using awk Easy enough, right?

grep --color=never -E '.*[0-9]{1,4}(\.|\:).*' | sed 's/\(.*\)\:.*/\1/' | \

In this next line, we use the extended regex feature of grep -E to filter out lines that look somewhat address-y, and we separate the remote host’s address from its port using sed. In this case, Linux appends port numbers using a colon (:), so we have to deviate slightly from the FreeBSD example. Also, since some distros might alias grep with grep --color=auto|always, we use --color=never to eliminate feeding ANSI control characters to sed.

sort -g -k 1 | uniq -c | sort -n -k 1

Lastly, we sort by the IP address using a generic sort (-g), filter out only those addresses that are unique, count them, and then sort by the count column which is now tacked onto the front.

Now we can get a fancy list of IP addresses, how many connections from them are being made to us, and sort them accordingly! Manipulating grep accordingly can re-introduce localhost or remove specific addresses that might not be of interest.

No comments.
***

PHP, Unicode, and MySQL Character Sets

This post could be subtitled: “When importing your old data breaks your character encoding,” but even that doesn’t quite capture the frustration felt when unexpected UTFမ8 (or UTF-16) characters are scattered throughout your data.

Historically, PHP and MySQL have shared mutually beneficial positions within the web services ecosystem. It’s no surprise then that the two have more or less evolved together, benefiting from the other’s success in what those with a background in the natural sciences might consider a symbiotic relationship. To say that the two began life in a world where latin1 (ISO-8859-1 and its more location-specific derivatives or “code pages”) was the de facto standard encoding might be an understatement, but it is also conveniently ignores a little piece of history. Things change, people change, and I suppose it could be said that sometimes standards change. Or, more correctly, they’re superseded by better standards, cast away into the forgotten reaches of history by the marauding armies of deprecation. This realm is also periodically referred to in technical parlance as the IETF archives.

Sometimes, but not always, old and clunky standards linger on where they refuse to die, because doing things the right way is actually quite difficult. Not to mention that if the old way of doing something has always worked, it usually has enough managerial or influential inertia to carry it on into a future that’s very different from what its developers envisioned.

Many years ago, PHP6 was announced as a planned upgrade path from PHP5.x. Many promises were made: Namespaces, closures, and unicode support (in the form of UTF-16) to name a few. But the process stalled as developers were bogged down by the difficulty of converting everything to Unicode. Namespaces and closures were eventually migrated into the PHP5.3 branch, and it seemed that language-level unicode support would have to wait. It also didn’t help that many users complained loudly about potential breakage; alas, sacrifices must occasionally be made when moving forward, and in our industry, it’s often the users themselves who most fiercely resist change. Admittedly, the entrenchment of PHP in the web services sector probably hasn’t helped much to this end…

The current metrics are nevertheless quite promising in spite of the delays. As of present, the current statistics indicate that the core conversion of PHP to unicode is about 70% complete (accessed May 2nd, 2012). Of course, what progress has been made–if it’s anything short of complete–is of little use to those who have an immediate need for complete unicode support. For others, like myself, unicode is a nice-to-have, but for a majority of the work-related data I’ve seen, it’s a matter of dealing with systems that were written years ago by naive developers.

Continuing the story: MySQL eventually added mostly-working support for UTF-8. I say mostly, because it like everything that came before suffered from occasional breakage and the weirdness one might expect in a system that saw such a drastic change. However, even in the early days it worked well for systems that were careful not to mangle incoming unicode and for those that were properly configured. Indeed, a fresh install of WordPress will configure tables to use UTF-8 as the default character set (assuming your server supports it, of course!), and if you’re careful, you can muck with incoming data via PHP without disturbing anything important. UTF-8, PHP, nor MySQL are fundamentally the problem; character set conversion, however, is.

With one of our latest site conversions, we noticed a flurry of unusual artifacts scattered throughout most of the articles we imported into WordPress from a CMS that shall remain unnamed. Things like “Ä”, which was supposed to be the UTF-8 non-breaking space, to “’”, which was supposed to be a stylized right-apostrophe (’), were scattered throughout almost every article. Evidently, the original site owners had interesting copy/paste habits (they used Word), but the site worked well for them and I can’t bring myself to judge anyone for putting together a system that worked for their specific needs.

I researched character encodings for a couple days, usually between fixing other things I had broken (oops!), and couldn’t come up with a definitive solution. I’m pretty sure I tried just about everything I could think of, including some unorthodox suggestions like examining the bin2hex() output, running an str_replace() on it, and then re-assembling it back into a binary string. But whereas an unorthodox–and unnecessarily complex–process might work, it isn’t often the right way to do something. Heck, I even tried various incantations of iconv and mb_string‘s conversion functions before giving up, left with the impression that whatever dark magics I thought I had once possessed regarding UTF-8 mangling had long since wafted away not unlike the wisps of smoke from a witch doctor’s extinguished incense burner.

After puzzling over the matter for a few hours the following morning an epiphany struck. The answer needn’t be so complex! Instead, perhaps it was the encoding PHP was using when it connected to the MySQL server that was the source of the problem. When I’d pull that data and save it over to the database that was UTF-8 encoded, that’s when it would be mangled. There was a character translation going on somewhere, and I was certain this was it.

To understand the confusion that lead up to my frantic Google searches and the discovery of puffer-fish-extract-like medicinal remedies such as running str_replace() on bin2hex() output, a little bit of knowledge regarding the MySQL tool set is helpful. First, two of the most common tools I use when importing test data into a database are mysqldump and the standard mysql client (for restoration). Interestingly, one only needs to reach as far as the man page for either utility to discover the source of the problem:

--default-character-set=charset_name

Use charset_name as the default character set. See Section 9.5, “Character Set Configuration”. If no character set is specified, mysqldump uses utf8, and earlier versions use latin1.

In retrospect, I should have considered this first. The old database was configured to use latin1 on all of its tables, and it undoubtedly survived through many early MySQL versions. The new database was set up on a very recent version of MySQL, as was my workstation’s test database, and was therefore using UTF-8 as the default character set. The only difference between the two was that my workstation has everything configured to use UTF-8–the server, however, does not. Contributing to the confusion was the behavior of the import: When I would run a mysqldump on one of the data tables and import it into my workstation’s database, I noticed three extra characters in place of the apostrophes. Yet the server only displayed a single unexpected character.

Something was afoot.

When I discovered the encoding defaults in MySQL’s tool chain, it occurred to me that I should have exported the source data as latin1 from the old database, and imported it as latin1 into my test harness. It worked, or at least it looked like it worked, and then I ran my import utility.

…and it broke again.

Somewhat furious that I couldn’t quite figure out what the solution was, I paused for a moment to reflect on what happened. Then it occurred to me that PHP was probably using whatever default the MySQL driver was configured to–namely, UTF-8. I added a line to my importer (using PDO):

$exportDb->query('SET CHARACTER SET latin1');

Then I re-ran my importer, waited for some sample data to complete, and checked the article. It worked. Oh, how gloriously it worked! Perhaps, I mused, the simplicity of the solution was at least partially to blame for its evasiveness.

Over-analysis, particularly in IT, can be a problematic hurdle that can only be avoided or straddled with sufficient experience. Oftentimes, the avoidance of our tendency to over-think a problem can only come from a lifetime of experiences that work together to undo all of the nonsense models we’ve established in our heads, and that’s why I sometimes feel so astounded whenever someone’s kid makes a profoundly deep, yet exceptionally simple observation. You know the sort: You’re left, probably in shock, thinking about what a damn genius that little bastard is going to be. But then, years down the road, that little genius goes to school and is taught “critical thinking skills,” goes into IT, and then sits up late one night writing a blog post about what a damn genius someone else’s kid is for thinking outside the box.

Maybe one of the lessons I should have taken away from this is that the best solution is often to take a few steps back and let the obvious fixes (at the time) flutter away on the wind. There’s usually an easier, better way to do it. Unfortunately, seeing that takes practice.

The shortened version of this can be condensed into the following: First, most of the time, converting one database to another will work out of the box and you won’t need to do anything more. Second, sometimes even with custom tools, database conversions will go smoothly even when dealing with different character sets. But, finally, sometimes you’ll encounter a data set that’s just old enough to be a small thorn in your side because it happened to have persisted through a major architectural change in whatever system it calls home. If that’s the case, don’t listen to anyone else–and especially don’t try mangling your data, because you’ll only make it worse!–check your client and server encoding and alter them accordingly. If you’re dumping data from a set of tables that use latin1, make sure your export tools also dump that same data in latin1; if you’re using mysql or mysqldump, that means using the --default-character-set option, and if you’re using PHP directly, configure the database driver accordingly. If you’re importing UTF-8 characters that were originally (or accidentally) stored in latin1, don’t panic. As long as you make sure to pull that source data in latin1 (or whatever encoding it was originally stored in), you should be OK: The conversion–or pass-through–to the destination can typically occur to UTF-8 without anything being lost.

You don’t even have to whisper arcane incantations to yourself at night.

No comments.
***