Distributed Version Control is here to stay, baby

A while ago Jeff and I had Eric Sink on the Stack Overflow Podcast, and we were yammering on about version control, especially the trendy new distributed version control systems, like Mercurial and Git.

In that podcast, I said, “To me, the fact that they make branching and merging easier just means that your coworkers are more likely to branch and merge, and you’re more likely to be confused.”


This is what Taco looks like now
Well, you know, that podcast is not prepared carefully in advance; it’s just a couple of people shooting the breeze. So what usually happens is that we say things that are, to use the technical term, wrong. Usually they are wrong either in details or in spirit, or in details and in spirit, but this time, I was just plain wrong. Like strawberry pizza. Or jalapeño bagels. WRONG.

Long before this podcast occurred, my team had switched to Mercurial, and the switch really confused me, so I hired someone to check in code for me (just kidding). I did struggle along for a while by memorizing a few key commands, imagining that they were working just like Subversion, but when something didn’t go the way it would have with Subversion, I got confused, and would pretty much just have to run down the hall to get Benjamin or Jacob to help.

And then my team said, hey you know what? This Mercurial bug-juice is really amazing, we want to actually make a code review product that works with it, and, and, what’s more, we think that there’s a big market providing commercial support and hosting for it (Mercurial itself is freely available under GPL, but a lot of corporations want some kind of support before they’ll use something).

And I thought, what do I know? But as you know I don’t really make the decisions around here, because “management is a support function,” so they took all the interns, all six of them, and set off to build a product around Mercurial.

I decided I better figure out what the heck is going on with this “distributed version control” stuff before somebody asks me a question about the products that my company allegedly sells, and I don’t have an answer, and somebody in the blogo-“sphere” writes another article about me junking the sharp.

And I studied, and studied, and finally figured something out. Which I want to share with you.

With distributed version control, the distributed part is actually not the most interesting part.

The interesting part is that these systems think in terms of changes, not in terms of versions.

That’s a very zen-like thing to say, I know. Traditional version control thinks: OK, I have version 1. And now I have version 2. And now I have version 3.

And distributed version control thinks, I had nothing. And then I got these changes. And then I got these other changes.

It’s a different Program Model, so the user model has to change.

In Subversion, you might think, “bring my version up to date with the main version” or “go back to the previous version.”

In Mercurial, you think, “get me Jacob’s change set” or “let’s just forget that change set.”

If you come at Mercurial with a Subversion mindset, things will almost work, but when they don’t, you’ll be confused, unhappy, and unsuccessful, and you’ll hate Mercurial.

Whereas if you free your mind and reimagine version control, and grok the zen of the difference between thinking about managing the versions vs. thinking about managing the changes, you’ll become enlightened and happy and realize that this is the way version control was meant to work.

I know, it’s strange… since 1972 everyone was thinking that we were manipulating versions, but, it turned out, surprisingly, that thinking about the changes themselves as first class solved a very important problem: the problem of merging branched code.

And here is the most important point, indeed, the most important thing that we’ve learned about developer productivity in a decade. It’s so important that it merits a place as the very last opinion piece that I write, so if you only remember one thing, remember this:

When you manage changes instead of managing versions, merging works better, and therefore, you can branch any time your organizational goals require it, because merging back will be a piece of cake.

I can’t tell you how many Subversion users have told me the following story: “We tried to branch our code, and that worked fine. But when it came time to merge back, it was a complete nightmare and we had to practically reapply every change by hand, and we swore never again and we developed a new way of developing software using if statements instead of branches.”

Sometimes they’re even kind of proud of this new, single-trunk invention of theirs. As if it’s a virtue to work around the fact that your version control tool is not doing what it’s meant to do.

With distributed version control, merges are easy and work fine. So you can actually have a stable branch and a development branch, or create long-lived branches for your QA team where they test things before deployment, or you can create short-lived branches to try out new ideas and see how they work.

This is too important to miss out on. This is possibly the biggest advance in software development technology in the ten years I’ve been writing articles here.

Or, to put it another way, I’d go back to C++ before I gave up on Mercurial.

If you are using Subversion, stop it. Just stop. Subversion = Leeches. Mercurial and Git = Antibiotics. We have better technology now.

Because so many people dive into Mercurial without fully understanding the new program model, which can leave them thinking that it’s broken and malicious, I wrote a Mercurial tutorial, HgInit.

Today, when people ask me about that podcast where I dissed DVCS, I tell them that it was just a very carefully planned fake-out of my long time friend and competitor Eric Sink, who makes a non-distributed version control system. Like that time he started selling bug-tracking software, and, to punish him, we sent him a very expensive Fog Creek backpack with a fake form letter that made it look like we were doing so well that expensive backpacks were the standard Christmas gift we were sending every FogBugz customer.

I seem to have run out the clock on this site. It has been an extreme honor to have you reading my essays over the last ten years. I couldn’t ask for a greater group of readers. Whether you’re one of the hundreds of people who volunteered their time to translate articles into over 40 languages, or the 22,894 people who has taken the time to send me an email, or the 50,838 people who subscribed to the email newsletter, or the 2,262,348 people per year who visited the website and read some of the 1067 articles I’ve written, I sincerely thank you for your attention.

Why are the Microsoft Office file formats so complicated? (And some workarounds)

Last week, Microsoft published the binary file formats for Office. These formats appear to be almost completely insane. The Excel 97-2003 file format is a 349 page PDF file. But wait, that’s not all there is to it! This document includes the following interesting comment:

Each Excel workbook is stored in a compound file.

You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file systems inside a single file. These are sufficiently complicated that you have to read another 9 page spec to figure that out. And these “specs” look more like C data structures than what we traditionally think of as a spec. It’s a whole hierarchical file system.

If you started reading these documents with the hope of spending a weekend writing some spiffy code that imports Word documents into your blog system, or creates Excel-formatted spreadsheets with your personal finance data, the complexity and length of the spec probably cured you of that desire pretty darn quickly. A normal programmer would conclude that Office’s binary file formats:

  • are deliberately obfuscated
  • are the product of a demented Borg mind
  • were created by insanely bad programmers
  • and are impossible to read or create correctly.

You’d be wrong on all four counts. With a little bit of digging, I’ll show you how those file formats got so unbelievably complicated, why it doesn’t reflect bad programming on Microsoft’s part, and what you can do to work around it.

The first thing to understand is that the binary file formats were designed with very different design goals than, say, HTML.

They were designed to be fast on very old computers. For the early versions of Excel for Windows, 1 MB of RAM was a reasonable amount of memory, and an 80386 at 20 MHz had to be able to run Excel comfortably. There are a lot of optimizations in the file formats that are intended to make opening and saving files much faster:

  • These are binary formats, so loading a record is usually a matter of just copying (blitting) a range of bytes from disk to memory, where you end up with a C data structure you can use. There’s no lexing or parsing involved in loading a file. Lexing and parsing are orders of magnitude slower than blitting.
  • The file format is contorted, where necessary, to make common operations fast. For example, Excel 95 and 97 have something called “Simple Save” which they use sometimes as a faster variation on the OLE compound document format, which just wasn’t fast enough for mainstream use. Word had something called Fast Save. To save a long document quickly, 14 out of 15 times, only the changes are appended to the end of the file, instead of rewriting the whole document from scratch. On the hard drives of the day, this meant saving a long document took one second instead of thirty. (It also meant that deleted data in a document was still in the file. This turned out to be not what people wanted.)

They were designed to use libraries. If you wanted to write a from-scratch binary importer, you’d have to support things like the Windows Metafile Format (for drawing things) and OLE Compound Storage. If you’re running on Windows, there’s library support for these that makes it trivial… using these features was a shortcut for the Microsoft team. But if you’re writing everything on your own from scratch, you have to do all that work yourself.

Office has extensive support for compound documents, for example, you can embed a spreadsheet in a Word document. A perfect Word file format parser would also have to be able to do something intelligent with the embedded spreadsheet.

They were not designed with interoperability in mind. The assumption, and a fairly reasonable one at the time, was that the Word file format only had to be read and written by Word. That means that whenever a programmer on the Word team had to make a decision about how to change the file format, the only thing they cared about was (a) what was fast and (b) what took the fewest lines of code in the Word code base. The idea of things like SGML and HTML—interchangeable, standardized file formats—didn’t really take hold until the Internet made it practical to interchange documents in the first place; this was a decade later than the Office binary formats were first invented. There was always an assumption that you could use importers and exporters to exchange documents. In fact Word does have a format designed for easy interchange, called RTF, which has been there almost since the beginning. It’s still 100% supported.

They have to reflect all the complexity of the applications. Every checkbox, every formatting option, and every feature in Microsoft Office has to be represented in file formats somewhere. That checkbox in Word’s paragraph menu called “Keep With Next” that causes a paragraph to be moved to the next page if necessary so that it’s on the same page as the paragraph after it? That has to be in the file format. And that means if you want to implement a perfect Word clone than can correctly read Word documents, you have to implement that feature. If you’re creating a competitive word processor that has to load Word documents, it may only take you a minute to write the code to load that bit from the file format, but it might take you weeks to change your page layout algorithm to accommodate it. If you don’t, customers will open their Word files in your clone and all the pages will be messed up.

They have to reflect the history of the applications. A lot of the complexities in these file formats reflect features that are old, complicated, unloved, and rarely used. They’re still in the file format for backwards compatibility, and because it doesn’t cost anything for Microsoft to leave the code around. But if you really want to do a thorough and complete job of parsing and writing these file formats, you have to redo all that work that some intern did at Microsoft 15 years ago. The bottom line is that there are thousands of developer years of work that went into the current versions of Word and Excel, and if you really want to clone those applications completely, you’re going to have to do thousands of years of work. A file format is just a concise summary of all the features an application supports.

Just for kicks, let’s look at one tiny example in depth. An Excel worksheet is a bunch of BIFF records of different types. I want to look at the very first BIFF record in the spec. It’s a record called 1904.

The Excel file format specification is remarkably obscure about this. It just says that the 1904 record indicates “if the 1904 date system is used.” Ah. A classic piece of useless specification. If you were a developer working with the Excel file format, and you found this in the file format specification, you might be justified in concluding that Microsoft is hiding something. This piece of information does not give you enough information. You also need some outside knowledge, which I’ll fill you in on now. There are two kinds of Excel worksheets: those where the epoch for dates is 1/1/1900 (with a leap-year bug deliberately created for 1-2-3 compatibility that is too boring to describe here), and those where the epoch for dates is 1/1/1904. Excel supports both because the first version of Excel, for the Mac, just used that operating system’s epoch because that was easy, but Excel for Windows had to be able to import 1-2-3 files, which used 1/1/1900 for the epoch. It’s enough to bring you to tears. At no point in history did a programmer ever not do the right thing, but there you have it.

Both 1900 and 1904 file types are commonly found in the wild, usually depending on whether the file originated on Windows or Mac. Converting from one to another silently can cause data integrity errors, so Excel won’t change the file type for you. To parse Excel files you have to handle both. That’s not just a matter of loading this bit from the file. It means you have to rewrite all of your date display and parsing code to handle both epochs. That would take several days to implement, I think.

Indeed, as you work on your Excel clone, you’ll discover all kinds of subtle details about date handling. When does Excel convert numbers to dates? How does the formatting work? Why is 1/31 interpreted as January 31 of this year, while 1/50 is interpreted as January 1st, 1950? All of these subtle bits of behavior cannot be fully documented without writing a document that has the same amount of information as the Excel source code.

And this is only the first of hundreds of BIFF records you have to handle, and one of the simplest. Most of them are complicated enough to reduce a grown programmer to tears.

The only possible conclusion is this. It’s very helpful of Microsoft to release the file formats for Microsoft and Office, but it’s not really going to make it any easier to import or save to the Office file formats. These are insanely complex and rich applications, and you can’t just implement the most popular 20% and expect 80% of the people to be happy. The binary file specification is, at most, going to save you a few minutes reverse engineering a remarkably complex system.

OK, I promised some workarounds. The good news is that for almost all common applications, trying to read or write the Office binary file formats is the wrong decision. There are two major alternatives you should seriously consider: letting Office do the work, or using file formats that are easier to write.

Let Office do the heavy work for you. Word and Excel have extremely complete object models, available via COM Automation, which allow you to programmatically do anything. In many situations, you are better off reusing the code inside Office rather than trying to reimplement it. Here are a few examples.

  1. You have a web-based application that’s needs to output existing Word files in PDF format. Here’s how I would implement that: a few lines of Word VBA code loads a file and saves it as a PDF using the built in PDF exporter in Word 2007. You can call this code directly, even from ASP or ASP.NET code running under IIS. It’ll work. The first time you launch Word it’ll take a few seconds. The second time, Word will be kept in memory by the COM subsystem for a few minutes in case you need it again. It’s fast enough for a reasonable web-based application.
  2. Same as above, but your web hosting environment is Linux. Buy one Windows 2003 server, install a fully licensed copy of Word on it, and build a little web service that does the work. Half a day of work with C# and ASP.NET.
  3. Same as above, but you need to scale. Throw a load balancer in front of any number of boxes that you built in step 2. No code required.

This kind of approach would work for all kinds of common Office types of applications you might perform on your server. For example:

  • Opening an Excel workbook, storing some data in input cells, recalculating, and pulling some results out of output cells
  • Using Excel to generate charts in GIF format
  • Pulling just about any kind of information out of any kind of Excel worksheet without spending a minute thinking about file formats
  • Converting Excel file formats to CSV tabular data (another approach is to use Excel ODBC drivers to suck data out using SQL queries).
  • Editing Word documents
  • Filling out Word forms
  • Converting files between any of the many file formats supported by Office (there are importers for dozens of word processor and spreadsheet formats)

In all of these cases, there are ways to tell the Office objects that they’re not running interactively, so they shouldn’t bother updating the screen and they shouldn’t prompt for user input. By the way, if you go this route, there are a few gotchas, and it’s not officially supported by Microsoft, so read their knowledge base article before you get started.

Use a simpler format for writing files. If you merely have to produce Office documents programmatically, there’s almost always a better format than the Office binary formats that you can use which Word and Excel will open happily, without missing a beat.

  • If you simply have to produce tabular data for use in Excel, consider CSV.
  • If you really need worksheet calculation features that CSV doesn’t support, the WK1 format (Lotus 1-2-3) is a heck of a lot simpler than Excel, and Excel will open it fine.
  • If you really, really have to generate native Excel files, find an extremely old version of Excel… Excel 3.0 is a good choice, before all the compound document stuff, and save a minimum file containing only the exact features you want to use. Use this file to see the exact minimum BIFF records that you have to output and just focus on that part of the spec.
  • For Word documents, consider writing HTML. Word will open those fine, too.
  • If you really want to generate fancy formatted Word documents, your best bet is to create an RTF document. Everything that Word can do can be expressed in RTF, but it’s a text format, not binary, so you can change things in the RTF document and it’ll still work. You can create a nicely formatted document with placeholders in Word, save as RTF, and then using simple text substitution, replace the placeholders on the fly. Now you have an RTF document that every version of Word will open happily.

Anyway, unless you’re literally trying to create a competitor to Office that can read and write all Office files perfectly, in which case, you’ve got thousands of years of work cut out for you, chances are that reading or writing the Office binary formats is the most labor intensive way to solve whatever problem it is that you’re trying to solve.

Can Your Programming Language Do This?

One day, you’re browsing through your code, and you notice two big blocks that look almost exactly the same. In fact, they’re exactly the same, except that one block refers to “Spaghetti” and one block refers to “Chocolate Moose.”

 // A trivial example: 

alert("I'd like some Spaghetti!"); 
alert("I'd like some Chocolate Moose!");

These examples happen to be in JavaScript, but even if you don’t know JavaScript, you should be able to follow along.

The repeated code looks wrong, of course, so you create a function:

function SwedishChef( food ) 
{
    alert("I'd like some " + food + "!"); 
} 

SwedishChef("Spaghetti"); 
SwedishChef("Chocolate Moose");

A picture of the Swedish Chef

OK, it’s a trivial example, but you can imagine a more substantial example. This is better code for many reasons, all of which you’ve heard a million times. Maintainability, Readability, Abstraction = Good!

Now you notice two other blocks of code which look almost the same, except that one of them keeps calling this function called BoomBoom and the other one keeps calling this function called PutInPot. Other than that, the code is pretty much the same.

alert("get the lobster"); 
PutInPot("lobster"); 
PutInPot("water"); 

alert("get the chicken"); 
BoomBoom("chicken"); 
BoomBoom("coconut");

Now you need a way to pass an argument to the function which itself is a function. This is an important capability, because it increases the chances that you’ll be able to find common code that can be stashed away in a function.

function Cook( i1, i2, f ) 
{ 
   alert("get the " + i1); 
   f(i1); 
   f(i2); 
} 

Cook( "lobster", "water", PutInPot ); 
Cook( "chicken", "coconut", BoomBoom );

Look! We’re passing in a function as an argument.

Can your language do this?

Wait… suppose you haven’t already defined the functions PutInPot or BoomBoom. Wouldn’t it be nice if you could just write them inline instead of declaring them elsewhere?

 Cook("lobster",
      "water",
      function(x) { alert("pot " + x); }  
     ); 

Cook("chicken",
     "coconut",  
     function(x) { alert("boom " + x); } 
    );

Jeez, that is handy. Notice that I’m creating a function there on the fly, not even bothering to name it, just picking it up by its ears and tossing it into a function.

As soon as you start thinking in terms of anonymous functions as arguments, you might notice code all over the place that, say, does something to every element of an array.

var a = [1,2,3]; 

for (i=0; i<a.length; i++) 
{ 
    a[i] = a[i] * 2; 
} 

for (i=0; i<a.length; i++) 
{ 
    alert(a[i]); 
}

Doing something to every element of an array is pretty common, and you can write a function that does it for you:

function map(fn, a) 
{ 
    for (i = 0; i < a.length; i++) 
    { 
        a[i] = fn(a[i]); 
    } 
}

Now you can rewrite the code above as:

map( function(x){return x*2;}, a ); 
map( alert, a );

Another common thing with arrays is to combine all the values of the array in some way.

function sum(a) 
{ 
    var s = 0; 
    for (i = 0; i < a.length; i++) 
        s += a[i]; 

    return s; 
} 

function join(a) 
{ 
    var s = ""; 
    for (i = 0; i < a.length; i++) 
        s += a[i]; 

    return s; 
} 

alert(sum([1,2,3])); 
alert(join(["a","b","c"]));

sum and join look so similar, you might want to abstract out their essence into a generic function that combines elements of an array into a single value:

function reduce(fn, a, init) 
{ 
    var s = init; 
    for (i = 0; i < a.length; i++) 
        s = fn( s, a[i] ); 

    return s; 
} 

function sum(a) 
{ 
    return reduce( function(a, b){ return a + b; },  
                   a, 0 ); 
} 

function join(a) 
{ 
    return reduce( function(a, b){ return a + b; }, 
                   a, "" ); 
}

Many older languages simply had no way to do this kind of stuff. Other languages let you do it, but it’s hard (for example, C has function pointers, but you have to declare and define the function somewhere else). Object-oriented programming languages aren’t completely convinced that you should be allowed to do anything with functions.

Java required you to create a whole object with a single method called a functor if you wanted to treat a function like a first class object. Combine that with the fact that many OO languages want you to create a whole file for each class, and it gets really klunky fast. If your programming language requires you to use functors, you’re not getting all the benefits of a modern programming environment. See if you can get some of your money back.

How much benefit do you really get out of writting itty bitty functions that do nothing more than iterate through an array doing something to each element?

Well, let’s go back to that map function. When you need to do something to every element in an array in turn, the truth is, it probably doesn’t matter what order you do them in. You can run through the array forward or backwards and get the same result, right? In fact, if you have two CPUs handy, maybe you could write some code to have each CPU do half of the elements, and suddenly map is twice as fast.

Or maybe, just hypothetically, you have hundreds of thousands of servers in several data centers around the world, and you have a really big array, containing, let’s say, again, just hypothetically, the entire contents of the internet. Now you can run map on thousands of computers, each of which will attack a tiny part of the problem.

So now, for example, writing some really fast code to search the entire contents of the internet is as simple as calling the map function with a basic string searcher as an argument.

The really interesting thing I want you to notice, here, is that as soon as you think of map and reduce as functions that everybody can use, and they use them, you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it’s a zillion times faster which means it can be used to tackle huge problems in an instant.

Lemme repeat that. By abstracting away the very concept of looping, you can implement looping any way you want, including implementing it in a way that scales nicely with extra hardware.

And now you understand something I wrote a while ago where I complained about CS students who are never taught anything but Java:

Without understanding functional programming, you can’t invent MapReduce, the algorithm that makes Google so massively scalable. The terms Map and Reduce come from Lisp and functional programming. MapReduce is, in retrospect, obvious to anyone who remembers from their 6.001-equivalent programming class that purely functional programs have no side effects and are thus trivially parallelizable. The very fact that Google invented MapReduce, and Microsoft didn’t, says something about why Microsoft is still playing catch up trying to get basic search features to work, while Google has moved on to the next problem: building Skynet^H^H^H^H^H^H the world’s largest massively parallel supercomputer. I don’t think Microsoft completely understands just how far behind they are on that wave.

Ok. I hope you’re convinced, by now, that programming languages with first-class functions let you find more opportunities for abstraction, which means your code is smaller, tighter, more reusable, and more scalable. Lots of Google applications use MapReduce and they all benefit whenever someone optimizes it or fixes bugs.

And now I’m going to get a little bit mushy, and argue that the most productive programming environments are the ones that let you work at different levels of abstraction. Crappy old FORTRAN really didn’t even let you write functions. C had function pointers, but they were ugleeeeee and not anonymous and had to be implemented somewhere else than where you were using them. Java made you use functors, which is even uglier. As Steve Yegge points out, Java is the Kingdom of Nouns.

Correction: The last time I used FORTRAN was 27 years ago. Apparently it got functions. I must have been thinking about GW-BASIC.

Making Wrong Code Look Wrong

Way back in September 1983, I started my first real job, working at Oranim, a big bread factory in Israel that made something like 100,000 loaves of bread every night in six giant ovens the size of aircraft carriers.

The first time I walked into the bakery I couldn’t believe what a mess it was. The sides of the ovens were yellowing, machines were rusting, there was grease everywhere.

“Is it always this messy?” I asked.

“What? What are you talking about?” the manager said. “We just finished cleaning. This is the cleanest it’s been in weeks.”

Oh boy.

It took me a couple of months of cleaning the bakery every morning before I realized what they meant. In the bakery, clean meant no dough on the machines. Clean meant no fermenting dough in the trash. Clean meant no dough on the floors.

Clean did not mean the paint on the ovens was nice and white. Painting the ovens was something you did every decade, not every day. Clean did not mean no grease. In fact there were a lot of machines that needed to be greased or oiled regularly and a thin layer of clean oil was usually a sign of a machine that had just been cleaned.

This is what a dough rounder looks like.The whole concept of clean in the bakery was something you had to learn. To an outsider, it was impossible to walk in and judge whether the place was clean or not. An outsider would never think of looking at the inside surfaces of the dough rounder (a machine that rolls square blocks of dough into balls, shown in the picture at right) to see if they had been scraped clean. An outsider would obsess over the fact that the old oven had discolored panels, because those panels were huge. But a baker couldn’t care less whether the paint on the outside of their oven was starting to turn a little yellow. The bread still tasted just as good.

After two months in the bakery, you learned how to “see” clean.

Code is the same way.

When you start out as a beginning programmer or you try to read code in a new language it all looks equally inscrutable. Until you understand the programming language itself you can’t even see obvious syntactic errors.

During the first phase of learning, you start to recognize the things that we usually refer to as “coding style.” So you start to notice code that doesn’t conform to indentation standards and Oddly-Capitalized variables.

It’s at this point you typically say, “Blistering Barnacles, we’ve got to get some consistent coding conventions around here!” and you spend the next day writing up coding conventions for your team and the next six days arguing about the One True Brace Style and the next three weeks rewriting old code to conform to the One True Brace Style until a manager catches you and screams at you for wasting time on something that can never make money, and you decide that it’s not really a bad thing to only reformat code when you revisit it, so you have about half of a True Brace Style and pretty soon you forget all about that and then you can start obsessing about something else irrelevant to making money like replacing one kind of string class with another kind of string class.

As you get more proficient at writing code in a particular environment, you start to learn to see other things. Things that may be perfectly legal and perfectly OK according to the coding convention, but which make you worry.

For example, in C:

char* dest, src;

This is legal code; it may conform to your coding convention, and it may even be what was intended, but when you’ve had enough experience writing C code, you’ll notice that this declares dest as a char pointer while declaring src as merely a char, and even if this might be what you wanted, it probably isn’t. That code smells a little bit dirty.

Even more subtle:

if (i != 0)
    foo(i);

In this case the code is 100% correct; it conforms to most coding conventions and there’s nothing wrong with it, but the fact that the single-statement body of the ifstatement is not enclosed in braces may be bugging you, because you might be thinking in the back of your head, gosh, somebody might insert another line of code there

if (i != 0)
    bar(i);
    foo(i);

… and forget to add the braces, and thus accidentally make foo(i)unconditional! So when you see blocks of code that aren’t in braces, you might sense just a tiny, wee, soupçon of uncleanliness which makes you uneasy.

OK, so far I’ve mentioned three levels of achievement as a programmer:

1. You don’t know clean from unclean.

2. You have a superficial idea of cleanliness, mostly at the level of conformance to coding conventions.

3. You start to smell subtle hints of uncleanliness beneath the surface and they bug you enough to reach out and fix the code.

There’s an even higher level, though, which is what I really want to talk about:

4. You deliberately architect your code in such a way that your nose for uncleanliness makes your code more likely to be correct.

This is the real art: making robust code by literally inventing conventions that make errors stand out on the screen.

So now I’ll walk you through a little example, and then I’ll show you a general rule you can use for inventing these code-robustness conventions, and in the end it will lead to a defense of a certain type of Hungarian Notation, probably not the type that makes people carsick, though, and a criticism of exceptions in certain circumstances, though probably not the kind of circumstances you find yourself in most of the time.

But if you’re so convinced that Hungarian Notation is a Bad Thing and that exceptions are the best invention since the chocolate milkshake and you don’t even want to hear any other opinions, well, head on over to Rory’s and read the excellent comix instead; you probably won’t be missing much here anyway; in fact in a minute I’m going to have actual code samples which are likely to put you to sleep even before they get a chance to make you angry. Yep. I think the plan will be to lull you almost completely to sleep and then to sneak the Hungarian=good, Exceptions=bad thing on you when you’re sleepy and not really putting up much of a fight.

An Example

Somewhere in Umbria

Right. On with the example. Let’s pretend that you’re building some kind of a web-based application, since those seem to be all the rage with the kids these days.

Now, there’s a security vulnerability called the Cross Site Scripting Vulnerability, a.k.a. XSS. I won’t go into the details here: all you have to know is that when you build a web application you have to be careful never to repeat back any strings that the user types into forms.

So for example if you have a web page that says “What is your name?” with an edit box and then submitting that page takes you to another page that says, Hello, Elmer! (assuming the user’s name is Elmer), well, that’s a security vulnerability, because the user could type in all kinds of weird HTML and JavaScript instead of “Elmer” and their weird JavaScript could do narsty things, and now those narsty things appear to come from you, so for example they can read cookies that you put there and forward them on to Dr. Evil’s evil site.

Let’s put it in pseudocode. Imagine that

s = Request("name")

reads input (a POST argument) from the HTML form. If you ever write this code:

Write "Hello, " & Request("name")

your site is already vulnerable to XSS attacks. That’s all it takes.

Instead you have to encode it before you copy it back into the HTML. Encoding it means replacing " with &quot;, replacing > with &gt;, and so forth. So

Write "Hello, " & Encode(Request("name"))

is perfectly safe.

All strings that originate from the user are unsafe. Any unsafe string must not be output without encoding it.

Let’s try to come up with a coding convention that will ensure that if you ever make this mistake, the code will just look wrong. If wrong code, at least, looks wrong, then it has a fighting chance of getting caught by someone working on that code or reviewing that code.

Possible Solution #1

One solution is to encode all strings right away, the minute they come in from the user:

s = Encode(Request("name"))

So our convention says this: if you ever see Request that is not surrounded by Encode, the code must be wrong.

You start to train your eyes to look for naked Requests, because they violate the convention.

That works, in the sense that if you follow this convention you’ll never have a XSS bug, but that’s not necessarily the best architecture. For example maybe you want to store these user strings in a database somewhere, and it doesn’t make sense to have them stored HTML-encoded in the database, because they might have to go somewhere that is not an HTML page, like to a credit card processing application that will get confused if they are HTML-encoded. Most web applications are developed under the principle that all strings internally are not encoded until the very last moment before they are sent to an HTML page, and that’s probably the right architecture.

We really need to be able to keep things around in unsafe format for a while.

OK. I’ll try again.

Possible Solution #2

What if we made a coding convention that said that when you write out any string you have to encode it?

s = Request("name")

// much later:
Write Encode(s)

Now whenever you see a naked Write without the Encode you know something is amiss.

Well, that doesn’t quite work… sometimes you have little bits of HTML around in your code and you can’t encode them:

If mode = "linebreak" Then prefix = "<br>"

// much later:
Write prefix

This looks wrong according to our convention, which requires us to encode strings on the way out:

Write Encode(prefix)

But now the "<br>", which is supposed to start a new line, gets encoded to &lt;br&gt; and appears to the user as a literal < b r >. That’s not right either.

So, sometimes you can’t encode a string when you read it in, and sometimes you can’t encode it when you write it out, so neither of these proposals works. And without a convention, we’re still running the risk that you do this:

s = Request("name")

...pages later...
name = s

...pages later...
recordset("name") = name // store name in db in a column "name"

...days later...
theName = recordset("name")

...pages or even months later...
Write theName

Did we remember to encode the string? There’s no single place where you can look to see the bug. There’s no place to sniff. If you have a lot of code like this, it takes a ton of detective work to trace the origin of every string that is ever written out to make sure it has been encoded.

The Real Solution

So let me suggest a coding convention that works. We’ll have just one rule:

All strings that come from the user must be stored in variables (or database columns) with a name starting with the prefix “us” (for Unsafe String). All strings that have been HTML encoded or which came from a known-safe location must be stored in variables with a name starting with the prefix “s” (for Safe string).

Let me rewrite that same code, changing nothing but the variable names to match our new convention.

us = Request("name")

...pages later...
usName = us

...pages later...
recordset("usName") = usName

...days later...
sName = Encode(recordset("usName"))

...pages or even months later...
Write sName

The thing I want you to notice about the new convention is that now, if you make a mistake with an unsafe string, you can always see it on some single line of code, as long as the coding convention is adhered to:

s = Request("name")

is a priori wrong, because you see the result of Request being assigned to a variable whose name begins with s, which is against the rules. The result of Request is always unsafe so it must always be assigned to a variable whose name begins with “us”.

us = Request("name")

is always OK.

usName = us

is always OK.

sName = us

is certainly wrong.

sName = Encode(us)

is certainly correct.

Write usName

is certainly wrong.

Write sName

is OK, as is

Write Encode(usName)

Every line of code can be inspected by itself, and if every line of code is correct, the entire body of code is correct.

Eventually, with this coding convention, your eyes learn to see the Write usXXX and know that it’s wrong, and you instantly know how to fix it, too. I know, it’s a little bit hard to see the wrong code at first, but do this for three weeks, and your eyes will adapt, just like the bakery workers who learned to look at a giant bread factory and instantly say, “jay-zuss, nobody cleaned insahd rounduh fo-ah! What the hayl kine a opparashun y’awls runnin’ heey-uh?”

In fact we can extend the rule a bit, and rename (or wrap) the Request and Encodefunctions to be UsRequest and SEncode… in other words, functions that return an unsafe string or a safe string will start with Us and S, just like variables. Now look at the code:

us = UsRequest("name")
usName = us
recordset("usName") = usName
sName = SEncode(recordset("usName"))
Write sName

See what I did? Now you can look to see that both sides of the equal sign start with the same prefix to see mistakes.

us = UsRequest("name") // ok, both sides start with US
s = UsRequest("name") // bug
usName = us // ok
sName = us // certainly wrong.
sName = SEncode(us) // certainly correct.

Heck, I can take it one step further, by naming Write to WriteS and renaming SEncode to SFromUs:

us = UsRequest("name")
usName = us
recordset("usName") = usName
sName = SFromUs(recordset("usName"))
WriteS sName

This makes mistakes even more visible. Your eyes will learn to “see” smelly code, and this will help you find obscure security bugs just through the normal process of writing code and reading code.

Making wrong code look wrong is nice, but it’s not necessarily the best possible solution to every security problem. It doesn’t catch every possible bug or mistake, because you might not look at every line of code. But it’s sure a heck of a lot better than nothing, and I’d much rather have a coding convention where wrong code, at least, looked wrong. You instantly gain the incremental benefit that every time a programmer’s eyes pass over a line of code, that particular bug is checked for and prevented.

A General Rule

This business of making wrong code look wrong depends on getting the right things close together in one place on the screen. When I’m looking at a string, in order to get the code right, I need to know, everywhere I see that string, whether it’s safe or unsafe. I don’t want that information to be in another file or on another page that I would have to scroll to. I have to be able to see it right there and that means a variable naming convention.

There are a lot of other examples where you can improve code by moving things next to each other. Most coding conventions include rules like:

  • Keep functions short.
  • Declare your variables as close as possible to the place where you will use them.
  • Don’t use macros to create your own personal programming language.
  • Don’t use goto.
  • Don’t put closing braces more than one screen away from the matching opening brace.

What all these rules have in common is that they are trying to get the relevant information about what a line of code really does physically as close together as possible. This improves the chances that your eyeballs will be able to figure out everything that’s going on.

In general, I have to admit that I’m a little bit scared of language features that hide things. When you see the code

i = j * 5;

… in C you know, at least, that j is being multiplied by five and the results stored in i.

But if you see that same snippet of code in C++, you don’t know anything. Nothing. The only way to know what’s really happening in C++ is to find out what types i and j are, something which might be declared somewhere altogether else. That’s because j might be of a type that has operator* overloaded and it does something terribly witty when you try to multiply it. And i might be of a type that has operator= overloaded, and the types might not be compatible so an automatic type coercion function might end up being called. And the only way to find out is not only to check the type of the variables, but to find the code that implements that type, and God help you if there’s inheritance somewhere, because now you have to traipse all the way up the class hierarchy all by yourself trying to find where that code really is, and if there’s polymorphism somewhere, you’re really in trouble because it’s not enough to know what type i and j are declared, you have to know what type they are right now, which might involve inspecting an arbitrary amount of code and you can never really be sure if you’ve looked everywhere thanks to the halting problem (phew!).

When you see i=j*5 in C++ you are really on your own, bubby, and that, in my mind, reduces the ability to detect possible problems just by looking at code.

None of this was supposed to matter, of course. When you do clever-schoolboy things like override operator*, this is meant to be to help you provide a nice waterproof abstraction. Golly, j is a Unicode String type, and multiplying a Unicode String by an integer is obviously a good abstraction for converting Traditional Chinese to Standard Chinese, right?

The trouble is, of course, that waterproof abstractions aren’t. I’ve already talked about this extensively in The Law of Leaky Abstractions so I won’t repeat myself here.

Scott Meyers has made a whole career out of showing you all the ways they fail and bite you, in C++ at least. (By the way, the third edition of Scott’s book Effective C++ just came out; it’s completely rewritten; get your copy today!)

Okay.

I’m losing track. I better summarize The Story Until Now:

Look for coding conventions that make wrong code look wrong. Getting the right information collocated all together in the same place on screen in your code lets you see certain types of problems and fix them right away.

I’m Hungary

Lugnano, Umbria, ItalySo now we get back to the infamous Hungarian notation.

Hungarian notation was invented by Microsoft programmer Charles Simonyi. One of the major projects Simonyi worked on at Microsoft was Word; in fact he led the project to create the world’s first WYSIWYG word processor, something called Bravo at Xerox Parc.

In WYSIWYG word processing, you have scrollable windows, so every coordinate has to be interpreted as either relative to the window or relative to the page, and that makes a big difference, and keeping them straight is pretty important.

Which, I surmise, is one of the many good reasons Simonyi started using something that came to be called Hungarian notation. It looked like Hungarian, and Simonyi was from Hungary, thus the name. In Simonyi’s version of Hungarian notation, every variable was prefixed with a lower case tag that indicated the kind of thing that the variable contained.

For example, if the variable name is rwCol, rw is the prefix.

I’m using the word kind on purpose, there, because Simonyi mistakenly used the word type in his paper, and generations of programmers misunderstood what he meant.

If you read Simonyi’s paper closely, what he was getting at was the same kind of naming convention as I used in my example above where we decided that us meant “unsafe string” and s meant “safe string.” They’re both of type string. The compiler won’t help you if you assign one to the other and Intellisense won’t tell you bupkis. But they are semantically different; they need to be interpreted differently and treated differently and some kind of conversion function will need to be called if you assign one to the other or you will have a runtime bug. If you’re lucky.

Simonyi’s original concept for Hungarian notation was called, inside Microsoft, Apps Hungarian, because it was used in the Applications Division, to wit, Word and Excel. In Excel’s source code you see a lot of rw and col and when you see those you know that they refer to rows and columns. Yep, they’re both integers, but it never makes sense to assign between them. In Word, I’m told, you see a lot of xl and xw, where xl means “horizontal coordinates relative to the layout” and xw means “horizontal coordinates relative to the window.” Both ints. Not interchangeable. In both apps you see a lot of cb meaning “count of bytes.” Yep, it’s an int again, but you know so much more about it just by looking at the variable name. It’s a count of bytes: a buffer size. And if you see xl = cb, well, blow the Bad Code Whistle, that is obviously wrong code, because even though xl and cb are both integers, it’s completely crazy to set a horizontal offset in pixels to a count of bytes.

In Apps Hungarian prefixes are used for functions, as well as variables. So, to tell you the truth, I’ve never seen the Word source code, but I’ll bet you dollars to donuts there’s a function called YlFromYw which converts from vertical window coordinates to vertical layout coordinates. Apps Hungarian requires the notation TypeFromType instead of the more traditional TypeToType so that every function name could begin with the type of thing that it was returning, just like I did earlier in the example when I renamed Encode SFromUs. In fact in proper Apps Hungarian the Encode function would have to be named SFromUs. Apps Hungarian wouldn’t really give you a choice in how to name this function. That’s a good thing, because it’s one less thing you need to remember, and you don’t have to wonder what kind of encoding is being referred to by the word Encode: you have something much more precise.

Apps Hungarian was extremely valuable, especially in the days of C programming where the compiler didn’t provide a very useful type system.

But then something kind of wrong happened.

The dark side took over Hungarian Notation.

Nobody seems to know why or how, but it appears that the documentation writers on the Windows team inadvertently invented what came to be known as Systems Hungarian.

Somebody, somewhere, read Simonyi’s paper, where he used the word “type,” and thought he meant type, like class, like in a type system, like the type checking that the compiler does. He did not. He explained very carefully exactly what he meant by the word “type,” but it didn’t help. The damage was done.

Apps Hungarian had very useful, meaningful prefixes like “ix” to mean an index into an array, “c” to mean a count, “d” to mean the difference between two numbers (for example “dx” meant “width”), and so forth.

Systems Hungarian had far less useful prefixes like “l” for long and “ul” for “unsigned long” and “dw” for double word, which is, actually, uh, an unsigned long. In Systems Hungarian, the only thing that the prefix told you was the actual data type of the variable.

This was a subtle but complete misunderstanding of Simonyi’s intention and practice, and it just goes to show you that if you write convoluted, dense academic prose nobody will understand it and your ideas will be misinterpreted and then the misinterpreted ideas will be ridiculed even when they weren’t your ideas. So in Systems Hungarian you got a lot of dwFoo meaning “double word foo,” and doggone it, the fact that a variable is a double word tells you darn near nothing useful at all. So it’s no wonder people rebelled against Systems Hungarian.

Systems Hungarian was promulgated far and wide; it is the standard throughout the Windows programming documentation; it was spread extensively by books like Charles Petzold’s Programming Windows, the bible for learning Windows programming, and it rapidly became the dominant form of Hungarian, even inside Microsoft, where very few programmers outside the Word and Excel teams understood just what a mistake they had made.

And then came The Great Rebellion. Eventually, programmers who never understood Hungarian in the first place noticed that the misunderstood subset they were using was Pretty Dang Annoying and Well-Nigh Useless, and they revolted against it. Now, there are still some nice qualities in Systems Hungarian, which help you see bugs. At the very least, if you use Systems Hungarian, you’ll know the type of a variable at the spot where you’re using it. But it’s not nearly as valuable as Apps Hungarian.

The Great Rebellion hit its peak with the first release of .NET. Microsoft finally started telling people, “Hungarian Notation Is Not Recommended.” There was much rejoicing. I don’t even think they bothered saying why. They just went through the naming guidelines section of the document and wrote, “Do Not Use Hungarian Notation” in every entry. Hungarian Notation was so doggone unpopular by this point that nobody really complained, and everybody in the world outside of Excel and Word were relieved at no longer having to use an awkward naming convention that, they thought, was unnecessary in the days of strong type checking and Intellisense.

But there’s still a tremendous amount of value to Apps Hungarian, in that it increases collocation in code, which makes the code easier to read, write, debug, and maintain, and, most importantly, it makes wrong code look wrong.

Before we go, there’s one more thing I promised to do, which is to bash exceptions one more time. The last time I did that I got in a lot of trouble. In an off-the-cuff remark on the Joel on Software homepage, I wrote that I don’t like exceptions because they are, effectively, an invisible goto, which, I reasoned, is even worse than a goto you can see. Of course millions of people jumped down my throat. The only person in the world who leapt to my defense was, of course, Raymond Chen, who is, by the way, the best programmer in the world, so that has to say something, right?

Here’s the thing with exceptions, in the context of this article. Your eyes learn to see wrong things, as long as there is something to see, and this prevents bugs. In order to make code really, really robust, when you code-review it, you need to have coding conventions that allow collocation. In other words, the more information about what code is doing is located right in front of your eyes, the better a job you’ll do at finding the mistakes. When you have code that says

dosomething();
cleanup();

… your eyes tell you, what’s wrong with that? We always clean up! But the possibility that dosomething might throw an exception means that cleanupmight not get called. And that’s easily fixable, using finally or whatnot, but that’s not my point: my point is that the only way to know that cleanup is definitely called is to investigate the entire call tree of dosomething to see if there’s anything in there, anywhere, which can throw an exception, and that’s ok, and there are things like checked exceptions to make it less painful, but the real point is that exceptions eliminate collocation. You have to look somewhere else to answer a question of whether code is doing the right thing, so you’re not able to take advantage of your eye’s built-in ability to learn to see wrong code, because there’s nothing to see.

Now, when I’m writing a dinky script to gather up a bunch of data and print it once a day, heck yeah, exceptions are great. I like nothing more than to ignore all possible wrong things that can happen and just wrap up the whole damn program in a big ol’ try/catch that emails me if anything ever goes wrong. Exceptions are fine for quick-and-dirty code, for scripts, and for code that is neither mission critical nor life-sustaining. But if you’re writing an operating system, or a nuclear power plant, or the software to control a high speed circular saw used in open heart surgery, exceptions are extremely dangerous.

I know people will assume that I’m a lame programmer for failing to understand exceptions properly and failing to understand all the ways they can improve my life if only I was willing to let exceptions into my heart, but, too bad. The way to write really reliable code is to try to use simple tools that take into account typical human frailty, not complex tools with hidden side effects and leaky abstractions that assume an infallible programmer.

More Reading

If you’re still all gung-ho about exceptions, read Raymond Chen’s essay Cleaner, more elegant, and harder to recognize. “It is extraordinarily difficult to see the difference between bad exception-based code and not-bad exception-based code… exceptions are too hard and I’m not smart enough to handle them.”

Raymond’s rant about Death by Macros, A rant against flow control macros, is about another case where failing to get information all in the same place makes code unmaintainable. “When you see code that uses [macros], you have to go dig through header files to figure out what they do.”

For background on the history of Hungarian notation, start with Simonyi’s original paper, Hungarian Notation. Doug Klunder introduced this to the Excel team in a somewhat clearer paper. For more stories about Hungarian and how it got ruined by documentation writers, read Larry Osterman‘s post, especially Scott Ludwig’s comment, or Rick Schaut’s post.

Foreword to Painless Project Management with FogBugz, by Mike Gunderloy

Painless Project Management with FogBugzThere’s a restaurant in my New York City neighborhood called Isabella’s that’s always packed.

Downstairs, upstairs, at the sidewalk cafe, it’s mobbed. And there are large crowds of happy yuppies out front, waiting 45 minutes for a table when they can clearly see other perfectly good restaurants right across the street which have plenty of tables.

It doesn’t matter when you go there. For Sunday brunch, it’s packed. Friday night? Packed, of course. But go on a quiet Wednesday night at 11:00 PM. You’ll get a table fairly quickly, but the restaurant is still, basically, packed.

Is it the food? Nah. Ruth Reichl, restaurant reviewer extraordinaire from the New York Times, dismissed it thusly: “The food is not very good.”

The prices? I doubt anyone cares. This is the neighborhood where Jerry Seinfeld bought Isaac Stern’s apartment with views over two parks.

Lack of competition? What, are you serious? This is Manhattan!

Here’s a clue as to why Isabella’s works. In ten years living in this neighborhood, I still go back there. All the time. Because they’ve never given me a single reason not to.

That actually says a lot.

I never go to a certain fake-Italian art-themed restaurant, because once I ate there and the waiter, who had gone beyond rude well into the realm of actual cruelty, mocking our entree choices, literally chased us down the street complaining about the small tip we left him.

I stopped going to another hole-in-the-wall pizza-pasta-bistro because the owner would come sit down at our table while we ate and ask for computer help.

I really, really loved the food at a local curry restaurant with headache-inducing red banquettes and zebra-striped decor. The katori chat was to die for. I was even willing to overlook the noxious smell of ammonia wafting up from the subterranean bathrooms. But the food inevitably took an hour to arrive, even when the place was empty, so I just never went back.

But in ten years I can’t think of a single bad thing that ever happened to me at Isabella’s.

Nothing.

So that’s why it’s so packed. People keep coming back, again and again, because when you dine at Isabella’s, nothing will ever go wrong.

Isabella’s is thoroughly and completely debugged.

It takes you ten years to notice this, because most of the time when you eat at a restaurant, nothing goes wrong. It took a couple of years of going to the curry place before we realized they were always going to make us miss our movie, no matter how early we arrived, and we finally had to write them off.

And so, on the Upper West Side of Manhattan, if you’re a restaurant, and you want to thrive, you have to carefully debug everything.

You have to make sure that there’s always someone waiting to greet guests. This person must learn never to leave the maitre d’ desk to show someone to their table, because otherwise the next person will come in and there will be nobody there to greet them. Instead, someone else needs to show patrons to their tables. And give them their menus, right away. And take their coats and drink orders.

You have to figure out who your best customers are—the locals who come on weekday nights when the restaurant is relatively quiet—and give them tables quickly on Friday night, even if the out-of-towners have to wait a little longer.

You need a procedure so that every water glass is always full.

Every time somebody is unhappy, that’s a bug. Write it down. Figure out what you’re going to do about it. Add it to the training manual. Never make the same mistake twice.

Eventually, Isabella’s became a fabulously profitable and successful restaurant, not because of its food, but because it was debugged. Just getting what we programmers call “the edge cases” right was sufficient to keep people coming back, and telling their friends, and that’s enough to overcome a review where the New York Times calls your food “not very good.”

Great products are great because they’re deeply debugged. Restaurants, software, it’s all the same.

Great software doesn’t crash when you do weird, rare things, because everybody does something weird.

Microsoft developer Larry Osterman, working on DOS 4, once thought he had found a rare bug. “But if that were the case,” he told DOS architect Gordon Letwin, “it’d take a one in a million chance for it to happen.”

Letwin’s reply? “In our business, one in a million is next Tuesday.”

Great software helps you out when you misunderstand it. If you try to drag a file to a button in the taskbar, Windows pops up a message that says, essentially, “You can’t do that!” but then it goes on to tell you how you can accomplish what you’re obviously trying to do (try it!)

Great software pops up messages that show that the designers have thought about the problem you’re working on, probably more than you have. In FogBugz, for example, if you try to reply to an email message but someone else tries to reply to that same email at the same time, you get a warning and your response is not sent until you can check out what’s going on.

Great software works the way everybody expects it to. I’m probably one of the few people left who still closes windows by double clicking in the top left corner instead of clicking on the [x] button. I don’t know why I do that, but it always works, with great software. Some software that I have is not so great. It doesn’t close if you double click in the top left corner. That makes me a little bit frustrated. It probably made a lot of people frustrated, and a lot of those people probably complained, but I’ll bet you that the software developers just didn’t do bug tracking, because they have never fixed that bug and probably never will.

What great software has in common is being deeply debugged and the only way to get software that’s deeply debugged is to keep track of your bugs.

A bug tracking database is not just a memory aid, or a scheduling tool. It doesn’t make it easier to produce great software, it makes it possible to create great software.

With bug tracking, every idea gets into the system. Every flaw gets into the system. Every tester’s possible misinterpretation of the user interface gets into the system. Every possible improvement that anybody thinks about gets into the system.

Bug tracking software captures the cosmic rays that cause the genetic mutations that make your software evolve into something superior.

And as you constantly evaluate, reprioritize, triage, punt, and assign these flaws, the software evolves. It gets better and better. It learns to deal with more and more weird situations, more and more misunderstanding users, and more and more scenarios.

That’s when something magical happens, and your software becomes better than just the sum of its features. Suddenly it becomes reliable. Reliable, meaning, it never screws up. It never makes its users angry. It never makes its customers wish they had purchased something else.

And that magic is the key to success. In restaurants as in software.

Biculturalism

By now, Windows and Unix are functionally more similar than different. They both support the same major programming metaphors, from command lines to GUIs to web servers; they are organized around virtually the same panoply of system resources, from nearly identical file systems to memory to sockets and processes and threads. There’s not much about the core set of services provided by each operating system to limit the kinds of applications you can create.

What’s left is cultural differences. Yes, we all eat food, but over there, they eat raw fish with rice using wood sticks, while over here, we eat slabs of ground cow on bread with our hands. A cultural difference doesn’t mean that American stomachs can’t digest sushi or that Japanese stomachs can’t digest Big Macs, and it doesn’t mean that there aren’t lots of Americans who eat sushi or Japanese who eat burgers, but it does mean that Americans getting off the plane for the first time in Tokyo are confronted with an overwhelming feeling that this place is strange, dammit, and no amount of philosophizing about how underneath we’re all the same, we all love and work and sing and die will overcome the fact that Americans and Japanese can never really get comfortable with each others’ toilet arrangements.

What are the cultural differences between Unix and Windows programmers? There are many details and subtleties, but for the most part it comes down to one thing: Unix culture values code which is useful to other programmers, while Windows culture values code which is useful to non-programmers.

This is, of course, a major simplification, but really, that’s the big difference: are we programming for programmers or end users? Everything else is commentary.

The Art of UNIX Programming - book coverThe frequently controversial Eric S. Raymond has just written a long book about Unix programming called The Art of UNIX Programming exploring his own culture in great detail. You can buy the book and read it on paper, or, if Raymond’s politics are just too anti-idiotarian for you to consider giving him money, you can even read it online for free and rest assured that the author will not receive a penny for his hard work.

Let’s look at a small example. The Unix programming culture holds in high esteem programs which can be called from the command line, which take arguments that control every aspect of their behavior, and the output of which can be captured as regularly-formatted, machine readable plain text. Such programs are valued because they can easily be incorporated into other programs or larger software systems by programmers. To take one miniscule example, there is a core value in the Unix culture, which Raymond calls “Silence is Golden,” that a program that has done exactly what you told it to do successfully should provide no output whatsoever. It doesn’t matter if you’ve just typed a 300 character command line to create a file system, or built and installed a complicated piece of software, or sent a manned rocket to the moon. If it succeeds, the accepted thing to do is simply output nothing. The user will infer from the next command prompt that everything must be OK.

This is an important value in Unix culture because you’re programming for other programmers. As Raymond puts it, “Programs that babble don’t tend to play well with other programs.” By contrast, in the Windows culture, you’re programming for Aunt Marge, and Aunt Marge might be justified in observing that a program that produces no output because it succeeded cannot be distinguished from a program that produced no output because it failed badly or a program that produced no output because it misinterpreted your request.

Similarly, the Unix culture appreciates programs that stay textual. They don’t like GUIs much, except as lipstick painted cleanly on top of textual programs, and they don’t like binary file formats. This is because a textual interface is easier to program against than, say, a GUI interface, which is almost impossible to program against unless some other provisions are made, like a built-in scripting language. Here again, we see that the Unix culture values creating code that is useful to other programmers, something which is rarely a goal in Windows programming.

Which is not to say that all Unix programs are designed solely for programmers. Far from it. But the culture values things that are useful to programmers, and this explains a thing or two about a thing or two.

Suppose you take a Unix programmer and a Windows programmer and give them each the task of creating the same end-user application. The Unix programmer will create a command-line or text-driven core and occasionally, as an afterthought, build a GUI which drives that core. This way the main operations of the application will be available to other programmers who can invoke the program on the command line and read the results as text. The Windows programmer will tend to start with a GUI, and occasionally, as an afterthought, add a scripting language which can automate the operation of the GUI interface. This is appropriate for a culture in which 99.999% of the users are not programmers in any way, shape, or form, and have no interest in being one.

There is one significant group of Windows programmers who are primarily coding for other programmers: the Windows team itself, inside Microsoft. The way they tend to do things is to create an API, callable from the C language, which implements the functionality, and then create GUI applications which call that API. Anything you can do from the Windows user interface can also be accomplished using a programming interface callable from any reasonable programming language. For example, Microsoft Internet Explorer itself is nothing but a tiny 89 KB program which wraps together dozens of very powerful components which are freely available to sophisticated Windows programmers and which are mostly designed to be flexible and powerful. Unfortunately, since programmers do not have access to the source code for those components, they can only be used in ways which were precisely foreseen and allowed for by the component developers at Microsoft, which doesn’t always work out. And sometimes there are bugs, usually the fault of the person calling the API, which are difficult or impossible to debug without the source code. The Unix cultural value of visible source code makes it an easier environment to develop for. Any Windows developer will tell you about the time they spent four days tracking down a bug because, say, they thought that the memory size returned by LocalSize would be the same as the memory size they originally requested with LocalAlloc, or some similar bug they could have fixed in ten minutes if they could see the source code of the library. Raymond invents an amusing story to illustrate this which will ring true to anyone who has ever used a library in binary form.

So you get these religious arguments. Unix is better because you can debug into libraries. Windows is better because Aunt Marge gets some confirmation that her email was actually sent. Actually, one is not better than another, they simply have different values: in Unix making things better for other programmers is a core value and in Windows making things better for Aunt Marge is a core value.

Picture of boats

Let’s look at another cultural difference. Raymond says, “Classic Unix documentation is written to be telegraphic but complete… The style assumes an active reader, one who is able to deduce obvious unsaid consequences of what is said, and who has the self-confidence to trust those deductions. Read every word carefully, because you will seldom be told anything twice.” Oy vey, I thought, he’s actually teaching young programmers to write more impossible man pages.

For end users, you’ll never get away with this. Raymond may call it “oversimplifying condescension,” but the Windows culture understands that end users don’t like reading and if they concede to read your documentation, they will only read the minimum amount, and so you have to explain things repeatedly… indeed the hallmark of a good Windows help file is that any single topic can be read by itself by an average reader without assuming knowledge of any other help topic.

How did we get different core values? This is another reason Raymond’s book is so good: he goes deeply into the history and evolution of Unix and brings new programmers up to speed with all the accumulated history of the culture back to 1969. When Unix was created and when it formed its cultural values, there were no end users. Computers were expensive, CPU time was expensive, and learning about computers meant learning how to program. It’s no wonder that the culture which emerged valued things which are useful to other programmers. By contrast, Windows was created with one goal only: to sell as many copies as conceivable at a profit. Scrillions of copies. “A computer on every desktop and in every home” was the explicit goal of the team which created Windows, set its agenda and determined its core values. Ease of use for non-programmers was the only way to get on every desk and in every home and thus usability über alles became the cultural norm. Programmers, as an audience, were an extreme afterthought.

The cultural schism is so sharp that Unix has never really made any inroads on the desktop. Aunt Marge can’t really use Unix, and repeated efforts to make a pretty front end for Unix that Aunt Marge can use have failed, entirely because these efforts were done by programmers who were steeped in the Unix culture. For example, Unix has a value of separating policy from mechanism which, historically, came from the designers of X. This directly led to a schism in user interfaces; nobody has ever quite been able to agree on all the details of how the desktop UI should work, and they think this is OK, because their culture values this diversity, but for Aunt Marge it is very much not OK to have to use a different UI to cut and paste in one program than she uses in another. So here we are, 20 years after Unix developers started trying to paint a good user interface on their systems, and we’re still at the point where the CEO of the biggest Linux vendor is telling people that home users should just use Windows. I have heard economists claim that Silicon Valley could never be recreated in, say, France, because the French culture puts such a high penalty on failure that entrepreneurs are not willing to risk it. Maybe the same thing is true of Linux: it may never be a desktop operating system because the culture values things which prevent it. OS X is the proof: Apple finally created Unix for Aunt Marge, but only because the engineers and managers at Apple were firmly of the end-user culture (which I’ve been imperialistically calling “the Windows Culture” even though historically it originated at Apple). They rejected the Unix culture’s fundamental norm of programmer-centricity. They even renamed core directories — heretical! — to use common English words like “applications” and “library” instead of “bin” and “lib.”

Raymond does attempt to compare and contrast Unix to other operating systems, and this is really the weakest part of an otherwise excellent book, because he really doesn’t know what he’s talking about. Whenever he opens his mouth about Windows he tends to show that his knowledge of Windows programming comes mostly from reading newspapers, not from actual Windows programming. That’s OK; he’s not a Windows programmer; we’ll forgive that. As is typical from someone with a deep knowledge of one culture, he knows what his culture values but doesn’t quite notice the distinction between parts of his culture which are universal (killing old ladies, programs which crash: always bad) and parts of the culture which only apply when you’re programming for programmers (eating raw fish, command line arguments: depends on audience).

There are too many monocultural programmers who, like the typical American kid who never left St. Paul, Minnesota, can’t quite tell the difference between a cultural value and a core human value. I’ve encountered too many Unix programmers who sneer at Windows programming, thinking that Windows is heathen and stupid. Raymond all too frequently falls into the trap of disparaging the values of other cultures without considering where they came from. It’s rather rare to find such bigotry among Windows programmers, who are, on the whole, solution-oriented and non-ideological. At the very least, Windows programmers will concede the faults of their culture and say pragmatically, “Look, if you want to sell a word processor to a lot of people, it has to run on their computers, and if that means we use the Evil Registry instead of elegant ~/.rc files to store our settings, so be it.” The very fact that the Unix world is so full of self-righteous cultural superiority, “advocacy,” and slashdot-karma-whoring sectarianism while the Windows world is more practical (“yeah, whatever, I just need to make a living here”) stems from a culture that feels itself under siege, unable to break out of the server closet and hobbyist market and onto the mainstream desktop. This haughtiness-from-a-position-of-weakness is the biggest flaw of The Art of UNIX Programming, but it’s not really a big flaw: on the whole, the book is so full of incredibly interesting insight into so many aspects of programming that I’m willing to hold my nose during the rare smelly ideological rants because there’s so much to learn about universal ideals from the rest of the book. Indeed I would recommend this book to developers of any culture in any platform with any goals, because so many of the values which it trumpets are universal. When Raymond points out that the CSV format is inferior to the /etc/passwd format, he’s trying to score points for Unix against Windows, but, you know what? He’s right. /etc/passwd is easier to parse than CSV,  and if you read this book, you’ll know why, and you’ll be a better programmer.

Craftsmanship

Making software is not a manufacturing process. In the 1980s everyone was running around terrified that Japanese software companies were setting up “software factories” that could churn out high quality code on an assembly line. It didn’t make any sense then and it doesn’t make sense now. Shoving a lot of programmers into a room and lining them up in neat rows did not really help get the bug counts down.

If writing code is not assembly-line style production, what is it? Some have proposed the label craftsmanship. That’s not quite right, either, because I don’t care what you say: that dialog box in Windows that asks you how you want your help file indexed does not in any way, shape, or form resemble what any normal English speaker would refer to as “craftsmanship.”

iPodWriting code is not production, it’s not always craftsmanship (though it can be), it’s design. Design is that nebulous area where you can add value faster than you add cost. The New York Times magazine has been raving about the iPod and how Apple is one of the few companies that knows how to use good design to add value. But I’ve talked enough about design, I want to talk about craftsmanship for a minute: what it is and how you recognize it.

I’d like to tell you about at a piece of code I’ve been rewriting for CityDesk 3.0: the file import code. (Advertainment: CityDesk is my company’s easy-to-use content management product.)

The spec seems about as simple as any snippet of code can be. The user chooses a file using a standard dialog box, and the program copies that file into the CityDesk database.

This turned out to be a great example of one of those places where “the last 1% of the code takes 90% of the time.” The first draft of the code looked like this:

  1. Open the file
  2. Read it all into a big byte array
  3. Store the byte array in a record

Worked great. For reasonable sized existing files it was practically instantaneous. It had a few little bugs, which I worked through one at a time.

The big bug surfaced when I stress-tested it by dragging a 120 MB file into CityDesk. Now, it is not common by any means for people to post 120 MB files on their web sites. In fact, it’s quite rare. But it’s not impossible, either. The code worked but took almost a minute and provided no visual feedback — the app just froze and appeared to be completely locked up. This is obviously not ideal.

From a UI perspective, what I really wanted was for long operations to bring up a progress bar of some sort, along with a Cancel button. In the ideal world you would be able to continue doing other operations with CityDesk while the file copy proceeded in the background. There were three obvious ways to do this:

  1. From a single thread, polling frequently for input events
  2. By launching a second thread and synchronizing it carefully
  3. By launching a second process and synchronizing it less carefully

My experience with #1 is that it never quite works. It is too hard to ensure that all the code throughout your application can be run safely while a file copy operation is in progress. And Eric S. Raymond has convinced me that threads are usually not as good a solution as separate processes: indeed years of experience have shown me that programming with multiple threads creates much additional complexity and introduces whole new categories of dangerously frightful heisenbugs. #3 seemed like a good solution, especially since our underlying database is multiuser and doesn’t mind lots of processes banging on it at the same time. So that’s what I’m planning to do when I get back from Thanksgiving vacation.

Notice, though, the big picture. We’ve gone from read the file/save it in the database to something significantly more complicated: launch a child process, tell it to read the file and save it in the database, add a progress bar and cancel button to the child process, and then some kind of mechanism so the child can notify the parent when the file has arrived so it can be displayed. There will also be some work passing command line arguments to the child process, and making sure the window focus behaves in an expected manner, and handling the case of the user shutting down their system while a file copy is in progress. I would guesstimate that when all is said and done I’ll have ten times as much code to handle large files gracefully, code that maybe 1% of our users will ever see.

And of course, a certain type of programmer will argue that my new child-process architecture is inferior to the original. It’s “bloated” (because of all the extra lines of code). It has more potential for bugs, because of all the extra lines of code. It’s overkill. It’s somehow emblematic of why Windows is an inferior operating system, they will say. What’s all this about progress indicators? they sneer. Just hit Ctrl+Z and then “ls -l” repeatedly and watch to see if the file size is growing!

A picture of the new door

The moral of the story is sometimes fixing a 1% defect takes 500% effort. This is not unique to software, no sirree, now that I’m managing all these construction projects I can tell you that. Last week, finally, our contractor finally put the finishing touches on the new Fog Creek offices. This consisted of installing shiny blue acrylic on the front doors, surrounded by aluminium trim with a screw every 20 cm. If you look closely at the picture, the aluminium trim goes all the way around each door. Where the doors meet, there are two pieces of vertical trim right next to each other. You can’t tell this from the picture, but the screws in the middle strips are almost but not exactly lined up. They are, maybe, 2 millimeters off. The carpenter working on this measured carefully, but he was installing the trim while the doors were on the ground, not mounted in place, and when the doors were mounted, “oops,” it became clear that the screws were not exactly lined up.

This is probably not that uncommon; there are lots of screws in our office that don’t line up perfectly. The problem is that fixing this once the holes are drilled would be ridiculously expensive. Since the correct placement for the screws is only a couple of millimeters away, you can’t just drill new holes in the door; you’d probably have to replace the whole door. It’s just not worth it. Another case where fixing a 1% defect takes 500% effort, and it explains why so many artifacts in our world are 99% good, not 100% good. (Our architect never stops raving about some really, really expensive house in Arizona where every screw lined up.)

It comes down to an attribute of software that most people think of as craftsmanship. When software is built by a true craftsman, all the screws line up. When you do something rare, the application behaves intelligently. More effort went into getting rare cases exactly right than getting the main code working. Even if it took an extra 500% effort to handle 1% of the cases.

Craftsmanship is, of course, incredibly expensive. The only way you can afford it is when you are developing software for a mass audience. Sorry, but internal HR applications developed at insurance companies are never going to reach this level of craftsmanship because there simply aren’t enough users to spread the extra cost out. For a shrinkwrapped software company, though, this level of craftsmanship is precisely what delights users and provides longstanding competitive advantage, so I’ll take the time and do it right. Bear with me.

The Law of Leaky Abstractions

There’s a key piece of magic in the engineering of the Internet which you rely on every single day. It happens in the TCP protocol, one of the fundamental building blocks of the Internet.

TCP is a way to transmit data that is reliable. By this I mean: if you send a message over a network using TCP, it will arrive, and it won’t be garbled or corrupted.

We use TCP for many things like fetching web pages and sending email. The reliability of TCP is why every exciting email from embezzling East Africans arrives in letter-perfect condition. O joy.

By comparison, there is another method of transmitting data called IP which is unreliable. Nobody promises that your data will arrive, and it might get messed up before it arrives. If you send a bunch of messages with IP, don’t be surprised if only half of them arrive, and some of those are in a different order than the order in which they were sent, and some of them have been replaced by alternate messages, perhaps containing pictures of adorable baby orangutans, or more likely just a lot of unreadable garbage that looks like the subject line of Taiwanese spam.

Here’s the magic part: TCP is built on top of IP. In other words, TCP is obliged to somehow send data reliably using only an unreliable tool.

To illustrate why this is magic, consider the following morally equivalent, though somewhat ludicrous, scenario from the real world.

Imagine that we had a way of sending actors from Broadway to Hollywood that involved putting them in cars and driving them across the country. Some of these cars crashed, killing the poor actors. Sometimes the actors got drunk on the way and shaved their heads or got nasal tattoos, thus becoming too ugly to work in Hollywood, and frequently the actors arrived in a different order than they had set out, because they all took different routes. Now imagine a new service called Hollywood Express, which delivered actors to Hollywood, guaranteeing that they would (a) arrive (b) in order (c) in perfect condition. The magic part is that Hollywood Express doesn’t have any method of delivering the actors, other than the unreliable method of putting them in cars and driving them across the country. Hollywood Express works by checking that each actor arrives in perfect condition, and, if he doesn’t, calling up the home office and requesting that the actor’s identical twin be sent instead. If the actors arrive in the wrong order Hollywood Express rearranges them. If a large UFO on its way to Area 51 crashes on the highway in Nevada, rendering it impassable, all the actors that went that way are rerouted via Arizona and Hollywood Express doesn’t even tell the movie directors in California what happened. To them, it just looks like the actors are arriving a little bit more slowly than usual, and they never even hear about the UFO crash.

That is, approximately, the magic of TCP. It is what computer scientists like to call an abstraction: a simplification of something much more complicated that is going on under the covers. As it turns out, a lot of computer programming consists of building abstractions. What is a string library? It’s a way to pretend that computers can manipulate strings just as easily as they can manipulate numbers. What is a file system? It’s a way to pretend that a hard drive isn’t really a bunch of spinning magnetic platters that can store bits at certain locations, but rather a hierarchical system of folders-within-folders containing individual files that in turn consist of one or more strings of bytes.

Back to TCP. Earlier for the sake of simplicity I told a little fib, and some of you have steam coming out of your ears by now because this fib is driving you crazy. I said that TCP guarantees that your message will arrive. It doesn’t, actually. If your pet snake has chewed through the network cable leading to your computer, and no IP packets can get through, then TCP can’t do anything about it and your message doesn’t arrive. If you were curt with the system administrators in your company and they punished you by plugging you into an overloaded hub, only some of your IP packets will get through, and TCP will work, but everything will be really slow.

This is what I call a leaky abstraction. TCP attempts to provide a complete abstraction of an underlying unreliable network, but sometimes, the network leaks through the abstraction and you feel the things that the abstraction can’t quite protect you from. This is but one example of what I’ve dubbed the Law of Leaky Abstractions:

All non-trivial abstractions, to some degree, are leaky.

Abstractions fail. Sometimes a little, sometimes a lot. There’s leakage. Things go wrong. It happens all over the place when you have abstractions. Here are some examples.

  • Something as simple as iterating over a large two-dimensional array can have radically different performance if you do it horizontally rather than vertically, depending on the “grain of the wood” — one direction may result in vastly more page faults than the other direction, and page faults are slow. Even assembly programmers are supposed to be allowed to pretend that they have a big flat address space, but virtual memory means it’s really just an abstraction, which leaks when there’s a page fault and certain memory fetches take way more nanoseconds than other memory fetches.
  • The SQL language is meant to abstract away the procedural steps that are needed to query a database, instead allowing you to define merely what you want and let the database figure out the procedural steps to query it. But in some cases, certain SQL queries are thousands of times slower than other logically equivalent queries. A famous example of this is that some SQL servers are dramatically faster if you specify “where a=b and b=c and a=c” than if you only specify “where a=b and b=c” even though the result set is the same. You’re not supposed to have to care about the procedure, only the specification. But sometimes the abstraction leaks and causes horrible performance and you have to break out the query plan analyzer and study what it did wrong, and figure out how to make your query run faster.
  • Even though network libraries like NFS and SMB let you treat files on remote machines “as if” they were local, sometimes the connection becomes very slow or goes down, and the file stops acting like it was local, and as a programmer you have to write code to deal with this. The abstraction of “remote file is the same as local file” leaks. Here’s a concrete example for Unix sysadmins. If you put users’ home directories on NFS-mounted drives (one abstraction), and your users create .forward files to forward all their email somewhere else (another abstraction), and the NFS server goes down while new email is arriving, the messages will not be forwarded because the .forward file will not be found. The leak in the abstraction actually caused a few messages to be dropped on the floor.
  • C++ string classes are supposed to let you pretend that strings are first-class data. They try to abstract away the fact that strings are hard and let you act as if they were as easy as integers. Almost all C++ string classes overload the + operator so you can write s + “bar” to concatenate. But you know what? No matter how hard they try, there is no C++ string class on Earth that will let you type “foo” + “bar”, because string literals in C++ are always char*’s, never strings. The abstraction has sprung a leak that the language doesn’t let you plug. (Amusingly, the history of the evolution of C++ over time can be described as a history of trying to plug the leaks in the string abstraction. Why they couldn’t just add a native string class to the language itself eludes me at the moment.)
  • And you can’t drive as fast when it’s raining, even though your car has windshield wipers and headlights and a roof and a heater, all of which protect you from caring about the fact that it’s raining (they abstract away the weather), but lo, you have to worry about hydroplaning (or aquaplaning in England) and sometimes the rain is so strong you can’t see very far ahead so you go slower in the rain, because the weather can never be completely abstracted away, because of the law of leaky abstractions.

One reason the law of leaky abstractions is problematic is that it means that abstractions do not really simplify our lives as much as they were meant to. When I’m training someone to be a C++ programmer, it would be nice if I never had to teach them about char*’s and pointer arithmetic. It would be nice if I could go straight to STL strings. But one day they’ll write the code “foo” + “bar”, and truly bizarre things will happen, and then I’ll have to stop and teach them all about char*’s anyway. Or one day they’ll be trying to call a Windows API function that is documented as having an OUT LPTSTR argument and they won’t be able to understand how to call it until they learn about char*’s, and pointers, and Unicode, and wchar_t’s, and the TCHAR header files, and all that stuff that leaks up.

In teaching someone about COM programming, it would be nice if I could just teach them how to use the Visual Studio wizards and all the code generation features, but if anything goes wrong, they will not have the vaguest idea what happened or how to debug it and recover from it. I’m going to have to teach them all about IUnknown and CLSIDs and ProgIDS and … oh, the humanity!

In teaching someone about ASP.NET programming, it would be nice if I could just teach them that they can double-click on things and then write code that runs on the server when the user clicks on those things. Indeed ASP.NET abstracts away the difference between writing the HTML code to handle clicking on a hyperlink (<a>) and the code to handle clicking on a button. Problem: the ASP.NET designers needed to hide the fact that in HTML, there’s no way to submit a form from a hyperlink. They do this by generating a few lines of JavaScript and attaching an onclick handler to the hyperlink. The abstraction leaks, though. If the end-user has JavaScript disabled, the ASP.NET application doesn’t work correctly, and if the programmer doesn’t understand what ASP.NET was abstracting away, they simply won’t have any clue what is wrong.

The law of leaky abstractions means that whenever somebody comes up with a wizzy new code-generation tool that is supposed to make us all ever-so-efficient, you hear a lot of people saying “learn how to do it manually first, then use the wizzy tool to save time.” Code generation tools which pretend to abstract out something, like all abstractions, leak, and the only way to deal with the leaks competently is to learn about how the abstractions work and what they are abstracting. So the abstractions save us time working, but they don’t save us time learning.

And all this means that paradoxically, even as we have higher and higher level programming tools with better and better abstractions, becoming a proficient programmer is getting harder and harder.

During my first Microsoft internship, I wrote string libraries to run on the Macintosh. A typical assignment: write a version of strcat that returns a pointer to the end of the new string. A few lines of C code. Everything I did was right from K&R — one thin book about the C programming language.

Today, to work on CityDesk, I need to know Visual Basic, COM, ATL, C++, InnoSetup, Internet Explorer internals, regular expressions, DOM, HTML, CSS, and XML. All high level tools compared to the old K&R stuff, but I still have to know the K&R stuff or I’m toast.

Ten years ago, we might have imagined that new programming paradigms would have made programming easier by now. Indeed, the abstractions we’ve created over the years do allow us to deal with new orders of complexity in software development that we didn’t have to deal with ten or fifteen years ago, like GUI programming and network programming. And while these great tools, like modern OO forms-based languages, let us get a lot of work done incredibly quickly, suddenly one day we need to figure out a problem where the abstraction leaked, and it takes 2 weeks. And when you need to hire a programmer to do mostly VB programming, it’s not good enough to hire a VB programmer, because they will get completely stuck in tar every time the VB abstraction leaks.

The Law of Leaky Abstractions is dragging us down.

Five Worlds

Something important is almost never mentioned in all the literature about programming and software development, and as a result we sometimes misunderstand each other.

You’re a software developer. Me too. But we may not have the same goals and requirements. In fact there are several different worlds of software development, and different rules apply to different worlds.

You read a book about UML modeling, and nowhere does it say that it doesn’t make sense for programming device drivers. Or you read an article saying that “the 20MB runtime [required for .NET] is a NON issue” and it doesn’t mention the obvious: if you’re trying to write code for a 32KB ROM on a pager it very much is an issue!

I think there are five worlds here, sometimes intersecting, often not. The five are:

  1. Shrinkwrap
  2. Internal
  3. Embedded
  4. Games
  5. Throwaway

When you read the latest book about Extreme Programming, or one of Steve McConnell’s excellent books, or Joel on Software, or Software Development magazine, you see a lot of claims about how to do software development, but you hardly ever see any mention of what kind of development they’re talking about, which is unfortunate, because sometimes you need to do things differently in different worlds.

Let’s go over the categories briefly.

Shrinkwrap is software that needs to be used “in the wild” by a large number of people. It may be actually wrapped in cellophane and sold at CompUSA, or it may be downloaded over the Internet. It may be commercial or shareware or open source or GNU or whatever — the main point here is software that will be installed and used by thousands or millions of people.

Shrinkwrap has special problems which derive from two special properties:

  • Since it has so many users who often have alternatives, the user interface needs to be easier than average in order to achieve success.
  • Since it runs on so many computers, the code must be unusually resilient to variations between computers. Last week someone emailed me about a bug in CityDesk which only appears in Polish Windows, because of the way that operating system uses Right-Alt to enter special characters. We tested Windows 95, 95OSR2, 98, 98SE, Me, NT 4.0, Win 2000, and Win XP. We tested with IE 5.01, 5.5, or 6.0 installed. We tested US, Spanish, French, Hebrew, and Chinese Windows. But we hadn’t quite gotten around to Polish yet.

There are three major variations of shrinkwrap. Open Source software is often developed without anyone getting paid to develop it, which changes the dynamics a lot. For example, things that are not considered “fun” often don’t get done in an all-volunteer team, and, as Matthew Thomas points out eloquently, this can hurt usability. Development is much more likely to be geographically dispersed, which results in a radically different quality of team communication. It’s rare in the open source world to have a face to face conversation around a whiteboard drawing boxes and arrows, so the kind of design decisions which benefit from drawing boxes and arrows are usually decided poorly on such projects. As a result geographically dispersed teams have done far better at cloning existing software where little or no design is required.

Consultingware is a variant of shrinkwrap which requires so much customization and installation that you need an army of consultants to install it, at outrageous cost. CRM and CMS packages often fall in this category. One gets the feeling that they don’t actually do anything, they are just an excuse to get an army of consultants in the door billing at $300/hour. Although consultingware is disguised as shrinkwrap, the high cost of an implementation means this is really more like internal software.

Commercial web based software such as Salesforce.com or even the more garden variety eBay still needs to be easy to use and run on many browsers. Although the developers have the luxury of (at least) some control over the “deployment” environment — the computers in the data center — they have to deal with a wide variety of web browsers and a large number of users so I consider this basically a variation of shrinkwrap.

Internal software only has to work in one situation on one company’s computers. This makes it a lot easier to develop. You can make lots of assumptions about the environment under which it will run. You can require a particular version of Internet Explorer, or Microsoft Office, or Windows. If you need a graph, let Excel build it for you; everybody in our department has Excel. (But try that with a shrinkwrap package and you eliminate half of your potential customers.)

Here usability is a lower priority, because a limited number of people need to use the software, and they don’t have any choice in the matter, and they will just have to deal with it. Speed of development is more important. Because the value of the development effort is spread over only one company, the amount of development resources that can be justified is significantly less. Microsoft can afford to spend $500,000,000 developing an operating system that’s only worth about $80 to the average person. But when Detroit Edison develops an energy trading platform, that investment must make sense for a single company. To get a reasonable ROI you can’t spend as much as you would on shrinkwrap. So sadly lots of internal software sucks pretty badly.

Embedded Software has the unique property that it goes in a piece of hardware and in almost every case can never be updated. This is a whole different world, here. The quality requirements are much higher, because there are no second chances. You may be dealing with a processor that runs dramatically more slowly than the typical desktop processor, so you may spend a lot of time optimizing. Fast code is more important than elegant code. The input and output devices available to you may be limited. Picture of Hertz Neverlost GPSThe GPS system in the car I rented last week had such pathetic I/O that the usability was dismal. Have you ever tried to input an address on one of these things? They displayed a “keyboard” on screen and you had to use the directional arrows to choose letters from five small matrices of 9 letters each. (Follow the link for more illustrations of this UI. The GPS in my own car has a touch screen which makes the UI dramatically better. But I digress).

Games are unique for two reasons. First, the economics of game development are hit-oriented. Some games are hits, many more games are failures, and if you want to make money on game software you recognize this and make sure that you have a portfolio of games so that the blockbuster hit makes up for the losses on the failures. This is more like movies than software.

The bigger issue with the development of games is that there’s only one version. Once your users have played through Duke Nukem 3D, they are not going to upgrade to Duke Nukem 3.1D just to get some bug fixes and new weapons. With some exceptions, once somebody has played the game to the end, it’s boring to play it again. So games have the same quality requirements as embedded software and an incredible financial imperative to get it right the first time. Shrinkwrap developers have the luxury of knowing that if 1.0 doesn’t meet people’s needs and doesn’t sell, maybe 2.0 will.

Finally Throwaway code is code that you create temporarily solely for the purpose of obtaining something else, which you never need to use again once you obtain that thing. For example, you might write a little shell script that massages an input file that you got into the format you need it for some other purpose, and this is a one time operation.

There are probably other kinds of software development that I’m forgetting.

Here’s an important thing to know. Whenever you read one of those books about programming methodologies written by a full time software development guru/consultant, you can rest assured that they are talking about internal, corporate software development. Not shrinkwrapped software, not embedded software, and certainly not games. Why? Because corporations are the people who hire these gurus. They’re paying the bill. (Trust me, id software is not about to hire Ed Yourdon to talk about structured analysis.)

Last week Kent Beck made a claim that you don’t really need bug tracking databases when you’re doing Extreme Programming, because the combination of pair programming (with persistent code review) and test driven development (guaranteeing 100% code coverage of the automated tests) means you hardly ever have bugs. That didn’t sound right to me. I looked in our own bug tracking database here at Fog Creek to see what kinds of bugs were keeping it busy.

Lo and behold, I discovered that very few of the bugs in there would have been discovered with pair programming or test driven development. Many of our “bugs” are really what XP calls stories — basically, just feature requests. We’re using the bug tracking system as a way of remembering, prioritizing, and managing all the little improvements and big features we want to implement.

A lot of the other bugs were only discovered after much use in the field. The Polish keyboard thing. There’s no way pair programming was going to find that. And logical mistakes that never occurred to us in the way that different features work together. The larger and more complex a program, the more interactions between the features that you don’t think about. A particular unlikely sequence of characters ({${?, if you must know) that confuses the lexer. Some ftp servers produce an error when you delete a file that doesn’t exist (our ftp server does not complain so this never occurred to us.)

I carefully studied every bug. Out of 106 bugs we fixed for the service pack release of CityDesk, exactly 5 of them could have been prevented through pair programming or test driven design. We actually had more bugs that we knew about and thought weren’t important (only to be corrected by our customers!) than bugs that could have been caught by XP methods.

But Kent is right, for other types of development. For most corporate development applications, none of these things would be considered a bug. Program crashes on invalid input? Run it again, and this time watch your {${?’s! And we only have One Kind of FTP server and nobody in the whole company uses Polish Windows.

Most things in software development are the same no matter what kind of project you’re working on, but not everything. When somebody tells you about methodology, think about how it applies to the work you’re doing. Think about where the person is coming from. Steve McConnell, Steve Maguire, and I all come from a very narrow corner: the world of mass market shrinkwrap spreadsheet applications written in Redmond, Washington. As such we have higher bars for ease of use and lower bars for bugs. Most of the other methodology gurus make their living doing consulting for in house corporate development, and that’s what they’re talking about. In any case, we should all be able to learn something from each other.

Back to Basics

We spend a lot of time on this site talking about exciting Big Picture Stuff like .NET versus Java, XML strategy, Lock-In, competitive strategy, software design, architecture, and so forth. All this stuff is a layer cake, in a way. At the top layer, you’ve got software strategy. Below that, we think about architectures like .NET, and below that, individual products: software development products like Java or platforms like Windows.

Go lower on the cake, please. DLLs? Objects? Functions? No! Lower! At some point you’re thinking about lines of code written in programming languages.

Still not low enough. Today I want to think about CPUs. A little bit of silicon moving bytes around. Pretend you are a beginning programmer. Tear away all that knowledge you’ve built up about programming, software, management, and get back to the lowest level Von Neumann fundamental stuff. Wipe J2EE out of your mind for a moment. Think Bytes.

Vancouver BCWhy are we doing this? I think that some of the biggest mistakes people make even at the highest architectural levels come from having a weak or broken understanding of a few simple things at the very lowest levels. You’ve built a marvelous palace but the foundation is a mess. Instead of a nice cement slab, you’ve got rubble down there. So the palace looks nice but occasionally the bathtub slides across the bathroom floor and you have no idea what’s going on.

So today, take a deep breath. Walk with me, please, through a little exercise which will be conducted using the C programming language.

Remember the way strings work in C: they consist of a bunch of bytes followed by a null character, which has the value 0. This has two obvious implications:

  1. There is no way to know where the string ends (that is, the string length) without moving through it, looking for the null character at the end.
  2. Your string can’t have any zeros in it. So you can’t store an arbitrary binary blob like a JPEG picture in a C string.

Why do C strings work this way? It’s because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant “ASCII with a Z (zero) at the end.”

Is this the only way to store strings? No, in fact, it’s one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague. Why?

Let’s start by writing a version of the code for strcat, the function which appends one string to another.

void strcat( char* dest, char* src )
{
     while (*dest) dest++;
     while (*dest++ = *src++);
}

Study the code a bit and see what we’re doing here. First, we’re walking through the first string looking for its null-terminator. When we find it, we walk through the second string, copying one character at a time onto the first string.

This kind of string handling and string concatenation was good enough for Kernighan and Ritchie, but it has its problems. Here’s a problem. Suppose you have a bunch of names that you want to append together in one big string:

char bigString[1000];     /* I never know how much to allocate… */
bigString[0] = ‘\0’;
strcat(bigString,”John, “);
strcat(bigString,”Paul, “);
strcat(bigString,”George, “);
strcat(bigString,”Joel “);

This works, right? Yes. And it looks nice and clean.

What is its performance characteristic? Is it as fast as it could be? Does it scale well? If we had a million strings to append, would this be a good way to do it?

No. This code uses the Shlemiel the painter’s algorithm. Who is Shlemiel? He’s the guy in this joke:

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?”

“I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!”

kansas(For extra credit, what are the real numbers?) This lame joke illustrates exactly what’s going on when you use strcat like I just did. Since the first part of strcat has to scan through the destination string every time, looking for that dang null terminator again and again, this function is much slower than it needs to be and doesn’t scale well at all. Lots of code you use every day has this problem. Many file systems are implemented in a way that it’s a bad idea to put too many files in one directory, because performance starts to drop off dramatically when you get thousands of items in one directory. Try opening an overstuffed Windows recycle bin to see this in action — it takes hours to show up, which is clearly not linear in the number of files it contains. There must be a Shlemiel the Painter’s Algorithm in there somewhere. Whenever something seems like it should have linear performance but it seems to have n-squared performance, look for hidden Shlemiels. They are often hidden by your libraries. Looking at a column of strcats or a strcat in a loop doesn’t exactly shout out “n-squared,” but that is what’s happening.

How do we fix this? A few smart C programmers implemented their own mystrcat as follows:

char* mystrcat( char* dest, char* src )
{
     while (*dest) dest++;
     while (*dest++ = *src++);
     return –dest;
}

What have we done here? At very little extra cost we’re returning a pointer to the end of the new, longer string. That way the code that calls this function can decide to append further without rescanning the string:

char bigString[1000];     /* I never know how much to allocate… */
char *p = bigString;
bigString[0] = ‘\0’;
p = mystrcat(p,”John, “);
p = mystrcat(p,”Paul, “);
p = mystrcat(p,”George, “);
p = mystrcat(p,”Joel “);

This is, of course, linear in performance, not n-squared, so it doesn’t suffer from degradation when you have a lot of stuff to concatenate.

The designers of Pascal were aware of this problem and “fixed” it by storing a byte count in the first byte of the string. These are called Pascal Strings. They can contain zeros and are not null terminated. Because a byte can only store numbers between 0 and 255, Pascal strings are limited to 255 bytes in length, but because they are not null terminated they occupy the same amount of memory as ASCIZ strings. The great thing about Pascal strings is that you never have to have a loop just to figure out the length of your string. Finding the length of a string in Pascal is one assembly instruction instead of a whole loop. It is monumentally faster.

The old Macintosh operating system used Pascal strings everywhere. Many C programmers on other platforms used Pascal strings for speed. Excel uses Pascal strings internally which is why strings in many places in Excel are limited to 255 bytes, and it’s also one reason Excel is blazingly fast.

For a long time, if you wanted to put a Pascal string literal in your C code, you had to write:

char* str = “\006Hello!”;

Yep, you had to count the bytes by hand, yourself, and hardcode it into the first byte of your string. Lazy programmers would do this, and have slow programs:

char* str = “*Hello!”;
str[0] = strlen(str) – 1;

Notice in this case you’ve got a string that is null terminated (the compiler did that) as well as a Pascal string. I used to call these fucked strings because it’s easier than calling them null terminated pascal strings but this is a rated-G channel so you will have use the longer name.

I elided an important issue earlier. Remember this line of code?

char bigString[1000];     /* I never know how much to allocate… */

Since we’re looking at the bits today I shouldn’t have ignored this. I should have done this correctly: figured out how many bytes I needed and allocated the right amount of memory.

Shouldn’t I have?

Because otherwise, you see, a clever hacker will read my code and notice that I’m only allocating 1000 bytes and hoping it will be enough, and they’ll find some clever way to trick me into strcatting a 1100 byte string into my 1000 bytes of memory, thus overwriting the stack frame and changing the return address so that when this function returns, it executes some code which the hacker himself wrote. This is what they’re talking about when they say that a particular program has a buffer overflow susceptibility. It was the number one cause of hacks and worms in the olden days before Microsoft Outlook made hacking easy enough for teenagers to do.

OK, so all those programmers are just lame-asses. They should have figured out how much memory to allocate.

But really, C does not make this easy on you. Let’s go back to my Beatles example:

char bigString[1000];     /* I never know how much to allocate… */
char *p = bigString;
bigString[0] = ‘\0’;
p = mystrcat(p,”John, “);
p = mystrcat(p,”Paul, “);
p = mystrcat(p,”George, “);
p = mystrcat(p,”Joel “);

How much should we allocate? Let’s try doing this The Right Way.

char* bigString;
int i = 0;
i = strlen(“John, “)
     + strlen(“Paul, “)
     + strlen(“George, “)
     + strlen(“Joel “);
bigString = (char*) malloc (i + 1);
/* remember space for null terminator! */

My eyes glazeth over. You’re probably about ready to change the channel already. I don’t blame you, but bear with me because it gets really interesting.

We have to scan through all the strings once just figuring out how big they are, then we scan through them again concatenating. At least if you use Pascal strings the strlen operation is fast. Maybe we can write a version of strcat that reallocates memory for us.

That opens another whole can of worms: memory allocators. Do you know how malloc works? The nature of malloc is that it has a long linked list of available blocks of memory called the free chain. When you call malloc, it walks the linked list looking for a block of memory that is big enough for your request. Then it cuts that block into two blocks — one the size you asked for, the other with the extra bytes, and gives you the block you asked for, and puts the leftover block (if any) back into the linked list. When you call free, it adds the block you freed onto the free chain. Eventually, the free chain gets chopped up into little pieces and you ask for a big piece and there are no big pieces available the size you want. So malloc calls a timeout and starts rummaging around the free chain, sorting things out, and merging adjacent small free blocks into larger blocks. This takes 3 1/2 days. The end result of all this mess is that the performance characteristic of malloc is that it’s never very fast (it always walks the free chain), and sometimes, unpredictably, it’s shockingly slow while it cleans up. (This is, incidentally, the same performance characteristic of garbage collected systems, surprise surprise, so all the claims people make about how garbage collection imposes a performance penalty are not entirely true, since typical malloc implementations had the same kind of performance penalty, albeit milder.)

Smart programmers minimize the potential distruption of malloc by always allocating blocks of memory that are powers of 2 in size. You know, 4 bytes, 8 bytes, 16 bytes, 18446744073709551616 bytes, etc. For reasons that should be intuitive to anyone who plays with Lego, this minimizes the amount of weird fragmentation that goes on in the free chain. Although it may seem like this wastes space, it is also easy to see how it never wastes more than 50% of the space. So your program uses no more than twice as much memory as it needs to, which is not that big a deal.

Suppose you wrote a smart strcat function that reallocates the destination buffer automatically. Should it always reallocate it to the exact size needed? My teacher and mentor Stan Eisenstat suggests that when you call realloc, you should always double the size of memory that was previously allocated. That means that you never have to call realloc more than lg n times, which has decent performance characteristics even for huge strings, and you never waste more than 50% of your memory.

Anyway. Life just gets messier and messier down here in byte-land. Aren’t you glad you don’t have to write in C anymore? We have all these great languages like Perl and Java and VB and XSLT that never make you think of anything like this, they just deal with it, somehow. But occasionally, the plumbing infrastructure sticks up in the middle of the living room, and we have to think about whether to use a String class or a StringBuilder class, or some such distinction, because the compiler is still not smart enough to understand everything about what we’re trying to accomplish and is trying to help us not write inadvertent Shlemiel the Painter algorithms.

[Image]

Last week I wrote that you can’t implement the SQL statement SELECT author FROM books fast when your data is stored in XML. Just in case everybody didn’t understand what I was talking about, and now that we’ve been rolling around in the CPU all day, this assertion might make more sense.

How does a relational database implement SELECT author FROM books? In a relational database, every row in a table (e.g. the books table) is exactly the same length in bytes, and every fields is always at a fixed offset from the beginning of the row. So, for example, if each record in the books table is 100 bytes long, and the author field is at offset 23, then there are authors stored at byte 23, 123, 223, 323, etc. What is the code to move to the next record in the result of this query? Basically, it’s this:

pointer += 100;

One CPU instruction. Faaaaaaaaaast.

Now lets look at the books table in XML.

<?xml blah blah>
<books>
     <book>
          <title>UI Design for Programmers</title>
          <author>Joel Spolsky</author>
     </book>
     <book>
          <title>The Chop Suey Club</title>
          <author>Bruce Weber</author>
     </book>
</books>

Quick question. What is the code to move to the next record?

Uh…

At this point a good programmer would say, well, let’s parse the XML into a tree in memory so that we can operate on it reasonably quickly. The amount of work that has to be done here by the CPU to SELECT author FROM books will bore you absolutely to tears. As every compiler writer knows, lexing and parsing are the slowest part of compiling. Suffice it to say that it involves a lot of string stuff, which we discovered is slow, and a lot of memory allocation stuff, which we discovered is slow, as we lex, parse, and build an abstract syntax tree in memory. That assumes that you have enough memory to load the whole thing at once. With relational databases, the performance of moving from record to record is fixed and is, in fact, one CPU instruction. That’s very much by design. And thanks to memory mapped files you only have to load the pages of disk that you are actually going to use. With XML, if you preparse, the performance of moving from record to record is fixed but there’s a huge startup time, and if you don’t preparse, the performance of moving from record to record varies based on the length of the record before it and is still hundreds of CPU instructions long.

What this means to me is that you can’t use XML if you need performance and have lots of data. If you have a little bit of data, or if what you’re doing doesn’t have to be fast, XML is a fine format. And if you really want the best of both worlds, you have to come up with a way to store metadata next to your XML, something like Pascal strings’ byte count, which give you hints about where things are in the file so that you don’t have to parse and scan for them. But of course then you can’t use text editors to edit the file because that messes up the metadata, so it’s not really XML anymore.

For those three gracious members of my audience who are still with me at this point, I hope you’ve learned something or rethought something. I hope that thinking about boring first-year computer-science stuff like how strcat and malloc actually work has given you new tools to think about the latest, top level, strategic and architectural decisions that you make in dealing with technologies like XML. For homework, think about why Transmeta chips will always feel sluggish. Or why the original HTML spec for TABLES was so badly designed that large tables on web pages can’t be shown quickly to people with modems. Or about why COM is so dang fast but not when you’re crossing process boundaries. Or about why the NT guys put the display driver into kernelspace instead of userspace.

These are all things that require you to think about bytes, and they affect the big top-level decisions we make in all kinds of architecture and strategy. This is why my view of teaching is that first year CS students need to start at the basics, using C and building their way up from the CPU. I am actually physically disgusted that so many computer science programs think that Java is a good introductory language, because it’s “easy” and you don’t get confused with all that boring string/malloc stuff but you can learn cool OOP stuff which will make your big programs ever so modular. This is a pedagogical disaster waiting to happen. Generations of graduates are descending on us and creating Shlemiel The Painter algorithms right and left and they don’t even realize it, since they fundamentally have no idea that strings are, at a very deep level, difficult, even if you can’t quite see that in your perl script. If you want to teach somebody something well, you have to start at the very lowest level. It’s like Karate Kid. Wax On, Wax Off. Wax On, Wax Off. Do that for three weeks. Then Knocking The Other Kid’s Head off is easy.

Emacs