2001/12/30

Made With CityDeskThe People wanted a Made With CityDesk banner. Who am I to argue with The People?

Doc says, “When the software industry is mature, it will look a lot like the construction industry.” This is a common theme. A lot of people complain about how the creation of software suffers from all kinds of problems that mature professions (construction, civil engineering, etc) don’t. The key problem is always that the software coding process doesn’t seem to be reproducible and predictable. Programmers think things will take 10 weeks and then they take 20 weeks, and they have weird bugs that you would never tolerate from a bridge built by a civic engineer.

Implicit in the claim that the software industry is “immature” is the belief that this is just because we haven’t learned all the tricks yet to getting reproducible results. But this idea rests on a falsehood. The unique thing about software is that it is infinitely clonable. Once you’ve written a subroutine, you can call it as often as you want. This means that almost everything we do as software developers is something that has never been done before. This is very different than what construction workers do. Herman the Handyman, who just installed a tile floor for me, has probably installed hundreds of tile floors. He has to keep installing tile floors again and again as long as new tile floors are needed. We in the software industry would have long since written a Tile Floor Template Library (TFTL) and generating new tile floors would be trivial. (OK, maybe there would be six versions of the library, one for Delphi, one for perl, etc. And some sick puppy programmers like me would rewrite it. But only once, and I would use it everywhere I needed a tile floor, and I would try to convince my clients that their back lawn would look really nice with tile instead of grass.)

In software, the boring problems are solved. Everything you do is on the cutting edge by definition. So by definition it is unpredictable. That’s why software has more of a science nature than a construction nature.

Getting Things Done When You’re Only a Grunt

This site is supposed to be about software management. But sometimes you don’t have the power to create change in your organization by executive fiat. Obviously, if you’re just a grunt programmer at the bottom of the totem pole, you can’t exactly order people to start creating schedules or bug databases. And in fact even if you’re a manager, you’ve probably discovered that managing developers is a lot like herding cats, only not as fun. Merely saying “make it so” doesn’t make it so.

It can be frustrating when you’re working in an organization that scores low on The Joel Test. No matter how good your code is, your coworkers write such bad code that you’re embarassed to be associated with the project. Or management is making bad decisions about what code to write, so you’re forced to waste your talent debugging the AS/400 version of a retirement-planning game for kids.

You could just leave, I suppose. But presumably, there’s some reason you’re stuck there. The stock options haven’t quite vested, there’s no better place to work in Podunk, or perhaps your boss is holding someone you love hostage. In any case, dealing with life on a bad team can be infuriating. But there are strategies for improving your team from the bottom, and I’d like to share a few of them.

Strategy 1 Just Do It

A lot can be done to improve the project just by one person doing it. Don’t have a daily build server? Make one. Set your own machine up with a scheduled job to make builds at night and send out email results. Does it take too many steps to make the build? Write the makefile. Nobody does usability tests? Do your own hallway usability tests on the mailroom folks with a piece of paper or a VB prototype.

Strategy 2 Harness the Power of Viral Marketing

Many of the Joel Test strategies can be implemented by a single person on an uncooperative team. Some of them, if done well, will spread to the rest of the team.

For example, suppose nobody on your team can be persuaded to use a bug database. Don’t let it bother you. Just keep your own. Enter bugs that you find in your own code. If you find a bug that somebody else really should fix, assign the bug to them using the bug database. If you have good bug tracking software, this will send them an email. But now, you can keep sending them emails if they don’t fix the bug. Eventually, they’ll see the value of bug tracking and start to use the system as it was intended. If the QA team refuses to input bugs to the bug tracking system, simply refuse to listen to bug reports through any other channel. About the three-thousdandth time that you say to people, “listen, I’d love to fix that, but I’m going to forget. Can you enter a bug in the system?” they’ll start using the database.

Nobody on your team wants to use source control? Create your own CVS repository, on your own hard drive if necessary. Even without cooperation, you can check your code in independently from everybody else’s. Then when they have problems that source control can solve (someone accidentally types rm * ~ instead of rm *~), they’ll come to you for help. Eventually, people will realize that they can have their own checkouts, too.

Strategy 3 Create a Pocket of Excellence

The team won’t make schedules? Or specs? Write your own. Nobody’s going to complain if you take a day or two to write a minimal spec and schedule for the work you’re about to do.

Get better people into the team. Get involved in hiring and interviewing, and recruit good candidates to join the team.

Find the people who are willing to improve and capable of it, and get them on your side. Even on poor teams, you’re likely to have some smart people who just don’t have the experience to create great code. Help them out. Set them up to learn. Read their code checkins. If they do something stupid, don’t send them a snooty email explaining what’s stupid about their checkins. That will just make them angry and defensive. Instead, innocently report the bug that you know is the result of the checkin. Let them figure out what’s causing it. When they find the bug for themselves, they’ll remember that lesson a lot better.

Strategy 4 Neutralize The Bozos

Even the best teams can have a bozo or two. The frustrating part about having bad programmers on your team is when their bad code breaks your good code, or good programmers have to spend time cleaning up after the bad programmers.

As a grunt, your goal is damage-minimization, a.k.a. containment. At some point, one of these geniuses will spend two weeks writing a bit of code that is so unbelievably bad that it can never work. You’re tempted to spend the fifteen minutes that it takes to rewrite the thing correctly from scratch. Resist the temptation. You’ve got a perfect opportunity to neutralize this moron for several months. Just keep reporting bugs against their code. They will have no choice but to keep slogging away at it for months until you can’t find any more bugs. Those are months in which they can’t do any damage anywhere else.

Strategy 5 Get Away From Interruptions

All happy work environments are alike (private offices, quiet working conditions, excellent tools, few interruptions and even fewer large meetings). All unhappy work environments are unhappy in their own way.

The bad news is that changing the working environment is almost impossible in virtually any company. Long term leases may mean that even the CEO can’t do anything about it. That’s why so few software developers get private offices. This hurts their companies in at least two ways. First, it makes it harder to recruit top notch developers, who will prefer the firm that gives them cushier conditions (all else being equal). Second, the level of interruptions can dramatically reduce the productivity of developers, who find it impossible to get into the zone and stay in it for any length of time.

Look for ways to get out of this environment. Take a laptop to the company cafeteria, where there are lots of tables that are empty most of the day (and nobody can find you). Book a conference room for the whole day and write code there, and make it clear through the preponderance of checkins just how much more work you get done when you’re in a room by yourself. The next time there’s a crunch on and your manager asks you what you need to Get This Done By Tomorrow, you know what to say. They’ll find you an office for the day. And pretty soon they’ll start wondering what they can do to keep that productive thing going year round.

Come into work late and leave late. Those hours after the rest of the company goes home can be the most productive. Or, if you’re on a team of developers who regularly come in late, get into work at 9 am. You’ll do more work in the two hours before other people come in and start bothering you than you do in the rest of the day.

Don’t keep your email or IM client running. Check your email every hour, if you want, but don’t keep it running.

Strategy 6 Become Invaluable

None of these strategies work if you’re not really an excellent contributor. If you don’t write good code, and lots of it, you’re just going to be resented for messing around with bug databases when you “should be” writing code. There’s nothing more deadly to your career than having a reputation of being so concerned with process that you don’t accomplish anything.

Once, when I started a new job as a grunt programmer at a new company, I discovered that the company was running somewhere around 2 on the Joel Test, and I was determined to fix it. But I also knew that making a good first impression was crucial. So I allocated the first seven hours of every day to just writing code, as was expected of me. There’s nothing like a flurry of checkins to make you look good to the rest of the development team. But I reserved another hour every afternoon before going home to improving the process. I used that time to fix things that made it hard to debug our product. I set up a daily build and a bug database. I fixed all the longstanding annoyances that made development difficult. I wrote specs for the work that I was doing during the day. I wrote a document explaining step-by-step how to create a development machine from scratch. I thoroughly documented an important internal language which had been undocumented. Slowly, the process got better and better. Nobody but me (and my team, when I was put in charge of one) ever did schedules or specs, but other than that we hit about 10 on the Joel Test.

You can make things better, even when you’re not in charge, but you have to be Caesar’s Wife: above suspicion. Otherwise you’ll make enemies as you go along.

Any other ideas?

2001/12/25

This site is supposed to be about software management. But sometimes you don’t have the power to create change in your organization by executive fiat. Obviously, if you’re just a grunt programmer at the bottom of the totem pole, you can’t exactly order people to start creating schedules or bug databases. And in fact even if you’re a manager, you’ve probably discovered that managing developers is a lot like herding cats, only not as fun. Merely saying “make it so” doesn’t make it so.

Getting Things Done When You’re Only a Grunt

WhatCounts Replaces Mailman

I used to use mailman to send the notification email to the 7568 subscribers who signed up. It had some problems. The biggest one was that every email that went out had to be exactly the same, and there was no way to include a recipient’s email address in the text of the outgoing message. This made it a pain to unsubscribe people who didn’t quite know why they were subscribed (usually they had an old email address forwarding to their current address). In fact I’m embarassed to admit that under the old system the only way to unsubscribe people was for me to go through all the unsubscribe messages and remove people manually. Why I put up with this for so long is completely beyond me.

WhatCountsThanks to David Geller at WhatCounts, I have a new system to manage the notification email. Every outgoing message will automatically have a custom return address on the WhatCounts server and a custom link you can click to unsubscribe. I hope this will make list management, finally, a totally painless process. The only problem with the new system is that if you receive one of my notifications and reply to it, I never see the reply. WhatCounts will assume it’s a bounce or an unsubscribe request and deal with it appropriately. If you want to contact me you have to send email, rather than just replying to the message.

2001/12/19

I’m still working my way through a pile of books I bought with the purpose of updating my list of recommended books.  One of these days I’ll get to the bottom of the pile and update the recommended list. All in good time!

FogBUGZ 3.0 Development

The FogBUGZ 3.0 development process is in full swing now. One of our first priorities is to refactor and clean up some of the code. There were a bunch of places with duplicated or almost-duplicated code; our logic wasn’t cleanly separated from the UI; the HTML we generated was based on ancient browsers. It’s the kind of project that would tempt a lesser programmer to “throw it out and start from scratch.” Instead, we’re going through the code one file at a time, cutting and pasting blocks, scrubbing the HTML (to use fully-XHTML valid code with all formatting isolated style sheets), and creating new classes for underlying logic that used to be scattered all over the place.

In the ideal world, when you refactor, you simply apply almost mechanical transformations to an existing code base that can be clearly understood to have no effect on the correctness of the code. The simplest example I can think of is changing “if not x then A else B” to “if x then B else A”. You know that you can always do such a thing without breaking the code, and if it’s a little clearer to read, it’s a safe fix.

Here’s a typical thing that happens. As I go through the code, I’ll find some SQL statement in the middle of some HTML. So I create a class and give it a method that executes the SQL statement, and call that method from the HTML. I’ll start with the old SQL code and move it into my new class unchanged. Then I’ll look at that class for opportunities to apply more simple, bite-sized transformations that make it cleaner or faster. In the principle of eating our own dogfood, all daily changes go up on our internal bug database right away. Because we’re refactoring rather than starting from scratch, at any given time we always have a working version of FogBUGZ checked into CVS. In the worst scenario, we’ve introduced a small bug.

The best time to refactor is at the beginning of a development cycle. That gives you the maximum amount of time to catch any bugs that you introduced accidentally.

In the ideal world, we would have strong unit tests so that we could convince ourselves that nothing was broken that used to work. As we applied transformations, the unit tests would tell us automatically if we had broken anything and we could probably stay at zero bugs throughout the process. This is one of the valuable principles of Extreme Programming. We haven’t created unit tests yet, because all the HTML->XHTML+CSS cleanup that we’re doing makes FogBUGZ’s “correct” output a rapidly moving target. As the output stablizes, we’ll build unit tests. We did not follow the XP principle of writing unit tests first, because it is so much easier to (a) write the code (b) hand-check the HTML that it produced (c) use that HTML as the basis for the unit test. If you do things in the XP order you have to figure out character-by-character what your output HTML is supposed to look like in advance, which strikes me as mind-numbingly tedious and a serious waste of time.

Humane Programming

Lazy programmers who didn’t read about how users don’t read anything:

Click for Larger View

  1. Most decent web programmers can always figure out how to design an interaction that won’t fail if you hit back or reload. These techniques are even older than cookies.
  2. Having three paragraphs of text telling you not to hit back or reload isn’t going to help, because people won’t read it.
  3. The programmer who wrote this screen admits that 10% of users “forget this warning.” It’s not because they forgot. It’s because humans are fallable and we’re used to hitting refresh when the screen is messed up, and we’re used to hitting back when we realize we went forward too soon. Don’t argue with me, Yahoo, when you’re the ones quoting statistics to prove it!

The main purpose of this screen seems to be so that users blame themselves when they hit reload and find themselves blown back to step one. Oh darn! I’m so stupid! say the users. Yes, that’s what happens. Watch any usability test where the product is failing – the users inevitably blame “their own stupidity.” Better that 100,000 users should feel stupid than one programmer admit he didn’t do a very good job.

Don’t let anyone tell you that as a programmer you don’t have to make moral or ethical decisions. Every time you decide that making users feel stupid is better than fixing your code, you’re making an ethical decision.

2001/12/13

Michael asks: “I’m interested in anyone’s ideas on either good or bad interface issues that they’ve seen while buying things. Remember that I’m selling software, so order tracking and fulfillment don’t really exist…”

I have really smart readers. Wow.

Google now has 20 years of Usenet online. You can entertain yourself with some scary stuff from my past.

  • my first post — this was rather a difficult trick because Penn didn’t officially allow undergrads to post to Usenet in those days (so I transferred out!) (April 1988)
  • my first Internet lecture — more than a decade before Joel on Software I had things to say. Not very interesting things though. (August 1988)
  • my first contribution to Open Source (September 1988)
  • my first software released under GPL (November 1988) — now the Slashdot kiddies have to shut up; I was writing GPLed software before they were born, bwa ha ha!
  • I was writing Macintosh code way back then, too. (August 1993)
  • First post from New York (September 1994)

Back to Basics

We spend a lot of time on this site talking about exciting Big Picture Stuff like .NET versus Java, XML strategy, Lock-In, competitive strategy, software design, architecture, and so forth. All this stuff is a layer cake, in a way. At the top layer, you’ve got software strategy. Below that, we think about architectures like .NET, and below that, individual products: software development products like Java or platforms like Windows.

Go lower on the cake, please. DLLs? Objects? Functions? No! Lower! At some point you’re thinking about lines of code written in programming languages.

Still not low enough. Today I want to think about CPUs. A little bit of silicon moving bytes around. Pretend you are a beginning programmer. Tear away all that knowledge you’ve built up about programming, software, management, and get back to the lowest level Von Neumann fundamental stuff. Wipe J2EE out of your mind for a moment. Think Bytes.

Vancouver BCWhy are we doing this? I think that some of the biggest mistakes people make even at the highest architectural levels come from having a weak or broken understanding of a few simple things at the very lowest levels. You’ve built a marvelous palace but the foundation is a mess. Instead of a nice cement slab, you’ve got rubble down there. So the palace looks nice but occasionally the bathtub slides across the bathroom floor and you have no idea what’s going on.

So today, take a deep breath. Walk with me, please, through a little exercise which will be conducted using the C programming language.

Remember the way strings work in C: they consist of a bunch of bytes followed by a null character, which has the value 0. This has two obvious implications:

  1. There is no way to know where the string ends (that is, the string length) without moving through it, looking for the null character at the end.
  2. Your string can’t have any zeros in it. So you can’t store an arbitrary binary blob like a JPEG picture in a C string.

Why do C strings work this way? It’s because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant “ASCII with a Z (zero) at the end.”

Is this the only way to store strings? No, in fact, it’s one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague. Why?

Let’s start by writing a version of the code for strcat, the function which appends one string to another.

void strcat( char* dest, char* src )
{
    while (*dest) dest++;
    while (*dest++ = *src++);
}

Study the code a bit and see what we’re doing here. First, we’re walking through the first string looking for its null-terminator. When we find it, we walk through the second string, copying one character at a time onto the first string.

This kind of string handling and string concatenation was good enough for Kernighan and Ritchie, but it has its problems. Here’s a problem. Suppose you have a bunch of names that you want to append together in one big string:

char bigString[1000];  /* I never know how much to allocate */
bigString[0] = '\0';
strcat(bigString,"John, ");
strcat(bigString,"Paul, ");
strcat(bigString,"George, ");
strcat(bigString,"Joel ");

This works, right? Yes. And it looks nice and clean.

What is its performance characteristic? Is it as fast as it could be? Does it scale well? If we had a million strings to append, would this be a good way to do it?

No. This code uses the Shlemiel the painter’s algorithm. Who is Shlemiel? He’s the guy in this joke:

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?”

“I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!”

kansas(For extra credit, what are the real numbers?) This lame joke illustrates exactly what’s going on when you use strcat like I just did. Since the first part of strcat has to scan through the destination string every time, looking for that dang null terminator again and again, this function is much slower than it needs to be and doesn’t scale well at all. Lots of code you use every day has this problem. Many file systems are implemented in a way that it’s a bad idea to put too many files in one directory, because performance starts to drop off dramatically when you get thousands of items in one directory. Try opening an overstuffed Windows recycle bin to see this in action — it takes hours to show up, which is clearly not linear in the number of files it contains. There must be a Shlemiel the Painter’s Algorithm in there somewhere. Whenever something seems like it should have linear performance but it seems to have n-squared performance, look for hidden Shlemiels. They are often hidden by your libraries. Looking at a column of strcats or a strcat in a loop doesn’t exactly shout out “n-squared,” but that is what’s happening.

How do we fix this? A few smart C programmers implemented their own mystrcat as follows:

char* mystrcat( char* dest, char* src )
{
    while (*dest) dest++;
    while (*dest++ = *src++);
    return --dest;
}

What have we done here? At very little extra cost we’re returning a pointer to the end of the new, longer string. That way the code that calls this function can decide to append further without rescanning the string:

char bigString[1000];  /* I never know how much to allocate */
char *p = bigString;
bigString[0] = '\0';
p = mystrcat(p,"John, ");
p = mystrcat(p,"Paul, ");
p = mystrcat(p,"George, ");
p = mystrcat(p,"Joel ");

This is, of course, linear in performance, not n-squared, so it doesn’t suffer from degradation when you have a lot of stuff to concatenate.

The designers of Pascal were aware of this problem and “fixed” it by storing a byte count in the first byte of the string. These are called Pascal Strings. They can contain zeros and are not null terminated. Because a byte can only store numbers between 0 and 255, Pascal strings are limited to 255 bytes in length, but because they are not null terminated they occupy the same amount of memory as ASCIZ strings. The great thing about Pascal strings is that you never have to have a loop just to figure out the length of your string. Finding the length of a string in Pascal is one assembly instruction instead of a whole loop. It is monumentally faster.

The old Macintosh operating system used Pascal strings everywhere. Many C programmers on other platforms used Pascal strings for speed. Excel uses Pascal strings internally which is why strings in many places in Excel are limited to 255 bytes, and it’s also one reason Excel is blazingly fast.

For a long time, if you wanted to put a Pascal string literal in your C code, you had to write:

char* str = "\006Hello!";

Yep, you had to count the bytes by hand, yourself, and hardcode it into the first byte of your string. Lazy programmers would do this, and have slow programs:

char* str = "*Hello!";
str[0] = strlen(str) - 1;

Notice in this case you’ve got a string that is null terminated (the compiler did that) as well as a Pascal string. I used to call these fucked strings because it’s easier than calling them null terminated pascal strings but this is a rated-G channel so you will have use the longer name.

I elided an important issue earlier. Remember this line of code?

char bigString[1000];  /* I never know how much to allocate */

Since we’re looking at the bits today I shouldn’t have ignored this. I should have done this correctly: figured out how many bytes I needed and allocated the right amount of memory.

Shouldn’t I have?

Because otherwise, you see, a clever hacker will read my code and notice that I’m only allocating 1000 bytes and hoping it will be enough, and they’ll find some clever way to trick me into strcatting a 1100 byte string into my 1000 bytes of memory, thus overwriting the stack frame and changing the return address so that when this function returns, it executes some code which the hacker himself wrote. This is what they’re talking about when they say that a particular program has a buffer overflow susceptibility. It was the number one cause of hacks and worms in the olden days before Microsoft Outlook made hacking easy enough for teenagers to do.

OK, so all those programmers are just lame-asses. They should have figured out how much memory to allocate.

But really, C does not make this easy on you. Let’s go back to my Beatles example:

char bigString[1000];  /* I never know how much to allocate */
char *p = bigString;
bigString[0] = '\0';
p = mystrcat(p,"John, ");
p = mystrcat(p,"Paul, ");
p = mystrcat(p,"George, ");
p = mystrcat(p,"Joel ");

How much should we allocate? Let’s try doing this The Right Way.

char* bigString;
int i = 0;
i = strlen("John, ")
+ strlen("Paul, ")
+ strlen("George, ")
+ strlen("Joel ");
bigString = (char*) malloc (i + 1);
/* remember space for null terminator! */
...

My eyes glazeth over. You’re probably about ready to change the channel already. I don’t blame you, but bear with me because it gets really interesting.

We have to scan through all the strings once just figuring out how big they are, then we scan through them again concatenating. At least if you use Pascal strings the strlen operation is fast. Maybe we can write a version of strcat that reallocates memory for us.

That opens another whole can of worms: memory allocators. Do you know how malloc works? The nature of malloc is that it has a long linked list of available blocks of memory called the free chain. When you call malloc, it walks the linked list looking for a block of memory that is big enough for your request. Then it cuts that block into two blocks — one the size you asked for, the other with the extra bytes, and gives you the block you asked for, and puts the leftover block (if any) back into the linked list. When you call free, it adds the block you freed onto the free chain. Eventually, the free chain gets chopped up into little pieces and you ask for a big piece and there are no big pieces available the size you want. So malloc calls a timeout and starts rummaging around the free chain, sorting things out, and merging adjacent small free blocks into larger blocks. This takes 3 1/2 days. The end result of all this mess is that the performance characteristic of malloc is that it’s never very fast (it always walks the free chain), and sometimes, unpredictably, it’s shockingly slow while it cleans up. (This is, incidentally, the same performance characteristic of garbage collected systems, surprise surprise, so all the claims people make about how garbage collection imposes a performance penalty are not entirely true, since typical malloc implementations had the same kind of performance penalty, albeit milder.)

Smart programmers minimize the potential distruption of malloc by always allocating blocks of memory that are powers of 2 in size. You know, 4 bytes, 8 bytes, 16 bytes, 18446744073709551616 bytes, etc. For reasons that should be intuitive to anyone who plays with Lego, this minimizes the amount of weird fragmentation that goes on in the free chain. Although it may seem like this wastes space, it is also easy to see how it never wastes more than 50% of the space. So your program uses no more than twice as much memory as it needs to, which is not that big a deal.

Suppose you wrote a smart strcat function that reallocates the destination buffer automatically. Should it always reallocate it to the exact size needed? My teacher and mentor Stan Eisenstat suggests that when you call realloc, you should always double the size of memory that was previously allocated. That means that you never have to call realloc more than lg n times, which has decent performance characteristics even for huge strings, and you never waste more than 50% of your memory.

Anyway. Life just gets messier and messier down here in byte-land. Aren’t you glad you don’t have to write in C anymore? We have all these great languages like Perl and Java and VB and XSLT that never make you think of anything like this, they just deal with it, somehow. But occasionally, the plumbing infrastructure sticks up in the middle of the living room, and we have to think about whether to use a String class or a StringBuilder class, or some such distinction, because the compiler is still not smart enough to understand everything about what we’re trying to accomplish and is trying to help us not write inadvertent Shlemiel the Painter algorithms.

[Image]

Last week I wrote that you can’t implement the SQL statement SELECT author FROM books fast when your data is stored in XML. Just in case everybody didn’t understand what I was talking about, and now that we’ve been rolling around in the CPU all day, this assertion might make more sense.

How does a relational database implement SELECT author FROM books? In a relational database, every row in a table (e.g. the books table) is exactly the same length in bytes, and every fields is always at a fixed offset from the beginning of the row. So, for example, if each record in the books table is 100 bytes long, and the author field is at offset 23, then there are authors stored at byte 23, 123, 223, 323, etc. What is the code to move to the next record in the result of this query? Basically, it’s this:

pointer += 100;

One CPU instruction. Faaaaaaaaaast.

Now lets look at the books table in XML.

<?xml blah blah>
<books>
<book>
<title>UI Design for Programmers</title>
<author>Joel Spolsky</author>
</book>
<book>
<title>The Chop Suey Club</title>
<author>Bruce Weber</author>
</book>
</books>

Quick question. What is the code to move to the next record?

Uh…

At this point a good programmer would say, well, let’s parse the XML into a tree in memory so that we can operate on it reasonably quickly. The amount of work that has to be done here by the CPU to SELECT author FROM books will bore you absolutely to tears. As every compiler writer knows, lexing and parsing are the slowest part of compiling. Suffice it to say that it involves a lot of string stuff, which we discovered is slow, and a lot of memory allocation stuff, which we discovered is slow, as we lex, parse, and build an abstract syntax tree in memory. That assumes that you have enough memory to load the whole thing at once. With relational databases, the performance of moving from record to record is fixed and is, in fact, one CPU instruction. That’s very much by design. And thanks to memory mapped files you only have to load the pages of disk that you are actually going to use. With XML, if you preparse, the performance of moving from record to record is fixed but there’s a huge startup time, and if you don’t preparse, the performance of moving from record to record varies based on the length of the record before it and is still hundreds of CPU instructions long.

What this means to me is that you can’t use XML if you need performance and have lots of data. If you have a little bit of data, or if what you’re doing doesn’t have to be fast, XML is a fine format. And if you really want the best of both worlds, you have to come up with a way to store metadata next to your XML, something like Pascal strings’ byte count, which give you hints about where things are in the file so that you don’t have to parse and scan for them. But of course then you can’t use text editors to edit the file because that messes up the metadata, so it’s not really XML anymore.

For those three gracious members of my audience who are still with me at this point, I hope you’ve learned something or rethought something. I hope that thinking about boring first-year computer-science stuff like how strcat and malloc actually work has given you new tools to think about the latest, top level, strategic and architectural decisions that you make in dealing with technologies like XML. For homework, think about why Transmeta chips will always feel sluggish. Or why the original HTML spec for TABLES was so badly designed that large tables on web pages can’t be shown quickly to people with modems. Or about why COM is so dang fast but not when you’re crossing process boundaries. Or about why the NT guys put the display driver into kernelspace instead of userspace.

These are all things that require you to think about bytes, and they affect the big top-level decisions we make in all kinds of architecture and strategy. This is why my view of teaching is that first year CS students need to start at the basics, using C and building their way up from the CPU. I am actually physically disgusted that so many computer science programs think that Java is a good introductory language, because it’s “easy” and you don’t get confused with all that boring string/malloc stuff but you can learn cool OOP stuff which will make your big programs ever so modular. This is a pedagogical disaster waiting to happen. Generations of graduates are descending on us and creating Shlemiel The Painter algorithms right and left and they don’t even realize it, since they fundamentally have no idea that strings are, at a very deep level, difficult, even if you can’t quite see that in your perl script. If you want to teach somebody something well, you have to start at the very lowest level. It’s like Karate Kid. Wax On, Wax Off. Wax On, Wax Off. Do that for three weeks. Then Knocking The Other Kid’s Head off is easy.

Emacs