Back to Basics

We spend a lot of time on this site talking about exciting Big Picture Stuff like .NET versus Java, XML strategy, Lock-In, competitive strategy, software design, architecture, and so forth. All this stuff is a layer cake, in a way. At the top layer, you’ve got software strategy. Below that, we think about architectures like .NET, and below that, individual products: software development products like Java or platforms like Windows.

Go lower on the cake, please. DLLs? Objects? Functions? No! Lower! At some point you’re thinking about lines of code written in programming languages.

Still not low enough. Today I want to think about CPUs. A little bit of silicon moving bytes around. Pretend you are a beginning programmer. Tear away all that knowledge you’ve built up about programming, software, management, and get back to the lowest level Von Neumann fundamental stuff. Wipe J2EE out of your mind for a moment. Think Bytes.

Vancouver BCWhy are we doing this? I think that some of the biggest mistakes people make even at the highest architectural levels come from having a weak or broken understanding of a few simple things at the very lowest levels. You’ve built a marvelous palace but the foundation is a mess. Instead of a nice cement slab, you’ve got rubble down there. So the palace looks nice but occasionally the bathtub slides across the bathroom floor and you have no idea what’s going on.

So today, take a deep breath. Walk with me, please, through a little exercise which will be conducted using the C programming language.

Remember the way strings work in C: they consist of a bunch of bytes followed by a null character, which has the value 0. This has two obvious implications:

  1. There is no way to know where the string ends (that is, the string length) without moving through it, looking for the null character at the end.
  2. Your string can’t have any zeros in it. So you can’t store an arbitrary binary blob like a JPEG picture in a C string.

Why do C strings work this way? It’s because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant “ASCII with a Z (zero) at the end.”

Is this the only way to store strings? No, in fact, it’s one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague. Why?

Let’s start by writing a version of the code for strcat, the function which appends one string to another.

void strcat( char* dest, char* src )
{
    while (*dest) dest++;
    while (*dest++ = *src++);
}

Study the code a bit and see what we’re doing here. First, we’re walking through the first string looking for its null-terminator. When we find it, we walk through the second string, copying one character at a time onto the first string.

This kind of string handling and string concatenation was good enough for Kernighan and Ritchie, but it has its problems. Here’s a problem. Suppose you have a bunch of names that you want to append together in one big string:

char bigString[1000];  /* I never know how much to allocate */
bigString[0] = '\0';
strcat(bigString,"John, ");
strcat(bigString,"Paul, ");
strcat(bigString,"George, ");
strcat(bigString,"Joel ");

This works, right? Yes. And it looks nice and clean.

What is its performance characteristic? Is it as fast as it could be? Does it scale well? If we had a million strings to append, would this be a good way to do it?

No. This code uses the Shlemiel the painter’s algorithm. Who is Shlemiel? He’s the guy in this joke:

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?”

“I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!”

kansas(For extra credit, what are the real numbers?) This lame joke illustrates exactly what’s going on when you use strcat like I just did. Since the first part of strcat has to scan through the destination string every time, looking for that dang null terminator again and again, this function is much slower than it needs to be and doesn’t scale well at all. Lots of code you use every day has this problem. Many file systems are implemented in a way that it’s a bad idea to put too many files in one directory, because performance starts to drop off dramatically when you get thousands of items in one directory. Try opening an overstuffed Windows recycle bin to see this in action — it takes hours to show up, which is clearly not linear in the number of files it contains. There must be a Shlemiel the Painter’s Algorithm in there somewhere. Whenever something seems like it should have linear performance but it seems to have n-squared performance, look for hidden Shlemiels. They are often hidden by your libraries. Looking at a column of strcats or a strcat in a loop doesn’t exactly shout out “n-squared,” but that is what’s happening.

How do we fix this? A few smart C programmers implemented their own mystrcat as follows:

char* mystrcat( char* dest, char* src )
{
    while (*dest) dest++;
    while (*dest++ = *src++);
    return --dest;
}

What have we done here? At very little extra cost we’re returning a pointer to the end of the new, longer string. That way the code that calls this function can decide to append further without rescanning the string:

char bigString[1000];  /* I never know how much to allocate */
char *p = bigString;
bigString[0] = '\0';
p = mystrcat(p,"John, ");
p = mystrcat(p,"Paul, ");
p = mystrcat(p,"George, ");
p = mystrcat(p,"Joel ");

This is, of course, linear in performance, not n-squared, so it doesn’t suffer from degradation when you have a lot of stuff to concatenate.

The designers of Pascal were aware of this problem and “fixed” it by storing a byte count in the first byte of the string. These are called Pascal Strings. They can contain zeros and are not null terminated. Because a byte can only store numbers between 0 and 255, Pascal strings are limited to 255 bytes in length, but because they are not null terminated they occupy the same amount of memory as ASCIZ strings. The great thing about Pascal strings is that you never have to have a loop just to figure out the length of your string. Finding the length of a string in Pascal is one assembly instruction instead of a whole loop. It is monumentally faster.

The old Macintosh operating system used Pascal strings everywhere. Many C programmers on other platforms used Pascal strings for speed. Excel uses Pascal strings internally which is why strings in many places in Excel are limited to 255 bytes, and it’s also one reason Excel is blazingly fast.

For a long time, if you wanted to put a Pascal string literal in your C code, you had to write:

char* str = "\006Hello!";

Yep, you had to count the bytes by hand, yourself, and hardcode it into the first byte of your string. Lazy programmers would do this, and have slow programs:

char* str = "*Hello!";
str[0] = strlen(str) - 1;

Notice in this case you’ve got a string that is null terminated (the compiler did that) as well as a Pascal string. I used to call these fucked strings because it’s easier than calling them null terminated pascal strings but this is a rated-G channel so you will have use the longer name.

I elided an important issue earlier. Remember this line of code?

char bigString[1000];  /* I never know how much to allocate */

Since we’re looking at the bits today I shouldn’t have ignored this. I should have done this correctly: figured out how many bytes I needed and allocated the right amount of memory.

Shouldn’t I have?

Because otherwise, you see, a clever hacker will read my code and notice that I’m only allocating 1000 bytes and hoping it will be enough, and they’ll find some clever way to trick me into strcatting a 1100 byte string into my 1000 bytes of memory, thus overwriting the stack frame and changing the return address so that when this function returns, it executes some code which the hacker himself wrote. This is what they’re talking about when they say that a particular program has a buffer overflow susceptibility. It was the number one cause of hacks and worms in the olden days before Microsoft Outlook made hacking easy enough for teenagers to do.

OK, so all those programmers are just lame-asses. They should have figured out how much memory to allocate.

But really, C does not make this easy on you. Let’s go back to my Beatles example:

char bigString[1000];  /* I never know how much to allocate */
char *p = bigString;
bigString[0] = '\0';
p = mystrcat(p,"John, ");
p = mystrcat(p,"Paul, ");
p = mystrcat(p,"George, ");
p = mystrcat(p,"Joel ");

How much should we allocate? Let’s try doing this The Right Way.

char* bigString;
int i = 0;
i = strlen("John, ")
+ strlen("Paul, ")
+ strlen("George, ")
+ strlen("Joel ");
bigString = (char*) malloc (i + 1);
/* remember space for null terminator! */
...

My eyes glazeth over. You’re probably about ready to change the channel already. I don’t blame you, but bear with me because it gets really interesting.

We have to scan through all the strings once just figuring out how big they are, then we scan through them again concatenating. At least if you use Pascal strings the strlen operation is fast. Maybe we can write a version of strcat that reallocates memory for us.

That opens another whole can of worms: memory allocators. Do you know how malloc works? The nature of malloc is that it has a long linked list of available blocks of memory called the free chain. When you call malloc, it walks the linked list looking for a block of memory that is big enough for your request. Then it cuts that block into two blocks — one the size you asked for, the other with the extra bytes, and gives you the block you asked for, and puts the leftover block (if any) back into the linked list. When you call free, it adds the block you freed onto the free chain. Eventually, the free chain gets chopped up into little pieces and you ask for a big piece and there are no big pieces available the size you want. So malloc calls a timeout and starts rummaging around the free chain, sorting things out, and merging adjacent small free blocks into larger blocks. This takes 3 1/2 days. The end result of all this mess is that the performance characteristic of malloc is that it’s never very fast (it always walks the free chain), and sometimes, unpredictably, it’s shockingly slow while it cleans up. (This is, incidentally, the same performance characteristic of garbage collected systems, surprise surprise, so all the claims people make about how garbage collection imposes a performance penalty are not entirely true, since typical malloc implementations had the same kind of performance penalty, albeit milder.)

Smart programmers minimize the potential distruption of malloc by always allocating blocks of memory that are powers of 2 in size. You know, 4 bytes, 8 bytes, 16 bytes, 18446744073709551616 bytes, etc. For reasons that should be intuitive to anyone who plays with Lego, this minimizes the amount of weird fragmentation that goes on in the free chain. Although it may seem like this wastes space, it is also easy to see how it never wastes more than 50% of the space. So your program uses no more than twice as much memory as it needs to, which is not that big a deal.

Suppose you wrote a smart strcat function that reallocates the destination buffer automatically. Should it always reallocate it to the exact size needed? My teacher and mentor Stan Eisenstat suggests that when you call realloc, you should always double the size of memory that was previously allocated. That means that you never have to call realloc more than lg n times, which has decent performance characteristics even for huge strings, and you never waste more than 50% of your memory.

Anyway. Life just gets messier and messier down here in byte-land. Aren’t you glad you don’t have to write in C anymore? We have all these great languages like Perl and Java and VB and XSLT that never make you think of anything like this, they just deal with it, somehow. But occasionally, the plumbing infrastructure sticks up in the middle of the living room, and we have to think about whether to use a String class or a StringBuilder class, or some such distinction, because the compiler is still not smart enough to understand everything about what we’re trying to accomplish and is trying to help us not write inadvertent Shlemiel the Painter algorithms.

[Image]

Last week I wrote that you can’t implement the SQL statement SELECT author FROM books fast when your data is stored in XML. Just in case everybody didn’t understand what I was talking about, and now that we’ve been rolling around in the CPU all day, this assertion might make more sense.

How does a relational database implement SELECT author FROM books? In a relational database, every row in a table (e.g. the books table) is exactly the same length in bytes, and every fields is always at a fixed offset from the beginning of the row. So, for example, if each record in the books table is 100 bytes long, and the author field is at offset 23, then there are authors stored at byte 23, 123, 223, 323, etc. What is the code to move to the next record in the result of this query? Basically, it’s this:

pointer += 100;

One CPU instruction. Faaaaaaaaaast.

Now lets look at the books table in XML.

<?xml blah blah>
<books>
<book>
<title>UI Design for Programmers</title>
<author>Joel Spolsky</author>
</book>
<book>
<title>The Chop Suey Club</title>
<author>Bruce Weber</author>
</book>
</books>

Quick question. What is the code to move to the next record?

Uh…

At this point a good programmer would say, well, let’s parse the XML into a tree in memory so that we can operate on it reasonably quickly. The amount of work that has to be done here by the CPU to SELECT author FROM books will bore you absolutely to tears. As every compiler writer knows, lexing and parsing are the slowest part of compiling. Suffice it to say that it involves a lot of string stuff, which we discovered is slow, and a lot of memory allocation stuff, which we discovered is slow, as we lex, parse, and build an abstract syntax tree in memory. That assumes that you have enough memory to load the whole thing at once. With relational databases, the performance of moving from record to record is fixed and is, in fact, one CPU instruction. That’s very much by design. And thanks to memory mapped files you only have to load the pages of disk that you are actually going to use. With XML, if you preparse, the performance of moving from record to record is fixed but there’s a huge startup time, and if you don’t preparse, the performance of moving from record to record varies based on the length of the record before it and is still hundreds of CPU instructions long.

What this means to me is that you can’t use XML if you need performance and have lots of data. If you have a little bit of data, or if what you’re doing doesn’t have to be fast, XML is a fine format. And if you really want the best of both worlds, you have to come up with a way to store metadata next to your XML, something like Pascal strings’ byte count, which give you hints about where things are in the file so that you don’t have to parse and scan for them. But of course then you can’t use text editors to edit the file because that messes up the metadata, so it’s not really XML anymore.

For those three gracious members of my audience who are still with me at this point, I hope you’ve learned something or rethought something. I hope that thinking about boring first-year computer-science stuff like how strcat and malloc actually work has given you new tools to think about the latest, top level, strategic and architectural decisions that you make in dealing with technologies like XML. For homework, think about why Transmeta chips will always feel sluggish. Or why the original HTML spec for TABLES was so badly designed that large tables on web pages can’t be shown quickly to people with modems. Or about why COM is so dang fast but not when you’re crossing process boundaries. Or about why the NT guys put the display driver into kernelspace instead of userspace.

These are all things that require you to think about bytes, and they affect the big top-level decisions we make in all kinds of architecture and strategy. This is why my view of teaching is that first year CS students need to start at the basics, using C and building their way up from the CPU. I am actually physically disgusted that so many computer science programs think that Java is a good introductory language, because it’s “easy” and you don’t get confused with all that boring string/malloc stuff but you can learn cool OOP stuff which will make your big programs ever so modular. This is a pedagogical disaster waiting to happen. Generations of graduates are descending on us and creating Shlemiel The Painter algorithms right and left and they don’t even realize it, since they fundamentally have no idea that strings are, at a very deep level, difficult, even if you can’t quite see that in your perl script. If you want to teach somebody something well, you have to start at the very lowest level. It’s like Karate Kid. Wax On, Wax Off. Wax On, Wax Off. Do that for three weeks. Then Knocking The Other Kid’s Head off is easy.

Emacs

A Hard Drill Makes an Easy Battle

I can’t say enough nice things about VMWare. This program has been amazingly helpful during the last few weeks as we tried to get CityDesk to work on every known version of 32-bit Windows. I have set up dozens of virtual machines, everything from a simple DOS partition (helpful as a starting point for installing other OSs), a bunch of combinations of NT 4.0, Chinese and Hebrew Win2K (even though our program is in English and doesn’t do anything fancy, we had various bugs that were revealed on these systems), assorted versions of Win 95/98/Me going all the way back to the August, 1995 release, even a small network of machines with a primary domain controller which we used for testing FogBUGZ setup.

VMWare

Getting code to work on the entire universe of Windows machines is a lot of work. That’s the real appeal of “write once, run anywhere” systems like Java. In theory, if you use the Java Virtual Machine, the burden is on the VM vendor to provide compatibility with all these platforms. In reality, as Java programmers learned, code is just too fragile for this to work very well. When I developed a game with Java I learned that Java’s inability to guarantee exactly when threads would run (a seemingly harmless concession to the fact that CPU scheduling is basically unpredictable) actually meant that on the Macintosh, some threads got starved, um, forever, basically, until the other threads tried to do i/o in fact, which is not what I had assumed and made my game not very challenging on Macs. (This was in 1996. Don’t email me with workarounds, fixes, or to say that this bug has been fixed.)

Yesterday’s Bug Du Jour is an example of the kind of thing that trips people up. Michael was allocating some memory using the ancient Windows API GlobalAlloc. Later, he was calling a function GlobalSize to determine the size of that memory. On our development systems (Windows 2000) GlobalSize returns the same value as the block you allocated. Allocate 13 bytes, and GlobalSize will return 13.

We got a bug report that “Copy and Paste don’t work” from a user with Windows 98. As you see from the screenshot above, I’ve got a Windows Me VM set up with VB6. Stepping through the code in the debugger I noticed the GlobalSize call, and remembered from the days of Win 95 that GlobalSize used to return the size of the actual allocated block, which was larger than you allocated and usually a multiple of 64. This led to the bug.

Now, the programmer at Microsoft who changed the behavior of GlobalSize probably thought he wasn’t breaking anything. The documentation for the function “GlobalSize” says clearly, “the size of a memory block may be larger than the size requested when the memory was allocated.” In fact the Microsoftee probably thought it was a minor and harmless improvement to GlobalSize to have it always return the size that you requested. Obviously any old code, that doesn’t trust the return value from GlobalSize, will continue to work. So why not improve the function?

Not every programmer studies every line of the documentation for every function they use, and if the code is working, they tend to move on to something else. And it’s not like everything is documented all that well — the type of minutiae that I’m talking about here rarely gets discussed in documentation. And this is where you get these issues. The rest of the world discovered this problem, long the bane of WINE programmers, when the second web browser shipped, and suddenly everybody was noticing how the bugs they had counted on to make their pages look kewl weren’t there any more. As we speak zillions of HTMLers are bemoaning the fact that IE6 now adheres to the standard for what <CENTER> does to the text inside a table, and so their pages that looked left-justified look like wedding invitations in IE6.

How did we get into this mess? Larry Wall famously said, “People understand instinctively that the best way for computer programs to communicate with each other is for each of the them to be strict in what they emit, and liberal in what they accept.” I think that the evolution of HTML has proven that this isn’t such a great idea. In fact, the stricter the API is about its input, the more likely the code is going to work in funny situations. The designers of Java got it right when they decided that nothing about the Java spec should leave any choice to the compiler developers (at least, not in the gratuitous way that C did, where the size of basic data types was not fixed). A better quote comes from Russian Field Marshal Suvorov: “A hard drill makes an easy battle.” You want your compiler and your development environment to be as strict as possible; you want it to literally generate random return values for GlobalSize so that you don’t get into the habit of counting on something that won’t be there everywhere; you want to use French international settings on Chinese Windows 2000 with an absurd color scheme, DVORAK keyboard, trackball, 640×480 VGA mode, and huge ugly fonts on your development system so that you remember to bake in the code that adjusts for all these things. Then your application will be buff and strong and it will laugh in the face of wimpy problems like people who use commas instead of dots as the decimal. Ha. I eat commas for breakfast, your code will say, with a Russian accent.

Anyway, this is what it takes to get software that works on hundreds of millions of computers. Those of you who develop apps that only have to run on one computer or in a controlled environment have it easy, but you’re getting flabby. One of these days you’ll need to get it to run on a second computer and you’ll need to pull an all nighter, installing a complete development environment on that computer and debugging for two hours, before you discover that you didn’t account for the possibility of spaces in the installation file path because the first computer didn’t have them. 

Hopefully in the future the concept of a Virtual Machine, whether it’s Java, .NET, or something else, will alleviate this pain, but we ain’t there yet. For now I’m happy that with ten minutes of debugging, I can make my app work for all the people who like pink on orange text and set up Windows accordingly. We spent probably 3 weeks of work, out of a 1 year development cycle, fixing configuration bugs. A small price to pay to increase the size of our potential customer base from just US Windows 2000 to the entire universe of NT 4, 95, 98, Me, and XP, worldwide. Cool.

Working on CityDesk, Part Five

As promised I’ve been writing a series of articles about the creation of CityDesk. The primary purpose of these articles is to share the reasoning behind some of the decisions we made as a software company, and the techniques we used in developing our flagship software product. Today I want to talk about the five guiding principles that shaped the architecture.

One of the biggest decisions we made about CityDesk was to make a standalone Windows executable. With all the hubbub about web services, not to mention the noisy Internet dotcom boom we just lived through, this was a non-obvious decision. Why is CityDesk one of those old-fashioned programs you download and run, instead of, say, a web site you go to, in the style of Blogger or Atomz? Or a web server you install, like the Big Iron content management systems (Interwoven, Vignette StoryServer, etc.)?

An even more non-obvious decision was not to use XML. People who were a little bit too close to Silicon Valley groupthink looked at us like we were crazy because CityDesk is not based on XML and is not delivered through a web browser.

Now we’re thanking our lucky stars that we didn’t have a bunch of stupid venture capitalists forcing us to copy all the other content management companies, and we’re grateful that we’re not in Silicon Valley where everyone meets at Bucks and Stanford University seminars and copies each other’s bad ideas, because the one thing we’ve heard from everybody who’s tried CityDesk, consistently, is that CityDesk is the easiest content management software they’ve ever seen, full stop. And we got this ease-of-use because we believed certain things about software.

1. You can’t do good UIs in a web browser.

When we started building CityDesk we kept hearing one thing about Big Iron content management systems, which were all web-browser-based. Despite the fact that these systems had web-based editing interfaces and advanced workflow, we kept hearing that in real life, reporters created their content in Microsoft Word, did the workflow the old way (email), and only at the very last minute, a secretary opened the Word file and cut and pasted it into the content management system.

Why?

Because nobody wants to compose in a big TEXTAREA on an HTML page.

There are too many things that are impossible to deliver properly in a web browser. First of all, there’s the latency. Even if the web server is running on your own machine, round trips still take a certain amount of time, so all web applications feel somewhat sluggish. Second, there is no decent text editing widget for web browsers. With CityDesk as a Windows program, I can give you a feature where you can drag a picture from the desktop into an article. There is no way to create that kind of user interface through the web. I can keep a word count in the corner of the screen and update it in the background whenever you stop typing. You can use Ctrl+S to save without losing your place in the document — something many writers have learned to do regularly. (Good luck creating a word processor inside a web browser that doesn’t instantly lose everything, without prompting, if the user closes the browser). CityDesk has menus. Remember menus? And they work exactly like you expected them to, because the web browser doesn’t have it’s own menu, with a bunch of irrelevant commands, that wants to eat the Alt key. We don’t have to waste any screen real estate on browser geegaws (like “back” buttons and spinning e-globes) that have no meaning for CityDesk. We get our own icon in the taskbar instead of looking like all the other web browsers you have open. I could go on for days about the nice UI things we can do in a Windows application that just can’t be done with a web browser. In fact there’s a whole chapter about it in the printed edition of my book.

“Sure,” you say, “you can’t get as slick an app through the web, but you get ubiquity! People can go to an Internet cafe in Phuket and …”  Yeah, OK, that sounds great for Hotmail. And in the future we will build a typical compromise-laden web interface to CityDesk for those times when you’re at the beach in Thailand. In the meantime, my number one priority is to make an application that’s easy to use so that people like it, so that they tell their friends about it, so that everybody buys it and we have to hire full time employees just to slit open the envelopes and take out the checks.

2. XML is a Dumb Format For Storing Data

I’m not sure why XML got so sexy. It has its advantages; it’s sure a good idea for data interchange or for all those little files you need to store settings. But for real work it just can’t do what a solid, multiuser, relational database can do. The next time some uninformed analyst at Gartner or Giga or Forrester tells you “in the future, everything will be XML,” ask them how to do “SELECT author FROM books” fast with XML. Hint: you can’t. It has to be slow. XML is not the way to store a lot of data. Now tell me how to insert a new book at the beginning of the table without massive bitblts. Of course, I doubt if there is an analyst in one of those companies who would even understand that sentence, but that’s life. In the meantime, CityDesk uses a relational database. Over and out.

3. Most People Don’t Control Their Web Servers

Of all the people in the world who need to put data up on the web, only a tiny percentage of them have any control whatsoever over what software runs on their web server. We decided that CityDesk has to be an all-client application because probably 95% of the people who maintain web sites simply do not have the ability to install a content management system on their web server, even if they wanted to. We reduced the server requirements of CityDesk to (a) any web server that knows how to serve static files and (b) ftp or file copy access to that server. We think this will open up content management to a whole new universe of potential users.

4. If You Can Only Do One Platform, Do Windows.

I’d love to have a Mac version and a Linux version, but they are not good uses of limited resources. Every dollar I invest in CityDesk Windows will earn me 20 times as many sales as a dollar invested in a hypothetical Mac version. Even if you assume that Mac has a higher percentage of creative and home users, I’m still going to sell a heck of a lot more copies on Windows than I could on Mac. And that means that to do a Mac version, the cost had better be under 10% of the cost of a Windows version. Unfortunately, that’s nowhere near true for CityDesk. We benefit from using libraries that are freely available on Windows (like the Jet multiuser ACID database engine and the DHTML edit control) for which there are no equivalents on the Macintosh. So if anything, a Mac port would cost more than the original Windows version. Until somebody does something about this fundamental economic truth, it’s hard to justify Mac versions from a business perspective. (Incidentally, I have said time and time again, if Apple wants to save the Mac, they have to change this equation.)

And don’t get me started about Linux. I don’t know of anyone making money off of Linux desktop software, and without making money, I can’t pay programmers and rent and buy computers and T1s. Despite romantic rhetoric, I really do need to pay the rent, so for now, you’re going to have to rely on college kids and the occasional charitable big company for your Linux software.

5. Content Management Can’t Require Programmers

One of the reasons Big Iron content management costs so much is the team of programmers it takes to figure out the thing and set it up. Even the free and low-cost content management systems are designed by geeks for geeks. People have said to me that there’s no market for CityDesk because “with XML and XSL it’s a solved problem.”

Uh huh. Thanks. Even I can’t figure out XML and XSL, and I’m pretty sharp.

We decided that everything in CityDesk has to be simple enough for web designers and HTMLers, who are not necessarily programmers. And for the end-user, even HTML is too much. To the end-user CityDesk can’t be any more complicated than a WYSIWYG word processor.

The rule of thumb with ease-of-use is that if you make your program 10% easier, you’ll double the potential number of users of your product. In CityDesk we made absolutely no compromises on ease of use. Sure, it’s not the most powerful system out there. We’ll compromise on that. And you can’t update your blog from an Internet Cafe in Bali. Another compromise. But I guarantee you that any HTMLer can create CityDesk sites, and anyone who can use a word processor can manage a site that was created for them by an HTMLer, and you’ll never need a programmer.

Summary

All in all we made a bunch of decisions based on a world view that, frankly, was not very popular during the Internet boom. There was a time when you couldn’t get funding for a company that wasn’t a web pure play. And the lemming VCs thought that they had to do what all the other VCs were doing: pure web interfaces. To which I say, Thank you! Thank you for keeping your dumb overfunded companies out of my way, because now, over here in GUI land, I can’t find any real competition, and now you’re not funding anyone, so it’s smooth sailing for Fog Creek.

Working on CityDesk Part: 1 2 3 4 5

Working on CityDesk, Part Four

Boy, what a terrible weekend. I’ve come down with a cold. Fever. Runny nose. General malaise. And WININET.DLL.

WININET.DLL, you see, is a software file provided by Microsoft. It comes with Internet Explorer and with all versions of Windows since about 1996, and it was the bane of my weekend. And it illustrates a fascinating fact about how much software developers’ day-to-day lives have changed in the last decade.

Here’s what happened. Late on Friday afternoon I was setting up some new web servers designed to handle the increased load we’re expecting when we ship CityDesk. One of the first things I did was try to use CityDesk to copy files to the new servers using the FTP protocol. We had a firewall set up that prevented FTP from going through. CityDesk froze up.

“No problem,” I thought, “I’ll use passive-mode FTP, which can get through that firewall.”

That’s when I noticed CityDesk doesn’t support passive-mode FTP.

“OK, how hard can that be to implement? It’s probably available as an option on the file transfer library we’re using.” One checkbox and I’m done.

But … where’s the checkbox?

No mention of it in the documentation.

Searching the object file itself didn’t turn up anything likely.

I checked Google Groups, formerly known as DejaNews. It’s a complete archive of every UseNet discussion. This is where programmers ask each other questions about the most arcane topics. As I told Babak once: it’s a big world out there. You’re never the first person to have this problem.

After about five minutes searching, the conclusion was inescapable. Our file transfer library, which Microsoft gives away for free with the Visual Basic compiler, can’t do passive-mode FTP.

OK, back to the drawing board. What are my choices? I wrote up a list:

  1. Do without Passive FTP. This would make CityDesk useless to a large number of people, something I wasn’t willing to do.
  2. Purchase a commercial FTP library. Honestly, I’ve always had bad luck with commercial libraries, having discovered one too many times that their code quality is rarely up to the meticulous standards we set for Fog Creek. When I looked around the discussion groups where developers were discussing other FTP libraries, they always seemed to have scary bugs that I couldn’t live with.
  3. Use Microsoft’s other file transfer library, the infamous WININET.DLL. This is actually what Microsoft Internet Explorer uses to transfer files, which, despite its dismal reputation, is used so widely that I thought it had to be reasonably bug free (by now, at least). Anyway, lots of programmers use WININET.DLL and if I run into trouble I’m sure to find ample discussion of the problems on DejaNews, er, Google Groups.

I thought that #3 seemed painless enough. In fact I was already using WININET.DLL somewhere else in our code to import web pages, using the HTTP protocol.

A search of Microsoft’s online knowledge base revealed that you can’t really do FTP with WININET.DLL from Visual Basic code; it does some complicated stuff with threads that mean that you have to call it from C or C++. I thought the easiest thing to do would be to create my own custom FTP control written in C++, which talked to WININET. I chose to use Microsoft’s ATL library to create the custom control, because it makes the smallest files. ATL is the most complicated programming environment in the world, requiring a brain the size of Colorado and 10 years of solid experience to understand what’s going on. I have studied ATL in depth three, maybe four times in my career and I can never remember all the bizarre template crap that’s going on in there. Nobody can.

Yes, Virginia, it is possible to create a software development environment which is so difficult to use that no human being can do it. ATL and COM+ are my two favorite examples (the latter is so complicated that only one man on Earth, Don Box, actually understands everything that’s going on). C++ itself comes pretty darn close. But most programmers are too macho to admit this.

Luckily, Microsoft provides some wizards with ATL which write all the hard code for you. If you want to do something unusual, you’re on your own, so my motto was Nothing Unusual. No sudden moves. Just add some simple methods and events and get the hell outta there, hopefully with my brains non-exploded.

At one point, after I had written almost half the code, I discovered that one of the checkboxes that I had checked on the wizard which wrote the inscrutable ATL code for me was wrong. But once the code is written, it’s written. For the life of me, I couldn’t figure out what the checkbox had done and where to change it in the code. I searched MSDN (a gigabyte of documentation on programming Windows which I keep on my hard drive), the online knowledge base, and finally the entire Internet using Google, and didn’t find an easy way to change it. So I created two entirely new projects using the wizard, checking the box in one case and not checking it in the other, and then ran WinDiff, which compares two entire directories listing all the differences, to find what the checkbox really changed so I could change it in my code. (Somewhere, I had to change a hardcoded number to 131473, because I wanted my controls to be visible at runtime. A classic example of why COM programming is not for humans.)

The Microsoft documentation for WinInet was pretty decent, as these things go, but not decent enough. In the page documenting FtpOpenFile, you find both of these mutually contradictory quotes:

  1. “No file handle is returned”
  2. “Return value: Returns a handle if successful”

Well, which is it? Empirically, it wasn’t returning a file handle.

The next thing I discovered is that if there is a packet filter (a simple form of firewall) somewhere between you and the server, the code hangs trying to copy files. That’s normal; it’s the way of the Internet; there’s nothing you can do about it. After a minute, WinInet will realize that packets are not getting through and will time out. But the user is likely to get impatient long before a minute is up and hit the Cancel button.

When you hit the Cancel button, my code tells WinInet to give up and close down the connection. But as I discovered, if you do this in one of these packet-filter situations, WinInet will simply crash, bringing your program down with it. It’s clearly a bug in Microsoft’s code. An exhaustive search of all my Internet sources found a couple of people reporting the same crashing behavior, but nobody had a workaround.

How can that be? I thought. Since Internet Explorer uses the exact same code, wouldn’t Internet Explorer crash in the same situation?

I tried it. What I discovered is that Internet Explorer doesn’t crash in this situation — it shows an hourglass and freezes for a couple of minutes, waiting for the time-out it knows it will get. This proves that Microsoft’s programmers knew about this bug in WinInet and worked around it, instead of just fixing the code in the first place. Stupid stupid stupid. For the umpteenth time, I found myself dependent on a code library which had a crashing bug that was unacceptable in code I shipped. What are you supposed to do if you’re the chef at Les Halles and your fishmonger is giving you smelly fish?

Another two hours of investigation and experimentation. Finally I decided that in this case, when the user hits the Cancel button, instead of freezing like Internet Explorer, I will simply hide the file transfer so it looks like the operation has been cancelled. In the background, invisible to the user, I’ll wait around for the time-out to happen.

So, as I said, developers’ lives have changed. All weekend I couldn’t sleep. Tossing and turning in sweat-drenched sheets, I had feverish ATL nightmares. Sunday morning I got up at 3 am and coded for 4 hours just to avoid the bad dreams about Structured Exception Handling.

Ten years ago, to write code, you needed to know a programming language, and you needed to know a library of maybe 50 functions that you used regularly. And those functions worked, every time, although some of them (gets) could not be used without creating security bugs.

Today, you need to know how to work with libraries of thousands of functions, representing buggy code written by other people. You can’t possibly learn them all, and the documentation is never good enough to write solid code, so you learn to use online resources like Google, DejaNews, MSDN. (I became much more productive after a coworker at Google showed me that you’re better off using Google to search Microsoft’s knowledge base rather than the pathetic search engine Microsoft supplies). In this new world, you’re better off using common languages like Visual Basic and common libraries like WinInet, because so many other people are using them it’s easier to find bug fixes and sample code on the Web. Last week, Michael added a feature to CityDesk to check the installed version of Internet Explorer. It’s not hard code to write, but why bother? It only took him a few seconds to find the code, in VB, on the Web and cut and paste it.

We used to write algorithms. Now we call APIs.

Nowadays a good programmer spends a lot of time doing defensive coding, working around other people’s bugs. It’s not uncommon to set up an exception handler to prevent your code from crashing when your crap library crashes.

Times have changed. Welcome to a world where the programmer who knows how to tap into other people’s brains and experience using the Internet has a decisive advantage.

Working on CityDesk Part: 1 2 3 4 5

Working on CityDesk, Part Three

UNM Humanities Building, where my dad had his officeLike all good hackers, I have been programming since junior high, using my dad’s account on the University of New Mexico IBM 360 mainframe. But the first real computer science course I took was during the first year of college. The course covered C and assembler. It had hundreds of students, most of whom, like me, had been fooling around in Pascal or Basic since they were toddlers.

The course proceeded merrily until one day, the professor introduced pointers.

A rock in Bandelier Park, New Mexico

Dread.

Suddenly, the majority of the students in the class were in deep trouble. For some reason, some people just do not seem to be able to write code with pointers in it. They were born without the part of the brain that does indirection, I guess.

Since then, I’ve mentally divided the world into three groups. The largest group of people can’t program at all. There’s another, smaller group of people who can program, but not with pointers. And there’s a tiny group of people who can program, even with pointers. Those elite few can even understand what it means to write CString*& in C++.

My first job at Microsoft was putting a decent scripting language into Excel. Although I was given free rein to implement whatever language I saw fit, we went with Basic for several reasons. (1) We had a Basic compiler team in house. (2) You didn’t need pointers, which meant that (3) more people are comfortable using Basic than any other language. Indeed Visual Basic is the best-selling language product of all time.

Visual Basic is an extremely productive way to write code, especially GUI code. Want bold text on a dialog box? It’s one click in VB. Now try doing it in MFC. You have to create a subclassed control, it’s a big mess, you have to know all about LOGFONTS and Windows window subclassing and a bunch of other things and you need about three lines of code once you have the magic class.

But many VB programs are spaghetti, either because they’re done as quick and dirty one-offs, or because they’re written by hack programmers without training in object oriented programming, or even structured programming.

What I wondered was, what happens if you take top-notch C++ programmers who dream in pointers, and let them code in VB. What I discovered at Fog Creek was that they become super-efficient coding machines. The code looks pretty good, it’s object-oriented and robust, but you don’t waste time using tools that are at a level lower than you need. I’ve spent years writing code for C++/MFC and years writing code in Visual Basic, and let me tell you, VB is just much, much more productive. Michael and I had a good laugh today when we discovered somebody selling a beta crash-reporting product at $5000 for three months that Michael implemented in CityDesk in two days. (And we actually implemented a good part of ours in C++/ATL). And I also guarantee you that our Visual Basic code in CityDesk looks a lot better than most of the code you find written in macho languages like C++, because we’re good programmers, and we write comments, and our variable names are well-chosen, and we do things the simple way, not the clever way, and so forth.

I’ll go out on a limb here. In my years of experience, I have seen many language and programming fads come and go. But there’s only ONE, that’s right, ONE language feature I’ve ever seen that actually improves your productivity significantly. No, it’s not object oriented programming; no, it’s not intentional programming or assertions or programming by example or CASE or UML or XML or Java. The only thing that improves your programming productivity is using managed code – that is, using a language in which memory management is automatic. Java and .NET languages do this with garbage collection; VB does this with reference counting; I don’t care how you do it, just let me concatenate strings without thinking about where the new bigger string will go and I’ll be happy.

One of the things about Visual Basic is that it doesn’t always give you access to the full repertoire of Windows goodies that you need to make a polished application. But what it does do, better than almost any other programming environment, is let you drop into C++ code (or call C APIs) when you’re desperate or when you need that extra speed. For example, you can always get the HWND of a control and do native stuff to it, which is not the case in Java. As another example, a lot of the non-GUI, time-sensitive inner loops, like the word counter, in CityDesk are actually implemented in C or C++ for speed. This ability gave us the confidence to use Visual Basic even though it can’t do everything and it tends to do string processing slowly. But since we’re all C++ programmers, we have no fear of creating a DLL or OCX to count words, or parse script, or call a Windows API. So about 5% of CityDesk is actually in C or C++, and we’ll probably move a little bit more of the code to C++ to speed up a few more inner loops.

Now, Visual Basic is not the perfect programming language. It’s fairly object oriented but there are little things that you can’t do with it, like have base classes with implementation reuse. It’s limited to Windows. And the worst part about coding in VB is that people think you’re not cool because your code doesn’t have {‘s and }’s. I can live with the shame if it means I’m more productive.

Philosophically, I think that C# has a bright future in the Windows GUI programming world. It’s not as embarrassing as VB, and it uses the type of syntax which C/C++/Java programmers have come to love. VB programmers looking to upgrade can’t upgrade painlessly to VB.NET, because there are so many major differences in the programming environment. Even Microsoft admits that you can’t port from VB to VB.NET, you have to rewrite. And that’s enough of a pain that many VB programmers will use this opportunity to look around at what else is out there. I think many will choose C#, because it’s virtually the same language as VB.NET with slightly different syntax and vastly less stigma attached to it.

What about Java? Yes, I’ve used Java extensively, but unfortunately the language, the code libraries, and especially the GUI libraries are just too primitive for a commercial desktop application. I like the language and I appreciate the benefit of write-once-run-anywhere, but frankly not a lot of desktop software is sold for Sun Solaris and I think that WORA benefits Sun more than it benefits software developers, and I’m not willing to write an app that behaves in an inferior way on 95% of my customer’s computers to benefit the 5% with alternate platforms. Every Java app LOOKS like a Java app, takes forever to launch, and just doesn’t feel completely native. Since CityDesk’s competitive advantage comes from having an excellent GUI, that’s one area where I refuse to skimp.

I am virtually certain that I will now receive a million emails from lovers of tcl/tk, or Delphi, or C++ Builder, or NextStep, or Cocoa, or perl, or Python, or RealBASIC, or some other programming environment which may or may not be suitable for creating a professional Windows GUI. That’s nice. I don’t really want to get into a debate about language features or programming environment features — at some point, you have to stop debating and write code! I’m just trying to explain why we chose to use VB for the GUI part and C++ for time sensitive bits of code. If you do want to have a fun religious war over programming languages, please do so on the discussion group!

Working on CityDesk Part: 1 2 3 4 5

In Defense of Not-Invented-Here Syndrome

Time for a pop quiz.

Copley Square 1. Code Reuse is:

a) Good
b) Bad

2. Reinventing the Wheel is:

a) Good
b) Bad

3. The Not-Invented-Here Syndrome is:

a) Good
b) Bad

Of course, everybody knows that you should always leverage other people’s work. The correct answers are, of course, 1(a) 2(b) 3(b).

Right?

Not so fast, there!

The Not-Invented-Here Syndrome is considered a classic management pathology, in which a team refuses to use a technology that they didn’t create themselves. People with NIH syndrome are obviously just being petty, refusing to do what’s in the best interest of the overall organization because they can’t find a way to take credit. (Right?) The Boring Business History Section at your local megabookstore is rife with stories about stupid teams that spend millions of dollars and twelve years building something they could have bought at Egghead for $9.99. And everybody who has paid any attention whatsoever to three decades of progress in computer programming knows that Reuse is the Holy Grail of all modern programming systems.

Right. Well, that’s what I thought, too. So when I was the program manager in charge of the first implementation of Visual Basic for Applications, I put together a careful coalition of four, count them, four different teams at Microsoft to get custom dialog boxes in Excel VBA. The idea was complicated and fraught with interdependencies. There was a team called AFX that was working on some kind of dialog editor. Then we would use this brand new code from the OLE group which let you embed one app inside another. And the Visual Basic team would provide the programming language behind it. After a week of negotiation I got the AFX, OLE, and VB teams to agree to this in principle.

I stopped by Andrew Kwatinetz’s office. He was my manager at the time and taught me everything I know. “The Excel development team will never accept it,” he said. “You know their motto? ‘Find the dependencies — and eliminate them.’ They’ll never go for something with so many dependencies.”

In-ter-est-ing. I hadn’t known that. I guess that explained why Excel had its own C compiler.

By now I’m sure many of my readers are rolling on the floor laughing. “Isn’t Microsoft stupid,” you’re thinking, “they refused to use other people’s code and they even had their own compiler just for one product.”

Not so fast, big boy! The Excel team’s ruggedly independent mentality also meant that they always shipped on time, their code was of uniformly high quality, and they had a compiler which, back in the 1980s, generated pcode and could therefore run unmodified on Macintosh’s 68000 chip as well as Intel PCs. The pcode also made the executable file about half the size that Intel binaries would have been, which loaded faster from floppy disks and required less RAM.

“Find the dependencies — and eliminate them.” When you’re working on a really, really good team with great programmers, everybody else’s code, frankly, is bug-infested garbage, and nobody else knows how to ship on time. When you’re a cordon bleu chef and you need fresh lavender, you grow it yourself instead of buying it in the farmers’ market, because sometimes they don’t have fresh lavender or they have old lavender which they pass off as fresh.

Indeed during the recent dotcom mania a bunch of quack business writers suggested that the company of the future would be totally virtual — just a trendy couple sipping Chardonnay in their living room outsourcing everything. What these hyperventilating “visionaries” overlooked is that the market pays for value added. Two yuppies in a living room buying an e-commerce engine from company A and selling merchandise made by company B and warehoused and shipped by company C, with customer service from company D, isn’t honestly adding much value. In fact, if you’ve ever had to outsource a critical business function, you realize that outsourcing is hell. Without direct control over customer service, you’re going to get nightmarishly bad customer service — the kind people write about in their weblogs when they tried to get someone, anyone, from some phone company to do even the most basic thing. If you outsource fulfillment, and your fulfillment partner has a different idea about what constitutes prompt delivery, your customers are not going to be happy, and there’s nothing you can do about it, because it took 3 months to find a fulfillment partner in the first place, and in fact, you won’t even know that your customers are unhappy, because they can’t talk to you, because you’ve set up an outsourced customer service center with the explicit aim of not listening to your own customers. That e-commerce engine you bought? There’s no way it’s going to be as flexible as what Amazon does with obidos, which they wrote themselves. (And if it is, then Amazon has no advantage over their competitors who bought the same thing). And no off-the-shelf web server is going to be as blazingly fast as what Google does with their hand-coded, hand-optimized server.

This principle, unfortunately, seems to be directly in conflict with the ideal of “code reuse good — reinventing wheel bad.”

The best advice I can offer:

If it’s a core business function — do it yourself, no matter what.

Pick your core business competencies and goals, and do those in house. If you’re a software company, writing excellent code is how you’re going to succeed. Go ahead and outsource the company cafeteria and the CD-ROM duplication. If you’re a pharmaceutical company, write software for drug research, but don’t write your own accounting package. If you’re a web accounting service, write your own accounting package, but don’t try to create your own magazine ads. If you have customers, never outsource customer service.

If you’re developing a computer game where the plot is your competitive advantage, it’s OK to use a third party 3D library. But if cool 3D effects are going to be your distinguishing feature, you had better roll your own.

The only exception to this rule, I suspect, is if your own people are more incompetent than everyone else, so whenever you try to do anything in house, it’s botched up. Yes, there are plenty of places like this. If you’re in one of them, I can’t help you.

Discuss

Working on CityDesk, Part Two

I’ve been trying to get a discussion board running for three days now, without much success. Installing server software is just much harder than installing Windows software. There are always multiple complicated steps that involve permissions, accounts, database servers, dependencies (do you have the absolute latest version of perl?), superuser, and web servers. If you’re using free software, nobody wants to volunteer to make a decent setup program (or even documentation, half the time), so you’re generally on your own there. And when you’re using commercial software, the vendor would usually like to sell you a three-week integration consulting project so they can make another $50,000 copying some files and editing some configuration files. (We have a double-click SETUP program for FogBUGZ that works pretty well, but there are still people with funny configurations where it doesn’t set all the permissions correctly.)

Untitled, Steven Harvey, Oil on Canvas

I’m also a bit particular about what discussion software we use. The purpose of the board will be a place for CityDesk beta testers (and later, users) to ask questions, provide feedback, and share ideas. In designing a UI for anything, the very first question you always want to ask is: who is the user? Specifically: are most users casual, occasional users, or are these users who will be spending all of their time using your program? For casual users, learnability and simplicity are more important than usability and power. In that sentence, by “learnability,” I mean, the ability for novices to figure out how to get tasks done rapidly. By “usability,” I mean only the ability to do tasks in a convenient and ergonomic way without making mistakes and without needing to do repetitive tasks. A data entry system that minimizes keystrokes by prefilling things and automatically jumping from field to field is more usable for experienced users, but it’s harder to learn because it behaves unexpectedly to a novice.

Most people using the Fog Creek bulletin board will be going there to ask a question or raise an issue. Certainly in the early days, learnability and simplicity are our priority. When I looked a bunch of discussion boards, I found that most of them have their heritage in the BBS world, where the same people log on to chat for 4 hours every night. Those people love features like a geegaw that lets you put a graphical smiley in your posting, and the ability to upload snapshots of their ugly mugs to appear next to their postings, and the ability to click a button and never see postings by the blithering idiot who wrote this one again. All of these features are neat for power users but they just clutter up the interface for novices.

Juno's Read and Write tabs(Longtime Juno email users may have noticed that as time went on, the big Read and Write tabs got smaller and smaller. In the early days, where almost every Juno user was a new Juno user, it was nice to have big giant buttons for reading and writing email, the most common task. As time went on, a larger proportion of our users were experienced users, who know how to “read” and “write” and would rather have more screen real estate available for other features. It’s not uncommon for a program to start out simple and evolve to be more complicated, and you can do this without hurting “average” usability because your users are getting more experienced on average.)

I spent Wednesday and Friday playing around, installing various buggy BBS systems, some of which required a Linux server, others Windows 2000, playing with DNS to move discuss.fogcreek.com hither and yon, figuring out why DNS caches weren’t flushing, installing and reinstalling database servers, and generally getting frustrated. One of the most popular packages, Discus, actually hardcoded its own URL in so many places that it needed to be reinstalled from scratch just to change the URL. (In fact, it wasn’t enough to reinstall it… you had to redownload it. The download package already had your personal URL hardcoded throughout.) And it had a perfectly terrible UI in which there was a little treeview showing folders (so far so good)… but each folder was actually a command, not a container. A broken metaphor, worse than no metaphor at all. That had to go. Then I tried IdealBB, a decidely beta package. I wouldn’t ordinarily run software that is advertised as beta, but this thing seemed to have been through many releases and it was alleged to be very close to shipping. Michael and I actually had to roll up our sleeves and debug the thing ourselves to get it sort of running, but then there were too many ASP errors and it had a tendency to crash the server. (Too bad, because it is one of the finest looking discussion boards, visually). Finally I flirted with Manila, since we have it running anyway for our weblogs, and we’ve already written a little daemon which watches Manila and restarts it when it crashes (about once every two days). But (as far as we could tell) Manila requires membership to post a message, and in my experience that is enough to turn away 90% of the casual visitors who might otherwise use the discussion board. It would be great for small elite communities of people who all post all the time, but I don’t want anything to get in the way of a beta tester casually reporting a bug.

The system I like best, believe it or not, is Lusenet, by Philip Greenspun, because it’s just super simple. That’s what I’ve been using for the Joel on Software discussion group. There are a couple of reasons we couldn’t use that. One, it is really not ready for prime time. There are actually things you can do in Lusenet that still show you the Oracle statement they just executed, as if Philip left in some debug printfs. Second, it’s not hosted on our own servers and we run the risk of it going away, taking our valuable data with it. Third, to host it on our own servers requires AOLServer and Oracle, and we don’t have the former, and can’t afford the latter.

When I got home, grumpy from all the time I had wasted, I realized that the software I want consists of exactly one table, and I could write the thing myself in less time than in took me to install some of these packages. Which I did. Two hours of work (ASP, Microsoft Access, and VBScript) and I had banged out a system that did pretty much everything I wanted (which is not much!) Check it out at http://discuss.fogcreek.com/joelonsoftware. (If you have trouble reaching it that’s because of all my DNS messing around. You’ll have to wait a couple of days for caches to flush.)

There’s a lesson in here somewhere, but I’m already well past 1000 words. In the past I wouldn’t have cared about word counts because I didn’t know what they were, but now I’m using CityDesk which keeps a running word count in the corner of the screen. So we’ll talk about the lesson tomorrow!

Working on CityDesk Part: 1 2 3 4 5

Working on CityDesk, Part One

I’ve been rather quiet lately on this weblog — mainly because we’ve been working so hard at Fog Creek getting ready for the beta of CityDesk, our flagship product. But I’d like to spend some time now talking about the design and development of CityDesk, because it’s a great case study for the kind of software development practices that I’ve been advocating here for more than a year. Over the next few columns, I’ll focus on “The Story of CityDesk,” with a look at some of the behind-the-scenes, day-to-day stories of a real software development project.

We’re launching the beta on October 15th — just three days away! This is officially On Time. We made a schedule for shipping CityDesk way back in June. We figured out estimates for all the remaining tasks and bug fixes, added them up, and got October 15th.

Every couple of weeks, we checked through our task list, revised estimates, and so forth. We’ve found a lot of bugs over that time and added a lot of small features, but we’ve also killed a lot of features that we just won’t have time for. And lo and behold, now we are almost completely done. Michael's cool crash handler actually enters diagnostics right into FogBUGZ. Most of what we’re doing these days is getting set up to run the beta. Michael added a dialog box to CityDesk for sending feedback. He also wrote some very spiffy code that catches any unhandled exception on any copy of CityDesk running anywhere in the world, prevents the app from crashing, and pushes the exception info up to our FogBUGZ bug tracking database here in New York City. I’ve been fixing bugs, writing the help file, and writing some pages for our corporate web site that will explain what CityDesk is and why you would buy it.

A common misconception, I assume popularized by Hollywood, is that as you get closer to shipping software, activity becomes frenetic as everybody scrambles to finish all the things that need to be done in time for the deadline. In the typical crappy movie, there’s a mad rush of typing in a room full of cool alterna-dressed programmers with found-object earrings and jeans jackets. Somebody stands up and shouts to the room in general “I need the Jiff subroutine! Somebody give me the Jiff subroutine!” A good looking young woman in Vivienne Tam urbanwear throws a floppy disk at him. “Thanks!”

Yeah, most programmers are as cute as Ryan Phillipe. That's why I'm in this field.As the second hand swoops towards the :00, the whole team waits breathlessly around Ryan Phillipe’s computer and watches the “copy” progress indicator as the final bits are put onto a floppy disk with less than a second to spare before the VC cuts off funding.

I suppose some software shops have last-minute coding frenzies like this. If so, their software is probably marked by incredibly poor quality. No code works the way you expected it to work the first time, and if you’re writing code up until the last minute, it’s going to suck. The first Netscape open source release, documented in the excellent movie Code Rush , demonstrates this.

On good teams, the days before shipping just get quieter and quieter as programmers literally run out of things to do one at a time. (Yesterday I took the day off to explore New York City with my wee niece and nephews.)

Times Square

The number of new bugs being found has decreased to a point at which we feel confident about releasing the beta. It is crucial to get to zero known bugs (what Netscape famously called “Zarro Boogs”) before releasing a beta. If you don’t, you’ll waste a lot of time during the beta reading 200 emails about a bug that you already knew about. And you’ve just used up time and goodwill of those 200 beta testers, so they may not bother telling you about the next bug they find, something you didn’t know about. Or the bug may stop them from trying other parts of the program that needs some pounding. This seems self-evident, but almost every time I’ve been on a real product, everybody starts to think that releasing the beta on time is more important than releasing a Zero Known Bugs beta. (After all, it’s ok to have bugs in the beta, they say. And I agree: it is ok to have bugs in the beta, just not known bugs.)

I’ll keep posting The Story of CityDesk over the next few days, keep an eye out for frequent updates.

Working on CityDesk Part: 1 2 3 4 5

Hard-assed Bug Fixin’

Software quality, or the lack thereof, is something everybody loves to gripe about. Now that I have my own company I finally decided to do something about it. Over the last two weeks we stopped everything at Fog Creek to ship a new incremental version of FogBUGZ with the goal of eliminating all known bugs (there were about 30).

As a software developer, fixing bugs is a good thing. Right? Isn’t it always a good thing?

No!

Fixing bugs is only important when the value of having the bug fixed exceeds the cost of the fixing it.

These things are hard, but not impossible, to measure. Let me give you an example. Suppose you operate a peanut-butter-and-jelly sandwich factory. Your factory produces 100,000 sandwiches a day. Recently, due to the introduction of some new flavors (garlic peanut butter with spicy Habanero jam), demand for your product has gone through the roof. The factory is operating full-out at 100,000 sandwiches, but the demand is probably closer to 200,000. You just can’t make any more. And each sandwich earns you a profit of 15 cents. So you’re losing $15,000 a day in potential earnings because you don’t have enough capacity.

Building a new factory would cost way too much. You don’t have the capital, and you’re afraid that spicy/garlicky sandwiches are just a fad which will pass, anyway. But you’re still losing that $15,000 a day.

It’s a good thing you hired Jason. Jason is a fourteen year old programmer who hacked into the computers that run the factory, and believes that he has come up with a way to speed up the assembly line by a factor of 2. Something about overclocking that he heard on slashdot. And it seemed to work in a test run.

There’s only one thing stopping you from rolling it out. There’s a teeny tiny wee little bug that causes a sandwich to be mushed once an hour or so. Jason wants to fix the wee bug. He thinks he can fix it in three days. Do you let him fix it, or do you roll out the software in its bug-addled state?

Rolling out the software three days later will cost you $45,000 in lost profits. And it will save you, um, the cost of raw materials for 72 sandwiches. (In either case Jason will get the bug fixed three days later). Well, I don’t know how much sandwiches cost on your planet, but here on Earth, they’re a lot less than $625.

Where was I. Oh yeah. Sometimes it is not worth fixing a bug. Here’s another bug that’s not worth fixing: if you have a bug that totally crashes your program when you open gigantic files, but it only happens to your single user who has OS/2 and who, for all you know, doesn’t even use large files. Well, don’t fix it. Worse things have happened at sea. Similarly I’ve generally given up caring about people with 16 color screens or people running off-the-shelf Windows 95 with no upgrades in 7 years. People like that don’t spend much money on packaged software products. Trust me.

But mostly, it’s worth fixing bugs. Even if they are “harmless” bugs, they may reduce the reputation of your company and your product, which, in the long run, will have a significant impact on your earnings. It’s hard to overcome the reputation of having a buggy product. When you do want to do that .01 release, here are some ideas for finding and fixing the right bugs: the ones that it is economically worth fixing.

Step One: Make Sure You Find Out About The Bugs.

In the case of FogBUGZ, we have two ways of doing that. First, we trap all bugs on our free demo server, capture as much information as we can, and email the whole thing to the development team. That found an awful lot of bugs, which was very cool. For example, we discovered a bunch of people who didn’t enter dates where they were supposed to in the “Fix For” screen. We didn’t even have an error message in that case, we just “crashed” (which, in a web app, just means you got an ugly IIS error instead of what you expected). Oops.

When I worked at Juno, we had an even cooler system in place to collect bugs “from the field” automatically. We installed a handler using TOOLHELP.DLL so that every time Juno crashed, it stayed alive just long enough to dump the stack into a log file before going to its grave. The next time the program connected to the Internet to send mail, it uploaded the log file. During betas, we gathered these log files, collated all the crashes, and entered them into the bug tracking database. This found literally hundreds of crashing bugs. When you have a million users, it is amazing what will crash, often because of severe low memory conditions or severely crappy computers (can you spell Packard Bell?) You could have code like this:

int foo( object& r )
{
    r.Blah();
    return 1;
}

and you would get crashes there because the r reference was NULL, even though that’s completely impossible, there’s no such thing as a NULL reference, C++ guarantees it, and you don’t have to believe me but when you wait long enough and have millions of users and religiously collect their stack dumps, you will find crashes in places like that and you won’t believe your eyes. (And you won’t fix them. Cosmic rays, man. Get a new computer and this time don’t install every cool shareware taskbar lint gizmo you find. Sheesh.)

The other thing we do is consider each and every tech support call to be evidence of a bug. When we take the call, we try to figure out what we could have done to eliminate it. For example, the old FogBUGZ Setup used to assume that FogBUGZ would run under the anonymous Internet user account. That was a good assumption 95% of the time, and a bad assumption 5% of the time, but every one of those 5% cases ended up in a call to our support line. So we modified Setup to prompt for an account.

Step Two: Make Sure You Get Economic Feedback

You may not be able to figure out exactly how much it’s worth to fix each bug, but there’s something you can do: charge the “cost” of tech support back to the business unit. In the early nineties there was a financial reorganization at Microsoft under which each product unit was charged for the full cost of all tech support calls. So the product units started insisting that PSS (Microsoft’s tech support) provide lists of Top Ten Bugs regularly. When the development team concentrated on those, product support costs plummeted.

This is a bit in contradiction with the new trend of letting the tech support department pay for its own operation, something that most large companies do. At Juno tech support was expected to break even by charging people for tech support. By moving the economic burden of bugs onto the users themselves, you lose what limited ability you might have had to detect the damage they were causing. (Instead you get irate users who resent having to pay for your bug, who tell their friends, and you can’t even measure how much that costs you. To be fair to Juno, the product itself was free, so stop yer bitchin.)

One way of resolving the two is to not charge the user when the support call was caused by a bug in your own product. Microsoft does this, and it’s quite nice, and I’ve never paid for a call to Microsoft 🙂 Instead, charge the $245 or whatever one developer incident costs these days back to the product unit. That blows away their profit completely for the product they sold you (several times over), and creates exactly the right economic incentives. Which reminds me of one reason DOS games were a terrible business… to get them to look good and run fast, you usually  needed strange video drivers, and a single tech support call about the video drivers would blow away the profit you could make from 20 copies of your product, assuming Egghead and Ingram and the ad on MTV hadn’t already guzzled away all your earnings.

Step Three: Figure Out What It’s Worth To You To Fix Them All.

At Fog Creek Software, well, we’re a tiny company (except in our own minds), and the development team just takes the tech support calls. The cost was running about 1 hour per day, which, based on our consulting rates, is somewhere around $75000 a year. We were pretty confident that we could get that down to 15 minutes a day by fixing all known bugs.

Using very sloppy numbers, here, that means that the net present value of the savings would be about $150,000. That justifies 62 days of work: if you can do it in less than 62 person-days, it’s worth doing.

Using the handy estimation feature built into FogBUGZ, we calculated that it would take 20 person-days (two people two weeks) to fix everything – that’s $48,000 “spent” for a return of $150,000, which is a great return on investment just on the basis of the tech support savings. (Observe that you could substitute the cost of programmer’s salaries and overhead instead of our consulting rate and get the same 3:1 result, since it cancels out).

I haven’t even begun to count the value from having a better product, but I can start doing that, too. We had 55 crashes on the demo server during the month of July with the old code, representing 17 distinct users. You have to imagine that at least one of those people decided not to buy FogBUGZ because they thought it was buggy when they ran the demo (although I don’t have real statistics for that.) In any case the lost sales was probably costing us somewhere between $7,000 and $100,000 in present value. (If you were serious enough, it wouldn’t be too hard to get a real number).

Next question. Can you charge more for a less buggy product? That would add a whole bunch of value to debugging. I suspect that at the extremes, bug count does affect price, but I am hard pressed to think of an example from the world of packaged software where this has been the case.

Please Don’t Beat Me Up!

Inevitably people read essays like this and come to silly conclusions, like, Joel doesn’t think you should fix bugs. In fact I think that for most of the kinds of bugs that most people fix, there’s a clear return on investment. But there may be an even higher monetary value to doing something other than fixing every last bug. If you have to decide between fixing the bug for OS/2 guy and adding a new feature that will sell 20,000 copies of your software to General Electric, well, sorry, OS/2 guy. And if you’re dumb enough to think that it’s still more important to fix OS/2 than to add the GE feature, maybe your competitors won’t be and you’ll be out of business.

With all that said, I’m optimistic at heart, and I believe that there is a lot of hidden value to producing very high quality products that is not very easy to capture. Your employees will be prouder. Fewer of your customers will send you back your CD in the mail after microwaving it and chopping it to bits with an ax. So I tend to err on the side of quality (indeed, we fixed every known bug in FogBUGZ, not just the big bang ones) and take pride in that, and feel confident, by the complete elimination of errors from the demo server, that we have a rock-solid product.

Good Software Takes Ten Years. Get Used To it.

Have a look at this little chart:

picture-lotus-notes:
[Source: Iris Associates]

This is a chart showing the number of installed seats of the Lotus Notes workgroup software, from the time it was introduced in 1989 through 2000. In fact when Notes 1.0 finally shipped it had been under development for five years. Notice just how dang long it took before Notes was really good enough that people started buying it. Indeed, from the first line of code written in 1984 until the hockey-stick part of the curve where things really started to turn up, about 11 years passed. During this time Ray Ozzie and his crew weren’t drinking piña coladas in St Barts. They were writing code.

The reason I’m telling you this story is that it’s not unusual for a serious software application. The Oracle RDBMS has been around for 22 years now. Windows NT development started 12 years ago. Microsoft Word is positively long in the tooth; I remember seeing Word 1.0 for DOS in high school (that dates me, doesn’t it? It was 1983.)

To experienced software people, none of this is very surprising. You write the first version of your product, a few people use it, they might like it, but there are too many obvious missing features, performance problems, whatever, so a year later, you’ve got version 2.0. Everybody argues about which features are going to go into 2.0, 3.0, 4.0, because there are so many important things to do. I remember from the Excel days how many things we had that we just had to do. Pivot Tables. 3-D spreadsheets. VBA. Data access. When you finally shipped a new version to the waiting public, people fell all over themselves to buy it. Remember Windows 3.1? And it positively, absolutely needed long file names, it needed memory protection, it needed plug and play, it needed a zillion important things that we can’t imagine living without, but there was no time, so those features had to wait for Windows 95.

But that’s just the first ten years. After that, nobody can think of a single feature that they really need. Is there anything you need that Excel 2000 or Windows 2000 doesn’t already do? With all due respect to my friends on the Office team, I can’t help but feel that there hasn’t been a useful new feature in Office since about 1995. Many of the so-called “features” added since then, like the reviled ex-paperclip and auto-document-mangling, are just annoyances and O’Reilly is doing a nice business selling books telling you how to turn them off.

So, it takes a long time to write a good program, but when it’s done, it’s done. Oh sure, you can crank out a new version every year or two, trying to get the upgrade revenues, but eventually people will ask: “why fix what ain’t broken?”

picture-fruit:

Failure to understand the ten-year rule leads to crucial business mistakes.

Mistake number 1. The Get Big Fast syndrome. This fallacy of the Internet bubble has already been thoroughly discredited elsewhere, so I won’t flog it too much. But an important observation is that the bubble companies that were trying to create software (as opposed to pet food shops) just didn’t have enough time for their software to get good. My favorite example is desktop.com, which had the beginnings of something that would have been great if they had worked on it for 10 years. But the build-to-flip mentality, the huge overstaffing and overspending of the company, and the need to raise VC every ten minutes made it impossible to develop the software over 10 years. And the 1.0 version, like everything, was really morbidly awful, and nobody could imagine using it. But desktop.com 8.0 might have been seriously cool. We’ll never know.

Mistake number 2. the Overhype syndrome. When you release 1.0, you might want to actually keep it kind of quiet. Let the early adopters find it. If you market it and promote it too heavily, when people see what you’ve actually done, they will be underwhelmed. Desktop.com is an example of this, so is Marimba, and Groove: they had so much hype on day one that people stopped in and actually looked at their 1.0 release, trying to see what all the excitement was about, but like most 1.0 products, it was about as exciting as watching grass dry. So now there are a million people running around who haven’t looked at Marimba since 1996, and who think it’s still a dorky list box that downloads Java applets that was thrown together in about 4 months.

Keeping 1.0 quiet means you have to be able to break even with fewer sales. And that means you need lower costs, which means fewer employees, which, in the early days of software development, is actually a really great idea, because if you can only afford 1 programmer at the beginning, the architecture is likely to be reasonably consistent and intelligent, instead of a big mishmash with dozens of conflicting ideas from hundreds of programmers that needs to be rewritten from scratch (like Netscape, according to the defenders of the decision to throw away all the source code and start over).

Mistake number 3. Believing in Internet Time. Around 1996, the New York Times first noticed that new Netscape web browser releases were coming out every six months or so, much faster than the usual 2 year upgrade cycle people were used to from companies like Microsoft. This led to the myth that there was something called “Internet time” in which “business moved faster.” Which would be nice, but it wasn’t true. Software was not getting created any faster, it was just getting released more often. And in the early stages of a new software product, there are so many important things to add that you can do releases every six months and still add a bunch of great features that people Gotta Have. So you do it. But you’re not writing software any faster than you did before. (I will give the Internet Explorer team credit. With IE versions 3.0 and 4.0 they probably created software about ten times faster than the industry norm. This had nothing to do with the Internet and everything to do with the fact that they had a fantastic, war-hardened team that benefited from 15 years of collective experience creating commercial software at Microsoft.)

Mistake number 4. Running out of upgrade revenues when your software is done. A bit of industry lore: in the early days (late 1980s), the PC industry was growing so fast that almost all software was sold to first time users. Microsoft generally charged about $30 for an upgrade to their $500 software packages until somebody noticed that the growth from new users was running out, and too many copies were being bought as upgrades to justify the low price. Which got us to where we are today, with upgrades generally costing 50%-60% of the price of the full version and making up the majority of the sales. Now the trouble comes when you can’t think of any new features, so you put in the paperclip, and then you take out the paperclip, and you try to charge people both times, and they aren’t falling for it. That’s when you start to wish that you had charged people for one year licenses, so you can make your product a subscription and have permission to keep taking their money even when you haven’t added any new features. It’s a neat accounting trick: if you sell a software package for $100, Wall Street will value that at $100. But if you can sell a one year license for $30, then you can claim that you’re going to get recurring revenue of $30 for the next, say, 10 years, which is worth $200 to Wall Street. Tada! Stock price doubles! (Incidentally, that’s how SAS charges for their software. They get something like 97% renewals every year.)

The trouble is that with packaged software like Microsoft’s, customers won’t fall for it. Microsoft has been trying to get their customers to accept subscription-based software since the early 90’s, and they get massive pushback from their customers every single time. Once people got used to the idea that you “own” the software that you bought, and you don’t have to upgrade if you don’t want the new features, that can be a big problem for the software company which is trying to sell a product that is already feature complete.

Mistake number 5. The “We’ll Ship It When It’s Ready” syndrome. Which reminds me. What the hell is going on with Mozilla? I made fun of them more than a year ago because three years had passed and the damn thing was still not out the door. There’s a frequently-obsolete chart on their web site which purports to show that they now think they will ship in Q4 2001. Since they don’t actually have anything like a schedule based on estimates, I’m not sure why they think this. Ah, such is the state of software development in Internet Time Land.

But I’m getting off topic. Yes, software takes 10 years to write, and no, there is no possible way a business can survive if you don’t ship anything for 10 years. By the time you discount that revenue stream from 10 years in the future to today, you get bupkis, especially since business analysts like to pretend that everything past 5 years is just “residual value” when they make their fabricated, fictitious spreadsheets that convince them that investing in sock puppets at a $100,000,000 valuation is a pretty good idea.

Anyway, getting good software over the course of 10 years assumes that for at least 8 of those years, you’re getting good feedback from your customers, and good innovations from your competitors that you can copy, and good ideas from all the people that come to work for you because they believe that your version 1.0 is promising. You have to release early, incomplete versions — but don’t overhype them or advertise them on the Super Bowl, because they’re just not that good, no matter how smart you are.

Mistake number 6. Too-frequent upgrades (a.k.a. the Corel Syndrome). At the beginning, when you’re adding new features and you don’t have a lot of existing customers, you’ll be able to release a new version every 6 months or so, and people will love you for the new features. After four or five releases like that, you have to slow down, or your existing customers will stop upgrading. They’ll skip releases because they don’t want the pain or expense of upgrading. Once they skip a release, they’ll start to convince themselves that, hey, they don’t always need the latest and greatest. I used Corel PhotoPaint 6.0 for 5 years. Yes, I know, it had all kinds of off-by-one bugs, but I knew all the off-by-one bugs and compensated by always dragging the selection one pixel to the right of where I thought it should be.

picture-roosevelt:

Make a ten year plan. Make sure you can survive for 10 years, because the software products that bring in a billion dollars a year all took that long. Don’t get too hung up on your version 1 and don’t think, for a minute, that you have any hope of reaching large markets with your first version. Good software, like wine, takes time.