2002/11/19

Good Books

The GoalA few months ago I read The Goal, by Eliyahu M. Goldratt, mainly because it has become extremely popular at business schools, and it looked fun. It was interesting, and fun. I didn’t understand how the book’s theory, called the Theory of Constraints, could possibly be applied to software development, but it was still interesting enough, and I figured if I ever found myself running a factory again, it would be helpful.

Critical ChainLast week I discovered his newer book, Critical Chain. This book applies the Theory of Constraints, introduced in The Goal, to project management, and it seems to really make sense.

Lets say you’re creating Painless Software Schedules. Most people’s intuition is to come up with conservative, padded estimates for each task, but they still find that their schedules always slip. Goldratt shows that the slip is precisely because we pad the estimates for each step, which leads to three problems:

  1. “Student Syndrome” – no matter how long you give students to work on something, they will start the night before. Phil Greenspun noticed this: “The first term that we taught 6.916, we gave the students one week to do Problem Set 1. It was pretty tough and some of them worked all night the last two nights. Having watched them still at their terminals when we left the lab at 4:00 am, we wanted to be kinder and gentler the next semester. So we gave them two weeks to do the same homework assignment. The first week went by. The students were working on other classes, playing sports on the lawn, going out with friends. They didn’t start working on the problem set until a few days before it was due and ended up in the lab all night just as before.”
  2. Multitasking, which, as I discuss, makes the lead time for each step dramatically longer, and
  3. the fact that delays accumulate, while advances do not (for example, if you have finished this week’s work on Friday morning, chances are you will waste time on Friday afternoon rather than starting the next week’s work. But if you don’t make it on time, you’ll still leave at 5 o’clock on Friday, accumulating a delay.)

Goldratt’s solution is to choose task estimates that are not padded: each individual task’s estimate should be exactly in the middle of the probability curve, so there is a 50% chance you will finish early and a 50% chance you will be late. You should move all the padding to the end of the project (or milestone) where it won’t do any harm.

I can’t do justice to an entire book in this short post, but if you’re doing any kind of project scheduling or management, I highly recommend that you read both books (read The Goal first no matter what you do, both because it’s more entertaining, and because it teaches you the foundation you need for Critical Chain).

2002/11/14

Bad Spam Filters

Spam is getting worse and worse. My incoming spam ratio is well over 50% by now. SpamAssassin catches and tags most of it; these are automatically shuttled into a “Spam” folder. About once a week, it takes me 15 seconds to make sure there’s nothing important in there and throw it out.

On the other hand, overzealous system administrators are causing serious damage to the connectivity of the Internet by imposing draconian spam filters. The Joel on Software mailing list is operated by a legitimate email delivery company with strong anti spam policies; it is double-opt-in, of course. Increasingly, emails sent to the mailing list are getting bounced — not tagged — before they even get to the users. In the last half hour, five people tried to sign up, but the confirmation email didn’t even get to them. Apparently my mailing list provider’s IP address is now blacklisted by SpamCop. OK, fair enough. But if you or your ISP is using a spam filter that bounces mail, you’re going to lose stuff that you didn’t want to lose. So don’t do it. Use tagging systems instead — have the spam filter add a tag like “***SPAM***” to the subject line, and let your email client shuttle these off to another folder.

Here’s what I’d like to see: a system that delivers an email for one cent. Nobody has to use it, but if you want to get your messages through, you pay one cent and the system delivers it for you. Every spam filtering system on earth can safely whitelist all email that comes from the one cent server, because no spammer can afford the penny times the 19 million messages they send. I would use it for all my email. You could even give 3/4s of a cent to the recipient as a credit to use for sending their own mail, keeping the 1/4 cent to pay for the servers. Eventually, if it caught on, you wouldn’t need a spam filter: just put all the free email in a suspect folder, and check it once a week in case some old school holdouts insist on sending you email without paying.

2002/11/13

Joel on Software in Russian is now live with the first 7 articles. Many thanks to Alexander Bashkov, Alexander Shirshov, Alexey Simonov, Dmitriy Mayorov, Marat Zborovskiy, Marianna Evseeva, Michael Zukerman, Petr Gladkikh, Sergey Kalmakov, Simon Hawkin, and Yury Udovichenko who translated and edited these articles. More are on the way!

The Law of Leaky Abstractions

There’s a key piece of magic in the engineering of the Internet which you rely on every single day. It happens in the TCP protocol, one of the fundamental building blocks of the Internet.

TCP is a way to transmit data that is reliable. By this I mean: if you send a message over a network using TCP, it will arrive, and it won’t be garbled or corrupted.

We use TCP for many things like fetching web pages and sending email. The reliability of TCP is why every email arrives in letter-perfect condition. Even if it’s just some dumb spam.

By comparison, there is another method of transmitting data called IP which is unreliable. Nobody promises that your data will arrive, and it might get messed up before it arrives. If you send a bunch of messages with IP, don’t be surprised if only half of them arrive, and some of those are in a different order than the order in which they were sent, and some of them have been replaced by alternate messages, perhaps containing pictures of adorable baby orangutans, or more likely just a lot of unreadable garbage that looks like that spam you get in a foreign language.

Here’s the magic part: TCP is built on top of IP. In other words, TCP is obliged to somehow send data reliably using only an unreliable tool.

To illustrate why this is magic, consider the following morally equivalent, though somewhat ludicrous, scenario from the real world.

Imagine that we had a way of sending actors from Broadway to Hollywood that involved putting them in cars and driving them across the country. Some of these cars crashed, killing the poor actors. Sometimes the actors got drunk on the way and shaved their heads or got nasal tattoos, thus becoming too ugly to work in Hollywood, and frequently the actors arrived in a different order than they had set out, because they all took different routes. Now imagine a new service called Hollywood Express, which delivered actors to Hollywood, guaranteeing that they would (a) arrive (b) in order (c) in perfect condition. The magic part is that Hollywood Express doesn’t have any method of delivering the actors, other than the unreliable method of putting them in cars and driving them across the country. Hollywood Express works by checking that each actor arrives in perfect condition, and, if he doesn’t, calling up the home office and requesting that the actor’s identical twin be sent instead. If the actors arrive in the wrong order Hollywood Express rearranges them. If a large UFO on its way to Area 51 crashes on the highway in Nevada, rendering it impassable, all the actors that went that way are rerouted via Arizona and Hollywood Express doesn’t even tell the movie directors in California what happened. To them, it just looks like the actors are arriving a little bit more slowly than usual, and they never even hear about the UFO crash.

That is, approximately, the magic of TCP. It is what computer scientists like to call an abstraction: a simplification of something much more complicated that is going on under the covers. As it turns out, a lot of computer programming consists of building abstractions. What is a string library? It’s a way to pretend that computers can manipulate strings just as easily as they can manipulate numbers. What is a file system? It’s a way to pretend that a hard drive isn’t really a bunch of spinning magnetic platters that can store bits at certain locations, but rather a hierarchical system of folders-within-folders containing individual files that in turn consist of one or more strings of bytes.

Back to TCP. Earlier for the sake of simplicity I told a little fib, and some of you have steam coming out of your ears by now because this fib is driving you crazy. I said that TCP guarantees that your message will arrive. It doesn’t, actually. If your pet snake has chewed through the network cable leading to your computer, and no IP packets can get through, then TCP can’t do anything about it and your message doesn’t arrive. If you were curt with the system administrators in your company and they punished you by plugging you into an overloaded hub, only some of your IP packets will get through, and TCP will work, but everything will be really slow.

This is what I call a leaky abstraction. TCP attempts to provide a complete abstraction of an underlying unreliable network, but sometimes, the network leaks through the abstraction and you feel the things that the abstraction can’t quite protect you from. This is but one example of what I’ve dubbed the Law of Leaky Abstractions:

All non-trivial abstractions, to some degree, are leaky.

Abstractions fail. Sometimes a little, sometimes a lot. There’s leakage. Things go wrong. It happens all over the place when you have abstractions. Here are some examples.

  • Something as simple as iterating over a large two-dimensional array can have radically different performance if you do it horizontally rather than vertically, depending on the “grain of the wood” — one direction may result in vastly more page faults than the other direction, and page faults are slow. Even assembly programmers are supposed to be allowed to pretend that they have a big flat address space, but virtual memory means it’s really just an abstraction, which leaks when there’s a page fault and certain memory fetches take way more nanoseconds than other memory fetches.
  • The SQL language is meant to abstract away the procedural steps that are needed to query a database, instead allowing you to define merely what you want and let the database figure out the procedural steps to query it. But in some cases, certain SQL queries are thousands of times slower than other logically equivalent queries. A famous example of this is that some SQL servers are dramatically faster if you specify “where a=b and b=c and a=c” than if you only specify “where a=b and b=c” even though the result set is the same. You’re not supposed to have to care about the procedure, only the specification. But sometimes the abstraction leaks and causes horrible performance and you have to break out the query plan analyzer and study what it did wrong, and figure out how to make your query run faster.
  • Even though network libraries like NFS and SMB let you treat files on remote machines “as if” they were local, sometimes the connection becomes very slow or goes down, and the file stops acting like it was local, and as a programmer you have to write code to deal with this. The abstraction of “remote file is the same as local file” leaks. Here’s a concrete example for Unix sysadmins. If you put users’ home directories on NFS-mounted drives (one abstraction), and your users create .forward files to forward all their email somewhere else (another abstraction), and the NFS server goes down while new email is arriving, the messages will not be forwarded because the .forward file will not be found. The leak in the abstraction actually caused a few messages to be dropped on the floor.
  • C++ string classes are supposed to let you pretend that strings are first-class data. They try to abstract away the fact that strings are hard and let you act as if they were as easy as integers. Almost all C++ string classes overload the + operator so you can write s + “bar” to concatenate. But you know what? No matter how hard they try, there is no C++ string class on Earth that will let you type “foo” + “bar”, because string literals in C++ are always char*’s, never strings. The abstraction has sprung a leak that the language doesn’t let you plug. (Amusingly, the history of the evolution of C++ over time can be described as a history of trying to plug the leaks in the string abstraction. Why they couldn’t just add a native string class to the language itself eludes me at the moment.)
  • And you can’t drive as fast when it’s raining, even though your car has windshield wipers and headlights and a roof and a heater, all of which protect you from caring about the fact that it’s raining (they abstract away the weather), but lo, you have to worry about hydroplaning (or aquaplaning in England) and sometimes the rain is so strong you can’t see very far ahead so you go slower in the rain, because the weather can never be completely abstracted away, because of the law of leaky abstractions.

One reason the law of leaky abstractions is problematic is that it means that abstractions do not really simplify our lives as much as they were meant to. When I’m training someone to be a C++ programmer, it would be nice if I never had to teach them about char*’s and pointer arithmetic. It would be nice if I could go straight to STL strings. But one day they’ll write the code “foo” + “bar”, and truly bizarre things will happen, and then I’ll have to stop and teach them all about char*’s anyway. Or one day they’ll be trying to call a Windows API function that is documented as having an OUT LPTSTR argument and they won’t be able to understand how to call it until they learn about char*’s, and pointers, and Unicode, and wchar_t’s, and the TCHAR header files, and all that stuff that leaks up.

In teaching someone about COM programming, it would be nice if I could just teach them how to use the Visual Studio wizards and all the code generation features, but if anything goes wrong, they will not have the vaguest idea what happened or how to debug it and recover from it. I’m going to have to teach them all about IUnknown and CLSIDs and ProgIDS and … oh, the humanity!

In teaching someone about ASP.NET programming, it would be nice if I could just teach them that they can double-click on things and then write code that runs on the server when the user clicks on those things. Indeed ASP.NET abstracts away the difference between writing the HTML code to handle clicking on a hyperlink (<a>) and the code to handle clicking on a button. Problem: the ASP.NET designers needed to hide the fact that in HTML, there’s no way to submit a form from a hyperlink. They do this by generating a few lines of JavaScript and attaching an onclick handler to the hyperlink. The abstraction leaks, though. If the end-user has JavaScript disabled, the ASP.NET application doesn’t work correctly, and if the programmer doesn’t understand what ASP.NET was abstracting away, they simply won’t have any clue what is wrong.

The law of leaky abstractions means that whenever somebody comes up with a wizzy new code-generation tool that is supposed to make us all ever-so-efficient, you hear a lot of people saying “learn how to do it manually first, then use the wizzy tool to save time.” Code generation tools which pretend to abstract out something, like all abstractions, leak, and the only way to deal with the leaks competently is to learn about how the abstractions work and what they are abstracting. So the abstractions save us time working, but they don’t save us time learning.

And all this means that paradoxically, even as we have higher and higher level programming tools with better and better abstractions, becoming a proficient programmer is getting harder and harder.

During my first Microsoft internship, I wrote string libraries to run on the Macintosh. A typical assignment: write a version of strcat that returns a pointer to the end of the new string. A few lines of C code. Everything I did was right from K&R — one thin book about the C programming language.

Today, to work on CityDesk, I need to know Visual Basic, COM, ATL, C++, InnoSetup, Internet Explorer internals, regular expressions, DOM, HTML, CSS, and XML. All high level tools compared to the old K&R stuff, but I still have to know the K&R stuff or I’m toast.

Ten years ago, we might have imagined that new programming paradigms would have made programming easier by now. Indeed, the abstractions we’ve created over the years do allow us to deal with new orders of complexity in software development that we didn’t have to deal with ten or fifteen years ago, like GUI programming and network programming. And while these great tools, like modern OO forms-based languages, let us get a lot of work done incredibly quickly, suddenly one day we need to figure out a problem where the abstraction leaked, and it takes 2 weeks. And when you need to hire a programmer to do mostly VB programming, it’s not good enough to hire a VB programmer, because they will get completely stuck in tar every time the VB abstraction leaks.

The Law of Leaky Abstractions is dragging us down.

2002/11/08

I thought that I’d be relaxing this week, after shipping FogBUGZ 3.0 on Monday.

I figured I’d come in late, spend a couple of hours posting some articles I have backlogged for Joel on Software, catch up on the translation effort, and maybe take off the afternoons to see movies.

Murphy had other plans.

The number of users trying the free online version of FogBUGZ surged to unheard of levels, and the server didn’t handle it very well, necessitating a last-minute rearchitecting of the way the trial server creates private databases. Hopefully that is fully in place now and the trial server should be rock solid.

In a perfect world, should we have load-tested the trial server before going live? It seems like “best practices,” right? Maybe not. Let’s assume load testing costs 4 engineer days which seems about right to me. Fixing the server to handle the code actually did cost 4 engineer days.

I need a table.

COST do load testing don’t load test
server OK 4 days 0 days
server NOT OK 8 days 4 days

If we have no information about whether the server is going to survive the load, i.e., there is a 50% chance it will fail, the expected cost with load testing is 6 days as opposed to 2 days without load testing.

Hmm, cheaper not to load test. That is, unless the cost of failure is higher than 4 days work. The actual cost of failure was that some people couldn’t get into their trial databases for about an hour before we noticed and kicked the server. Probably no big deal; worth 4 engineer days.

I may have drawn the wrong conclusion; maybe one of those people who lost interest was considering a site license for 300,000 IBM employees. But still, you need some kind of economic model to decide where to spend your limited resources. You can’t make sensible decisions reliably by saying things like “load testing is a no-brainer” or “the server will probably survive.” Those are emotional brain droppings, not analysis. And in the long run we scientists will win.

At Fog Creek we do calculations like this all the time. For example, a lot of our internal utilities and databases are really pretty buggy. The bugginess causes pain, but fixing the bugginess would cost us actual money. It’s not worth spending an engineering day to fix a problem that wastes 30 seconds of someone’s time once a month. This only applies to software for internal consumption. With the software that we sell, those tiny incremental improvements are the whole reason our software is better and can compete in the marketplace.

In other words: with internal software, there are steeply diminishing marginal returns as you fix smaller and smaller bugs, so it is economically rational for internal software systems to be somewhat buggy, as long as they get the job done. After that bug fixes become deadweight.

With commercial software in highly competitive markets (like software project tracking) virtually all your competitive advantage comes from fixing smaller and smaller bugs to attain a higher level of quality. That’s one of the big differences between those two worlds.

FogBUGZ sales are through the roof, which is nice, of course, but it means that the usual 5% of people with weird configuration issues who need help are taking up a lot more time than usual.

And no matter how perfect your SETUP or your web site, you always get phone calls from potential customers who just want to chat with a human for 10 minutes and then they’ll buy. OK.

All in all, it’s been busy and hectic when I was hoping for a quiet week. Ah well.

2002/11/04

FogBUGZ 3.0 is now shipping! This is a really huge upgrade; FogBUGZ moves up from being a simple bug tracking package to a rather robust management system that handles the entire development process.

One big theme of FogBUGZ 3.0 was “listen to your customers.” If you use FogBUGZ you can get feedback from your customers via email or the web. You can even add a menu item to send feedback in your application, or catch crashes from the field and submit them directly into FogBUGZ like we do with CityDesk. Customer suggestions can become feature items, prioritized and tracked alongside other development tasks.

Here at Fog Creek we started using the email integration component to handle all customer service email that comes into the company. Instead of using a lame single-user email client like Outlook, we now have a web interface that shows everybody messages as they come in, which anyone can assign, track, and prioritize just like bugs. And a complete record of every customer email interaction is kept in FogBUGZ for all to see, so anybody in the company can pick up the thread of a conversation with a customer. If your company-wide email aliases (“info@little-hope-inn.com”)  are causing problems, you should look at FogBUGZ, even if you don’t use it for bug tracking at all.

Inspired by the Extreme Programming idea of prioritized user stories, FogBUGZ lets you maintain lists of features, ideas, bugs, keep them prioritized at all times, and always work off the top of your list. If 3×5 cards or big whiteboards don’t cut it for you, FogBUGZ is a great way to do Extreme Programming. (Read about how the Six Degrees team did it.)

I’m also happy to say that FogBUGZ does not have certain features that seem sensible but which experience shows reduce the overall effectiveness of the product. I’ve already talked about custom fields, which are dangerous if used inappropriately, but there are some other things that FogBUGZ intentionally does not do.

For example, inspired by software testing guru Cem Kaner and of course Dr. Deming, FogBUGZ does not provide individual performance metrics for people. If you want a report for which programmer makes the most bugs, or the infamous “which programmer has the most bugs that they allegedly ‘fixed’ reopened by testing because they were not really fixed,” FogBUGZ won’t give it to you. Why? Because as soon as you start measuring people and compensating people based on things like this, they are going to start optimizing their behavior to optimize these numbers, and not in the way you intended. Every bug report becomes an argument. Programmers insist on recategorizing bugs as “features.” Or they refuse to check in code until the performance review period is over. Testers are afraid to enter bugs — why antagonize programmers? And pretty soon, the measurements give you what you “wanted”: the number of bugs in the bug tracking system goes down to zero. Of course, there are just as many bugs in the code, those are an inevitable part of writing software, they’re just not being tracked. And the bug tracking software, hijacked as an HR crutch, becomes worthless for what it was intended for.

We long ago decided that to make FogBUGZ successful, it needs to reflect and enforce as much as we know about good software development practices. That’s why I’m very proud of the product we released today. Check it out, there’s a free online demo. If you buy the software this month we’re giving out free copies of Mary Romero Sweeney’s great book. And if you’re using any other bug tracking method — commercial, open source, or Excel files on a server — we’ll let you take 30% off as a competitive upgrade.