I thought that I’d be relaxing this week, after shipping FogBUGZ 3.0 on Monday.

I figured I’d come in late, spend a couple of hours posting some articles I have backlogged for Joel on Software, catch up on the translation effort, and maybe take off the afternoons to see movies.

Murphy had other plans.

The number of users trying the free online version of FogBUGZ surged to unheard of levels, and the server didn’t handle it very well, necessitating a last-minute rearchitecting of the way the trial server creates private databases. Hopefully that is fully in place now and the trial server should be rock solid.

In a perfect world, should we have load-tested the trial server before going live? It seems like “best practices,” right? Maybe not. Let’s assume load testing costs 4 engineer days which seems about right to me. Fixing the server to handle the code actually did cost 4 engineer days.

I need a table.

COST do load testing don’t load test
server OK 4 days 0 days
server NOT OK 8 days 4 days

If we have no information about whether the server is going to survive the load, i.e., there is a 50% chance it will fail, the expected cost with load testing is 6 days as opposed to 2 days without load testing.

Hmm, cheaper not to load test. That is, unless the cost of failure is higher than 4 days work. The actual cost of failure was that some people couldn’t get into their trial databases for about an hour before we noticed and kicked the server. Probably no big deal; worth 4 engineer days.

I may have drawn the wrong conclusion; maybe one of those people who lost interest was considering a site license for 300,000 IBM employees. But still, you need some kind of economic model to decide where to spend your limited resources. You can’t make sensible decisions reliably by saying things like “load testing is a no-brainer” or “the server will probably survive.” Those are emotional brain droppings, not analysis. And in the long run we scientists will win.

At Fog Creek we do calculations like this all the time. For example, a lot of our internal utilities and databases are really pretty buggy. The bugginess causes pain, but fixing the bugginess would cost us actual money. It’s not worth spending an engineering day to fix a problem that wastes 30 seconds of someone’s time once a month. This only applies to software for internal consumption. With the software that we sell, those tiny incremental improvements are the whole reason our software is better and can compete in the marketplace.

In other words: with internal software, there are steeply diminishing marginal returns as you fix smaller and smaller bugs, so it is economically rational for internal software systems to be somewhat buggy, as long as they get the job done. After that bug fixes become deadweight.

With commercial software in highly competitive markets (like software project tracking) virtually all your competitive advantage comes from fixing smaller and smaller bugs to attain a higher level of quality. That’s one of the big differences between those two worlds.

FogBUGZ sales are through the roof, which is nice, of course, but it means that the usual 5% of people with weird configuration issues who need help are taking up a lot more time than usual.

And no matter how perfect your SETUP or your web site, you always get phone calls from potential customers who just want to chat with a human for 10 minutes and then they’ll buy. OK.

All in all, it’s been busy and hectic when I was hoping for a quiet week. Ah well.

About the author.

I'm Joel Spolsky, co-founder of Trello and Fog Creek Software, and CEO of Stack Overflow. More about me.