2003/01/28

Binary Search Debugging

Something we had done since the last release of CityDesk somehow caused our publish times to increase by about 100%; on a particular large site we use for stress testing it had gone from about a minute to about two minutes.

The first thing I tried was a profiler: Compuware DevPartner Studio. Indeed this showed me where a lot of bottlenecks are; that data will be useful to speed up our publish times even more, but I really wanted to find the specific bug that I thought we had introduced which was slowing us down.

The next thing I tried was a method I learned from Gabi at Juno: the old binary search method. Before we started work on this release, publishing took 1’04”. Today it takes 1’57”. So I started checking out old versions of the source from CVS by date, rebuilding, and timing how long publishing took with each day’s build. Here’s what I found:

As of May 1: 1’57”
As of April 1: 1’05”
As of April 15: 1’05”
As of April 22: 1’06”
As of April 26: 1’58”
As of April 24: 1’05”
As of April 25: 1’05”

Aha! Now all I had to do was run WinDiff to compare the source tree from April 25th and April 26th, and I discovered four things that were changed that day, one of which was a function that DevPartner had told me was kind of slow, anyway. Within minutes I found the culprit — that function was originally written to cache its results because it’s often called with the same inputs, and I had inadvertently changed the cache key in one place and not another, so we were getting 100% misses instead of 99% hits. Solved! Total elapsed time to find this bug: about an hour. If your source code is much bigger than CityDesk, builds and checkouts may be slow. This is as good a reason as any to keep all your old daily builds around.

2003/01/25

A week of Murphy’s Law gone wild.

Chapter One. The Linux server hosting our CVS repository (all our source code) fails. No big deal, it is automatically mirrored (using rdist) to a remote location. It takes a few hours to compress and transmit the mirrored data. We discover that we forgot the option to rdist that removes deleted files, so the mirror isn’t perfect: it includes files that were deleted. These have to be manually removed.

When this is all done I decide to check out the whole source tree from scratch and compare it to what I already have, as a final sanity check. But I don’t have enough disk space on my laptop to do this. Time to upgrade. I order a 60 GB laptop hard drive and a PCMCIA/harddrive connector that is supposed to allow you to clone the old hard drive on the new one. This process takes something like 6 hours and fails when it is 50% complete, instructing me to “run scandisk.” Which takes a couple of hours. Start another copy. 6 hours more. At 50%, it fails again. Only now, the original hard drive is toast taking my entire life with it. It takes a couple of hours fiddling around, putting the drive into different computers, etc., to discover that it is indeed lost.

OK, not too big a deal, we have daily backups (NetBackup Pro). I put the new 60 GB drive into the laptop, format it, and install Windows XP Pro. I instruct NetBackup Pro to restore that machine to its pre-crash state. I’ll lose a day of work, but it was a day in which I hardly got anything done, anyway. A day of email was lost so if you sent me something this week and I didn’t respond, resend.

NetBackup Pro works for a few hours. I go home, to let it finish overnight. In the morning the system is completely toast and won’t even boot. I hypothesize that it must be because I tried to restore a Win2K image on top of an XP Pro OS. So I start again, this time installing Win 2K (format hard drive: 1 hour; install Win 2K: 1 hour; then install the NetBackup Pro Client). And I start the restore again. Five hours later, it’s only halfway done, and I go home.

The next morning, the system doesn’t quite boot, it blue-screens, but a half hour of fiddling around with Safe Mode and I get it to boot happily. And behold, everything is restored, except, for some reason, a few files which I let Windows encrypt for me (using EFS) are inaccessible. This has something to do with public keys and certificates. When you restore a file that was encrypted I guess you can’t read it. I still haven’t found the solution to this. If you know how to fix this I will be forever indebted to you. [1/26: I fixed this problem after a few hours of tearing out my hair.]

Lesson Learned: This is not the first time that a hard drive failure has led to a series of other problems that wound up wasting days and days of work. Notice that I had a very respectable backup strategy, everything was backed up daily, offsite. In fact I believe this is the third time that a hard drive failure has led to a series of mishaps that wasted days. Conclusion: backups aren’t good enough. I want RAID mirroring from now on. When a drive dies I want to spend 15 minutes putting in a new drive and resume working exactly where I left off. New policy: all non-laptops at Fog Creek will have RAID mirroring.

Chapter Two. Did you notice that our web server was down? On Friday around noon a fire in a local Verizon switch knocked out all our phone lines and our Internet connectivity. Verizon got the phone lines working in a couple of hours, but the T1 was a bit more problematic. We purchased the T1 from SAVVIS, which, in turn, hired MCI to run the local loop, which is now called WorldCom, and of course Worldcom doesn’t actually run any loops, God forbid they should get their hands dirty, they just buy the local loop from Verizon.

So from Friday at noon until Saturday at midnight, Michael and I, working as a tag team, call Savvis every hour or so to see what’s going on. We’re pushing on Savvis, who, occassionally, push on Worldcom, who have decided that some kind of SQL Server DDOS attack can be blamed for everything, so they kind of ignore Savvis, who don’t tell us that Worldcom is ignoring them, and we push on Savvis again, and they push on Worldcom again, and around the third time Worldcom agrees to call Verizon who send out a tech who fixes the thing. Honestly, it’s like pushing on string. Just like the last time Savvis made our T1 go down for a day, the technical problem was relatively trivial and could have been diagnosed and fixed in minutes if we weren’t dealing with so many idiot companies.

Lesson Learned: When you’re buying a service from a company that’s just outsourcing that service, one level deep, it’s difficult to get decent customer service. When there are two levels of oursourcing, it’s nearly impossible. Much as I hate to encourage monopolistic local telcos, the only thing worse than dealing with a local telco directly is dealing with another idiot bureaucratic company who themselves have no choice but to deal with the local telco. Our next office space will be wired by Verizon DSL, thank you very much.

Incidentally, none of you would have noticed this outage at all if Dell had delivered our damn server on time. We were supposed to be up and running in a nice Peer1 highly redundant secure colocation facility a month ago. See previous rant. Did I mention that I have a fever? I always get sick when things are going wrong.

Chapter Three. For the thousandth time, the heat on the fourth floor of the Fog Creek brownstone is out. Heat is supplied by hot water pipes running through the walls. These pipes were frozen solid. How did they get a chance to freeze? Oh, that’s because the furnace went off last week, because it was installed by an idiot moron, probably unlicensed, who put in a 25 foot long horizontal chimney segment which prevents ventilation and has, so far, hospitalized one tenant and caused the furnace to switch off dozens of times. Finally someone at the heating company admitted that it was possible to install a draft inducer forcing the chimney to ventilate, which they did, but not before the hot water pipes had frozen. Of course, the pipes are inadequately insulated due to another incompetent in the New York City construction trade, but this wouldn’t have mattered if the furnace had kept running.

Lesson Learned: Weak systems may appear perfectly healthy until neighboring systems break down. People with allergies and back problems may go for months without suffering from either one, but suddenly an attack of hayfever makes them sneeze hard enough to throw out their back. You see this in systems administration all the time. Use these opportunities to fix all the problems at once. Get RAID on all your PCs and do backups, and don’t use EFS and always get hard drives that are way too large so you’ll never have to stop to upgrade them, and double check the command line options to rdist. Install the draft inducer and insulate the pipes. Move your important servers to a secure colo facility and switch the office T1 to Verizon.

2003/01/16

When Apple releases a new product, they tend to surprise the heck out of people, even the devoted Apple-watchers who have spent the last few months riffling through garbage dumpsters at One Infinite Loop.

Microsoft, on the other hand, can’t stop talking about products that are mere glimmers in someone’s eye. Testers outside the company were using .NET for years before it finally shipped.

So, which is right? Should you talk endlessly about your products under development, in hopes of building buzz, or should you hold off until you’ve got something ready to go?

Mouth Wide Shut

Mouth Wide Shut

When Apple releases a new product, they tend to surprise the heck out of people, even the devoted Applewatchers who have spent the last few months riffling through garbage dumpsters at One Infinite Loop.

Microsoft, on the other hand, can’t stop talking about products that are mere glimmers in someone’s eye. Testers outside the company were using .NET for years before it finally shipped.

So, which is right? Should you talk endlessly about your products under development, in hopes of building buzz, or should you hold off until you’ve got something ready to go?

Fog Creek‘s default policy so far has been Absolute Radio Silence. At times, I’ve considered changing that policy. After all, why not open up our complete development process to the world, let everybody peek in the windows and see what’s going on? There’s nothing to hide!

In my personal life, I have a policy lifted from Marlon Brando, playing a mob boss in The Freshman: “Every word I say, by definition, is a promise.” The best way to avoid breaking promises is not to make any, and that’s as good a reason as I need not to talk about future versions of our products. There are four other reasons for this in software development.

Competition. I’m not overly paranoid about competition, but as soon as you have competition that’s paying attention to what you do, if you discuss features that haven’t shipped yet, you’re handing them an opportunity. If you keep your mouth shut until you ship, you will be guaranteed at least half a year without competition before everyone else matches that cool new feature.

Underpromise and overdeliver. When you tell a customer or potential customer you’re going to add a feature, they will inevitably imagine that the feature will do all kinds of things that it may not actually do. When you tell someone about the upcoming clam steaming feature, no matter how careful you are to delimit what it actually does, they will inevitably spin elaborate fantasies about how it cures baldness and warts and has a telepathic user interface. When you finally deliver something, they are bound to be disappointed. This can only hurt.

Flexibility. If you want to keep your promises, you can’t talk about upcoming features and release dates unless you’re willing to lock into them. This may eliminate flexibility that you need later when Murphy strikes. And if you don’t keep your promises, you’ve ruined something that’s very hard to get back: your reputation.

Simplicity. If your policy is Radio Silence, every employee understands it and can follow it. If your policy is in any way complicated, nobody is sure what to do and things leak.

Doesn’t advance buzz and publicity help? I don’t know. A little, but not as much as nonadvance publicity. I’m inclined to think that publicity that comes out when you can’t actually buy the product is 90% wasted. Remember that incredibly big burst of Segway publicity about a year ago? With Jeff Bezos and Steve Jobs talking about how “IT” was going to revolutionize the entire universe? Cities would be reconfigured. OK, so, we all talked about the Segway, but nobody could buy one, so it’s not clear that it was publicity well-spent. And it certainly seems like the same amount of publicity would have helped more if it appeared when every Walmart has Segways in stock.

One purported advantage of talking about your products under development is to get early feedback. But honestly, you don’t need feedback from the entire world. Look at the poor Chandler guys; they started talking about their product before any design was done and immediately got buried under such a deluge of feedback just managing it all was impossible. Now everybody thinks Chandler is going to be All Things to Everybody. Quite a lot to live up to. I’ve found that you can get just as good feedback from a few hundred carefully selected customers. The other 800,000 people that send you suggestions might be sending good suggestions, but you already heard them, so they aren’t adding value. (By the way, Chandler did exactly the right thing, since they are an open source project. They don’t care if competitors use their ideas, and at this stage it’s worth sifting through everybody’s crazy feature requests if that’s the price of attracting more volunteers to write the code.)

A lot of times customers and potential customers come to me and say, “FogBUGZ (or CityDesk) is almost perfect for my needs. But I need a salad spinner. When are you going to have a salad spinner? Will the next release have a salad spinner?” To these people I have to say, “I don’t know. We might, or we might not. If you really can’t live without a salad spinner, I’m afraid you’ll have to go somewhere else.” OK, maybe we lose a sale or two by refusing to indulge in the vaporware habit. I’m a patient man, and Fog Creek is profitable so I don’t need the cash. Next year we’ll have salad spinners, and I won’t lose the sale. It’s better than selling you something that doesn’t fit your needs and having you get pissed off at me, or getting a reputation as being unable to deliver.

2003/01/15

Local Optimization, or, The Trouble With Dell

One of my core beliefs is that if you pick some aspect of your business to optimize, and focus only on that one number, you tend to ruin other parts of the business that you’re not measuring as carefully, resulting in a local optimization that actually harms your business as a whole.

Dell Computer is a great case in point. In books and magazine articles (see also the January Business 2.0) they brag endlessly about how low their inventory is. Everybody at Dell knows that what makes Mr. Dell happy is reducing the inventory to a bare minimum. Inventory, he says, costs money, especially in the fast moving computer industry where a part on a shelf has a half life of 6 months.

Unfortunately, the dirty little secret about Dell is that all they have really done is push the pain of inventory up to their suppliers and down to their customers. Their suppliers end up building big warehouses right next to the Dell plants where they keep the inventory, which gets reflected in the cost of the goods that Dell consumes. And every time there’s a little hiccup in supplies, Dell customers just don’t get their products. I ordered a new server from them more than a month ago and every week, the day the server is supposed to ship, I get an automated email telling me that my order is delayed by another week and there’s nothing they can do about it. Today the email said that the server will ship January 21 (originally promised January 7th).

I called Dell to ask what was holding up my server. “It looks like it’s the CPU,” the rep told me. “It’s those Xeons.”

“OK, do you have any different CPUs you could put in? Maybe a faster or slower Xeon?”

“Let me check that for you, sir. Nope, there are no Xeons available until January 30th.”

“None? At any speed?” I asked, incredulously. Somehow I don’t believe that Dell can’t get a single Xeon until next month.

“No sir.”

“How about a Pentium 4?” I know that the server I’m buying used to come with Pentium 4s.

“No, we don’t make those servers with Pentium 4s any more.”

Sounds to me like they don’t make those servers at all.

When you need a computer right away, you can’t call Dell: even when everything is working perfectly they have to build the computer for you and it takes a week or two to get it. Compare that to, say, PCConnection, which can overnight a new IBM or HP server to you and you get it the next day, and Dell is at a significant competitive disadvantage. Combine that with the fact that the no-margin-of-error inventory model means that every hiccup in the supply chain automatically results in an angry customer, and you have a pretty serious liability that probably hurts Dell a lot more than not carrying a few days of inventory helps them. But Michael Dell never told his employees to optimize for customer satisfaction or to optimize for delivery time, he told them to optimize for inventory velocity and nothing else, and that is what he got.

2003/01/09

I’ve spent most of the week pounding the pavement, looking at new office space for Fog Creek. We’re focusing on a neighborhood in New York called the Garment District, which used to be the home to thousands of small clothing factories, many of whom are still there, but they’re slowly getting pushed out of business and the old factories are being gentrified with architects, designers, modeling agencies, photographers, and of course software companies taking over, putting in nice new windows and polished wood floors, replacing the rows of sewing machines with laptop computers, and paying double the rent. It’s great to be a creditworthy tenant looking for office space; there’s so much empty office space in New York that landlords are falling over themselves to do good deals.

Clay Shirky: “Two years and hundreds of millions of dollars later, FedEx pulled the plug on ZapMail, allowing it to vanish without a trace. And the story of ZapMail’s collapse holds a crucial lesson for the telephone companies today.” Excellent. Clay has hit the nail right on the head; everybody’s looking for the huge business opportunities around 802.11 and VoIP and they probably aren’t really there.

2003/01/03

I posted a bit of internal documentation that is likely to be of interest only to people using VB6 with DAO database access. It describes the entity classes that we’re implementing as we refactor CityDesk, in order to isolate database access code and provide a clean, super-easy interface to database tables. CityDesk Entity Classes.

Yesterday I switched over the CityDesk setup from InnoSetup to My Inno Setup Extensions. Advantages: Pascal scripting, which we don’t need but might, and conditional compilations, which allow me to use a single source setup script for the main setup and the upgrade. Inspired by Dr. Motulsky I eliminated three unnecessary steps from setup, for example, the stupid screen that might as well just say, “OK, we realize you just clicked next, please click next again now, thank you.” (The famous “ready to install” screen that nobody reads.)

2003/01/02

“With TDD, you create an automated test first, and only then write the minimal amount of code that you can get away with to satisfy that test. Every time someone finds a new bug, it gets added to the fully automated test suite. Then the programmer writes the minimal amount of code to make the new test pass (which makes the bug go away).” —From Test Driving Test Driven Development, a column I’ve written that will appear in the next issue of STQE Magazine. More paper! You can’t read it online, but the nice folks at STQE have offered 15% off the usual subscription price for Joel on Software readers.

Bookmark this: DLL Help is a complete database of every DLL Microsoft has ever shipped, and which versions of which product it shipped with. I’m trying to figure out which of the 7 versions of scrrun.dll in the wild is causing problems for the occasional CityDesk user.

Four Days With Dr. DemingA good introduction to Dr. Deming’s philosophy of management: Four Days with Dr. Deming summarizes the four day seminars Deming used to give to business leaders. Key insight: you can’t improve your team’s performance just by picking some numeric measurement and then rewarding or punishing people to optimize it. Problem one: the variability in the measurement may be caused by a broken system that only management can change, not by individual performance. Problem two: people may optimize locally to improve that one measurement, even at the cost of hurting the performance of the company as a whole. If you’re in a rut constantly trying to figure out how to rejigger your employees’ incentive systems, this book will get you out of it.