TripIt is awesome

You know what I really like? TripIt.com. It’s amazingly simple. You take all those travel confirmation emails that you get from your travel agent, hotels, car rental agencies, etc, and you just forward them to plans@tripit.com. That’s all you have to do. You don’t have to sign up for an account. You don’t have to log on. You just forward those emails. You can do it right now.

You get a link back by email, with a beautifully organized itinerary, showing all your travel data plus maps, weather reports, and all the confirmation numbers for your flights and address for your hotels and so on.

It’s kind of magical. You don’t have to fill out lots of little fields with all the details, because they’ve done a lot of work to parse those confirmation emails correctly… it worked flawlessly for my upcoming trip to Japan.

Think of it this way. Suppose you want to enter a round trip flight on your calendar. The minimum information you need to enter is probably:

  1. the airline
  2. the flight number
  3. four times (departure and arrival, there and back)
  4. four time zones (or else your phone will tell you that your flight is at 5 pm when it’s really at 2pm)
  5. a confirmation number (for when the airline denies that you exist)
  6. where you’re going

All in all it takes a few minutes and is very error prone. Whereas, with TripIt, you just take that email from the airline or Orbitz, Ctrl+F, type plans@tripit.com, and send. Done.

TripIt is a beautiful example of the Figure It Out school of user interface design. Why should you need to register? TripIt figures out who you are based on your email address. Why should you parse the schedule data? Everyone gets email from the same 4 online travel agencies, 100-odd airlines, 15 hotel chains, 5 car rental chains… it’s pretty easy to just write screen scrapers for each of those to parse out the necessary data.

Anyway, it’s a shame I have to say this, but I have no connection whatsoever to tripit.com.

Microsoft can’t speak straight any more

Here’s how Microsoft says, “SQL Server 2008 will be late:”

“We want to provide clarification on the roadmap for SQL Server 2008. Over the coming months, customers and partners can look forward to significant product milestones for SQL Server.  Microsoft is excited to deliver a feature complete CTP during the Heroes Happen Here launch wave and a release candidate (RC) in Q2 calendar year 2008, with final Release to manufacturing (RTM) of SQL Server 2008 expected in Q3. Our goal is to deliver the highest quality product possible and we simply want to use the time to meet the high bar that you, our customers, expect.”

What? Can you understand that? “A feature complete CTP during the Heroes Happen Here launch wave?” What on earth does that mean?

The guy who wrote this, Francois Ajenstat, ought to be ashamed of himself. Have some guts. Just say it’s late. We really don’t care that much. SQL Server 2005 is fine. As Judge Judy says, “Don’t piss on my leg and tell me it’s raining.”

Phil Factor explains.

:CueCat is back!

Google: “2D barcodes are an especially exciting part of this because they allow readers to “click” on interesting print ads with their cellphones and seamlessly connect to relevant online content.”

Years ago, I went out on a limb and dismissed a similar scheme thus: “The number of dumb things going on here exceeds my limited ability to grok all at once. I’m a bit overwhelmed with what a feeble business idea this is.”

OK, more than seven years have passed. Things have changed. People have camera phones with web browsers now. Some things are still the same: typing URLs is not hard, this is a monumental chicken and egg problem, and this doesn’t provide any value to the consumers who are expected to install new software on their phones to go along with this ridonculous scheme.

Sometimes when the elders say to the youngsters, “don’t do that, we tried that, it failed,” it’s just because they’re failing to notice that the world has changed. But sometimes the elders are right, and the youngsters really are too young to know the history of the idea they think that they’ve just invented.

I guess we’ll get to watch to see whether the oldsters or the youngsters will win this one.

Still, it doesn’t say much for the quality of those 150 people Google hires every week that they’re now chasing some of the worst of the bad ideas of the fin de siecle. What’s next, GooglePetFood.com?

Copilot is now free on weekends

Remember Fog Creek Copilot? The app that our 2005 interns built that lets you remote control someone’s computer over the Internet to help them with technical problems?

Well, recently we figured out that we’re paying for a lot of bandwidth over the weekends that we don’t need, so we decided to make Copilot absolutely free on weekends. Yep, that’s right… free as in zero dollars, free, no cost, no credit card, no email address, nothing.

How it works: You go to https://www.copilot.com, enter your name, and get an invitation code. You then download and run a tiny piece of software. Tell your friend the invitation code, they go to copilot.com and enter it, and they download a tiny piece of software. Now you’re controlling their computer. Works with Windows or Macintosh, through almost any firewall.

Details: Weekend = 8pm EST (GMT-5) Friday night to 2am EST Monday morning. Copilot subscribers can use Copilot free on weekends, too.

ALSO! The Copilot team is still hard at work; Copilot 3.0 is just starting to enter testing. Tyler and Ben want to hire a Summer intern in marketing for the Copilot team. If you’re a smart college student that’s more interested in marketing than software development, please apply by emailing your resume to jobs@fogcreek.com.

Five whys

At 3:30 in the morning of January 10th, 2008, a shrill chirping woke up our system administrator, Michael Gorsuch, asleep at home in Brooklyn. It was a text message from Nagios, our network monitoring software, warning him that something was wrong.

He swung out of bed, accidentally knocking over (and waking up) the dog, sleeping soundly in her dog bed, who, angrily, staggered out to the hallway, peed on the floor, and then returned to bed. Meanwhile Michael logged onto his computer in the other room and discovered that one of the three data centers he runs, in downtown Manhattan, was unreachable from the Internet.

This particular data center is in a secure building in downtown Manhattan, in a large facility operated by Peer 1. It has backup generators, several days of diesel fuel, and racks and racks of batteries to keep the whole thing running for a few minutes while the generators can be started. It has massive amounts of air conditioning, multiple high speed connections to the Internet, and the kind of “right stuff” down-to-earth engineers who always do things the boring, plodding, methodical way instead of the flashy cool trendy way, so everything is pretty reliable.

Internet providers like Peer 1 like to guarantee the uptime of their services in terms of a Service Level Agreement, otherwise known as an SLA. A typical SLA might state something like “99.99% uptime.” When you do the math, let’s see, there are 525,949 minutes in a year (or 525,600 if you are in the cast of Rent), so that allows them 52.59 minutes of downtime per year. If they have any more downtime than that, the SLA usually provides for some kind of penalty, but honestly, it’s often rather trivial… like, you get your money back for the minutes they were down. I remember once getting something like $10 off the bill once from a T1 provider because of a two day outage that cost us thousands of dollars. SLAs can be a little bit meaningless that way, and given how low the penalties are, a lot of network providers just started advertising 100% uptime.

Within 10 minutes everything seemed to be back to normal, and Michael went back to sleep.

Until about 5:00 a.m. This time Michael called the Peer 1 Network Operations Center (NOC) in Vancouver. They ran some tests, started investigating, couldn’t find anything wrong, and by 5:30 a.m. things seemed to be back to normal, but by this point, he was as nervous as a porcupine in a balloon factory.

At 6:15 a.m. the New York site lost all connectivity. Peer 1 couldn’t find anything wrong on their end. Michael got dressed and took the subway into Manhattan. The server seemed to be up. The Peer1 network connection was fine. The problem was something with the network switch. Michael temporarily took the switch out of the loop, connecting our router directly to Peer 1’s router, and lo and behold, we were back on the Internet.

By the time most of our American customers got to work in the morning, everything was fine. Our European customers had already started emailing us to complain. Michael spent some time doing a post-mortem, and discovered that the problem was a simple configuration problem on the switch. There are several possible speeds that a switch can use to communicate (10, 100, or 1000 megabits/second). You can either set the speed manually, or you can let the switch automatically negotiate the highest speed that both sides can work with. The switch that failed had been set to autonegotiate. This usually works, but not always, and on the morning of January 10th, it didn’t.

Michael knew this could be a problem, but when he installed the switch, he had forgotten to set the speed, so the switch was still in the factory-default autonegotiate mode, which seemed to work fine. Until it didn’t.

Michael wasn’t happy. He sent me an email:

I know that we don’t officially have an SLA for On Demand, but I would like us to define one for internal purposes (at least). It’s one way that I can measure if myself and the (eventual) sysadmin team are meeting the general goals for the business. I was in the slow process of writing up a plan for this, but want to expedite in light of this morning’s mayhem.

An SLA is generally defined in terms of ‘uptime’, so we need to define what ‘uptime’ is in the context of On Demand. Once that is made clear, it’ll get translated into policy, which will then be translated into a set of monitoring / reporting scripts, and will be reviewed on a regular interval to see if we are ‘doing what we say’.

Good idea!

But there are some problems with SLAs. The biggest one is the lack of statistical meaningfulness when outages are so rare. We’ve had, if I remember correctly, two unplanned outages, including this one, since going live with FogBugz on Demand six months ago. Only one was our fault. Most well-run online services will have two, maybe three outages a year. With so few data points, the length of the outage starts to become really significant, and that’s one of those things that’s wildly variable. Suddenly, you’re talking about how long it takes a human to get to the equipment and swap out a broken part. To get really high uptime, you can’t wait for a human to switch out failed parts. You can’t even wait for a human to figure out what went wrong: you have to have previously thought of every possible thing that can possibly go wrong, which is vanishingly improbable. It’s the unexpected unexpecteds, not the expected unexpecteds, that kill you.

Really high availability becomes extremely costly. The proverbial “six nines” availability (99.9999% uptime) means no more than 30 seconds downtime per year. That’s really kind of ridiculous. Even the people who claim that they have built some big multi-million dollar superduper ultra-redundant six nines system are gonna wake up one day, I don’t know when, but they will, and something completely unusual will have gone wrong in a completely unexpected way, three EMP bombs, one at each data center, and they’ll smack their heads and have fourteen days of outage.

Think of it this way: If your six nines system goes down mysteriously just once and it takes you an hour to figure out the cause and fix it, well, you’ve just blown your downtime budget for the next century. Even the most notoriously reliable systems, like AT&T’s long distance service, have had long outages (six hours in 1991) which put them at a rather embarrassing three nines … and AT&T’s long distance service is considered “carrier grade,” the gold standard for uptime.

Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: “A black swan is an outlier, an event that lies beyond the realm of normal expectations.” Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They’re the kind of things that happen so rarely it doesn’t even make sense to use normal statistical methods like “mean time between failure.” What’s the “mean time between catastrophic floods in New Orleans?”

Measuring the number of minutes of downtime per year does not predict the number of minutes of downtime you’ll have the next year. It reminds me of commercial aviation today: the NTSB has done such a great job of eliminating all the common causes of crashes that nowadays, each commercial crash they investigate seems to be a crazy, one-off, black-swan outlier.

Somewhere between the “extremely unreliable” level of service, where it feels like stupid outages occur again and again and again, and the “extremely reliable” level of service, where you spend millions and millions of dollars getting an extra minute of uptime a year, there’s a sweet spot, where all the expected unexpecteds have been taken care of. A single hard drive failure, which is expected, doesn’t take you down. A single DNS server failure, which is expected, doesn’t take you down. But the unexpected unexpecteds might. That’s really the best we can hope for.

To reach this sweet spot, we borrowed an idea from Sakichi Toyoda, the founder of Toyota. He calls it Five Whys. When something goes wrong, you ask why, again and again, until you ferret out the root cause. Then you fix the root cause, not the symptoms.

Since this fit well with our idea of fixing everything two ways, we decided to start using five whys ourselves. Here’s what Michael came up with:

  • Our link to Peer1 NY went down
  • Why? – Our switch appears to have put the port in a failed state
  • Why? – After some discussion with the Peer1 NOC, we speculate that it was quite possibly caused by an Ethernet speed / duplex mismatch
  • Why? – The switch interface was set to auto-negotiate instead of being manually configured
  • Why? – We were fully aware of problems like this, and have been for many years.  But – we do not have a written standard and verification process for production switch configurations.
  • Why? – Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team, whereas, it should really be thought of as a checklist.

“Had we produced a written standard prior to deploying the switch and subsequently reviewed our work to match the standard, this outage would not have occurred,” Michael wrote. “Or, it would occur once, and the standard would get updated as appropriate.”

After some internal discussion we all agreed that rather than imposing a statistically meaningless measurement and hoping that the mere measurement of something meaningless would cause it to get better, what we really needed was a process of continuous improvement. Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we’re doing to prevent that problem in the future. In this case, the change is that our internal documentation will include detailed checklists for all operational procedures in the live environment.

Our customers can look at the blog to see what caused the problems and what we’re doing to make things better, and, hopefully, they can see evidence of steadily improving quality.

In the meantime, our customer service folks have the authority to credit customers’ accounts if they feel like they were affected by an outage. We let the customer decide how much they want to be credited, up to a whole month, because not every customer is even going to notice the outage, let alone suffer from it. I hope this system will improve our reliability to the point where the only outages we suffer are really the extremely unexpected black swans.

PS. Yes, we want to hire another system administrator so Michael doesn’t have to be the only one to wake up in the middle of the night.

This year’s Business of Software conference

The best conference I went to last year was the Business of Software Conference, organized by Neil Davidson of Red Gate Software over in Cambridge, England. It had a great lineup of speakers [PDF] including Guy Kawasaki, Eric Sink, Tim Lister (coauthor of Peopleware), Rick Chapman, Hugh MacLeod, and others, a great collection of real software businesses in attendance, and almost no fluff. Plus, my first 5.6 earthquake, experienced from the top of a highrise hotel. Great fun.

This year when Neil approached me about co-sponsoring the conference, I thought, why not? It’s exactly the kind of conference I would organize if I were organizing a conference about the software business, which, thankfully, I’m not, but Neil is, and he’s doing a bang up job.

So this year, it’s going to be called “Business of Software 2008: A Joel on Software Conference.” It’ll almost certainly be in Boston some time in the fall, but nothing is even remotely final yet. Sign up for the mailing list at that site, and they’ll let you know when a time and place are set.

Undergraduate programming

From CrossTalk, The Journal of Defense Software Engineering: “It is our view that Computer Science (CS) education is neglecting basic skills, in particular in the areas of programming and formal methods. We consider that the general adoption of Java as a first programming language is in part responsible for this decline.”

JavaSchools are not operating in a vacuum: they’re dumbing down their curriculum because they think it’s the only way to keep CS students. The real problem is that these schools are not doing anything positive to attract the kids who are really interesting in programming, not computer science. I think the solution would be to create a programming-intensive BFA in Software Development–a Julliard for programmers. Such a program would consist of a practical studio requirement developing significant works of software on teams with very experienced teachers, with a sprinkling of liberal arts classes for balance. It would be a huge magnet to the talented high school kids who love programming, but can’t get excited about proving theorums.

When I said BFA, Bachelor of Fine Arts, I meant it: software development is an art, and the existing Computer Science education, where you’re expected to learn a few things about NP completeness and Quicksort is singularly inadequate to training students how to develop software.

Imagine instead an undergraduate curriculum that consists of 1/3 liberal arts, and 2/3 software development work. The teachers are experienced software developers from industry. The studio operates like a software company. You might be able to major in Game Development and work on a significant game title, for example, and that’s how you spend most of your time, just like a film student spends a lot of time actually making films and the dance students spend most of their time dancing.

There are already several programs going in this direction: a lot of Canadian universities, notably Waterloo, have Software Engineering programs, and in Indiana, Rose-Hulman combines a good software engineering program with a co-op program called Rose-Hulman Ventures. These programs have no problem attracting lots of qualified students at a time when the Ivy League CS departments consider themselves lucky if they get a dozen majors a year.

In the meantime, think about how many computer science departments earned their reputation by writing an important piece of code: MIT’s X Window, Athena, and Lisp Machine; CMU’s Andrew File System, Mach, and Lycos; Berkeley’s Unix; the University of Kansas’ Lynx; Columbia’s Kermit. Where are those today? What have the universities given us lately? What’s the best college for a high school senior who really loves programming but isn’t so excited about lambda calculus?

Voting machine bugs

Clive Thompson, writing in The New York Times: “In 2005, the state of California complained that the machines were crashing. In tests, Diebold determined that when voters tapped the final “cast vote” button, the machine would crash every few hundred ballots. They finally intuited the problem: their voting software runs on top of Windows CE, and if a voter accidentally dragged his finger downward while touching “cast vote” on the screen, Windows CE interpreted this as a “drag and drop” command. The programmers hadn’t anticipated that Windows CE would do this, so they hadn’t programmed a way for the machine to cope with it. The machine just crashed.”