Archive for February 2008

Going to Tokyo 06 Feb

I'm heading out to Tokyo tomorrow in advance of next week's Shoeisha Developers Summit.

See you later!

Why are the Microsoft Office file formats so complicated? (And some workarounds) 19 Feb

Last week, Microsoft published the binary file formats for Office. These formats appear to be almost completely insane. The Excel 97-2003 file format is a 349 page PDF file. But wait, that’s not all there is to it! This document includes the following interesting comment:

Each Excel workbook is stored in a compound file.

You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file systems inside a single file. These are sufficiently complicated that you have to read another 9 page spec to figure that out. And these “specs” look more like C data structures than what we traditionally think of as a spec. It's a whole hierarchical file system.

If you started reading these documents with the hope of spending a weekend writing some spiffy code that imports Word documents into your blog system, or creates Excel-formatted spreadsheets with your personal finance data, the complexity and length of the spec probably cured you of that desire pretty darn quickly. A normal programmer would conclude that Office’s binary file formats:

  • are deliberately obfuscated
  • are the product of a demented Borg mind
  • were created by insanely bad programmers
  • and are impossible to read or create correctly.

You’d be wrong on all four counts. With a little bit of digging, I’ll show you how those file formats got so unbelievably complicated, why it doesn’t reflect bad programming on Microsoft’s part, and what you can do to work around it.

The first thing to understand is that the binary file formats were designed with very different design goals than, say, HTML.

They were designed to be fast on very old computers. For the early versions of Excel for Windows, 1 MB of RAM was a reasonable amount of memory, and an 80386 at 20 MHz had to be able to run Excel comfortably. There are a lot of optimizations in the file formats that are intended to make opening and saving files much faster:

  • These are binary formats, so loading a record is usually a matter of just copying (blitting) a range of bytes from disk to memory, where you end up with a C data structure you can use. There’s no lexing or parsing involved in loading a file. Lexing and parsing are orders of magnitude slower than blitting.
  • The file format is contorted, where necessary, to make common operations fast. For example, Excel 95 and 97 have something called “Simple Save” which they use sometimes as a faster variation on the OLE compound document format, which just wasn’t fast enough for mainstream use. Word had something called Fast Save. To save a long document quickly, 14 out of 15 times, only the changes are appended to the end of the file, instead of rewriting the whole document from scratch. On the hard drives of the day, this meant saving a long document took one second instead of thirty. (It also meant that deleted data in a document was still in the file. This turned out to be not what people wanted.)

They were designed to use libraries. If you wanted to write a from-scratch binary importer, you’d have to support things like the Windows Metafile Format (for drawing things) and OLE Compound Storage. If you’re running on Windows, there’s library support for these that makes it trivial... using these features was a shortcut for the Microsoft team. But if you’re writing everything on your own from scratch, you have to do all that work yourself.

Office has extensive support for compound documents, for example, you can embed a spreadsheet in a Word document. A perfect Word file format parser would also have to be able to do something intelligent with the embedded spreadsheet.

They were not designed with interoperability in mind. The assumption, and a fairly reasonable one at the time, was that the Word file format only had to be read and written by Word. That means that whenever a programmer on the Word team had to make a decision about how to change the file format, the only thing they cared about was (a) what was fast and (b) what took the fewest lines of code in the Word code base. The idea of things like SGML and HTML—interchangeable, standardized file formats—didn’t really take hold until the Internet made it practical to interchange documents in the first place; this was a decade later than the Office binary formats were first invented. There was always an assumption that you could use importers and exporters to exchange documents. In fact Word does have a format designed for easy interchange, called RTF, which has been there almost since the beginning. It’s still 100% supported.

They have to reflect all the complexity of the applications. Every checkbox, every formatting option, and every feature in Microsoft Office has to be represented in file formats somewhere. That checkbox in Word’s paragraph menu called “Keep With Next” that causes a paragraph to be moved to the next page if necessary so that it’s on the same page as the paragraph after it? That has to be in the file format. And that means if you want to implement a perfect Word clone than can correctly read Word documents, you have to implement that feature. If you’re creating a competitive word processor that has to load Word documents, it may only take you a minute to write the code to load that bit from the file format, but it might take you weeks to change your page layout algorithm to accommodate it. If you don’t, customers will open their Word files in your clone and all the pages will be messed up.

They have to reflect the history of the applications. A lot of the complexities in these file formats reflect features that are old, complicated, unloved, and rarely used. They’re still in the file format for backwards compatibility, and because it doesn’t cost anything for Microsoft to leave the code around. But if you really want to do a thorough and complete job of parsing and writing these file formats, you have to redo all that work that some intern did at Microsoft 15 years ago. The bottom line is that there are thousands of developer years of work that went into the current versions of Word and Excel, and if you really want to clone those applications completely, you’re going to have to do thousands of years of work. A file format is just a concise summary of all the features an application supports.

Just for kicks, let’s look at one tiny example in depth. An Excel worksheet is a bunch of BIFF records of different types. I want to look at the very first BIFF record in the spec. It’s a record called 1904.

The Excel file format specification is remarkably obscure about this. It just says that the 1904 record indicates “if the 1904 date system is used.” Ah. A classic piece of useless specification. If you were a developer working with the Excel file format, and you found this in the file format specification, you might be justified in concluding that Microsoft is hiding something. This piece of information does not give you enough information. You also need some outside knowledge, which I’ll fill you in on now. There are two kinds of Excel worksheets: those where the epoch for dates is 1/1/1900 (with a leap-year bug deliberately created for 1-2-3 compatibility that is too boring to describe here), and those where the epoch for dates is 1/1/1904. Excel supports both because the first version of Excel, for the Mac, just used that operating system’s epoch because that was easy, but Excel for Windows had to be able to import 1-2-3 files, which used 1/1/1900 for the epoch. It’s enough to bring you to tears. At no point in history did a programmer ever not do the right thing, but there you have it.

Both 1900 and 1904 file types are commonly found in the wild, usually depending on whether the file originated on Windows or Mac. Converting from one to another silently can cause data integrity errors, so Excel won’t change the file type for you. To parse Excel files you have to handle both. That’s not just a matter of loading this bit from the file. It means you have to rewrite all of your date display and parsing code to handle both epochs. That would take several days to implement, I think.

Indeed, as you work on your Excel clone, you'll discover all kinds of subtle details about date handling. When does Excel convert numbers to dates? How does the formatting work? Why is 1/31 interpreted as January 31 of this year, while 1/50 is interpreted as January 1st, 1950? All of these subtle bits of behavior cannot be fully documented without writing a document that has the same amount of information as the Excel source code.

And this is only the first of hundreds of BIFF records you have to handle, and one of the simplest. Most of them are complicated enough to reduce a grown programmer to tears.

The only possible conclusion is this. It's very helpful of Microsoft to release the file formats for Microsoft and Office, but it's not really going to make it any easier to import or save to the Office file formats. These are insanely complex and rich applications, and you can’t just implement the most popular 20% and expect 80% of the people to be happy. The binary file specification is, at most, going to save you a few minutes reverse engineering a remarkably complex system.

OK, I promised some workarounds. The good news is that for almost all common applications, trying to read or write the Office binary file formats is the wrong decision. There are two major alternatives you should seriously consider: letting Office do the work, or using file formats that are easier to write.

Let Office do the heavy work for you. Word and Excel have extremely complete object models, available via COM Automation, which allow you to programmatically do anything. In many situations, you are better off reusing the code inside Office rather than trying to reimplement it. Here are a few examples.

  1. You have a web-based application that’s needs to output existing Word files in PDF format. Here’s how I would implement that: a few lines of Word VBA code loads a file and saves it as a PDF using the built in PDF exporter in Word 2007. You can call this code directly, even from ASP or ASP.NET code running under IIS. It’ll work. The first time you launch Word it’ll take a few seconds. The second time, Word will be kept in memory by the COM subsystem for a few minutes in case you need it again. It’s fast enough for a reasonable web-based application.
  2. Same as above, but your web hosting environment is Linux. Buy one Windows 2003 server, install a fully licensed copy of Word on it, and build a little web service that does the work. Half a day of work with C# and ASP.NET.
  3. Same as above, but you need to scale. Throw a load balancer in front of any number of boxes that you built in step 2. No code required.

This kind of approach would work for all kinds of common Office types of applications you might perform on your server. For example:

  • Opening an Excel workbook, storing some data in input cells, recalculating, and pulling some results out of output cells
  • Using Excel to generate charts in GIF format
  • Pulling just about any kind of information out of any kind of Excel worksheet without spending a minute thinking about file formats
  • Converting Excel file formats to CSV tabular data (another approach is to use Excel ODBC drivers to suck data out using SQL queries).
  • Editing Word documents
  • Filling out Word forms
  • Converting files between any of the many file formats supported by Office (there are importers for dozens of word processor and spreadsheet formats)

In all of these cases, there are ways to tell the Office objects that they’re not running interactively, so they shouldn’t bother updating the screen and they shouldn’t prompt for user input. By the way, if you go this route, there are a few gotchas, and it's not officially supported by Microsoft, so read their knowledge base article before you get started.

Use a simpler format for writing files. If you merely have to produce Office documents programmatically, there’s almost always a better format than the Office binary formats that you can use which Word and Excel will open happily, without missing a beat.

  • If you simply have to produce tabular data for use in Excel, consider CSV.
  • If you really need worksheet calculation features that CSV doesn’t support, the WK1 format (Lotus 1-2-3) is a heck of a lot simpler than Excel, and Excel will open it fine.
  • If you really, really have to generate native Excel files, find an extremely old version of Excel… Excel 3.0 is a good choice, before all the compound document stuff, and save a minimum file containing only the exact features you want to use. Use this file to see the exact minimum BIFF records that you have to output and just focus on that part of the spec.
  • For Word documents, consider writing HTML. Word will open those fine, too.
  • If you really want to generate fancy formatted Word documents, your best bet is to create an RTF document. Everything that Word can do can be expressed in RTF, but it’s a text format, not binary, so you can change things in the RTF document and it’ll still work. You can create a nicely formatted document with placeholders in Word, save as RTF, and then using simple text substitution, replace the placeholders on the fly. Now you have an RTF document that every version of Word will open happily.

Anyway, unless you’re literally trying to create a competitor to Office that can read and write all Office files perfectly, in which case, you’ve got thousands of years of work cut out for you, chances are that reading or writing the Office binary formats is the most labor intensive way to solve whatever problem it is that you’re trying to solve.

Some corrections 21 Feb

A well-meaning, but rushed, journalist named Ben Charny interviewed me this morning about Microsoft's interoperability press release.

He made a ridiculous number of mistakes. That's the nature of the wire reporters on deadline. Here are some corrections.

  • "For want of a few key details from Microsoft Corp. (MSFT), Joel Spolsky tried but failed more than 10 years ago to make a better version of Microsoft's remote office desktop-computer feature."

Not true. Fog Creek Copilot was developed less than three years ago and has been under continous development since then. It has been a profitable product and we're still developing new versions. I believe that what we built is better than Microsoft's Remote Desktop in many ways (it works through firewalls, supports Macintosh and Windows, and is easier to set up for ad-hoc tech support) and it's worse in some ways (it's slower, and uses JPG compression as an optimization which can make the screen blurry). So it's debatable whether I "failed." There was no need for "a few key details from Microsoft" because we don't interoperate with Remote Desktop, we use the open source VNC protocol (incidentally, the client code for Copilot is freely available under the GPL).

  • "Now Project Co-Pilot has gotten new life."

OK, it's not "Project Co-Pilot", it's Fog Creek Copilot, and, like I say, we've been working on it continuously for almost three years and are about to release a major new version, so nothing about Microsoft's announcement granted it "new life." That part is just a fabrication.

  • "On Thursday Spolsky finally located those elusive lines of code tucked inside 30,000 pages made public Thursday by Microsoft. Before they were available only under trade-secrets licenses."

Huh? Lines of code? Ok, I understand, tech journalists may not understand the difference between "lines of code" and protocol specifications. When the press release came out, I was curious to see if it included a spec for the Remote Desktop protocol. We've always known how the protocol works, and how it transmits GDI commands for performance reasons, which is neither rocket science nor a trade secret. I don't know if the concept is patented, but the X Window server worked this way before Windows was even invented, so if there is a patent for this GDI business, it ain't Microsoft's, but that's neither here nor there.

Copilot doesn't use the remote desktop protocol, full stop, and we don't plan to. I just happened to look for that in the spec (in the 15 minutes between Microsoft's press release and the time the journalist called me) because I was curious to see what kind of stuff was in there, and this was an area I had wondered about.

  • By sharing more technical information about its key products on Thursday, Microsoft has jump-started a wave of development destined to unleash software that will compete with many more of Microsoft's own products. One of the first appears to be remote desktop software, which uses a secured Internet connection to remotely access files and features stored on office- desktop computers. Currently, there's no third-party version of the software, save for Citrix Systems Inc. (CTXS), which has a cross-licensing deal with Microsoft. But Microsoft's existing product is a very basic one, making it rife for improvements. Its relative simplicity is demonstrative of how there's been no competitive offerings that would have forced Microsoft to make a better widget.

This part is what he got mostly right, and it's what I said. As far as I know, there are no competitive clients for Windows Remote Desktop (formerly called Terminal Services) except for Citrix's cross-licensed implementation, presumably because the protocol was never publicized. As a result, if you want to use Windows Remote Desktop, you are stuck with the rudimentary clients Microsoft gives you. There are LOTS of great competitive remote desktop solutions that include both the client and the server; besides our own Copilot, there's Bomgar, LogMeIn, GotoMyPC, and the granddaddy PCAnywhere, and another dozen or so options. But I'm pretty sure none of them interoperate with Remote Desktop because the spec has not been available. Everybody, including Copilot, has their own protocol, usually a variation of the RFB protocol [PDF spec].

So I thought this was one example of an area where Microsoft actually stood to gain from publicizing their protocols. It's bound to open up lots of opportunities in hundreds of areas (remote desktop is just a tiny example) where third-party developers will be able to develop better drop-in versions of various pieces of the Microsoft software stack, which helps the Microsoft ecosystem more than it detracts from their business. Microsoft isn't making a dime off of RDC because it's free and built into Windows... a few competive options with more features can only help the Windows business in the long run.

(Updated 2/22) A correction to the correction.

A correction to the correction 22 Feb

Some readers were kind enough to point out in the comments that there are a couple of open source projects that integrate with Remote Desktop. I guess news analysis that appears within hours of a story breaking is never very good. The whole business of remote desktop access is an entire industry, and this is just one tiny fraction of the protocols that were published by Microsoft yesterday, so Charny's story was the moral equivalent of trying to decide who is going to win the next American election by talking to a couple of truckers at a bar. The problem wasn't the accuracy in reporting (although that was sorely lacking), the problem was that the story was too ambitious. When Charny called me I should have said, "The Schleswig-Holstein question is so complicated, only three men in Europe have ever understood it. One was Prince Albert, who is dead. The second was a German professor who became mad. I am the third and I have forgotten all about it."

PS here are the Microsoft Open Protocol Specifications.

Speakers for Business of Software Conference 23 Feb

Neil Davidson has posted a preliminary list of some of the speakers for this year's Business of Software Conference. “So far I've lined up (in no particular order) Joel Spolsky, Eric Sink, Richard Stallman, Dharmesh Shah, Jason Fried, Steve Johnson and Paul Kenny.”

Inspired Misfires 25 Feb

From my February column in Inc. Magazine: “The bureaucrats in Washington had forgotten Newton's first law: An object in motion tends to remain in motion.”

Pressed uniforms, mid-winter tans, and morale 28 Feb

March's Inc. column is online: “Don't tell your star salespeople to take the bus and stay with relatives when they make that call in St. Louis, even though that's what you did when you started the company.”

Current News >>

Historical Archive

1111 posts over 14 years. Everything I’ve ever published is right here.

1999           Dec
2000  MarAprMayJunJulAugSepOctNovDec
2001JanFebMarAprMayJunJulAugSepOctNovDec
2002JanFebMarAprMayJunJulAugSepOctNovDec
2003JanFebMarAprMayJunJulAugSepOctNovDec
2004JanFebMarAprMayJunJulAugSepOctNovDec
2005JanFebMarAprMayJunJulAugSepOctNovDec
2006JanFebMarAprMayJunJulAugSepOctNovDec
2007JanFebMarAprMayJunJulAugSepOctNovDec
2008JanFebMarAprMayJunJulAugSepOctNovDec
2009JanFebMarAprMayJunJulAugSepOctNovDec
2010JanFebMarAprMayJunJulAugSepOct Dec
2011JanFebMarAprMayJun  Sep   
2012JanFebMarApr  Jul     
2013  MarApr  Jul     

Now that you’ve read all that —

There’s a software company in New York City dedicated to doing things the right way and proving that it can be done profitably and successfully.

Fog Creek Software