Today I came across a column posted (in Dutch) on Webwereld entitled "It’s a trap!". This column is responding to the recent decision at a LibreOffice/OpenOffice Workshop to put more effort into support for Microsoft‘s proprietary OOXML format.

Perhaps it is because I have been reading Pull. Perhaps it is because I have been working at a new company over the past year who are trying to address some related issues. But this column got me thinking about what the essence is of what we call documents nowadays and how this might change in the coming years.


It’s a trap! in short…

For those of you who do not read Dutch, the point of the original column "It’s a trap" is that word processors as they are nowadays are simply a continuation of the evolution of the paper document, from clay tablet to an electronic view on preparing documents for printing on static paper. However, the author states, documents are moving more and more to being electronic entities living primarily on computers, independent of the static form of the paper document. In light of this he is surprised that the LibreOffice community wants to concentrate on adopting the "enemy" document format and try to challenge Microsoft on its own homeground, when that is just playing to Microsoft’s strength rather than trying to pull ahead of them in the area of the location- and medium independent document that just lives in the cloud.

Semantic web

Like I mentioned earlier, I’ve been working for a new company for the past year. This company is trying to embrace part of the vision of the semantic web as put forth in books like David Siegel‘s Pull. As part of this new job I’ve been able to spend time thinking about the vision of Pull and how reversing the current model of interaction between companies and consumers will affect the way we think about personal data and data sharing.

The basic idea behind the semantic web is that it must be possible to interpret the meaning of (textual) resources and the relationships to other resources automatically and unambiguously. To that end these relationships are described formally in a (the?) semantic web, enabling software to reason about these relationships according to preset rules in order to deliver value to human users of the software.

Once you start thinking about it, the consequence of building and using a semantic web is that data is not the only thing that is important any more. In addition to data (which exists for human interpretation and consumption), formally formed data about data is very important: the metadata. This metadata, like a formal description of relationships, describes what data is about in a way that can be understood by machines. Metadata is what enables automated processing of data and as the amount and richness of metadata increases and is linked together in a semantic web, the more it will enable smart(er) software to process data in new and unusual ways. It will enable software to discover data in the cloud and even generate new data by combining existing data through actual understanding of that data.

It’s a document, but not as we know it

So when is a datum data and when is it metadata? That’s a good question. The answer is that it depends. Data is intended for humans and metadata for computers… but why shouldn’t humans know who the sender of an email is, just because that information is metadata? Or the other way around, why shouldn’t a computer try to make some sense of data, or try to perform some processing of data, just because the primary audience is human? Search engines do exactly that…

The upshot of it is that data can be data and can be metadata interchangeably, depending on the context. In other words, the role of data is not fixed but depends on how you happen to be treating it. Nothing new there, of course. Which is when I came across that column.

The column raises the question of whether LibreOffice should try to follow the OOXML format when the very nature of document storage and presentation is changing. That’s a good point, but in the light of the semantic web it’s not going far enough. Since data can be both data and metadata and metadata enables processing (such as presentation), I say the real question is: what is a document?

The point of a file format like the .DOC or the .DOCX formats is that they contain document data (i.e. contents), what users consider metadata (i.e. properties like the author), and application-specific metadata pertaining to the way applications are supposed to present data. All very nice and useful. But one of those three categories is only useful to applications that want to present the content exactly the way Word does. An application that wants to apply different formatting, or wants to read the contents out loud, or wants to do something completely different than presenting the contents on-screen or on paper is not interested in the Word instructions at all.

So here’s a question: what is really a document? Is it a file on a storage medium? Or is it just the contents and the application-independent metadata? Is it in any way interesting to include presentation hints in a document? After all, if you follow the visions of data pull and the semantic web, real value comes from making data application-agnostic and allowing each application to overlay any presentation that fits its purposes.

And that is the direction I think documents will start following more and more. Even in the original Web 1.0 the intention already existed that data only be described and display be left to the user agent. After having veered off the path in HTML 3 and 4, the web is now morphing into the transport medium for a worldwide system of applications forming (partially) dynamic SOA networks, whose entire possibility of existence is based on being able to process raw data without having to wade through superfluous bytes that are specific to one particular application. Application networks like that are based on location-agnostic data with plenty of application-independent metadata to allow for all sorts of processing without binding to a specific application.

So I think that in the future our notion of a document will change completely. I think that we will move to a separation of content and semantic data from anything else, meaning that we will be divorcing documents from the applications that process them. We will even divorce documents from presentation forms and formats, so that the distinction between a Word document, a PDF, a spreadsheet, a (NoSQL) database of facts and relations, an MP3 recording of the contents and a steganographic encoding of some piece of abstract art will vanish and the informational content of the document will depend only on the application doing the processing (note that "informational content" is not the same as "semantics").

Oh, by the way…

The column that inspired this blog was wondering whether or not the LibreOffice community should invest effort in becoming (more) compatible with the OOXML format and so emulate MS Office. Contrary to what you might be thinking, I feel that the answer given by that column is wrong. Not that I think that the goal should be to become fully OOXML-compatible, because I do feel the very notion of "document" and "document format" should change. But in the world that I think we are moving towards, the value of an application will not be in its compatibility with one data format but in how much value it creates for a user in the way it interprets and (re)processes the basic data and metadata.

So how does this relate to LibreOffice and their OOXML compatibility? Well, LibreOffice could stand to start offering more value as an office suite. Simply put, LibreOffice sucks as an office suite. It wasn’t so bad when Microsoft Office was still in the pre-2007 releases and documents from MS Office sucked in the same way as those from OpenOffice. But with Office 2007, Microsoft took a huge leap ahead of all the alternatives out there. And the difference isn’t in any technical improvement they made either; Microsoft didn’t add any great new technical functionalities, any huge new editing or DTP facilities or anything to do with the document content. But they did invest a lot in a number of document looks and feels (designed by graphics designers) that make it possible to create good-looking documents very simply (i.e. to present the content in an attractive way with little effort). A vast improvement over documents from other Office suites (including earlier versions of MS Office), which produced documents whose appearance sucked unless you put in A LOT of effort. In other words, Microsoft put a lot of effort into how they processed and presented the basic content of each file and so added a lot of value for the end user rather than adding useless functionality that nobody needs like they did in earlier releases. Their concentration on adding this value means MS Office is now a far better document processor than LibreOffice. And in that sense, yes, the LibreOffice community should definitely try to emulate Microsoft Office far more than it is doing now. Not for the document format, but for the value the office suite adds as a document interpreter.