Small, compatible and secure are the watchwords of the new generation of office files.
In future, all the important office software manufacturers are planning to use XML-based file formats. Despite this, there’s no sign of a standard format emerging.
There are already two rival camps: the Open Document format being promoted by IBM, Sun (Star Office) and Openoffice.org, and Microsoft’s own variant of XML.
Office 2007 will read and write Microsoft’s own Open XML files, but it won’t support Open Document out of the box.
Microsoft has recently relented somewhat with the announcement of its Open XML Translator project, which will let developers create a bridge between the rival formats.
Although this is being presented as a battle of the document formats, both sides are technically quite similar. Both file types have a common basis in XML (Extensible Markup Language).
All alphanumeric document content – presentations, text or tables – is stored in XML files. All other document elements, such as graphics and OLE (Object Linking and Embedding) or VBA (Visual Basic for Applications) objects, are strictly separated from them.
Further XML files belonging to the document can hold supplementary information (known as metadata) about format templates and definitions, comments, paths to linked resources, the author, number of characters and so on.
Open minded
In both Open Document and Open XML, all the constituent parts of the document
are kept together in a Zip container file that appears as the actual document to
the user.
Both types of file use a compressed archive, which reduces the storage space required. XML is slim, but this makes it smaller still.
Embedded picture files are converted into a space-saving format during saves and then the lossless Zip compression shrinks them further. In our tests, files which were saved in the new format shrank by 50-90 per cent compared with their original size.
Better data integrity is promised with the use of a CRC checksum (Cyclic Redundancy Check) – a familiar component of the Zip compression algorithm. This checks the integrity of each file in the archive.
The CRC is highly sensitive to any modifications to the archived data. But even if part of the Zip archive contains errors, you can still make use of the remaining data.
Once a document has been saved from Star Office (Odt, Open Document Text) or Microsoft Word (DocX, Word Open XML), you can rely on the content being stored securely, having been checked by a proven algorithm.
Separate data storage, compression and CRC testing also have other advantages.
As Windows has its own decompression routine, the use of Zip compression is a plus point. If need be the files can be worked on without any special software.
Simply change the document extension from Odt or Docx to Zip. You can then view the data container like a compressed file with Windows Explorer or a Zip-compatible compression utility such as PKzip for Windows.
In the case of Openoffice.org, all the text is saved in pure XML files. You can use copy and paste to move it to another program – for example, a text editor – without having the suite’s Writer component installed on your PC.
By using a PHP script and the add-on Pclzip, it’s possible to extract the content from large quantities of documents automatically.
The possibility of harvesting all or part of the content from Office files is very attractive for organisations needing to process the data from document management systems, and XML simplifies this process.
All Desktop Computers Tags: Office, Open Source
