Latest update Dec 6, 2009
To avoid spam, all mail addresses on this page have the "@" replaced by "#".
One person. This project requires some knowledge of HTML.
MS Word can generate HTML versions of its documents. However, the way MS Word generates HTML documents has a number of drawbacks: they typically contain a lot of meta-information, which makes them bloated; they contain explicit font directives, typically to Windows-specific fonts, which makes them look ugly when viewed on non-Windows platforms; they often have lots of other formatting directives that are really not necessary if the document is used for normal browsing. All this goes against the original idea of the web that HTML documents should be "markup only", specifying only contents and not formatting, and that the documents should be browsable on any platform.
Very often, one would simply like to have a straight, universally viewable HTML document without any particular formatting or similar. The task of this project is to write a filter that strips an MS-Word generated HTML file of all excess information and returns a "clean" HTML document that is browsable on any platform.
Since Microsoft does not reveal any documentation how they generate HTML, you will need to look at some MS-Word generated HTML files to see what they look like and what should be removed. This means that your filter cannot reasonably be perfect, but it should at least do a decent job for the most common "junk" found in these HTML files.
A crucial part of the project is to decide what information is to be considered "excessive". This is not always immediately clear, and can be different in different situations. Whatever choice you make should be motivated.
It might be good to think in terms of a general "purifier" of HTML code, that removes non-essential formatting information.
Test examples of MS Word-generated html files are
found here (unfiltered),
and here (filtered), both generated from
the same Word document. Your filter should be able to strip both of a
significant amount of excessive information.