When citing sources, CiteIt removes all HTML formatting before computing the 500 characters of context before and after each selection.
But although it is expedient to remove HTML tags, so as to simplify the text extraction process, it would be desirable for the text version to maintain white space and some other formatting.
Suppose you had an example series of ordered and unordered lists which mimic the “Step 3: Profit” meme:
How to get rich with a dot com company:
li Sell dog food on the internet.
A simple stripping of the formatting would remove the list styling.
It would be preferable to
- Indent the list, and
- Put each item on its own line.
- Ordered lists should start with a number and unordered lists with an asterisk.
There is an existing Java library called Boilerplate that runs on Google App Engine. It doesn’t quite do what we want, but is similar.
Do you know of a library or approach which could make it possible to create the type of text version desired?
A similar process needs to be done for other document types such as PDF and Doc.