I tried this manuscript: http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708 based on tidy and also sed, and also it reduced the dimension to about 150K, but there are still lots of pointless Stretches.
When I convert them along with Libre Office “Spare as HTML”, the leading reports are substantial, for example, a doc file of 112K comes to be 450K HTML, a lot of it ineffective FONT and also PERIOD tags (for one reason or another, each and every punctuation mark is actually confined in its very own period!).
People that send out content to my internet site use Word, so I receive a considerable amount of HTML to convert to Word documents in C#. I wish to conserve simply the basic formatting – titles, checklists and also importance – no images.
I attempted to duplicate and also past right into Kompozer – an HTML publisher, and then spare as HTML; however it converted all my non-Latin (Hebrew) characters to bodies including “ְ”, which boosted the dimension to 750K!
I attempted docvert: https://github.com/holloway/docvert/issues/6 but learnt that it needs a python library that calls for an additional public libraries, etc., which appears like an unlimited path of reliances …
Exists a simple method to generate clean HTML coming from Office documents?
For every tag in the HTML, the indicated callback feature will definitely be called. You can keep any type of given tag and also some or all of its characteristics (or even tweaking all of them), get rid of the tag yet always keep the indoor web content, always keep the tag yet acquire rid of the material, change the material (for closing tags), or even acquire rid of both the tag and also indoor content. This method enables incredibly processed command over the very most complicated HTML out certainly there as well as processes the input in a singular successfully pass.
The only disadvantage is that the callback needs to keep an eye on where it goes to between each call whereas one thing like Basic HTML DOM selects factors based on a DOM-like model. That is actually merely a disadvantage if the document being actually refined has things like ‘id’s as well as ‘class’ es … very most Word/Libre HTML information does not, which implies it is a large blob of unrecognizable/unparseable HTML as far as DOM processing resources go.
Below is a set of PowerShell scripts that will certainly wash Word-Filtered HTML as well as properly label super/subscripts regarding 95% of the moment. (No, you can not feel better than that, Word is made for printing.) https://github.com/suzumakes/replaceit
Directions are there in the ReadMe as well as if you occur to come across any type of added characters that require to become caught or even produced any tweaks/improvements, I would certainly more than happy to see your pull ask for.
For cleaning defective HTML, the default options from TagFilter:: GetHTMLOptions() are going to function as a good beginning factor. Those alternatives form the basis of valid HTML content as well as, not doing anything else, will certainly tidy up any input data into something that yet another device like Simple HTML DOM may accurately analyze in a DOM design.
You pass in pair of factors: A range of possibilities and also the data to analyze as HTML.
I do not understand if there is a more convenient method, yet this way is actually one hundred% Free as well as simple for HTML tag clean-up handling by means of Note pad++.
As for changing inline-styles to exterior CSS (which I recommend as the second process after changing needless tags), attempt this application … http://inlinecssextractor.com/home.html
Right now all you must perform coming from that aspect is actually click Discover Next until you reach the tags you would like to switch out and after that click Change for each and every tag that requires to be changed. Ensure the “Substitute along with:” container is actually unfilled.