Converting HTML markup to a RTF document - python

I have an XML document containing embedded HTML content that I am attempting to convert to an RTF output file. I have the XML elements decorated with <li>, <p>, <b> and other HTML markup, that I would like to have transferred into the generated RTF.
Here is what works as of now:
Fetch XML tag content as string (containing HTML tags for line breaks, paragraph breaks, and lists)
Write the XML tag content to an RTF file.
I am using Python scripts to achieve the conversion. Also being used is ElementTree (to parse input XML) PyRTF-NG (to convert from HTML to RTF), a library that handles tables and other special formatting. At the moment, I have managed to get everything I need except the 'markdown' of the HTML (i.e. translating HTML format tags into actual RTF formatting). To clarify, I mean that if my RTF convertor encounters an <ol><li> tag, it should create an ordered list in the RTF, instead of just spitting out <ol><li> tags into the RTF.
Does anyone know if Python has any native calls that will allow me to do this, or any other Python libraries that might have what I need to complete the full-conversion into RTF.
Thanks!

The best free conversor is the LibreOffice, and it can be used directly by command line at termimal, see
libreoffice --convert-to
The same conversor is indirectally called by Python using UNO bridge,
http://api.libreoffice.org/
http://software.opensuse.org/package/libreoffice-pyuno
...

Related

How to import RichText/rtf and/or HTML content formatting into OpenOffice documents using PyUno

I'm looking to import RichText and/or HTML formatted text into OpenOffice documents, with the PyUno API.
For now, the only solution I've found is to parse the format from source content and applying it to the document using text formatting (cursor's CharFontWeight among other things).
Since OpenOffice already manages text formatting through the clipboard, I would like to know if there already was a proper function or else from PyUno to import the RichText or HTML formatted content.
Thanks!

Is there any method available to convert the html text to xhtml?

I am trying to store a table which is created by PrettyTable, in to confluence. I converted the data to html using the method PrettyTable.get_html_string(). I can store this data to local html file as per my requirement without any issues. However, when I tried to upload the data to the confluence using confluence.createPage(), I am getting errors related to XHTML parsing errors as createPage() accepts only XHTML and the content is not formatted properly. So, I would like to convert my HTML data to XHTML so I can push it to the confluence. Is there any method available to convert the pretty table data directly to the XHTML?
I tried to use PrettyTable._get_formatted_html_string(), but there is no proper information about which arguments should I give to that method from http://www.aplu.ch/classdoc/raspipylib/prettytable-pysrc.html#PrettyTable._get_formatted_html_string
check the source html being created. XHTML needs all tags closed which is most likely the problem:
paragraph tag needs a close paragraph tag etc...
There is a method available in PrettyTable. Using the method get_html_string(xhtml=True) which automatically adds as end tags. Here is the reference: http://www.aplu.ch/classdoc/raspipylib/prettytable-pysrc.html#PrettyTable.get_html_string
Thank you :)

Parse XML file in python and display it in HTML page

I am doing a digital signage project using Raspberry-Pi. The R-Pi will be connected to HDMI display and to internet. There will be one XML file and one self-designed HTML webpage in R-Pi.The XML file will be frequently updated from remote terminal.
My idea is to parse the XML file using Python (lxml) and pass this parsed data to my local HTML webpage so that it can display this data in R-Pi's web-browser.The webpage will be frequently reloading to reflect the changed data.
I was able to parse the XML file using Python (lxml). But what tools should I use to display this parsed contents (mostly strings) in a local HTML webpage ?
This question might sound trivial but I am very new to this field and could not find any clear answer anywhere. Also there are methods that use PHP for parsing XML and then pass it to HTML page but as per my other needs I am bound to use Python.
I think there are 3 steps you need to make it work.
Extracting only the data you want from the given XML file.
Using simple template engine to insert the extracted data into a HTML file.
Use a web server to service the file create above.
Step 1) You are already using lxml which is a good library for doing this so I don't think you need help there.
Step 2) Now there are many python templating engines out there but for a simple purpose you just need an HTML file that was created in advance with some special markup such as {{0}}, {{1}} or whatever that works for you. This would be your template. Take the data from step 1 and just do find and replace in the template and save the output to a new HTML file.
Step 3) To make that file accessible using a browser on a different device or a PC you need to service it using a simple HTTP web server. Python provides http.server library or you can use an 3rd party web server and just make sure it can access the file created on step 2.
Instead of passing the parsed data (parsed from a XML file) to specific components in the HTML page, I've written python code such that it rewrites the entire HTML webpage's code periodically.
Suppose we have a XML file, a python script, a HTML webpage.
XML file : Contains certain values that are updated periodically and are to be parsed.
Python Script : Parses the XML file (when ever there are changes in XML file) and updates the HTML page with the newly parsed values
HTML webpage : To be shown on R-Pi screen and reloaded periodically (to reflect any changes at the browser)
The python code will have a string (say, str) declared, that contains the code of the HTML page, say the below code.
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
Then suppose we would like to update the My first paragraph with a value we parsed from XML, we can use Python string replacement function,
str.replace("My first paragraph",root[0][1].text)
After the replacement is done, write that entire string (str) into the HTML file. Now, the HTML file will have new code and once it's reloaded, the updated webpage will show up in the browser (of R-Pi)

Python -- Parsing files (docx, pdf and odt) and converting the content into my data model

I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written.
We already have a tool for the .epub format, and we'd like to extend the functionality beyond that, so users of the site can have more flexibility.
So far I've looked at PDFMiner and also found out that docx is just based on the openxml format, so the word/document.xml is essentially the file containing the whole thing, and I can parse it with lxml.
The question I have is: I'm hoping to parse the contents of these files, and from that content, extract things like chapter names, images (if any), and chapter text, so that I can fit the content into a data model of:
Book --> o2m --> Chapter --> o2m --> Image
Clearly, PDFMiner has a .get_outlines() function that will return the TOC for me. But it can't link any of the returned tuples (chapter numbers and titles) to the actual pages for that chapter.
Even more problematic is that with docx/odt; those are just paragraphs -- <\w:sdt> -- elements, with attrs and child elements.
I'm looking for idea(s) to extrapolate some sense of structure from these filetypes, and if need be, I can apply those ideas (2 or 3) as suggested formats for our users who wish to import a book via one of those file formats.
Textract is the best tool that i have encountered so far for parsing different file formats.
It can parse most of the file formats.
You can find the project on Github
Here is the official documentation
(Python 3 answer)
When I was looking for a tool to read .docx files, I was able to find one here: http://etienned.github.io/posts/extract-text-from-word-docx-simply/
What it does is simply get the text from a .docx file and return it as a string; separate paragraphs are still clearly separate, as there are the new lines between, but all other formatting is lost. I think this may include the loss of end- and foot-notes, but if you want the body of a text, it works great.
I have tested it on both Windows 10 and on OS X, and it has worked successfully on both. Here is what it imports:
import zipfile
try:
from xml.etree.cElementTree import XML
print("cElementTree")
except ImportError:
from xml.etree.ElementTree import XML
print("ElementTree")
EDIT:
If, in the body of the function, you replace
'word/document.xml'
with
'word/footnotes.xml'
or
'word/endnotes.xml'
you can get the footnotes and endnotes, respectively.
The markers for where they were in the text are lost, however.

parsing wikipedia stopwords html with nltk

Related to this question, I am working on a program to extract the introduction of wikipedia entities. As you can read in the above link, I already succeeded to query the api and am now focussing on the processing of the xml returned by the api call. I use nltk to process the xml, where I use
wikiwords = nltk.word_tokenize(introtext)
for wikiword in wikiwords:
wikiword = lemmatizer.lemmatize(wikiword.lower())
...
But with this I end up having recorded words like </, /p, <, ... . Since I am not using the structure of the xml, simply ignoring all xml would work, I guess. Is there a tool of nltk or is there a stopwords list available. I would just like to know, what's best practice?
You didn't specify what exact query are you using, but it seems what you have now is HTML, not XML, which you extracted from the XML response.
And if you want to strip all HTML tags from the HTML code and leave only the text, you should use HTML library for that, like BeautifulSoup.

Categories

Resources