how to use QWebView of PyQt to display xml witl xslt applied - python

According to PythonCentral :
QWebView ... allows you to display web pages from URLs, arbitrary HTML, XML with XSLT stylesheets, web pages constructed as QWebPages, and other data whose MIME types it knows how to interpret
However, the xml contents are displayed as if it were interpreted as html, that is, the tags filtered away and the textnodes shown w/o line breaks.
Question is: how do I show xml in QWebView with the xsl style sheet applied?
The same xml-file opened in any stand-alone webbrowser shows fine. The html-file resulted from the transformed xml (by lxml.etree) also displays well in QWebView.
Here is my (abbreviated) xml file:
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="../../page.xsl"?>
<specimen>
...
</specimen>

OK, I found part of the solution. It's a multi-step approach using QXmlQuery:
path = base + "16000-16999/HF16019"
xml = os.path.join(path, "specimen.xml")
xsl = os.path.join(path, "../../page.xsl")
app = QApplication([])
query = QXmlQuery(QXmlQuery.XSLT20)
query.setFocus(QUrl("file:///" + xml));
query.setQuery(QUrl("file:///" + xsl));
out = query.evaluateToString();
win = QWebView()
win.setHtml(out);
win.show()
app.exec_()
Evidently, the xslt is applied this way. What's still wrong is that the css style sheets referenced in the xslt are not applied/found.

I came across your question because I had a similar problem. I thought I'd post the solution I found to the problem as well, because it works without QXmlQuery and is rather simple.
For my solution, my xml file was also interpreted as HTML, so I just worked with that and replaced every < with <, every > with > and every & with & as mentioned in this answer.
So, for your xmlString, just do:
xmlString.replace("<","<").replace(">",">").replace("&", "&")
This way, if your xml file gets interpreted as html it will at least show the text correctly with all the tags.

Related

Parsing an xml file with an emphasis tag in it in python

I am currently writing a python script that can extract all of the text in an xml file. I am using the Element Tree library to interpret the data but I am running into this problem however when the data is structured like this...
<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>
When I attempt to read out the text, I get the first half of the Segment ("Alright. So what we had") before the pause tag.
What I am trying to figure out is if there is a way to ignore the tags in the data segments and print out all of the text.
Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
doc = SimplifiedDoc(html)
print(doc.Segment)
print(doc.Segment.text)
Result:
{'StartTime': '639.752', 'EndTime': '642.270', 'Participant': 'fe016', 'tag': 'Segment', 'html': "\n But I bet it's a good <Pause /> superset of it.\n"}
But I bet it's a good superset of it.
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
xml = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
pause = root.find('./Pause')
print(root.text + pause.tail)

python lxml pkg - how to incrementally write to an XML file using etree.xmlfile AND passing in existing elements?

very new to anything xml related please bear with me - trying to build some code that converts rasters to KML files for google earth.
I've come across the lxml package which has made my life easier, but now am facing an issue.
Let's say I've created an element called kml with namespaces:
from lxml import etree
version = '2.2'
namespace_hdr = {'gx':f'http://www.google.com/kml/ext/{version}',
'kml':f'http://www.opengis.net/kml/{version}',
'atom':f'http://www.w3.org/2005/Atom'
}
kml = etree.Element('kml', nsmap=namespace_hdr)
And I've also created an element called Document:
Document = etree.SubElement(kml, 'Document')
Now..I have alot of data I want to write and am running into memory issues, so I figured the best approach would be to generate my data to write on the fly and write it as I go, hence the incremental writing.
The approach I'm using is:
out_file = 'test.kml'
with etree.xmlfile(out_file, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element(kml):
xf.write(Document)
Which returns the error:
TypeError: Argument must be bytes or unicode, got '_Element'
If I change kml to 'kml' it works fine, but obviously does not write the namespaces to the file that I've defined in the kml element.
How is it possible to pass in the kml element instead of a string? Is there a way to do this? Or some other way of incrementally writing to the file?
Any thoughts would be appreciated.
FYI - output when using 'kml' is:
<?xml version='1.0' encoding='utf-8'?>
<kml><Document/>
</kml>
I'm trying to achieve the same but with the namespaces:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document/>
</kml>

Updating an Existing XML Document in Python

I have a large XML file whose structure is approximately as follows:
<GROUNDTRUTH>
<thing fileName="1" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="2" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="3" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
</GROUNDTRUTH>
I don't think I was clear enough in the original posting of this question. I have an xml document called GROUNDTRUTH, and inside of that I have several thousand "things". I want to search through all of the things in the document via filename and then change an attribute. So if I was searching for fileName="2", I would change its attribute to attrib=x. And for some thing, perhaps I'd go down to the sub level and change moreStuff.
My plan is to store into a csv file the names of the 'things' I need to change, and what I want to change the value of 'attrib' to. What function or module will provide this kind of functionality? Or am I just missing an easy/obvious approach? Ultimately I'd like to have a working script that will take a csv file with the thing identifier, and value to be updated, and take the xml file to make those changes onto.
Thanks for your help and suggestions!
First, you can transform the original xml file into an outputted xml file using an xslt stylesheet which can modify xml files in any way, shape, or form such as modifying, re-structuring, re-ordering attributes, elements, etc. Do note xsl is a declarative special-purpose language to transform and render XML documents.
Then, you can use Python's lxml library to run the transformation:
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('originalfile.xml')
xslt = ET.parse('transformfile.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
xmlfile = open('finalfile.xml','ab')
xmlfile.write(tree_out)
xmlfile.close()
By the way, PHP, Java, C, VB, or pretty much any language, even your everyday browser can run transformations! To have the browser run it, simply add stylesheet in header:
<?xml-stylesheet type="text/xsl" href="transformfile.xsl"?>

pypdf not extracting tables from pdf

I am using pypdf to extract text from pdf files . The problem is that the tables in the pdf files are not extracted. I have also tried using the pdfminer but i am having the same issue .
The problem is that tables in PDFs are generally made up of absolutely positioned lines and characters, and it is non-trivial to convert this into a sensible table representation.
In Python, PDFMiner is probably your best bet. It gives you a tree structure of layout objects, but you will have to do the table interpreting yourself by looking at the positions of lines (LTLine) and text boxes (LTTextBox). There's a little bit of documentation here.
Alternatively, PDFX attempts this (and often succeeds), but you have to use it as a web service (not ideal, but fine for the occasional job). To do this from Python, you could do something like the following:
import urllib2
import xml.etree.ElementTree as ET
# Make request to PDFX
pdfdata = open('example.pdf', 'rb').read()
request = urllib2.Request('http://pdfx.cs.man.ac.uk', pdfdata, headers={'Content-Type' : 'application/pdf'})
response = urllib2.urlopen(request).read()
# Parse the response
tree = ET.fromstring(response)
for tbox in tree.findall('.//region[#class="DoCO:TableBox"]'):
src = ET.tostring(tbox.find('content/table'))
info = ET.tostring(tbox.find('region[#class="TableInfo"]'))
caption = ET.tostring(tbox.find('caption'))

What is the best way to change text contained in an XML file using Python?

Let's say I have an existing trivial XML file named 'MyData.xml' that contains the following:
<?xml version="1.0" encoding="utf-8" ?>
<myElement>foo</myElement>
I want to change the text value of 'foo' to 'bar' resulting in the following:
<?xml version="1.0" encoding="utf-8" ?>
<myElement>bar</myElement>
Once I am done, I want to save the changes.
What is the easiest and simplest way to accomplish all this?
Use Python's minidom
Basically you will take the following steps:
Read XML data into DOM object
Use DOM methods to modify the document
Save new DOM object to new XML document
The python spec should hold your hand rather nicely though this process.
This is what I wrote based on #Ryan's answer:
from xml.dom.minidom import parse
import os
# create a backup of original file
new_file_name = 'MyData.xml'
old_file_name = new_file_name + "~"
os.rename(new_file_name, old_file_name)
# change text value of element
doc = parse(old_file_name)
node = doc.getElementsByTagName('myElement')
node[0].firstChild.nodeValue = 'bar'
# persist changes to new file
xml_file = open(new_file_name, "w")
doc.writexml(xml_file, encoding="utf-8")
xml_file.close()
Not sure if this was the easiest and simplest approach but it does work. (#Javier's answer has less lines of code but requires non-standard library)
For quick, non-critical XML manipulations, i really like P4X. It let's you write like this:
import p4x
doc = p4x.P4X (open(file).read)
doc.myElement = 'bar'
You also might want to check out Uche Ogbuji's excellent XML Data Binding Library, Amara:
http://uche.ogbuji.net/tech/4suite/amara
(Documentation here:
http://platea.pntic.mec.es/~jmorilla/amara/manual/)
The cool thing about Amara is that it turns an XML document in to a Python object, so you can just do stuff like:
record = doc.xml_create_element(u'Record')
nameElem = doc.xml_create_element(u'Name', content=unicode(name))
record.xml_append(nameElem)
valueElem = doc.xml_create_element(u'Value', content=unicode(value))
record.xml_append(valueElem
(which creates a Record element that contains Name and Value elements (which in turn contain the values of the name and value variables)).

Categories

Resources