How do I get properly escaped XML in python etree untouched? - python

I'm using python version 2.7.3.
test.txt:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<test>The tag <StackOverflow> is good to bring up at parties.</test>
</root>
Result:
>>> import xml.etree.ElementTree as ET
>>> e = ET.parse('test.txt')
>>> root = e.getroot()
>>> print root.find('test').text
The tag <StackOverflow> is good to bring up at parties.
As you can see, the parser must have changed the <'s to <'s etc.
What I'd like to see:
The tag <StackOverflow> is good to bring up at parties.
Untouched, raw text. Sometimes I really like it raw. Uncooked.
I'd like to use this text as-is for display within HTML, therefore I don't want an XML parser to mess with it.
Do I have to re-escape each string or can there be another way?

import xml.etree.ElementTree as ET
e = ET.parse('test.txt')
root = e.getroot()
print(ET.tostring(root.find('test')))
yields
<test>The tag <StackOverflow> is good to bring up at parties.</test>
Alternatively, you could escape the text with saxutils.escape:
import xml.sax.saxutils as saxutils
print(saxutils.escape(root.find('test').text))
yields
The tag <StackOverflow> is good to bring up at parties.

Related

Debugging xml ElementTrees in Python

I have an XML string
xml_str = '<Foo><Bar>burp</Bar></Foo>'
I'm parsing it with xml etree
import xml.etree.ElementTree as ET
root_element = ET.fromstring(xml_str)
This creates an Element object(root_element) with a tag, tail, text, and attrib values within it. I can see all of them when debugging. However, I can't see any children Elements while debugging. I know the children are there because I can access them in a for loop.
for child in root_element:
*break point here*
Below is a screenshot of what I'm seeing
Is there a way to see all elements at once while debugging? And is this issue because the XML parser is a JIT parser or something?
It sounds like you want to be able to see the available elements in the XML document you want to parse.
This will list all the child tags of the root element
all_children = list(root_element.iter())
This will produce
[<Element 'Foo' at 0x11b315908>, <Element 'Bar' at 0x11b315c28>]
This output, however, doesn't respect the 'shape' of the XML.
When I want to parse XML, I find it easier to use ElementTree but my first experiences parsing XML was with BeautifulSoup. I still like the prettify() function.
This code,
pretty = ""
soup = BeautifulSoup(xml_str, 'html.parser')
for value in soup.find_all("foo"):
pretty += value.prettify()
produces this output
print(pretty)
<foo>
<bar>
burp
</bar>
</foo>
You can replace the find_all() with specific elements you might be looking for.

Django parse XML from a POST

I'm receiving an HTTP POST. With one parameter thats sent: xml
It contain an xml document. The format of this document is:
<?xml version="1.1" encoding="ISO-8859-1"?>
<delivery_receipt>
<version>1.0</version>
<status>Delivered</status>
</delivery_receipt>
I need to get whats in <status> from the POST, how do I parse the parameter and get the 'status'?
Update....
if request.POST:
from lxml.cssselect import CSSSelector
from lxml.etree import fromstring
h = fromstring(request.POST['xml'])
h.cssselect('delivery_reciept status').text_content()
I'm not sure that request.POST['xml'] will work tho
You can (and should) use CSS selectors with XML documents, granted you are doing relatively simple tasks for parsing XML documents. CSS selectors are clear, easy to read and write, and are more expressive than XPATH queries.
I suggest getting lxml installed, and using their cssselect features.
Your end result might look like this:
>>> h = fromstring("""<?xml version="1.1" encoding="ISO-8859-1"?>
<delivery_receipt>
<version>1.0</version>
<status>Delivered</status>
</delivery_receipt> """)
>>> h.cssselect('delivery_reciept status').text_content()

Parsing RSS with Elementtree in Python

How do you search for namespace-specific tags in XML using Elementtree in Python?
I have an XML/RSS document like:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.0/"
>
<channel>
<title>sometitle</title>
<pubDate>Tue, 28 Aug 2012 22:36:02 +0000</pubDate>
<generator>http://wordpress.org/?v=2.5.1</generator>
<language>en</language>
<wp:wxr_version>1.0</wp:wxr_version>
<wp:category><wp:category_nicename>apache</wp:category_nicename><wp:category_parent></wp:category_parent><wp:cat_name><![CDATA[Apache]]></wp:cat_name></wp:category>
</channel>
</rss>
But when I try and find all "wp:category" tags by doing:
import xml.etree.ElementTree as xml
tree = xml.parse(fn)
doc = tree.getroot()
categories = doc.findall('channel/wp:category')
I get the error:
SyntaxError: prefix 'wp' not found in prefix map
Searching for any non-namespace specific fields works just fine. What am I doing wrong?
You need to handle the namespace prefixes, either by using iterparse and handling the event directly or by explicitly declaring the prefixes you're interested in before parsing. Depending on what you're trying to do, I will admit in my lazier moments I just strip all the prefixes out with a string replace before parsing the XML.
EDIT: this similar question might help.

lxml removing <?xml ...> tags when parsing?

I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...>. For example
from lxml import etree
tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etree.XMLParser())
print etree.tostring(tree)
will result in
<dmodule>test</dmodule>
Does anyone know why the <?xml ...> element is being removed? I thought encoding tags were valid XML. Thanks for your time.
The <?xml> element is an XML declaration, so it's not strictly an element. It just gives info about the XML tree below it.
If you need to print it out with lxml, there is some info here about the xmlDeclaration=TRUE flag you can use.
http://lxml.de/api.html#serialisation
etree.tostring(tree, xml_declaration=True)
Does anyone know why the <?xml ...> element is being removed?
XML defaults to version 1.0 in UTF-8 so the document is equivalent if you remove them.
You are parsing some XML to a data structure and then converting that data structure back to XML. You will get a representation of that data structure in XML, but it might not be expressed in the same way (so the prolog can be removed and <foo /> can be exchanged with <foo></foo> and so on).

simplexml_load_string equivalent Python / Django

I'm trying to find a xml-interpret function (like the simplexml_load_string) in Python, but with no success :/
Let's say I have xml in a string
my_xml_string = """
<root>
<content>
<one>A value</one>
<two>Here goes for ...</two>
</content>
</root>"""
To read an value in php I would normaly do something like this
// read into object
$xml = simplexml_load_string(my_xml_string);
// print some values
echo $xml->root->content->one
echo $xml->root->content->two
are there any equivalent object in python/django?
Thanks
The nearest is probably ElementTree which is part of the python standard library (or an extended version lxml)
import xml.etree
element = xml.etree.ElementTree.XML(my_xml_string)
sets up element which is of class Element and this can be treated as lists of XML elements
e.g.
# for your example
print(element[0][0].tag)
print(element[0][0].text)
print(element[0][3].text)
You can also search by XPaths if you want to use names.
lxml also has an objectify model that allows access of elements as "if you were dealing with a normal Python object hierarchy." Which matches the php useage more exactly
lxml.objectify does exactly what you want
In [1]: from lxml import objectify
In [2]: x = objectify.fromstring("""<response><version>1.2</version><amount>1.01</amount><currency>USD</currency></response>""")
In [3]: x.version
Out[3]: 1.2
In [4]: x.amount
Out[4]: 1.01
In [5]: x.currency
Out[5]: 'USD'
The Python standard library includes several xml parsing modules. Probably the easiest is ElementTree.
from xml.etree import cElementTree as ET
xml = ET.fromstring(my_xml_string)
print xml.find('.//content/one').text
print xml.find('.//content/two').text
ElementTree is quite common and is probably the best library included in Python (since version 2.5).
However, personally I prefer lxml for both power and flexibility. The "lxml.objectify" method is particularly useful for parsing large XML DOMs into pythonic objects.
from xml.dom.minidom import *
my_xml_string = """
<root>
<content>
<one>A value</one>
<two>Here goes for ...</two>
</content>
</root>"""
xml = parseString(xml_string)
result = xml.getElementsByTagName('one')[0].firstChild.data
This did the trick, for now!

Categories

Resources