Parsing RSS with Elementtree in Python - python

How do you search for namespace-specific tags in XML using Elementtree in Python?
I have an XML/RSS document like:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.0/"
>
<channel>
<title>sometitle</title>
<pubDate>Tue, 28 Aug 2012 22:36:02 +0000</pubDate>
<generator>http://wordpress.org/?v=2.5.1</generator>
<language>en</language>
<wp:wxr_version>1.0</wp:wxr_version>
<wp:category><wp:category_nicename>apache</wp:category_nicename><wp:category_parent></wp:category_parent><wp:cat_name><![CDATA[Apache]]></wp:cat_name></wp:category>
</channel>
</rss>
But when I try and find all "wp:category" tags by doing:
import xml.etree.ElementTree as xml
tree = xml.parse(fn)
doc = tree.getroot()
categories = doc.findall('channel/wp:category')
I get the error:
SyntaxError: prefix 'wp' not found in prefix map
Searching for any non-namespace specific fields works just fine. What am I doing wrong?

You need to handle the namespace prefixes, either by using iterparse and handling the event directly or by explicitly declaring the prefixes you're interested in before parsing. Depending on what you're trying to do, I will admit in my lazier moments I just strip all the prefixes out with a string replace before parsing the XML.
EDIT: this similar question might help.

Related

Parsing XML document that includes another XML document embedded in a CDATA section

I'm trying out web scraping for the first time using lxml.etree. The website I want to scrape has an XML feed, which I can read fine, except for a part of its XML which is embedded within a CDATA section:
from lxml import etree
parser = etree.XMLParser(recover=True)
data=b'''<?xml version="1.0" encoding="UTF-8"?>
<feed>
<entry>
<summary type="xhtml"><![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<REMITUrgentMarketMessages>
<UMM>
<messageId>2023-86___________________001</messageId>
<event>
<eventStatus>Active</eventStatus>
<eventType>Other unavailability</eventType>
<eventStart>2023-09-07T06:00:00.000+02:00</eventStart>
<eventStop>2023-09-10T06:00:00.000+02:00</eventStop>
</event>
<unavailabilityType>Planned</unavailabilityType>
<publicationDateTime>2022-10-06T13:42:00.000+02:00</publicationDateTime>
<capacity>
<unitMeasure>mcm/d</unitMeasure>
<unavailableCapacity>9.0</unavailableCapacity>
<availableCapacity>0.0</availableCapacity>
<technicalCapacity>9.0</technicalCapacity>
</capacity>
<unavailabilityReason>Yearly maintenance</unavailabilityReason>
<remarks>Uncertain duration</remarks>
<balancingZone>21Y000000000024I</balancingZone>
<balancingZone>21Y0000000001278</balancingZone>
<balancingZone>21YGB-UKGASGRIDW</balancingZone>
<balancingZone>21YNL----TTF---1</balancingZone>
<balancingZone>37Y701125MH0000I</balancingZone>
<balancingZone>37Y701133MH0000P</balancingZone>
<affectedAsset>
<ns2:name>Dvalin</ns2:name>
</affectedAsset>
<marketParticipant>
<ns2:name>Gassco AS</ns2:name>
<ns2:eic>21X-NO-A-A0A0A-2</ns2:eic>
</marketParticipant>
</UMM>
</REMITUrgentMarketMessages>]]></summary>
</entry>
</feed>
'''
tree = etree.fromstring(data)
block = tree.xpath("/feed/entry/summary")[0]
block_str = "b'''"+block.text+"'''"
tree_in_tree = etree.fromstring(block_str)
The problem the XML code in the CDATA section is weirdly indented, meaning that if I just pass the CDATA content into a string and then read it with etree (like I do below), I get a message error because of indentation.
This is the message:
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
Basically I understand that the indentation between the first line of CDATA and REMITUrgentMarketMessages is badly indented.
Does anyone know how to fix this? :)
Thanks for the help!
This has nothing to do with indentation. The problem is that the xml within the CDATA is not well formed. It uses namespaces (<ns2:name>Gassco AS</ns2:name>, for example), but without a namespace declaration. So when you try to parse block.text you should get
XMLSyntaxError: Namespace prefix ns2 on name is not defined
at least based on the xml in the question. Not sure why you get the error you are showing.
The solution is to ask the source of the feed to fix the xml so it's well formed.
The b prefix is used for bytes literals, but block.text is not a literal. Instead, create the bytes object (representing the embedded XML document) using bytes():
block_str = bytes(block.text, "UTF-8")
Now when the program is run, you will get the following error:
lxml.etree.XMLSyntaxError: Namespace prefix ns2 on name is not defined
That is a serious error, but it can be bypassed by using the parser configured with recover=True:
tree_in_tree = etree.fromstring(block_str, parser)

Python2 extracting tags from xml

I have xml document that I need to parse, but I am stuck, I may say at the very beggining.
Here is part of xml file.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
I want to print out element tags only. I do it with this piece of code form python docs.
I issue these commands at python interpreter.
tree = ET.parse('pom.xml')
root = tree.getroot()
root = ET.fromstring(data)
root.tag
root.tag returns this
{http://maven.apache.org/POM/4.0.0}project
Is expected result just
project
?
Python is parsing your XML in a way that keeps the declared namespaces and thus does not lose data, so the expected result is not just project :)
The {http://maven.apache.org/POM/4.0.0}project you see is a namespace-qualified name for the tag.
Even if the tag start <project does not contain a namespace prefix, the immediately following xmlns="http://maven.apache.org/POM/4.0.0" attribute declares every tag that has no explicit namespace prefix to belong into that namespace.
If you absolutely want a non-namespace-qualified name, you can of course do tag_name = element.tag.split("}", 1)[-1]. (This should be safe for non-namespace-qualified names due to the -1 indexing.)
And of course you can recursively walk an ElementTree tree and replace all tag.names with their non-namespace-qualified names with the above expression if you really really want to.

How do I get properly escaped XML in python etree untouched?

I'm using python version 2.7.3.
test.txt:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<test>The tag <StackOverflow> is good to bring up at parties.</test>
</root>
Result:
>>> import xml.etree.ElementTree as ET
>>> e = ET.parse('test.txt')
>>> root = e.getroot()
>>> print root.find('test').text
The tag <StackOverflow> is good to bring up at parties.
As you can see, the parser must have changed the <'s to <'s etc.
What I'd like to see:
The tag <StackOverflow> is good to bring up at parties.
Untouched, raw text. Sometimes I really like it raw. Uncooked.
I'd like to use this text as-is for display within HTML, therefore I don't want an XML parser to mess with it.
Do I have to re-escape each string or can there be another way?
import xml.etree.ElementTree as ET
e = ET.parse('test.txt')
root = e.getroot()
print(ET.tostring(root.find('test')))
yields
<test>The tag <StackOverflow> is good to bring up at parties.</test>
Alternatively, you could escape the text with saxutils.escape:
import xml.sax.saxutils as saxutils
print(saxutils.escape(root.find('test').text))
yields
The tag <StackOverflow> is good to bring up at parties.

Django parse XML from a POST

I'm receiving an HTTP POST. With one parameter thats sent: xml
It contain an xml document. The format of this document is:
<?xml version="1.1" encoding="ISO-8859-1"?>
<delivery_receipt>
<version>1.0</version>
<status>Delivered</status>
</delivery_receipt>
I need to get whats in <status> from the POST, how do I parse the parameter and get the 'status'?
Update....
if request.POST:
from lxml.cssselect import CSSSelector
from lxml.etree import fromstring
h = fromstring(request.POST['xml'])
h.cssselect('delivery_reciept status').text_content()
I'm not sure that request.POST['xml'] will work tho
You can (and should) use CSS selectors with XML documents, granted you are doing relatively simple tasks for parsing XML documents. CSS selectors are clear, easy to read and write, and are more expressive than XPATH queries.
I suggest getting lxml installed, and using their cssselect features.
Your end result might look like this:
>>> h = fromstring("""<?xml version="1.1" encoding="ISO-8859-1"?>
<delivery_receipt>
<version>1.0</version>
<status>Delivered</status>
</delivery_receipt> """)
>>> h.cssselect('delivery_reciept status').text_content()

lxml removing <?xml ...> tags when parsing?

I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...>. For example
from lxml import etree
tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etree.XMLParser())
print etree.tostring(tree)
will result in
<dmodule>test</dmodule>
Does anyone know why the <?xml ...> element is being removed? I thought encoding tags were valid XML. Thanks for your time.
The <?xml> element is an XML declaration, so it's not strictly an element. It just gives info about the XML tree below it.
If you need to print it out with lxml, there is some info here about the xmlDeclaration=TRUE flag you can use.
http://lxml.de/api.html#serialisation
etree.tostring(tree, xml_declaration=True)
Does anyone know why the <?xml ...> element is being removed?
XML defaults to version 1.0 in UTF-8 so the document is equivalent if you remove them.
You are parsing some XML to a data structure and then converting that data structure back to XML. You will get a representation of that data structure in XML, but it might not be expressed in the same way (so the prolog can be removed and <foo /> can be exchanged with <foo></foo> and so on).

Categories

Resources