xPath with ElementTree (python) to parse XML from string - python

I'm using ElementTree to parse some XML retrieved from a website, but somehow I can't see to be able to use ".find" or ".findall". I tried to use ElementTree, and I tired lxml.etree and nothing is working with me. My goal is to retrieve //course from my XML file retrieved from a URL.
import requests
import xml.etree.ElementTree as ET
res = requests.get(COURSES_URL).text #Storing the XML into res
XML = ET.fromstring(res)
print(XML.findall('//COURSE'))
COURSES_URL is my own URL which I am retrieving the XML from, and yes it is working since I got the output XML that I want (sample):
<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated by Oracle Reports version 11.1.2.1.0 -->
<SYRSPOS_REP>
<LIST_G_PROGRAM>
<G_PROGRAM>
<SPRIDEN_ID>U712214</SPRIDEN_ID>
<STUDENT_NAME>Mark Adam Johns</STUDENT_NAME>
<SMBPOGN_PIDM>98</SMBPOGN_PIDM>
<SMBPOGN_REQUEST_NO>46</SMBPOGN_REQUEST_NO>
<COURSE ID=1411001>PASS</COURSE>
<COURSE ID=1411023>PASS</COURSE>
<COURSE ID=1411136>PASS</COURSE>
</G_PROGRAM>
</LIST_G_PROGRAM>
</SYRSPOS_REP>

Solved:
Apparently I had 2 issues.
First of all I can't use findall in print since it returns a list, I had to do a for in loop for i in XML.findall(), then I print i.text().
Secondly, I had to add a dot after the quotation mark, as in ".//COURSES"

Related

Parsing an xml file with an emphasis tag in it in python

I am currently writing a python script that can extract all of the text in an xml file. I am using the Element Tree library to interpret the data but I am running into this problem however when the data is structured like this...
<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>
When I attempt to read out the text, I get the first half of the Segment ("Alright. So what we had") before the pause tag.
What I am trying to figure out is if there is a way to ignore the tags in the data segments and print out all of the text.
Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
doc = SimplifiedDoc(html)
print(doc.Segment)
print(doc.Segment.text)
Result:
{'StartTime': '639.752', 'EndTime': '642.270', 'Participant': 'fe016', 'tag': 'Segment', 'html': "\n But I bet it's a good <Pause /> superset of it.\n"}
But I bet it's a good superset of it.
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
xml = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
pause = root.find('./Pause')
print(root.text + pause.tail)

python lxml pkg - how to incrementally write to an XML file using etree.xmlfile AND passing in existing elements?

very new to anything xml related please bear with me - trying to build some code that converts rasters to KML files for google earth.
I've come across the lxml package which has made my life easier, but now am facing an issue.
Let's say I've created an element called kml with namespaces:
from lxml import etree
version = '2.2'
namespace_hdr = {'gx':f'http://www.google.com/kml/ext/{version}',
'kml':f'http://www.opengis.net/kml/{version}',
'atom':f'http://www.w3.org/2005/Atom'
}
kml = etree.Element('kml', nsmap=namespace_hdr)
And I've also created an element called Document:
Document = etree.SubElement(kml, 'Document')
Now..I have alot of data I want to write and am running into memory issues, so I figured the best approach would be to generate my data to write on the fly and write it as I go, hence the incremental writing.
The approach I'm using is:
out_file = 'test.kml'
with etree.xmlfile(out_file, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element(kml):
xf.write(Document)
Which returns the error:
TypeError: Argument must be bytes or unicode, got '_Element'
If I change kml to 'kml' it works fine, but obviously does not write the namespaces to the file that I've defined in the kml element.
How is it possible to pass in the kml element instead of a string? Is there a way to do this? Or some other way of incrementally writing to the file?
Any thoughts would be appreciated.
FYI - output when using 'kml' is:
<?xml version='1.0' encoding='utf-8'?>
<kml><Document/>
</kml>
I'm trying to achieve the same but with the namespaces:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document/>
</kml>

xmltodict and back again; attributes get their own tag

I'm trying to create something that imports an XML, compares some of the values to values from another XML or values in an Oracle Database, and then write it back again with some values changed. I've tried simply importing the xml and then exporting it again, but that already leads to an issue for me; xml attributes are not shown as attributes within the tag anymore, instead they get their own child tag.
I think it's the same problem as described here, in which the top answer says that the issue has been open for years. I'm hoping you guys know an elegant way to fix this, because the only thing I can think of is doing a replace after the export.
import xmltodict
from dicttoxml import dicttoxml
testfile = '<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>'
print(testfile)
print('\n')
orgfile = xmltodict.parse(testfile)
print(orgfile)
print('\n')
newfile = dicttoxml(orgfile, attr_type=False).decode()
print(newfile)
Result:
D:\python3 Test.py
<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>
OrderedDict([('Testfile', OrderedDict([('Amt', OrderedDict([('#Ccy', 'EUR'), ('#
text', '123.45')]))]))])
<?xml version="1.0" encoding="UTF-8" ?><root><Testfile><Amt><key name="#Ccy">EUR
</key><key name="#text">123.45</key></Amt></Testfile></root>
You can see the input tag Amt Ccy="EUR" gets converted to Amt with child tags.
I'm not sure which libraries you're actually using, but xmltodict has an unparse method, that does exactly what you want:
import xmltodict
testfile = '<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>'
print(testfile)
print('\n')
orgfile = xmltodict.parse(testfile)
print(orgfile)
print('\n')
newfile = xmltodict.unparse(orgfile, pretty=False)
print(newfile)
Output:
<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>
OrderedDict([('Testfile', OrderedDict([('Amt', OrderedDict([('#Ccy', 'EUR'), ('#text', '123.45')]))]))])
<?xml version="1.0" encoding="utf-8"?>
<Testfile><Amt Ccy="EUR">123.45</Amt></Testfile>

XML not parsing correctly using requests and lxml

I'm trying to get content out of XML from an API call. I'm able to use requests to get the xml content, but can't seem to parse it correctly. Here is the code that has been semi-successful so far:
import requests
from lxml import etree
data = requests.get('http://elections.huffingtonpost.com/pollster/api/polls.xml', params={'sort':'updated'})
tree = etree.XML(data.content)
The tree is showing the line breaks from the xml as text, and some of the nodes that are more than 3 levels deep are gone.

Find an element in an XML tree using ElementTree

I am trying to locate a specific element in an XML file, using ElementTree. Here is the XML:
<documentRoot>
<?version="1.0" encoding="UTF-8" standalone="yes"?>
<n:CallFinished xmlns="http://api.callfire.com/data" xmlns:n="http://api.callfire.com/notification/xsd">
<n:SubscriptionId>96763001</n:SubscriptionId>
<Call id="158864460001">
<FromNumber>5129618605</FromNumber>
<ToNumber>15122537666</ToNumber>
<State>FINISHED</State>
<ContactId>125069153001</ContactId>
<Inbound>true</Inbound>
<Created>2014-01-15T00:15:05Z</Created>
<Modified>2014-01-15T00:15:18Z</Modified>
<FinalResult>LA</FinalResult>
<CallRecord id="94732950001">
<Result>LA</Result>
<FinishTime>2014-01-15T00:15:15Z</FinishTime>
<BilledAmount>1.0</BilledAmount>
<AnswerTime>2014-01-15T00:15:06Z</AnswerTime>
<Duration>9</Duration>
</CallRecord>
</Call>
</n:CallFinished>
</documentRoot>
I am interested in the <Created> item. Here is the code I am using:
import xml.etree.ElementTree as ET
calls_root = ET.fromstring(calls_xml)
for item in calls_root.find('CallFinished/Call/Created'):
print "Found you!"
call_start = item.text
I have tried a bunch of different XPath expressions, but I'm stumped - I cannot locate the element. Any tips?
You aren't referencing the namespaces that exist in the XML document, so ElementTree can't find the elements in that XPath. You need to tell ElementTree what namespaces you are using.
The following should work:
import xml.etree.ElementTree as ET
namespaces = {'n':'{http://api.callfire.com/notification/xsd}',
'_':'{http://api.callfire.com/data}'
}
calls_root = ET.fromstring(calls_xml)
for item in calls_root.find('{n}CallFinished/{_}Call/{_}Created'.format(**namespaces)):
print "Found you!"
call_start = item.text
Alternatively, LXML has a wrapper around ElementTree and has good support for namespaces without having to worry about string formatting.

Categories

Resources