Extracting Specific Lines of XML with Python ElementTree

Extracting Specific Lines of XML with Python ElementTree - python

I am a bit stuck on a project I am doing which uses Python -which I am very new to. I have been told to use ElementTree and get specified data out of an incoming XML file. It sounds simple but I am not great at programming. Below is a (very!) tiny example of an incoming file along with the code I am trying to use.
I would like any tips or places to go next with this. I have tried searching and following what other people have done but I can't seem to get the same results. My aim is to get the information contained in the "Active", "Room" and "Direction" but later on I will need to get much more information.
I have tried using XPaths but it does not work too well, especially with the namespaces the xml uses and the fact that an XPath for everything I would need would become too large. I have simplified the example so I can understand the principle to do, as after this it must be extended to gain more information from an "AssetEquipment" and multiple instances of them. Then end goal would be all information from one equipment being saved to a dictionary so I can manipulate it later, with each new equipment in its own separate dictionary.
Example XML:
<AssetData>
<Equipment>
<AssetEquipment ID="3" name="PC960">
<Active>Yes</Active>
<Location>
<RoomLocation>
<Room>23</Room>
<Area>
<X-Area>-1</X-Area>
<Y-Area>2.4</Y-Area>
</Area>
</RoomLocation>
</Location>
<Direction>Positive</Direction>
<AssetSupport>12</AssetSupport>
</AssetEquipment>
</Equipment>
Example Code:
tree = ET.parse('C:\Temp\Example.xml')
root = tree.getroot()
ns = "{http://namespace.co.uk}"
for equipment in root.findall(ns + "Equipment//"):
tagname = re.sub(r'\{.*?\}','',equipment.tag)
name = equipment.get('name')
if tagname == 'AssetEquipment':
print "\tName: " + repr(name)
for attributes in root.findall(ns + "Equipment/" + ns + "AssetEquipment//"):
attname = re.sub(r'\{.*?\}','',attributes.tag)
if tagname == 'Room': #This does not work but I need it to be found while
#in this instance of "AssetEquipment" so it does not
#call information from another asset instead.
room = equipment.text
print "\t\tRoom:", repr(room)

import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
for elem in tree.getiterator():
if elem.tag=='{http://www.namespace.co.uk}AssetEquipment':
output={}
for elem1 in list(elem):
if elem1.tag=='{http://www.namespace.co.uk}Active':
output['Active']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Direction':
output['Direction']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Location':
for elem2 in list(elem1):
if elem2.tag=='{http://www.namespace.co.uk}RoomLocation':
for elem3 in list(elem2):
if elem3.tag=='{http://www.namespace.co.uk}Room':
output['Room']=elem3.text
print output

Related

Parsing XML Data with Python

actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.
XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml
XML structure (example):
<parkingAreaStatus>
<parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="2[Zeil]"
version="1.0"/>
<parkingAreaStatusTime>2018-02-
04T01:30:00.000+01:00</parkingAreaStatusTime
</parkingAreaStatus>
<parkingAreaStatus>
<parkingAreaOccupancy>0.34625</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]"
version="1.0"/>
</parkingAreaStatus>
Using the code
import csv
import pymysql
import urllib.request
url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"
from lxml.objectify import parse
from lxml import etree
from urllib.request import urlopen
locations_root = parse(urlopen(url)).getroot()
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)
print(*locations)
I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.
Playing arround with some code I got the sentiment that only the first block is parsed, I was able to fill the list with the value of "parkingAreaOccupancy" of the "parkingAreaReference" id="2[Zeil]" block by using the code
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)
print(*locations)
-> 0.2533602
which is not the expected outcome
-> 0.2533602
-> 0.34625
MY question is:
What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?
Example output:
A = [[ID:2[Zeil],0.2533602,stable,2018-02-
04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-
04T01:30:00.000+01:00]]
or in general
A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]
After hours of research I hope for some tips from your site
Thank you in advance,
TR

You can just use etree directly and find interesting elements using XPath1 query. One important thing to note is, that your XML has default namespace declared at the root element :
xmlns="http://datex2.eu/schema/2/2_0"
By definition, element where default namespace is declared and all descendant elements without prefix are belong to this default namespace (unless another default namespace found in one of the descendant elements, which is not the case with your XML). This is why we define a prefix d, which references default namespace URI, in the following code, and we use that prefix to find every elements we need to get information from :
root = etree.parse(urlopen(url)).getroot()
ns = { 'd': 'http://datex2.eu/schema/2/2_0' }
parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)
for pa in parking_area:
area_ref = pa.find('d:parkingAreaReference', ns)
occupancy = pa.find('d:parkingAreaOccupancy', ns)
trend = pa.find('d:parkingAreaOccupancyTrend', ns)
status_time = pa.find('d:parkingAreaStatusTime', ns)
print area_ref.get('id'), occupancy.text, trend.text, status_time.text
Below is the output of the demo code above. Instead of print, you can store these information in whatever data structure you like :
2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00
5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00
1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00
3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00
4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00
1) some references on XPath :
XPath 1.0 spec: The most trustworthy reference on XPath 1.0
XPath syntax: Gentler introduction to basic XPath expressions

How do I extract specific data from xml using python?

I'm relatively new to python. I've been trying to learn python through a hands-on approach (I learnt c/c++ through the doing the euler project).
Right now I'm learning how to extract data from files. I've gotten the hang of extracting data from simple text files but I'm kinda stuck on xml files.
An example of what I was trying to do.
I have my call logs backed up on google drive and they're a lot (about 4000)
Here is the xml file example
<call number="+91234567890" duration="49" date="1483514046018" type="3" presentation="1" readable_date="04-Jan-2017 12:44:06 PM" contact_name="Dad" />
I want to take all the calls to my dad and display them like this
number = 234567890
duration = "49" date="04-Jan-2017 12:44:06 PM"
duration = "x" date="y"
duration = "n" date="z"
and so on like that.
How do you propose I do that?

It's advisable to provide sufficient information in a question so that problem can be recreated.
<?xml version="1.0" encoding="UTF-8"?>
<call number="+91234567890" duration="49" date="1483514046018" type="3"
presentation="1" readable_date="04-Jan-2017 12:44:06 PM"
contact_name="Dad" />
First we need to figure out what elements can we iter on. Since <call ../> is root element over here, we iter over that.
NOTE: if you have tags/element prior to the line provided, you will need to figure out proper root element instead of call.
>>> [i for i in root.iter('call')]
[<Element 'call' at 0x29d3410>]
Here you can see, we can iter on element call.
Then we simply iter over the element and separate out element attribute key and values as per requirements.
Working Code
import xml.etree.ElementTree as ET
data_file = 'test.xml'
tree = ET.parse(data_file)
root = tree.getroot()
for i in root.iter('call'):
print 'duration', "=", i.attrib['duration']
print 'data', "=", i.attrib['date']
Result
>>>
duration = 49
data = 1483514046018
>>>

How to parse .xml file with multiple nested children in python?

I am using python to parse a .xml file which is quite complicated since it has a lot of nested children; accessing some of the values contained in it is quite annoying since the code starts to become pretty bad looking.
Let me first present you the .xml file:
<?xml version="1.0" encoding="utf-8"?>
<Start>
<step1 stepA="5" stepB="6" />
<step2>
<GOAL1>11111</GOAL1>
<stepB>
<stepBB>
<stepBBB stepBBB1="pinco">1</stepBBB>
</stepBB>
<stepBC>
<stepBCA>
<GOAL2>22222</GOAL2>
</stepBCA>
</stepBC>
<stepBD>-NO WOMAN NO CRY
-I SHOT THE SHERIF
-WHO LET THE DOGS OUT
</stepBD>
</stepB>
</step2>
<step3>
<GOAL3 GOAL3_NAME="GIOVANNI" GOAL3_ID="GIO">
<stepB stepB1="12" stepB2="13" />
<stepC>XXX</stepC>
<stepC>
<stepCC>
<stepCC GOAL4="saf12">33333</stepCC>
</stepCC>
</stepC>
</GOAL3>
</step3>
<step3>
<GOAL3 GOAL3_NAME="ANDREA" GOAL3_ID="DRW">
<stepB stepB1="14" stepB2="15" />
<stepC>YYY</stepC>
<stepC>
<stepCC>
<stepCC GOAL4="fwe34">44444</stepCC>
</stepCC>
</stepC>
</GOAL3>
</step3>
</Start>
My goal would be to access the values contained inside of the children named "GOAL" in a nicer way then the one I wrote in my sample code below. Furthermore I would like to find an automated way to find the values of GOALS having the same type of tag belonging to different children having the same name:
Example: GIOVANNI and ANDREA are both under the same kind of tag (GOAL3_NAME) and belong to different children having the same name (<step3>) though.
Here is the code that I wrote:
import xml.etree.ElementTree as ET
data = ET.parse('test.xml').getroot()
GOAL1 = data.getchildren()[1].getchildren()[0].text
print(GOAL1)
GOAL2 = data.getchildren()[1].getchildren()[1].getchildren()[1].getchildren()[0].getchildren()[0].text
print(GOAL2)
GOAL3 = data.getchildren()[2].getchildren()[0].text
print(GOAL3)
GOAL4_A = data.getchildren()[2].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_A)
GOAL4_B = data.getchildren()[3].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_B)
and the output that I get is the following:
11111
22222
33333
44444
The output that I would like should be like this:
11111
22222
GIOVANNI
33333
ANDREA
44444
As you can see I am able to read GOAL1 and GOAL2 easily but I am looking for a nicer code practice to access those values since it seems to me too long and hard to read/understand.
The second thing I would like to do is getting GOAL3 and GOAL4 in a automated way so that I do not have to repeat similar lines of codes and make it more readable and understandable.
Note: as you can see I was not able to read GOAL3. If possible I would like to get both the GOAL3_NAME and GOAL3_ID
In order to make the .xml file structure more understandable I post an image of what it looks like:
The highlighted elements are what I am looking for.

here is simple example for iterating from head to tail with a recursive method and cElementTree(15-20x faster), you can than collect the needed information from that
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
def get_tail(root):
for child in root:
print child.text
get_tail(child)
get_tail(root)

import xml.etree.cElementTree as ET
data = ET.parse('test.xml')
for d in data.iter():
if d.tag in ["GOAL1", "GOAL2", "stepCC", "stepCC"]:
print d.text
elif d.tag in ["GOAL3", "GOAL4"]:
print d.attrib.values()[0]

Extracting XML Attributes

I have an XML file with several thousand records in it in the form of:
<custs>
<record cust_ID="B123456#Y1996" l_name="Jungle" f_name="George" m_name="OfThe" city="Fairbanks" zip="00010" current="1" />
<record cust_ID="Q975697#Z2000" l_name="Freely" f_name="I" m_name="P" city="Yellow River" zip="03010" current="1" />
<record cust_ID="M7803#J2323" l_name="Jungle" f_name="Jim" m_name="" city="Fallen Arches" zip="07008" current="0" />
</custs>
# (I know it's not normalized. This is just sample data)
How can I convert this into a CSV or tab-delimited file? I know I can hard-code it in Python using re.compile() statements, but there has to be something easier, and more portable among diff XML file layouts.
I've found a couple threads here about attribs, (Beautifulsoup unable to extract data using attrs=class, Extracting an attribute value with beautifulsoup) and they have gotten me almost there with:
# Python 3.30
#
from bs4 import BeautifulSoup
import fileinput
Input = open("C:/Python/XML Tut/MinGrp.xml", encoding = "utf-8", errors = "backslashreplace")
OutFile = open('C:/Python/XML Tut/MinGrp_Out.ttxt', 'w', encoding = "utf-8", errors = "backslashreplace")
soup = BeautifulSoup(Input, features="xml")
results = soup.findAll('custs', attrs={})
# output = results [0]#[0]
for each_tag in results:
cust_attrb_value = results[0]
# print (cust_attrb_value)
OutFile.write(cust_attrb_value)
OutFile.close()
What's the next (last?) step?

If this data is formatted correctly -- as in, uses canonical XML -- you should consider lxml rather than BeautifulSoup. With lxml, you read the file, then you can apply DOM logic on it, including XPath queries. With your XPath queries, you can then get the lxml objects that represent each node that you're interested in, extract the data from them that you need, and rewrite them into an arbitrary format of your choosing using something like the csv module..
Specifically, in the lxml documentation, check out these tutorials:
Parsing from Strings and Files
The Element Class: Using XPath to Find Text

I (also) wouldn't use BeautifulSoup for this, and though I like lxml, that's an extra install, and if you don't want to bother, this is simple enough to do with the standard lib ElementTree module.
Something like:
import xml.etree.ElementTree as ET
import sys
tree=ET.parse( 'test.xml' )
root=tree.getroot()
rs=root.getchildren()
keys = rs[0].attrib.keys()
for a in keys: sys.stdout.write(a); sys.stdout.write('\t')
sys.stdout.write('\n')
for r in rs:
assert keys == r.attrib.keys()
for k in keys: sys.stdout.write( r.attrib[k]); sys.stdout.write('\t')
sys.stdout.write('\n')
will, from python-3, produce :
zip m_name current city cust_ID l_name f_name
00010 OfThe 1 Fairbanks B123456#Y1996 Jungle George
03010 P 1 Yellow River Q975697#Z2000 Freely I
07008 0 Fallen Arches M7803#J2323 Jungle Jim
Note that with Python-2.7, the order of the attributes will be different.
If you want them to output in a different specific order, you should sort or
order the list "keys" .
The assert is checking that all rows have the same attributes.
If you actually have missing or different attributes in the elements,
then you'll have to remove that and add some code to deal with the differences
and supply defaults for missing values. ( In your sample data, you have a
null value ( m_name="" ), rather than a missing value. You might want to check
that this case is handled OK by the consumer of this output, or else add some
more special handling for this case.

<product product_id='66656432' name='munch'><category>men</category></product>
In beautiful soup,
product=soup.find("product",attrs={})
then use attribute to access data like product["name"]

Pulling the value from parsed XML in Python (only)

I am trying to pull a value (only) from some XML in Python using Beautiful Soup (but I'll gleefully dump it for anything else if recommended). Consider the following bit of code;
global humidity, temperature, weatherdescription, winddescription
query = urllib2.urlopen('http://www.google.com/ig/api?weather="Aberdeen+Scotland"')
weatherxml = query.read()
weathersoup = BeautifulSoup(weatherxml)
query.close()
print weatherxml
This prints out the weather forecast for Aberdeen, Scotland as XML (currently) thusly (much XML removed to prevent giant wall of text syndrome);
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
><forecast_information><city data="Aberdeen, Aberdeen City"/><postal_code data=""Aberdeen Scotland""/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2012-07-31"/><current_date_time data="1970-01-01 00:00:00
+0000"/><unit_system data="US"/></forecast_information><current_conditions><condition
data="Clear"/><temp_f data="55"/><temp_c data="13"/><humidity
data="Humidity: 82%"/><icon
data="/ig/images/weather/sunny.gif"/><wind_condition data="Wind: SE at
8 mph"/></current_conditions>
Now I'd like, for example, to be able to populate variables with the values of the weather in this XML, for example make temperature = 13. Parsing it is proving a nightmare.
If I use any of the find functions on weathersoup, I get the entire tag (e.g for temp_c it returns "<temp_c data="13">), various other functions return nothing, or the entire sheet, or parts of it.
How do I simply return the VALUE for any given XML tag, without a mess of "strip"s, or resorting to regex, or basically hacking it?

To access an attribute data in element temp_c:
weathersoup.temp_c['data']

Use lxml, and get friendly with XPath. Some of this example doesn't make sense with the XML you provided, since it doesn't parse correctly... but hopefully it gives you an idea of how powerful XPath can be.
from lxml import etree
# xmlstr is the string of the input XML data
root = etree.fromstring(xmlstr)
# print the text in all current_date_time elements
for elem in root.xpath('//current_date_time'):
print elem.text
# print the values for every data attribute in every temp_c element
for value in root.xpath('//temp_c#data'):
print value
# print the text for only the temp_c elements whose data element is 'Celsius'
for elem in root.xpath('//temp_c[#data="Celsius"]'):
print elem.text
# print the text for only the temp_c elements that are under the temperatures element, which is under the root.
for elem in root.xpath('/temperatures/temp_c'):
print elem.text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Specific Lines of XML with Python ElementTree - python

Related

Parsing XML Data with Python

How do I extract specific data from xml using python?

How to parse .xml file with multiple nested children in python?

Extracting XML Attributes

Pulling the value from parsed XML in Python (only)

Categories

Resources