Extract multiple xml attributes to pandas dataframe

Extract multiple xml attributes to pandas dataframe - python

I have a basic xml file called meals.xml which looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<meals name="Sample Text">
<meal id="1" name="Poached Eggs" type="breakfast"/>
<meal id="2" name="Club Sandwich" type="lunch"/>
<meal id="3" name="Steak" type="dinner"/>
<meal id="4" name="Steak" type="dinner"/>
</meals>
I want to extract both 'id' and 'name' attributes in to a dataframe. I can extract one when specifying one column and one attribute (eg, name only), but can't seem to figure out the syntax for getting multiple attributes in the for loop. This what I've tried, adding id to the 'df_cols' and 'attrib.get' function:
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.parse('meals.xml').getroot()
df_cols = ["id", "name"]
rows = []
for node in root:
value = node.attrib.get('id', 'name')
rows.append(value)
df = pd.DataFrame(rows, columns = df_cols)
df
Can someone advise how to do this?

The below may work for you
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<meals name="Sample Text">
<meal id="1" name="Poached Eggs" type="breakfast"/>
<meal id="2" name="Club Sandwich" type="lunch"/>
<meal id="3" name="Steak" type="dinner"/>
<meal id="4" name="Steak" type="dinner"/>
</meals>'''
root = ET.fromstring(xml)
data = [{'id': m.attrib['id'], 'name': m.attrib['name']} for m in root.findall('.//meal')]
df = pd.DataFrame(data)
print(df)
output
id name
0 1 Poached Eggs
1 2 Club Sandwich
2 3 Steak
3 4 Steak

Related

Parsing custom xml file using python

I have an xml file of following format :
<?xml version='1.0' encoding='utf-8'?>
<execute time="0.59">
<exec name="recursive_a" loops="3" fail="2" skipped="0">
<testcase tname="test_a" name="test.cpp" time="0.50">
<pass>
001,test,pass
</pass>
</testcase>
</exec>
</execute>
how can i parse "recursive_a" string from this xml using python? (i am using minidom xml parser)

With xml.etree.ElementTree and pandas one solution could be:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('Code3r.xml')
root = tree.getroot()
for elem in root:
if elem.tag == "exec":
# print(elem.attrib) or with pandas
df = pd.DataFrame.from_dict(elem.attrib, orient='index')
print(df.T.to_string(index=False))
Output:
name loops fail skipped
recursive_a 3 2 0

How can we convert a nested XML to CSV in Python Dynamically, Nested XML may contain array of values as well?

Sharing Sample XML file. Need to convert this fie to CSV, even if extra tags are added in this file. {without using tag names}. And XML file tag names should be used as column names while converting it to CSV}
Example Data:
<?xml version="1.0" encoding="UTF-8"?>
<Food>
<Info>
<Msg>Food Store items.</Msg>
</Info>
<store slNo="1">
<foodItem>meat</foodItem>
<price>200</price>
<quantity>1kg</quantity>
<discount>7%</discount>
</store>
<store slNo="2">
<foodItem>fish</foodItem>
<price>150</price>
<quantity>1kg</quantity>
<discount>5%</discount>
</store>
<store slNo="3">
<foodItem>egg</foodItem>
<price>100</price>
<quantity>50 pieces</quantity>
<discount>5%</discount>
</store>
<store slNo="4">
<foodItem>milk</foodItem>
<price>50</price>
<quantity>1 litre</quantity>
<discount>3%</discount>
</store>
</Food>
Tried Below code but getting error with same.
import xml.etree.ElementTree as ET
import pandas as pd
ifilepath = r'C:\DATA_DIR\feeds\test\sample.xml'
ofilepath = r'C:\DATA_DIR\feeds\test\sample.csv'
root = ET.parse(ifilepath).getroot()
print(root)
with open(ofilepath, "w") as file:
for child in root:
print(child.tag, child.attrib)
# naive example how you could save to csv line wise
file.write(child.tag+";"+child.attrib)
Above code is able to find root node, but unable to concatenate its attributes though
Tried one more code, but this works for 1 level nested XML, who about getting 3-4 nested tags in same XML file. And currently able to print values of all tags and their text. need to convert these into relational model { CSV file}
import xml.etree.ElementTree as ET
tree = ET.parse(ifilepath)
root = tree.getroot()
for member in root.findall('*'):
print(member.tag,member.attrib)
for i in (member.findall('*')):
print(i.tag,i.text)
Above example works well with pandas read_xml { using lxml parser}
But when we try to use the similar way out for below XML data, it doesn't produce indicator ID value and Country ID value as output in CSV file
Example Data ::
<?xml version="1.0" encoding="UTF-8"?>
<du:data xmlns:du="http://www.dummytest.org" page="1" pages="200" per_page="20" total="1400" sourceid="5" sourcename="Dummy ID Test" lastupdated="2022-01-01">
<du:data>
<du:indicator id="AA.BB">various, tests</du:indicator>
<du:country id="MM">test again</du:country>
<du:date>2021</du:date>
<du:value>1234567</du:value>
<du:unit />
<du:obs_status />
<du:decimal>0</du:decimal>
</du:data>
<du:data>
<du:indicator id="XX.YY">testing, cases</du:indicator>
<du:country id="DD">coverage test</du:country>
<du:date>2020</du:date>
<du:value>3456223</du:value>
<du:unit />
<du:obs_status />
<du:decimal>0</du:decimal>
</du:data>
</du:data>
Solution Tried ::
import pandas as pd
pd.read_xml(ifilepath, xpath='.//du:data', namespaces= {"du": "http://www.dummytest.org"}).to_csv(ofilepath, sep=',', index=None, header=True)
Output Got ::
indicator,country,date,value,unit,obs_status,decimal
"various, tests",test again,2021,1234567,,,0
"testing, cases",coverage test,2020,3456223,,,0
Expected output ::
indicator id,indicator,country id,country,date,value,unit,obs_status,decimal
AA.BB,"various, tests",MM,test again,2021,1234567,,,0
XX.YY,"testing, cases",DD,coverage test,2020,3456223,,,0
Adding Example data , having usage of 2 or more xpath's.
Looking for ways to convert the same using pandas to_csv()
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl'?>
<CATALOG>
<PLANT>
<COMMON>rose</COMMON>
<BOTANICAL>canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Shady</LIGHT>
<PRICE>202</PRICE>
<AVAILABILITY>446</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>mango</COMMON>
<BOTANICAL>sunny</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>shady</LIGHT>
<PRICE>301</PRICE>
<AVAILABILITY>569</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Marigold</COMMON>
<BOTANICAL>palustris</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Sunny</LIGHT>
<PRICE>500</PRICE>
<AVAILABILITY>799</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>carrot</COMMON>
<BOTANICAL>Caltha</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>sunny</LIGHT>
<PRICE>205</PRICE>
<AVAILABILITY>679</AVAILABILITY>
</PLANT>
<FOOD>
<NAME>daal fry</NAME>
<PRICE>300</PRICE>
<DESCRIPTION>
Famous daal tadka from surat
</DESCRIPTION>
<CALORIES>60</CALORIES>
</FOOD>
<FOOD>
<NAME>Dhosa</NAME>
<PRICE>350</PRICE>
<DESCRIPTION>
The famous south indian dish
</DESCRIPTION>
<CALORIES>80</CALORIES>
</FOOD>
<FOOD>
<NAME>Khichdi</NAME>
<PRICE>150</PRICE>
<DESCRIPTION>
The famous gujrati dish
</DESCRIPTION>
<CALORIES>40</CALORIES>
</FOOD>
<BOOK>
<AUTHOR>Santosh Bihari</AUTHOR>
<TITLE>PHP Core</TITLE>
<GENER>programming</GENER>
<PRICE>44.95</PRICE>
<DATE>2000-10-01</DATE>
</BOOK>
<BOOK>
<AUTHOR>Shyam N Chawla</AUTHOR>
<TITLE>.NET Begin</TITLE>
<GENER>Computer</GENER>
<PRICE>250</PRICE>
<DATE>2002-17-05</DATE>
</BOOK>
<BOOK>
<AUTHOR>Anci C</AUTHOR>
<TITLE>Dr. Ruby</TITLE>
<GENER>Computer</GENER>
<PRICE>350</PRICE>
<DATE>2001-04-11</DATE>
</BOOK>
</CATALOG>

ElementTree is not really the best tool for what I believe you're trying to do. Since you have well-formed, relatively simple xml, try using pandas:
import pandas as pd
#from here, it's just a one liner
pd.read_xml('input.xml',xpath='.//store').to_csv('output.csv',sep=',', index = None, header=True)
and that should get you your csv file.

Given parsing element values and their corresponding attributes involves a second layer of iteration, consider a nested list/dict comphrehension with dictionary merge. Also, use csv.DictWriter to build CSV via dictionaries:
from csv import DictWriter
import xml.etree.ElementTree as ET
ifilepath = "Input.xml"
tree = ET.parse(ifilepath)
nmsp = {"du": "http://www.dummytest.org"}
data = [
{
**{el.tag.split('}')[-1]: (el.text.strip() if el.text is not None else None) for el in d.findall("*")},
**{f"{el.tag.split('}')[-1]} {k}":v for el in d.findall("*") for k,v in el.attrib.items()},
**d.attrib
}
for d in tree.findall(".//du:data", namespaces=nmsp)
]
dkeys = list(data[0].keys())
with open("DummyXMLtoCSV.csv", "w", newline="") as f:
dw = DictWriter(f, fieldnames=dkeys)
dw.writeheader()
dw.writerows(data)
Output
indicator,country,date,value,unit,obs_status,decimal,indicator id,country id
"various, tests",test again,2021,1234567,,,0,AA.BB,MM
"testing, cases",coverage test,2020,3456223,,,0,XX.YY,DD
While above will add attributes to last columns of CSV. For specific ordering, re-order the dictionaries:
data = [ ... ]
cols = ["indicator id", "indicator", "country id", "country", "date", "value", "unit", "obs_status", "decimal"]
data = [
{k: d[k] for k in cols} for d in data
]
with open("DummyXMLtoCSV.csv", "w", newline="") as f:
dw = DictWriter(f, fieldnames=cols)
dw.writeheader()
dw.writerows(data)
Output
indicator id,indicator,country id,country,date,value,unit,obs_status,decimal
AA.BB,"various, tests",MM,test again,2021,1234567,,,0
XX.YY,"testing, cases",DD,coverage test,2020,3456223,,,0

Transform the CSV to XML in python

I have a scenario where the data is extracted from oracle in the form of CSV and then it should be transformed to desired XML format.
Input CSV File:
Id,SubID,Rank,Size
1,123,1,0.1
1,234,2,0.2
2,456,1,0.1
2,123,2,0.2
Expected XML output:
<AA_ITEMS>
<Id ID="1">
<SubId ID="123">
<Rank>1</Rank>
<Size>0.1</Size>
</SubId>
<SubId ID="234">
<Rank>2</Rank>
<Size>0.2</Size>
</SubId>
</Id>
<Id ID="2">
<SubId ID="456">
<Rank>1</Rank>
<Size>0.1</Size>
</SubId>
<SubId ID="123">
<Rank>2</Rank>
<Size>0.2</Size>
</SubId>
</Id>
Note: The CSV file is a daily load and contains around 150K to 200K records
Please assist. Thanks in advance

There are a couple of ways to approach it and though some people dislike building xml from a template, I believe it works best:
from itertools import groupby
from lxml import etree
csv_string = """[your csv ab0ve]
"""
#first deal with the csv
#split it into lines and discard the headers
lines = csv_string.splitlines()[1:]
#group the lines by the first character
grpfunc = lambda x: x[0]
grps = [list(group) for key, group in groupby(lines, grpfunc)]
#now convert the whole thing into xml:
xml_string = """
<AA_ITEMS>
"""
for grp in grps:
elem = f' <Id ID="{grp[0][0]}">'
for g in grp:
entry = g.split(',')
#create an entry template:
id_tmpl = f"""
<SubId ID="{entry[1]}">
<Rank>{entry[2]}</Rank>
<Size>{entry[3]}</Size>
</SubId>
"""
elem+=id_tmpl
#close elem
elem+="""</Id>
"""
xml_string+=elem
#close the xml string
xml_string += """</AA_ITEMS>"""
#finally, show that the output is well formed xml:
print(etree.tostring(etree.fromstring(xml_string)).decode())
The output should be your expected xml.

Export information from child nodes in xml using Python

I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?

See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles

Parsing KML file using pyKML

I'm learning how to parse KML files in Python using the pyKML module. The specific file I'm using can be found here and I've also added it at the bottom of this post. I have saved the file on my computer and name it test.kml.
After some research, I managed to extract a specific portion of the test.kml file and save the result to a DataFrame. Here's my code:
from pykml import parser
import pandas as pd
filename = 'test.kml'
with open(filename) as fobj:
folder = parser.parse(fobj).getroot().Document
plnm = []
for pm in folder.Placemark:
plnm1 = pm.name
plnm.append(plnm1.text)
df = pd.DataFrame()
df['name'] = plnm
print(df)
name
0 Club house
1 By the lake
I would like to add a new column to my DataFrame corresponding to the value of the "holeNumber". I have tried to add the following lines in my for loop but without success.
for pm in folder.Placemark:
plnm1 = pm.name
val1 = pm.ExtendedData.holeNumber.value
plnm.append(plnm1.text)
val.append(val1.text)
I'm not sure how to access the value from that specific node. The resulting DataFrame I'm looking for is the following:
| name | holeNumber |
|-------------|------------|
| Club house | 1 |
| By the lake | 5 |
Any help would be appreciated.
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<name>My Golf Course Example</name>
<Placemark>
<name>Club house</name>
<ExtendedData>
<Data name="holeNumber">
<value>1</value>
</Data>
<Data name="holeYardage">
<value>234</value>
</Data>
<Data name="holePar">
<value>4</value>
</Data>
</ExtendedData>
<Point>
<coordinates>-111.956,33.5043</coordinates>
</Point>
</Placemark>
<Placemark>
<name>By the lake</name>
<ExtendedData>
<Data name="holeNumber">
<value>5</value>
</Data>
<Data name="holeYardage">
<value>523</value>
</Data>
<Data name="holePar">
<value>5</value>
</Data>
</ExtendedData>
<Point>
<coordinates>-111.95,33.5024</coordinates>
</Point>
</Placemark>
</Document>
</kml>

Here's a quick way to parse the KML.
plnm = []
holeNumber = []
for pm in folder.Placemark:
plnm1 = pm.name
val1 = pm.ExtendedData.Data[0].value
plnm.append(plnm1.text)
holeNumber.append(val1.text)
df = pd.DataFrame()
df['name'] = plnm
df['holeNumber'] = holeNumber
print(df)
Or
df = pd.DataFrame(columns=('name', 'holeNumber'))
for pm in folder.Placemark:
name = pm.name.text
value = pm.ExtendedData.Data[0].value.text
df = df.append({ 'name' : name, 'holeNumber' : value }, ignore_index=True)
print(df)
Output:
name holeNumber
0 Club house 1
1 By the lake 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract multiple xml attributes to pandas dataframe - python

Related

Parsing custom xml file using python

How can we convert a nested XML to CSV in Python Dynamically, Nested XML may contain array of values as well?

Transform the CSV to XML in python

Export information from child nodes in xml using Python

Parsing KML file using pyKML

Categories

Resources