Extract variables from XML to Pandas - python

I am working on parsing XML variables to pandas dataframe. The XML files looks like (
This XML file has been simplified for demo)
<Instrm>
<Rcrd>
<FinPpt>
<Id>BT0007YSAWK</Id>
<FullNm>Turbo Car</FullNm>
<Ccy>EUR</Ccy>
<Cmmdty>false</Cmmdty>
</FinPpt>
<Issr>529900M2F7D5795H1A49</Issr>
<Attrbts>
<Authrty>US</Authrty>
<Prd>
<Dt>2002-03-20</Dt>
</Prd>
<Ven>NYSE</Ven>
</Attrbts >
</Rcrd>
</Instrm>
<Instrm>
<Rcrd>
<FinPpt>
<Id>BX0009YNOYK</Id>
<FullNm>Turbo truk</FullNm>
<Ccy>EUR</Ccy>
<Cmmdty>false</Cmmdty>
</FinPpt>
<Issr>58888M2F7D579536J4</Issr>
<Attrbts>
<Authrty>UK</Authrty>
<Prd>
<Dt>2002-04-21</Dt>
</Prd>
<Ven>BOX</Ven>
</Attrbts >
</Rcrd>
</Instrm>
...
I attempted to parse this XML file to a dataframe with attributes to be the column names, like this:
Id FullNm Ccy Cmmdty Issr Authrty Dt Ven
BT0007YSAWK Turbo Car EUR false 529900M2F7D5795H1A49 US 2002-03-20 NYSE
BX0009YNOYK Turbo truk EUR false 58888M2F7D579536J4 UK 2002-04-21 BOX
..... ......
but still don't know how after I reviewed some post. All I can do is to extract ID in a list, like
import xml.etree.ElementTree as ET
import pandas as pd
import sys
tree = ET.parse('sample.xml')
root = tree.getroot()
report = root[1][0][0]
records = report.findall('Instrm')
ids = []
for r in records:
ids.append(r[0][0][0].text)
print(ids[0:100])
out:
[BT0007YSAWK, BX0009YNOYK, …….]
I don't quite understand how to utilize 'nodes' here. Can someone help? Thank you.

Assuming a <root> node in posted XML without namespaces, consider building a dictionary via list/dict comprehension and combining sub dictionaries (available in Python 3.5+) that parse to needed nodes. Then call the DataFrame() constructor on returned list of dictionaries.
data = [{**{el.tag:el.text.strip() for el in r.findall('FinPpt/*')},
**{el.tag:el.text.strip() for el in r.findall('Issr')},
**{el.tag:el.text.strip() for el in r.findall('Attrbts/*')},
**{el.tag:el.text.strip() for el in r.findall('Attrbts/Prd/*')}
} for r in root.findall('Instrm/Rcrd')]
df = pd.DataFrame(data)

To get your target data without converting use an xml parser (like lxml) and xpath.
Something along these lines: [note that you have to wrap you xml with a root element]
string = """
<doc>
[your xml above]
</doc>
"""
from lxml import etree
doc = etree.XML(string)
insts = doc.xpath('//Instrm')
for inst in insts:
f_nams = inst.xpath('//FullNm')
ccys = inst.xpath('//Ccy')
cmds = inst.xpath('//Cmmdty')
issuers = inst.xpath('//Issr')
for a,b,c,d in zip(f_nams,ccys,cmds,issuers):
print(a.text,b.text,c.text,d.text)
Output:
Turbo Car EUR false 529900M2F7D5795H1A49
Turbo truk EUR false 58888M2F7D579536J4

Related

Parsing subfields in XML and merging with matching columns

This is a follow-up question from here. it got lost due to high amount of other topic on this forum. Maybe i presented the question too complicated. Since then I improved and simplified the approach.
To sum up: i'd like to extract data from subfields in multiple XML files and attach those to a new df on a matching positions.
This is a sample XML-1:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>238981</vrednost>
</energent>
<energent>
<sifra>energy_to</sifra>
<naziv>Do</naziv>
<vrednost>16359</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
his is a sample XML-2:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>424242</vrednost>
</energent>
<energent>
<sifra>energy_en</sifra>
<naziv>Do</naziv>
<vrednost>29</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
My code:
import xml.etree.ElementTree as ETree
import pandas as pd
xmldata = r"C:\...\S1.xml"
prstree = ETree.parse(xmldata)
root = prstree.getroot()
# print(root)
store_items = []
all_items = []
for storeno in root.iter('energent'):
cona_sifra = storeno.find('sifra').text
cona_vrednost = storeno.find('vrednost').text
store_items = [cona_sifra, cona_vrednost]
all_items.append(store_items)
xmlToDf = pd.DataFrame(all_items, columns=[
'sifra', 'vrednost'])
print(xmlToDf.to_string(index=False))
This results in:
sifra vrednost
energy_e 238981
energy_to 16359
Which is fine for 1 example. But i have 1,000 of XML files and the wish is to 1) have all results in 1 row for each XML and 2) to differentiate between different 'sifra' codes.
There can be e.g. energy_e, energy_en, energy_to
So ideally the final df would look like this
xml energy_e energy_en energy_to
xml-1 238981 0 16539
xml-2 424242 29 0
can it be done?
Simply use pandas.read_xml since the part of the XML you need is a flat part of the document:
energy_df = pd.read_xml("Input.xml", xpath=".//energent") # IF lxml INSTALLED
energy_df = pd.read_xml("Input.xml", xpath=".//energent", parser="etree") # IF lxml NOT INSTALLED
And to bind across many XML files, simply build a list of data frames from a list of XML file paths, adding a column for source file, and then run pandas.concat to row bind all into a single data frame:
xml_files = [...]
energy_dfs = [
pd.read_xml(f, xpath=".//energent", parser="etree").assign(source=f) for f in xml_files
]
energy_long_df = pd.concat(energy_dfs, ignore_index=True)
And from your desired output, you can then pivot values from sifra columns with pivot_table:
energy_wide_df = energy_long_df.pivot_table(
values="vrednost", index="source", columns="sifra", aggfunc="sum"
)
If I understand the situation correctly, this can be done - but because of the complexity, I would use here lxml, instead of ElementTree.
I'll try to annotate the code a bit, but you'll have to really do read up on this.
By the way, the two xml files you posted are not well formed (closing tags for <energenti> and <cone> are missing), but assuming that is fixed - try this:
from lxml import etree
xmls =[XML-1,XML-2]
#note: For simplicity, I'm using the well formed version of the xml strings in your question; you'll have to use actual file names and paths
energies = ["xml", "energy_e", "energy_en", "energy_to", "whatever"]
#I just made up some names - you'll have to use actual names, of course; the first one is for the file identifier - see below
rows = []
for xml in xmls:
row = []
id = "xml-"+str(xmls.index(xml)+1)
#this creates the file identifier
row.append(id)
root = etree.XML(xml.encode())
#in real life, you'll have to use the parse() method
for energy in energies[1:]:
#the '[1:]' is used to skip the first "energy"; it's only used as the file identifier
target = root.xpath(f'//energent[./sifra[.="{energy}"]]/vrednost/text()')
#note the use of f-strings
row.extend( target if len(target)>0 else "0" )
rows.append(row)
print(pd.DataFrame(rows,columns=energies))
Output:
xml energy_e energy_en energy_to whatever
0 xml-1 238981 0 16359 0
1 xml-2 424242 29 0 0

Get children elements of multiple instances of the same name tag using ElementTree

I have an xml file looking like this:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<boundary_conditions>
<rot>
<rot_instance>
<name>BC_1</name>
<rpm>200</rpm>
<parts>
<name>rim_FL</name>
<name>tire_FL</name>
<name>disk_FL</name>
<name>center_FL</name>
</parts>
</rot_instance>
<rot_instance>
<name>BC_2</name>
<rpm>100</rpm>
<parts>
<name>tire_FR</name>
<name>disk_FR</name>
</parts>
</rot_instance>
</data>
I actually know how to extract data corresponding to each instance. So I can do this for the names tag as follows:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
names= tree.findall('.//boundary_conditions/rot/rot_instance/name')
for val in names:
print(val.text)
which gives me:
BC_1
BC_2
But if I do the same thing for the parts tag:
names= tree.findall('.//boundary_conditions/rot/rot_instance/parts/name')
for val in names:
print(val.text)
It will give me:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
Which combines all data corresponding to parts/name together. I want output that gives me the 'parts' sub-element for each instance as separate lists. So this is what I want to get:
instance_BC_1 = ['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
instance_BC_2 = ['tire_FR', 'disk_FR']
Any help is appreciated,
Thanks.
You've got to first find all parts elements, then from each parts element find all name tags.
Take a look:
parts = tree.findall('.//boundary_conditions/rot/rot_instance/parts')
for part in parts:
for val in part.findall("name"):
print(val.text)
print()
instance_BC_1 = [val.text for val in parts[0].findall("name")]
instance_BC_2 = [val.text for val in parts[1].findall("name")]
print(instance_BC_1)
print(instance_BC_2)
Output:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
['tire_FR', 'disk_FR']

Xpath for ElementTree Reference to XML with Namespace for UK Statute Metadata

New to python but trying to access the metadata for a UK statute e.g. https://www.legislation.gov.uk/ukpga/2018/12/part/3/chapter/4/data.xml - Chapter 4 of Part 3 of the UK Data Protection Act.
The problem is that there are two namespaces involved - uk legislation ukm: and the dublin core dc:
<Legislation xmlns="http://www.legislation.gov.uk/namespaces/legislation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DocumentURI="http://www.legislation.gov.uk/ukpga/2018/12" IdURI="http://www.legislation.gov.uk/id/ukpga/2018/12" NumberOfProvisions="1103" xsi:schemaLocation="http://www.legislation.gov.uk/namespaces/legislation http://www.legislation.gov.uk/schema/legislation.xsd" SchemaVersion="1.0" RestrictExtent="E+W+S+N.I." RestrictStartDate="2020-02-14">
<ukm:Metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:ukm="http://www.legislation.gov.uk/namespaces/metadata">
<dc:identifier>
...
How do I access the Legislation/ukm:Metadata element?
I've tried unsuccessfully using:-
statute_xml_tree = ET.parse(statute_xmi_doc)
statute_root = statute_xml_tree.getroot()
statute_metadata = statute_root.findall("{http://www.legislation.gov.uk/namespaces/metadata}Metadata")
along the lines of
#All dublin-core "title" tags in the document
root.findall(".//{http://purl.org/dc/elements/1.1/}title")
from https://docs.python.org/3/library/xml.etree.elementtree.html#elementtree-xpath
The answer seems to be that a namespace prefix is required ;
dcmi_title = statute_root.find(mm_ns + "Metadata/" + dc_ns + "title").text
but the addition of namespaces produces lengthy XPath strings ...
Alternative approach could be using lxml along with xpath() method:
from lxml import etree as et
root = et.parse('ukpga-2018-12-part-3-chapter-4.xml')
title = root.xpath(".//*[local-name()='Metadata']/*[local-name()='title']/text()")
Here notation *[local-name()='Metadata'] means "every child having local name 'Metadata'", which gives ability to ignore namespaces.

Extract data from ORCID XML files using Python

I ma trying to (offline) parse names from ORCID XML files using Python, which is downloaded from :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<record:record xmlns:internal="http://www.orcid.org/ns/internal" xmlns:address="http://www.orcid.org/ns/address" xmlns:email="http://www.orcid.org/ns/email" xmlns:history="http://www.orcid.org/ns/history" xmlns:employment="http://www.orcid.org/ns/employment" xmlns:person="http://www.orcid.org/ns/person" xmlns:education="http://www.orcid.org/ns/education" xmlns:other-name="http://www.orcid.org/ns/other-name" xmlns:personal-details="http://www.orcid.org/ns/personal-details" xmlns:bulk="http://www.orcid.org/ns/bulk" xmlns:common="http://www.orcid.org/ns/common" xmlns:record="http://www.orcid.org/ns/record" xmlns:keyword="http://www.orcid.org/ns/keyword" xmlns:activities="http://www.orcid.org/ns/activities" xmlns:deprecated="http://www.orcid.org/ns/deprecated" xmlns:external-identifier="http://www.orcid.org/ns/external-identifier" xmlns:funding="http://www.orcid.org/ns/funding" xmlns:error="http://www.orcid.org/ns/error" xmlns:preferences="http://www.orcid.org/ns/preferences" xmlns:work="http://www.orcid.org/ns/work" xmlns:researcher-url="http://www.orcid.org/ns/researcher-url" xmlns:peer-review="http://www.orcid.org/ns/peer-review" path="/0000-0001-5006-8001">
<common:orcid-identifier>
<common:uri>http://orcid.org/0000-0001-5006-8001</common:uri>
<common:path>0000-0001-5006-8001</common:path>
<common:host>orcid.org</common:host>
</common:orcid-identifier>
<preferences:preferences>
<preferences:locale>en</preferences:locale>
</preferences:preferences>
<person:person path="/0000-0001-5006-8001/person">
<common:last-modified-date>2016-06-06T15:29:36.952Z</common:last-modified-date>
<person:name visibility="public" path="0000-0001-5006-8001">
<common:created-date>2016-04-15T20:45:16.141Z</common:created-date>
<common:last-modified-date>2016-04-15T20:45:16.141Z</common:last-modified-date>
<personal-details:given-names>Marjorie</personal-details:given-names>
<personal-details:family-name>Biffi</personal-details:family-name>
</person:name>
What I want is to extract given-names and family-name: Marjorie Biffi. I am trying to use this code:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('f.xml').getroot()
>>> p=root.findall('{http://www.orcid.org/ns/personal-details}personal-details')
>>> p
[]
I can't figure out how to extract names/surname from this XML file. I am trying also yo use XPath/Selector, but no succes.
This will get you the results you want, but by climbing through each one.
p1 = root.find('{http://www.orcid.org/ns/person}person')
name = p1.find('{http://www.orcid.org/ns/person}name')
given_names = name.find('{http://www.orcid.org/ns/personal-details}given-names')
family_name = name.find('{http://www.orcid.org/ns/personal-details}family-name')
print(given_names.text, '', family_name.text)
You could also just go directly to that sublevel with .\\
family_name = root.find('.//{http://www.orcid.org/ns/personal-details}family-name')
Also I just posted here about simpler ways to parse through xml if you're doing more basic operations. These include xmltodict (converting to an OrderedDict) or untangle which is a little inefficient but very quick and easy to learn.

Extract an XML file (TED Europe) to a pandas dataframe

I have an XML from TED Europe (like this:TED Europa XML Files (Login required)) Within the XML Files are public procurement contracts.
My question is now how can I parse the XML File to a pandas dataframe.
So far I tried to achieve this using the ElementTree package.
However since I am still a beginner habe trouble extracting the information since the relevant text is marked with only "p" tags.
How can I extract this information for the english translation so that for example "TI_MARK" is the column header and the information within "TXT_MARK" and the "p" tags are the rows? The other rows are later filled with information from other public procurement XML Files.
<FORM_SECTION>
<OTH_NOT LG="DA" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
<OTH_NOT LG="DE" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
<OTH_NOT LG="EN" VERSION="R2.0.8.S03.E01" CATEGORY="ORIGINAL">
<FD_OTH_NOT>
<TI_DOC>
<P>BE-Brussels: IPA - Improved implementation of animal health, food safety and phytosanitary legislation and corresponding information systems</P>
</TI_DOC>
<STI_DOC>
<P>Location — The former Yugoslav Republic of Macedonia</P>
</STI_DOC>
<STI_DOC>
<P>SERVICE CONTRACT NOTICE</P>
</STI_DOC>
<CONTENTS>
<GR_SEQ>
<TI_GRSEQ>
<BLK_BTX/>
</TI_GRSEQ>
<BLK_BTX_SEQ>
<MARK_LIST>
<MLI_OCCUR NO_SEQ="001">
<NO_MARK>1.</NO_MARK>
<TI_MARK>Publication reference</TI_MARK>
<TXT_MARK>
<P>EuropeAid/139253/DH/SER/MK</P>
</TXT_MARK>
</MLI_OCCUR>
<MLI_OCCUR NO_SEQ="002">
<NO_MARK>2.</NO_MARK>
<TI_MARK>Procedure</TI_MARK>
<TXT_MARK>
<P>Restricted</P>
</TXT_MARK>
</MLI_OCCUR>
So far my code is:
import xml.etree.cElementTree as ET
tree = ET.parse('196658_2018.xml')
#Print Tree
print(tree)
#tree=ET.ElementTree(file='196658_2018.xml')
root = tree.getroot()
#Print root
print(root)
for element in root.findall('{ted/R2.0.8.S03/publication}FORM_SECTION/{ted/R2.0.8.S03/publication}OTH_NOT/{ted/R2.0.8.S03/publication}FD_OTH_NOT/{ted/R2.0.8.S03/publication}TI_DOC/{ted/R2.0.8.S03/publication}P'):
print(element.text)
Strangely the extraction only works if I add {ted/R2.0.8.S03/publication} to each path element.
Moving on from that I have problems writing a function which contains all paths with infos and appends them to a pandas dataframe. Ideally only the english translation should be extracted.
For another part of the XML File I used a function like this:
from lxml import etree
import pandas as pd
import xml.etree.ElementTree as ET
def parse_xml_fields(file, base_tag, tag_list, final_list):
root = etree.parse(file)
nodes = root.findall("//{}".format(base_tag))
for node in nodes:
item = {}
for tag in tag_list:
if node.find(".//{}".format(tag)) is not None:
item[tag] = node.find(".//{}".format(tag)).text.strip()
final_list.append(item)
# My variables
field_list = ["{ted/R2.0.8.S03/publication}TI_CY","{ted/R2.0.8.S03/publication}TI_TOWN", "{ted/R2.0.8.S03/publication}TI_TEXT"]
entities_list = []
parse_xml_fields("196658_2018.xml", "{ted/R2.0.8.S03/publication}ML_TI_DOC", field_list, entities_list)
df = pd.DataFrame(entities_list, columns=field_list)
print(df)
#better column names
df.columns = ['Country', 'Town', 'Text']
df.to_csv("TED_Europa_List.csv", sep=',', encoding='utf-8')
The Path and Tags however are much more distinguishable for this section because the tags already are named after their content and the tags are more distinguishable.

Categories

Resources