Extract variables from XML to Pandas

Extract variables from XML to Pandas - python

I am working on parsing XML variables to pandas dataframe. The XML files looks like (
This XML file has been simplified for demo)
<Instrm>
<Rcrd>
<FinPpt>
<Id>BT0007YSAWK</Id>
<FullNm>Turbo Car</FullNm>
<Ccy>EUR</Ccy>
<Cmmdty>false</Cmmdty>
</FinPpt>
<Issr>529900M2F7D5795H1A49</Issr>
<Attrbts>
<Authrty>US</Authrty>
<Prd>
<Dt>2002-03-20</Dt>
</Prd>
<Ven>NYSE</Ven>
</Attrbts >
</Rcrd>
</Instrm>
<Instrm>
<Rcrd>
<FinPpt>
<Id>BX0009YNOYK</Id>
<FullNm>Turbo truk</FullNm>
<Ccy>EUR</Ccy>
<Cmmdty>false</Cmmdty>
</FinPpt>
<Issr>58888M2F7D579536J4</Issr>
<Attrbts>
<Authrty>UK</Authrty>
<Prd>
<Dt>2002-04-21</Dt>
</Prd>
<Ven>BOX</Ven>
</Attrbts >
</Rcrd>
</Instrm>
...
I attempted to parse this XML file to a dataframe with attributes to be the column names, like this:
Id FullNm Ccy Cmmdty Issr Authrty Dt Ven
BT0007YSAWK Turbo Car EUR false 529900M2F7D5795H1A49 US 2002-03-20 NYSE
BX0009YNOYK Turbo truk EUR false 58888M2F7D579536J4 UK 2002-04-21 BOX
..... ......
but still don't know how after I reviewed some post. All I can do is to extract ID in a list, like
import xml.etree.ElementTree as ET
import pandas as pd
import sys
tree = ET.parse('sample.xml')
root = tree.getroot()
report = root[1][0][0]
records = report.findall('Instrm')
ids = []
for r in records:
ids.append(r[0][0][0].text)
print(ids[0:100])
out:
[BT0007YSAWK, BX0009YNOYK, …….]
I don't quite understand how to utilize 'nodes' here. Can someone help? Thank you.

Assuming a <root> node in posted XML without namespaces, consider building a dictionary via list/dict comprehension and combining sub dictionaries (available in Python 3.5+) that parse to needed nodes. Then call the DataFrame() constructor on returned list of dictionaries.
data = [{**{el.tag:el.text.strip() for el in r.findall('FinPpt/*')},
**{el.tag:el.text.strip() for el in r.findall('Issr')},
**{el.tag:el.text.strip() for el in r.findall('Attrbts/*')},
**{el.tag:el.text.strip() for el in r.findall('Attrbts/Prd/*')}
} for r in root.findall('Instrm/Rcrd')]
df = pd.DataFrame(data)

To get your target data without converting use an xml parser (like lxml) and xpath.
Something along these lines: [note that you have to wrap you xml with a root element]
string = """
<doc>
[your xml above]
</doc>
"""
from lxml import etree
doc = etree.XML(string)
insts = doc.xpath('//Instrm')
for inst in insts:
f_nams = inst.xpath('//FullNm')
ccys = inst.xpath('//Ccy')
cmds = inst.xpath('//Cmmdty')
issuers = inst.xpath('//Issr')
for a,b,c,d in zip(f_nams,ccys,cmds,issuers):
print(a.text,b.text,c.text,d.text)
Output:
Turbo Car EUR false 529900M2F7D5795H1A49
Turbo truk EUR false 58888M2F7D579536J4

Related

Parsing subfields in XML and merging with matching columns

This is a follow-up question from here. it got lost due to high amount of other topic on this forum. Maybe i presented the question too complicated. Since then I improved and simplified the approach.
To sum up: i'd like to extract data from subfields in multiple XML files and attach those to a new df on a matching positions.
This is a sample XML-1:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>238981</vrednost>
</energent>
<energent>
<sifra>energy_to</sifra>
<naziv>Do</naziv>
<vrednost>16359</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
his is a sample XML-2:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>424242</vrednost>
</energent>
<energent>
<sifra>energy_en</sifra>
<naziv>Do</naziv>
<vrednost>29</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
My code:
import xml.etree.ElementTree as ETree
import pandas as pd
xmldata = r"C:\...\S1.xml"
prstree = ETree.parse(xmldata)
root = prstree.getroot()
# print(root)
store_items = []
all_items = []
for storeno in root.iter('energent'):
cona_sifra = storeno.find('sifra').text
cona_vrednost = storeno.find('vrednost').text
store_items = [cona_sifra, cona_vrednost]
all_items.append(store_items)
xmlToDf = pd.DataFrame(all_items, columns=[
'sifra', 'vrednost'])
print(xmlToDf.to_string(index=False))
This results in:
sifra vrednost
energy_e 238981
energy_to 16359
Which is fine for 1 example. But i have 1,000 of XML files and the wish is to 1) have all results in 1 row for each XML and 2) to differentiate between different 'sifra' codes.
There can be e.g. energy_e, energy_en, energy_to
So ideally the final df would look like this
xml energy_e energy_en energy_to
xml-1 238981 0 16539
xml-2 424242 29 0
can it be done?

Simply use pandas.read_xml since the part of the XML you need is a flat part of the document:
energy_df = pd.read_xml("Input.xml", xpath=".//energent") # IF lxml INSTALLED
energy_df = pd.read_xml("Input.xml", xpath=".//energent", parser="etree") # IF lxml NOT INSTALLED
And to bind across many XML files, simply build a list of data frames from a list of XML file paths, adding a column for source file, and then run pandas.concat to row bind all into a single data frame:
xml_files = [...]
energy_dfs = [
pd.read_xml(f, xpath=".//energent", parser="etree").assign(source=f) for f in xml_files
]
energy_long_df = pd.concat(energy_dfs, ignore_index=True)
And from your desired output, you can then pivot values from sifra columns with pivot_table:
energy_wide_df = energy_long_df.pivot_table(
values="vrednost", index="source", columns="sifra", aggfunc="sum"
)

If I understand the situation correctly, this can be done - but because of the complexity, I would use here lxml, instead of ElementTree.
I'll try to annotate the code a bit, but you'll have to really do read up on this.
By the way, the two xml files you posted are not well formed (closing tags for <energenti> and <cone> are missing), but assuming that is fixed - try this:
from lxml import etree
xmls =[XML-1,XML-2]
#note: For simplicity, I'm using the well formed version of the xml strings in your question; you'll have to use actual file names and paths
energies = ["xml", "energy_e", "energy_en", "energy_to", "whatever"]
#I just made up some names - you'll have to use actual names, of course; the first one is for the file identifier - see below
rows = []
for xml in xmls:
row = []
id = "xml-"+str(xmls.index(xml)+1)
#this creates the file identifier
row.append(id)
root = etree.XML(xml.encode())
#in real life, you'll have to use the parse() method
for energy in energies[1:]:
#the '[1:]' is used to skip the first "energy"; it's only used as the file identifier
target = root.xpath(f'//energent[./sifra[.="{energy}"]]/vrednost/text()')
#note the use of f-strings
row.extend( target if len(target)>0 else "0" )
rows.append(row)
print(pd.DataFrame(rows,columns=energies))
Output:
xml energy_e energy_en energy_to whatever
0 xml-1 238981 0 16359 0
1 xml-2 424242 29 0 0

Get children elements of multiple instances of the same name tag using ElementTree

I have an xml file looking like this:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<boundary_conditions>
<rot>
<rot_instance>
<name>BC_1</name>
<rpm>200</rpm>
<parts>
<name>rim_FL</name>
<name>tire_FL</name>
<name>disk_FL</name>
<name>center_FL</name>
</parts>
</rot_instance>
<rot_instance>
<name>BC_2</name>
<rpm>100</rpm>
<parts>
<name>tire_FR</name>
<name>disk_FR</name>
</parts>
</rot_instance>
</data>
I actually know how to extract data corresponding to each instance. So I can do this for the names tag as follows:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
names= tree.findall('.//boundary_conditions/rot/rot_instance/name')
for val in names:
print(val.text)
which gives me:
BC_1
BC_2
But if I do the same thing for the parts tag:
names= tree.findall('.//boundary_conditions/rot/rot_instance/parts/name')
for val in names:
print(val.text)
It will give me:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
Which combines all data corresponding to parts/name together. I want output that gives me the 'parts' sub-element for each instance as separate lists. So this is what I want to get:
instance_BC_1 = ['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
instance_BC_2 = ['tire_FR', 'disk_FR']
Any help is appreciated,
Thanks.

You've got to first find all parts elements, then from each parts element find all name tags.
Take a look:
parts = tree.findall('.//boundary_conditions/rot/rot_instance/parts')
for part in parts:
for val in part.findall("name"):
print(val.text)
print()
instance_BC_1 = [val.text for val in parts[0].findall("name")]
instance_BC_2 = [val.text for val in parts[1].findall("name")]
print(instance_BC_1)
print(instance_BC_2)
Output:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
['tire_FR', 'disk_FR']

Xpath for ElementTree Reference to XML with Namespace for UK Statute Metadata

New to python but trying to access the metadata for a UK statute e.g. https://www.legislation.gov.uk/ukpga/2018/12/part/3/chapter/4/data.xml - Chapter 4 of Part 3 of the UK Data Protection Act.
The problem is that there are two namespaces involved - uk legislation ukm: and the dublin core dc:
<Legislation xmlns="http://www.legislation.gov.uk/namespaces/legislation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DocumentURI="http://www.legislation.gov.uk/ukpga/2018/12" IdURI="http://www.legislation.gov.uk/id/ukpga/2018/12" NumberOfProvisions="1103" xsi:schemaLocation="http://www.legislation.gov.uk/namespaces/legislation http://www.legislation.gov.uk/schema/legislation.xsd" SchemaVersion="1.0" RestrictExtent="E+W+S+N.I." RestrictStartDate="2020-02-14">
<ukm:Metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:ukm="http://www.legislation.gov.uk/namespaces/metadata">
<dc:identifier>
...
How do I access the Legislation/ukm:Metadata element?
I've tried unsuccessfully using:-
statute_xml_tree = ET.parse(statute_xmi_doc)
statute_root = statute_xml_tree.getroot()
statute_metadata = statute_root.findall("{http://www.legislation.gov.uk/namespaces/metadata}Metadata")
along the lines of
#All dublin-core "title" tags in the document
root.findall(".//{http://purl.org/dc/elements/1.1/}title")
from https://docs.python.org/3/library/xml.etree.elementtree.html#elementtree-xpath

The answer seems to be that a namespace prefix is required ;
dcmi_title = statute_root.find(mm_ns + "Metadata/" + dc_ns + "title").text
but the addition of namespaces produces lengthy XPath strings ...

Alternative approach could be using lxml along with xpath() method:
from lxml import etree as et
root = et.parse('ukpga-2018-12-part-3-chapter-4.xml')
title = root.xpath(".//*[local-name()='Metadata']/*[local-name()='title']/text()")
Here notation *[local-name()='Metadata'] means "every child having local name 'Metadata'", which gives ability to ignore namespaces.

Extract data from ORCID XML files using Python

I ma trying to (offline) parse names from ORCID XML files using Python, which is downloaded from :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<record:record xmlns:internal="http://www.orcid.org/ns/internal" xmlns:address="http://www.orcid.org/ns/address" xmlns:email="http://www.orcid.org/ns/email" xmlns:history="http://www.orcid.org/ns/history" xmlns:employment="http://www.orcid.org/ns/employment" xmlns:person="http://www.orcid.org/ns/person" xmlns:education="http://www.orcid.org/ns/education" xmlns:other-name="http://www.orcid.org/ns/other-name" xmlns:personal-details="http://www.orcid.org/ns/personal-details" xmlns:bulk="http://www.orcid.org/ns/bulk" xmlns:common="http://www.orcid.org/ns/common" xmlns:record="http://www.orcid.org/ns/record" xmlns:keyword="http://www.orcid.org/ns/keyword" xmlns:activities="http://www.orcid.org/ns/activities" xmlns:deprecated="http://www.orcid.org/ns/deprecated" xmlns:external-identifier="http://www.orcid.org/ns/external-identifier" xmlns:funding="http://www.orcid.org/ns/funding" xmlns:error="http://www.orcid.org/ns/error" xmlns:preferences="http://www.orcid.org/ns/preferences" xmlns:work="http://www.orcid.org/ns/work" xmlns:researcher-url="http://www.orcid.org/ns/researcher-url" xmlns:peer-review="http://www.orcid.org/ns/peer-review" path="/0000-0001-5006-8001">
<common:orcid-identifier>
<common:uri>http://orcid.org/0000-0001-5006-8001</common:uri>
<common:path>0000-0001-5006-8001</common:path>
<common:host>orcid.org</common:host>
</common:orcid-identifier>
<preferences:preferences>
<preferences:locale>en</preferences:locale>
</preferences:preferences>
<person:person path="/0000-0001-5006-8001/person">
<common:last-modified-date>2016-06-06T15:29:36.952Z</common:last-modified-date>
<person:name visibility="public" path="0000-0001-5006-8001">
<common:created-date>2016-04-15T20:45:16.141Z</common:created-date>
<common:last-modified-date>2016-04-15T20:45:16.141Z</common:last-modified-date>
<personal-details:given-names>Marjorie</personal-details:given-names>
<personal-details:family-name>Biffi</personal-details:family-name>
</person:name>
What I want is to extract given-names and family-name: Marjorie Biffi. I am trying to use this code:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('f.xml').getroot()
>>> p=root.findall('{http://www.orcid.org/ns/personal-details}personal-details')
>>> p
[]
I can't figure out how to extract names/surname from this XML file. I am trying also yo use XPath/Selector, but no succes.

This will get you the results you want, but by climbing through each one.
p1 = root.find('{http://www.orcid.org/ns/person}person')
name = p1.find('{http://www.orcid.org/ns/person}name')
given_names = name.find('{http://www.orcid.org/ns/personal-details}given-names')
family_name = name.find('{http://www.orcid.org/ns/personal-details}family-name')
print(given_names.text, '', family_name.text)
You could also just go directly to that sublevel with .\\
family_name = root.find('.//{http://www.orcid.org/ns/personal-details}family-name')
Also I just posted here about simpler ways to parse through xml if you're doing more basic operations. These include xmltodict (converting to an OrderedDict) or untangle which is a little inefficient but very quick and easy to learn.

Extract an XML file (TED Europe) to a pandas dataframe

I have an XML from TED Europe (like this:TED Europa XML Files (Login required)) Within the XML Files are public procurement contracts.
My question is now how can I parse the XML File to a pandas dataframe.
So far I tried to achieve this using the ElementTree package.
However since I am still a beginner habe trouble extracting the information since the relevant text is marked with only "p" tags.
How can I extract this information for the english translation so that for example "TI_MARK" is the column header and the information within "TXT_MARK" and the "p" tags are the rows? The other rows are later filled with information from other public procurement XML Files.
<FORM_SECTION>
<OTH_NOT LG="DA" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
<OTH_NOT LG="DE" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
<OTH_NOT LG="EN" VERSION="R2.0.8.S03.E01" CATEGORY="ORIGINAL">
<FD_OTH_NOT>
<TI_DOC>
<P>BE-Brussels: IPA - Improved implementation of animal health, food safety and phytosanitary legislation and corresponding information systems</P>
</TI_DOC>
<STI_DOC>
<P>Location — The former Yugoslav Republic of Macedonia</P>
</STI_DOC>
<STI_DOC>
<P>SERVICE CONTRACT NOTICE</P>
</STI_DOC>
<CONTENTS>
<GR_SEQ>
<TI_GRSEQ>
<BLK_BTX/>
</TI_GRSEQ>
<BLK_BTX_SEQ>
<MARK_LIST>
<MLI_OCCUR NO_SEQ="001">
<NO_MARK>1.</NO_MARK>
<TI_MARK>Publication reference</TI_MARK>
<TXT_MARK>
<P>EuropeAid/139253/DH/SER/MK</P>
</TXT_MARK>
</MLI_OCCUR>
<MLI_OCCUR NO_SEQ="002">
<NO_MARK>2.</NO_MARK>
<TI_MARK>Procedure</TI_MARK>
<TXT_MARK>
<P>Restricted</P>
</TXT_MARK>
</MLI_OCCUR>
So far my code is:
import xml.etree.cElementTree as ET
tree = ET.parse('196658_2018.xml')
#Print Tree
print(tree)
#tree=ET.ElementTree(file='196658_2018.xml')
root = tree.getroot()
#Print root
print(root)
for element in root.findall('{ted/R2.0.8.S03/publication}FORM_SECTION/{ted/R2.0.8.S03/publication}OTH_NOT/{ted/R2.0.8.S03/publication}FD_OTH_NOT/{ted/R2.0.8.S03/publication}TI_DOC/{ted/R2.0.8.S03/publication}P'):
print(element.text)
Strangely the extraction only works if I add {ted/R2.0.8.S03/publication} to each path element.
Moving on from that I have problems writing a function which contains all paths with infos and appends them to a pandas dataframe. Ideally only the english translation should be extracted.
For another part of the XML File I used a function like this:
from lxml import etree
import pandas as pd
import xml.etree.ElementTree as ET
def parse_xml_fields(file, base_tag, tag_list, final_list):
root = etree.parse(file)
nodes = root.findall("//{}".format(base_tag))
for node in nodes:
item = {}
for tag in tag_list:
if node.find(".//{}".format(tag)) is not None:
item[tag] = node.find(".//{}".format(tag)).text.strip()
final_list.append(item)
# My variables
field_list = ["{ted/R2.0.8.S03/publication}TI_CY","{ted/R2.0.8.S03/publication}TI_TOWN", "{ted/R2.0.8.S03/publication}TI_TEXT"]
entities_list = []
parse_xml_fields("196658_2018.xml", "{ted/R2.0.8.S03/publication}ML_TI_DOC", field_list, entities_list)
df = pd.DataFrame(entities_list, columns=field_list)
print(df)
#better column names
df.columns = ['Country', 'Town', 'Text']
df.to_csv("TED_Europa_List.csv", sep=',', encoding='utf-8')
The Path and Tags however are much more distinguishable for this section because the tags already are named after their content and the tags are more distinguishable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract variables from XML to Pandas - python

Related

Parsing subfields in XML and merging with matching columns

Get children elements of multiple instances of the same name tag using ElementTree

Xpath for ElementTree Reference to XML with Namespace for UK Statute Metadata

Extract data from ORCID XML files using Python

Extract an XML file (TED Europe) to a pandas dataframe

Categories

Resources