Parsing subfields in XML and merging with matching columns

Parsing subfields in XML and merging with matching columns - python

This is a follow-up question from here. it got lost due to high amount of other topic on this forum. Maybe i presented the question too complicated. Since then I improved and simplified the approach.
To sum up: i'd like to extract data from subfields in multiple XML files and attach those to a new df on a matching positions.
This is a sample XML-1:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>238981</vrednost>
</energent>
<energent>
<sifra>energy_to</sifra>
<naziv>Do</naziv>
<vrednost>16359</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
his is a sample XML-2:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>424242</vrednost>
</energent>
<energent>
<sifra>energy_en</sifra>
<naziv>Do</naziv>
<vrednost>29</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
My code:
import xml.etree.ElementTree as ETree
import pandas as pd
xmldata = r"C:\...\S1.xml"
prstree = ETree.parse(xmldata)
root = prstree.getroot()
# print(root)
store_items = []
all_items = []
for storeno in root.iter('energent'):
cona_sifra = storeno.find('sifra').text
cona_vrednost = storeno.find('vrednost').text
store_items = [cona_sifra, cona_vrednost]
all_items.append(store_items)
xmlToDf = pd.DataFrame(all_items, columns=[
'sifra', 'vrednost'])
print(xmlToDf.to_string(index=False))
This results in:
sifra vrednost
energy_e 238981
energy_to 16359
Which is fine for 1 example. But i have 1,000 of XML files and the wish is to 1) have all results in 1 row for each XML and 2) to differentiate between different 'sifra' codes.
There can be e.g. energy_e, energy_en, energy_to
So ideally the final df would look like this
xml energy_e energy_en energy_to
xml-1 238981 0 16539
xml-2 424242 29 0
can it be done?

Simply use pandas.read_xml since the part of the XML you need is a flat part of the document:
energy_df = pd.read_xml("Input.xml", xpath=".//energent") # IF lxml INSTALLED
energy_df = pd.read_xml("Input.xml", xpath=".//energent", parser="etree") # IF lxml NOT INSTALLED
And to bind across many XML files, simply build a list of data frames from a list of XML file paths, adding a column for source file, and then run pandas.concat to row bind all into a single data frame:
xml_files = [...]
energy_dfs = [
pd.read_xml(f, xpath=".//energent", parser="etree").assign(source=f) for f in xml_files
]
energy_long_df = pd.concat(energy_dfs, ignore_index=True)
And from your desired output, you can then pivot values from sifra columns with pivot_table:
energy_wide_df = energy_long_df.pivot_table(
values="vrednost", index="source", columns="sifra", aggfunc="sum"
)

If I understand the situation correctly, this can be done - but because of the complexity, I would use here lxml, instead of ElementTree.
I'll try to annotate the code a bit, but you'll have to really do read up on this.
By the way, the two xml files you posted are not well formed (closing tags for <energenti> and <cone> are missing), but assuming that is fixed - try this:
from lxml import etree
xmls =[XML-1,XML-2]
#note: For simplicity, I'm using the well formed version of the xml strings in your question; you'll have to use actual file names and paths
energies = ["xml", "energy_e", "energy_en", "energy_to", "whatever"]
#I just made up some names - you'll have to use actual names, of course; the first one is for the file identifier - see below
rows = []
for xml in xmls:
row = []
id = "xml-"+str(xmls.index(xml)+1)
#this creates the file identifier
row.append(id)
root = etree.XML(xml.encode())
#in real life, you'll have to use the parse() method
for energy in energies[1:]:
#the '[1:]' is used to skip the first "energy"; it's only used as the file identifier
target = root.xpath(f'//energent[./sifra[.="{energy}"]]/vrednost/text()')
#note the use of f-strings
row.extend( target if len(target)>0 else "0" )
rows.append(row)
print(pd.DataFrame(rows,columns=energies))
Output:
xml energy_e energy_en energy_to whatever
0 xml-1 238981 0 16359 0
1 xml-2 424242 29 0 0

Related

How to pull a value out of an element in a nested XML document in Python?

I'm asking an API to look up part numbers I get from a user with a barcode scanner. The API returns a much longer document than the below code block, but I trimmed a bunch of unnecessary empty elements, but the structure of the document is still the same. I need to put each part number in a dictionary where the value is the text inside of the <mfgr> element. With each run of my program, I generate a list of part numbers and have a loop that asks the API about each item in my list and each returns a huge document as expected. I'm a bit stuck on trying to parse the XML and get only the text inside of <mfgr> element, then save it to a dictionary with the part number that it belongs to. I'll put my loop that goes through my list below the XML document
<ArrayOfitem xmlns="WhereDataComesFrom.com" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<item>
<associateditem_swap/>
<bulk>false</bulk>
<category>Memory</category>
<clei>false</clei>
<createddate>5/11/2021 7:34:58 PM</createddate>
<description>sample description</description>
<heci/>
<imageurl/>
<item_swap/>
<itemid>1640</itemid>
<itemnumber>**sample part number**</itemnumber>
<listprice>0.0000</listprice>
<manufactureritem/>
<maxavailable>66</maxavailable>
<mfgr>**sample manufacturer**</mfgr>
<minreorderqty>0</minreorderqty>
<noninventory>false</noninventory>
<primarylocation/>
<reorderpoint>0</reorderpoint>
<rep>AP</rep>
<type>Memory </type>
<updateddate>2/4/2022 2:22:51 PM</updateddate>
<warehouse>MAIN</warehouse>
</item>
</ArrayOfitem>
Below is my Python code that loops through the part number list and asks the API to look up each part number.
import http.client
import xml.etree.ElementTree as etree
raw_xml = None
pn_list=["samplepart1","samplepart2"]
api_key= **redacted lol**
def getMFGR():
global raw_xml
for part_number in pn_list:
conn = http.client.HTTPSConnection("api.website.com")
payload = ''
headers = {
'session-token': 'api_key',
'Cookie': 'firstpartofmycookie; secondpartofmycookie'
}
conn.request("GET", "/webapi.svc/MI/XML/GetItemsByItemNumber?ItemNumber="+part_number, payload, headers)
res = conn.getresponse()
data = res.read()
raw_xml = data.decode("utf-8")
print(raw_xml)
print()
getMFGR()
Here is some code I tried while trying to get the mfgr. It will go inside the getMFGR() method inside the for loop so that it saves the manufacturer to a variable with each loop. Once the code works I want to have the dictionary look like this: {"samplepart1": "manufacturer1", "samplepart2": "manufacturer2"}.
root = etree.fromstring(raw_xml)
my_ns = {'root': 'WhereDataComesFrom.com'}
mfgr = root.findall('root:mfgr',my_ns)[0].text
The code above gives me a list index out of range error when I run it. I don't think it's searching past the namespaces node but I'm not sure how to tell it to search further.

This is where an interactive session becomes very useful. Drop your XML data into a file (say, data.xml), and then start up a Python REPL:
>>> import xml.etree.ElementTree as etree
>>> with open('data.xml') as fd:
... raw_xml=fd.read()
...
>>> root = etree.fromstring(raw_xml)
>>> my_ns = {'root': 'WhereDataComesFrom.com'}
Let's first look at your existing xpath expression:
>>> root.findall('root:mfgr',my_ns)
[]
That returns an empty list, which is why you're getting an "index out of range" error. You're getting an empty list because there is no mfgr element at the top level of the document; it's contained in an <item> element. So this will work:
>>> root.findall('root:item/root:mfgr',my_ns)
[<Element '{WhereDataComesFrom.com}mfgr' at 0x7fa5a45e2b60>]
To actually get the contents of that element:
>>> [x.text for x in root.findall('root:item/root:mfgr',my_ns)]
['**sample manufacturer**']
Hopefully that's enough to point you in the right direction.

I suggest use pandas for this structure of XML:
import pandas as pd
# Read XML row into DataFrame
ns = {"xmlns":"WhereDataComesFrom.com", "xmlns:i":"http://www.w3.org/2001/XMLSchema-instance"}
df = pd.read_xml("parNo_plant.xml", xpath=".//xmlns:item", namespaces=ns)
# Print only columns of interesst
df_of_interest = df[['itemnumber', 'mfgr']]
print(df_of_interest,'\n')
#Print the dictionary from DataFrame
print(df_of_interest.to_dict(orient='records'))
# If I understood right, you search this layout:
dictionary = dict(zip(df.itemnumber, df.mfgr))
print(dictionary)
Result (Pandas dataframe or dictionary):
itemnumber mfgr
0 **sample part number** **sample manufacturer**
[{'itemnumber': '**sample part number**', 'mfgr': '**sample manufacturer**'}]
{'**sample part number**': '**sample manufacturer**'}

Get children elements of multiple instances of the same name tag using ElementTree

I have an xml file looking like this:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<boundary_conditions>
<rot>
<rot_instance>
<name>BC_1</name>
<rpm>200</rpm>
<parts>
<name>rim_FL</name>
<name>tire_FL</name>
<name>disk_FL</name>
<name>center_FL</name>
</parts>
</rot_instance>
<rot_instance>
<name>BC_2</name>
<rpm>100</rpm>
<parts>
<name>tire_FR</name>
<name>disk_FR</name>
</parts>
</rot_instance>
</data>
I actually know how to extract data corresponding to each instance. So I can do this for the names tag as follows:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
names= tree.findall('.//boundary_conditions/rot/rot_instance/name')
for val in names:
print(val.text)
which gives me:
BC_1
BC_2
But if I do the same thing for the parts tag:
names= tree.findall('.//boundary_conditions/rot/rot_instance/parts/name')
for val in names:
print(val.text)
It will give me:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
Which combines all data corresponding to parts/name together. I want output that gives me the 'parts' sub-element for each instance as separate lists. So this is what I want to get:
instance_BC_1 = ['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
instance_BC_2 = ['tire_FR', 'disk_FR']
Any help is appreciated,
Thanks.

You've got to first find all parts elements, then from each parts element find all name tags.
Take a look:
parts = tree.findall('.//boundary_conditions/rot/rot_instance/parts')
for part in parts:
for val in part.findall("name"):
print(val.text)
print()
instance_BC_1 = [val.text for val in parts[0].findall("name")]
instance_BC_2 = [val.text for val in parts[1].findall("name")]
print(instance_BC_1)
print(instance_BC_2)
Output:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
['tire_FR', 'disk_FR']

Extract data from ORCID XML files using Python

I ma trying to (offline) parse names from ORCID XML files using Python, which is downloaded from :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<record:record xmlns:internal="http://www.orcid.org/ns/internal" xmlns:address="http://www.orcid.org/ns/address" xmlns:email="http://www.orcid.org/ns/email" xmlns:history="http://www.orcid.org/ns/history" xmlns:employment="http://www.orcid.org/ns/employment" xmlns:person="http://www.orcid.org/ns/person" xmlns:education="http://www.orcid.org/ns/education" xmlns:other-name="http://www.orcid.org/ns/other-name" xmlns:personal-details="http://www.orcid.org/ns/personal-details" xmlns:bulk="http://www.orcid.org/ns/bulk" xmlns:common="http://www.orcid.org/ns/common" xmlns:record="http://www.orcid.org/ns/record" xmlns:keyword="http://www.orcid.org/ns/keyword" xmlns:activities="http://www.orcid.org/ns/activities" xmlns:deprecated="http://www.orcid.org/ns/deprecated" xmlns:external-identifier="http://www.orcid.org/ns/external-identifier" xmlns:funding="http://www.orcid.org/ns/funding" xmlns:error="http://www.orcid.org/ns/error" xmlns:preferences="http://www.orcid.org/ns/preferences" xmlns:work="http://www.orcid.org/ns/work" xmlns:researcher-url="http://www.orcid.org/ns/researcher-url" xmlns:peer-review="http://www.orcid.org/ns/peer-review" path="/0000-0001-5006-8001">
<common:orcid-identifier>
<common:uri>http://orcid.org/0000-0001-5006-8001</common:uri>
<common:path>0000-0001-5006-8001</common:path>
<common:host>orcid.org</common:host>
</common:orcid-identifier>
<preferences:preferences>
<preferences:locale>en</preferences:locale>
</preferences:preferences>
<person:person path="/0000-0001-5006-8001/person">
<common:last-modified-date>2016-06-06T15:29:36.952Z</common:last-modified-date>
<person:name visibility="public" path="0000-0001-5006-8001">
<common:created-date>2016-04-15T20:45:16.141Z</common:created-date>
<common:last-modified-date>2016-04-15T20:45:16.141Z</common:last-modified-date>
<personal-details:given-names>Marjorie</personal-details:given-names>
<personal-details:family-name>Biffi</personal-details:family-name>
</person:name>
What I want is to extract given-names and family-name: Marjorie Biffi. I am trying to use this code:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('f.xml').getroot()
>>> p=root.findall('{http://www.orcid.org/ns/personal-details}personal-details')
>>> p
[]
I can't figure out how to extract names/surname from this XML file. I am trying also yo use XPath/Selector, but no succes.

This will get you the results you want, but by climbing through each one.
p1 = root.find('{http://www.orcid.org/ns/person}person')
name = p1.find('{http://www.orcid.org/ns/person}name')
given_names = name.find('{http://www.orcid.org/ns/personal-details}given-names')
family_name = name.find('{http://www.orcid.org/ns/personal-details}family-name')
print(given_names.text, '', family_name.text)
You could also just go directly to that sublevel with .\\
family_name = root.find('.//{http://www.orcid.org/ns/personal-details}family-name')
Also I just posted here about simpler ways to parse through xml if you're doing more basic operations. These include xmltodict (converting to an OrderedDict) or untangle which is a little inefficient but very quick and easy to learn.

Extract an XML file (TED Europe) to a pandas dataframe

I have an XML from TED Europe (like this:TED Europa XML Files (Login required)) Within the XML Files are public procurement contracts.
My question is now how can I parse the XML File to a pandas dataframe.
So far I tried to achieve this using the ElementTree package.
However since I am still a beginner habe trouble extracting the information since the relevant text is marked with only "p" tags.
How can I extract this information for the english translation so that for example "TI_MARK" is the column header and the information within "TXT_MARK" and the "p" tags are the rows? The other rows are later filled with information from other public procurement XML Files.
<FORM_SECTION>
<OTH_NOT LG="DA" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
<OTH_NOT LG="DE" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
<OTH_NOT LG="EN" VERSION="R2.0.8.S03.E01" CATEGORY="ORIGINAL">
<FD_OTH_NOT>
<TI_DOC>
<P>BE-Brussels: IPA - Improved implementation of animal health, food safety and phytosanitary legislation and corresponding information systems</P>
</TI_DOC>
<STI_DOC>
<P>Location — The former Yugoslav Republic of Macedonia</P>
</STI_DOC>
<STI_DOC>
<P>SERVICE CONTRACT NOTICE</P>
</STI_DOC>
<CONTENTS>
<GR_SEQ>
<TI_GRSEQ>
<BLK_BTX/>
</TI_GRSEQ>
<BLK_BTX_SEQ>
<MARK_LIST>
<MLI_OCCUR NO_SEQ="001">
<NO_MARK>1.</NO_MARK>
<TI_MARK>Publication reference</TI_MARK>
<TXT_MARK>
<P>EuropeAid/139253/DH/SER/MK</P>
</TXT_MARK>
</MLI_OCCUR>
<MLI_OCCUR NO_SEQ="002">
<NO_MARK>2.</NO_MARK>
<TI_MARK>Procedure</TI_MARK>
<TXT_MARK>
<P>Restricted</P>
</TXT_MARK>
</MLI_OCCUR>
So far my code is:
import xml.etree.cElementTree as ET
tree = ET.parse('196658_2018.xml')
#Print Tree
print(tree)
#tree=ET.ElementTree(file='196658_2018.xml')
root = tree.getroot()
#Print root
print(root)
for element in root.findall('{ted/R2.0.8.S03/publication}FORM_SECTION/{ted/R2.0.8.S03/publication}OTH_NOT/{ted/R2.0.8.S03/publication}FD_OTH_NOT/{ted/R2.0.8.S03/publication}TI_DOC/{ted/R2.0.8.S03/publication}P'):
print(element.text)
Strangely the extraction only works if I add {ted/R2.0.8.S03/publication} to each path element.
Moving on from that I have problems writing a function which contains all paths with infos and appends them to a pandas dataframe. Ideally only the english translation should be extracted.
For another part of the XML File I used a function like this:
from lxml import etree
import pandas as pd
import xml.etree.ElementTree as ET
def parse_xml_fields(file, base_tag, tag_list, final_list):
root = etree.parse(file)
nodes = root.findall("//{}".format(base_tag))
for node in nodes:
item = {}
for tag in tag_list:
if node.find(".//{}".format(tag)) is not None:
item[tag] = node.find(".//{}".format(tag)).text.strip()
final_list.append(item)
# My variables
field_list = ["{ted/R2.0.8.S03/publication}TI_CY","{ted/R2.0.8.S03/publication}TI_TOWN", "{ted/R2.0.8.S03/publication}TI_TEXT"]
entities_list = []
parse_xml_fields("196658_2018.xml", "{ted/R2.0.8.S03/publication}ML_TI_DOC", field_list, entities_list)
df = pd.DataFrame(entities_list, columns=field_list)
print(df)
#better column names
df.columns = ['Country', 'Town', 'Text']
df.to_csv("TED_Europa_List.csv", sep=',', encoding='utf-8')
The Path and Tags however are much more distinguishable for this section because the tags already are named after their content and the tags are more distinguishable.

Python Minidom XML parsing interchangable upper and lower case node names

Hoping for help here. I am writing a small script to pull info from a data file. the following is the start of the xml...its quite big.
<?xml version="1.0" encoding="ISO-8859-1"?>
<flatNet Version="1" id="{014852F8-3010-4a5f-8215-8F47B000EA60}" sch="Vbt-mbb-1.4.scs">
<partNumbers>
<PartNumber Name="PN_LIB_NAME" Version="1">
<Properties Ver="1">
<a Key="PARTNAME" Value="PART_NAME"/>
<a Key="ALTPARTREF" Value="PART_REF"/>
My problem is that in some of the files I need to parse, the node is capitalized and in some it is lower case.
How do I get the node name (either "a" or "A") into a variable so I can use it for a function?
Else I am stuck changing it manually every time I want to parse a new file depending on what that file contains.
Thanks heaps in advance!

you will either have to have 2 nodelists and work with them individually
a1 = doc.getElementsByTagName('a')
a2 = doc.getElementsByTagName('A')
#do stuff with a1 and a2
or normalize the tags along the lines of this:
>>> import xml.dom.minidom as minidom
>>> xml = minidom.parseString('<root><head>hi!</head><body>text<a>blah</a><A>blahblah</A></body></root>')
>>> allEls = xml.getElementsByTagName('*')
>>> for i in allEls:
if i.localName.lower() == 'a':
print i.toxml()
<a>blah</a>
<A>blahblah</A>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing subfields in XML and merging with matching columns - python

Related

How to pull a value out of an element in a nested XML document in Python?

Get children elements of multiple instances of the same name tag using ElementTree

Extract data from ORCID XML files using Python

Extract an XML file (TED Europe) to a pandas dataframe

Python Minidom XML parsing interchangable upper and lower case node names

Categories

Resources