Python Minidom XML parsing interchangable upper and lower case node names

Python Minidom XML parsing interchangable upper and lower case node names - python

Hoping for help here. I am writing a small script to pull info from a data file. the following is the start of the xml...its quite big.
<?xml version="1.0" encoding="ISO-8859-1"?>
<flatNet Version="1" id="{014852F8-3010-4a5f-8215-8F47B000EA60}" sch="Vbt-mbb-1.4.scs">
<partNumbers>
<PartNumber Name="PN_LIB_NAME" Version="1">
<Properties Ver="1">
<a Key="PARTNAME" Value="PART_NAME"/>
<a Key="ALTPARTREF" Value="PART_REF"/>
My problem is that in some of the files I need to parse, the node is capitalized and in some it is lower case.
How do I get the node name (either "a" or "A") into a variable so I can use it for a function?
Else I am stuck changing it manually every time I want to parse a new file depending on what that file contains.
Thanks heaps in advance!

you will either have to have 2 nodelists and work with them individually
a1 = doc.getElementsByTagName('a')
a2 = doc.getElementsByTagName('A')
#do stuff with a1 and a2
or normalize the tags along the lines of this:
>>> import xml.dom.minidom as minidom
>>> xml = minidom.parseString('<root><head>hi!</head><body>text<a>blah</a><A>blahblah</A></body></root>')
>>> allEls = xml.getElementsByTagName('*')
>>> for i in allEls:
if i.localName.lower() == 'a':
print i.toxml()
<a>blah</a>
<A>blahblah</A>

Related

Parsing subfields in XML and merging with matching columns

This is a follow-up question from here. it got lost due to high amount of other topic on this forum. Maybe i presented the question too complicated. Since then I improved and simplified the approach.
To sum up: i'd like to extract data from subfields in multiple XML files and attach those to a new df on a matching positions.
This is a sample XML-1:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>238981</vrednost>
</energent>
<energent>
<sifra>energy_to</sifra>
<naziv>Do</naziv>
<vrednost>16359</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
his is a sample XML-2:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>424242</vrednost>
</energent>
<energent>
<sifra>energy_en</sifra>
<naziv>Do</naziv>
<vrednost>29</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
My code:
import xml.etree.ElementTree as ETree
import pandas as pd
xmldata = r"C:\...\S1.xml"
prstree = ETree.parse(xmldata)
root = prstree.getroot()
# print(root)
store_items = []
all_items = []
for storeno in root.iter('energent'):
cona_sifra = storeno.find('sifra').text
cona_vrednost = storeno.find('vrednost').text
store_items = [cona_sifra, cona_vrednost]
all_items.append(store_items)
xmlToDf = pd.DataFrame(all_items, columns=[
'sifra', 'vrednost'])
print(xmlToDf.to_string(index=False))
This results in:
sifra vrednost
energy_e 238981
energy_to 16359
Which is fine for 1 example. But i have 1,000 of XML files and the wish is to 1) have all results in 1 row for each XML and 2) to differentiate between different 'sifra' codes.
There can be e.g. energy_e, energy_en, energy_to
So ideally the final df would look like this
xml energy_e energy_en energy_to
xml-1 238981 0 16539
xml-2 424242 29 0
can it be done?

Simply use pandas.read_xml since the part of the XML you need is a flat part of the document:
energy_df = pd.read_xml("Input.xml", xpath=".//energent") # IF lxml INSTALLED
energy_df = pd.read_xml("Input.xml", xpath=".//energent", parser="etree") # IF lxml NOT INSTALLED
And to bind across many XML files, simply build a list of data frames from a list of XML file paths, adding a column for source file, and then run pandas.concat to row bind all into a single data frame:
xml_files = [...]
energy_dfs = [
pd.read_xml(f, xpath=".//energent", parser="etree").assign(source=f) for f in xml_files
]
energy_long_df = pd.concat(energy_dfs, ignore_index=True)
And from your desired output, you can then pivot values from sifra columns with pivot_table:
energy_wide_df = energy_long_df.pivot_table(
values="vrednost", index="source", columns="sifra", aggfunc="sum"
)

If I understand the situation correctly, this can be done - but because of the complexity, I would use here lxml, instead of ElementTree.
I'll try to annotate the code a bit, but you'll have to really do read up on this.
By the way, the two xml files you posted are not well formed (closing tags for <energenti> and <cone> are missing), but assuming that is fixed - try this:
from lxml import etree
xmls =[XML-1,XML-2]
#note: For simplicity, I'm using the well formed version of the xml strings in your question; you'll have to use actual file names and paths
energies = ["xml", "energy_e", "energy_en", "energy_to", "whatever"]
#I just made up some names - you'll have to use actual names, of course; the first one is for the file identifier - see below
rows = []
for xml in xmls:
row = []
id = "xml-"+str(xmls.index(xml)+1)
#this creates the file identifier
row.append(id)
root = etree.XML(xml.encode())
#in real life, you'll have to use the parse() method
for energy in energies[1:]:
#the '[1:]' is used to skip the first "energy"; it's only used as the file identifier
target = root.xpath(f'//energent[./sifra[.="{energy}"]]/vrednost/text()')
#note the use of f-strings
row.extend( target if len(target)>0 else "0" )
rows.append(row)
print(pd.DataFrame(rows,columns=energies))
Output:
xml energy_e energy_en energy_to whatever
0 xml-1 238981 0 16359 0
1 xml-2 424242 29 0 0

How to update value between specific xml tags, where input is string, Python?

Consider I have a string that looks like the following below. It's type is string but it will always represents an xml document. I'm researching available python libraries for xml. How can I update a value in between 2 specific tags? What library would I be using for that?
<?xml version="1.0"?>
<PostTelemetryRequest xmlns:ns2="urn:com:onstar:global:common:schema:PostTelemetryData:1">
<ns2:PartnerVehicles>
<ns2:PartnerVehicle>
<ns2:partnerNotificationID>251029655</ns2:partnerNotificationID>
</ns2:PartnerVehicle>
</ns2:PartnerVehicles>
</PostTelemetryRequest>
For instance, if the input is the string above how can I update the value between <ns2:partnerNotificationID> and </ns2:partnerNotificationID> tags to a new value?

This is the base code:
>>> from xml.etree import ElementTree
>>> s = """<?xml version="1.0"?>
<PostTelemetryRequest xmlns:ns2="urn:com:onstar:global:common:schema:PostTelemetryData:1">
<ns2:PartnerVehicles>
<ns2:PartnerVehicle>
<ns2:partnerNotificationID>251029655</ns2:partnerNotificationID>
</ns2:PartnerVehicle>
</ns2:PartnerVehicles>
</PostTelemetryRequest>
"""
>>> root = ElementTree.fromstring(s)
>>> for e in root.iter():
... if e.tag=='{urn:com:onstar:global:common:schema:PostTelemetryData:1}partnerNotificationID':
... e.text='mytext'
...
>>> etree.ElementTree.tostring(root)
b'<PostTelemetryRequest xmlns:ns0="urn:com:onstar:global:common:schema:PostTelemetryData:1">\n <ns0:PartnerVehicles>\n <ns0:PartnerVehicle>\n <ns0:partnerNotificationID>mytext</ns0:partnerNotificationID>\n </ns0:PartnerVehicle>\n </ns0:PartnerVehicles>\n</PostTelemetryRequest>'

Extract data from ORCID XML files using Python

I ma trying to (offline) parse names from ORCID XML files using Python, which is downloaded from :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<record:record xmlns:internal="http://www.orcid.org/ns/internal" xmlns:address="http://www.orcid.org/ns/address" xmlns:email="http://www.orcid.org/ns/email" xmlns:history="http://www.orcid.org/ns/history" xmlns:employment="http://www.orcid.org/ns/employment" xmlns:person="http://www.orcid.org/ns/person" xmlns:education="http://www.orcid.org/ns/education" xmlns:other-name="http://www.orcid.org/ns/other-name" xmlns:personal-details="http://www.orcid.org/ns/personal-details" xmlns:bulk="http://www.orcid.org/ns/bulk" xmlns:common="http://www.orcid.org/ns/common" xmlns:record="http://www.orcid.org/ns/record" xmlns:keyword="http://www.orcid.org/ns/keyword" xmlns:activities="http://www.orcid.org/ns/activities" xmlns:deprecated="http://www.orcid.org/ns/deprecated" xmlns:external-identifier="http://www.orcid.org/ns/external-identifier" xmlns:funding="http://www.orcid.org/ns/funding" xmlns:error="http://www.orcid.org/ns/error" xmlns:preferences="http://www.orcid.org/ns/preferences" xmlns:work="http://www.orcid.org/ns/work" xmlns:researcher-url="http://www.orcid.org/ns/researcher-url" xmlns:peer-review="http://www.orcid.org/ns/peer-review" path="/0000-0001-5006-8001">
<common:orcid-identifier>
<common:uri>http://orcid.org/0000-0001-5006-8001</common:uri>
<common:path>0000-0001-5006-8001</common:path>
<common:host>orcid.org</common:host>
</common:orcid-identifier>
<preferences:preferences>
<preferences:locale>en</preferences:locale>
</preferences:preferences>
<person:person path="/0000-0001-5006-8001/person">
<common:last-modified-date>2016-06-06T15:29:36.952Z</common:last-modified-date>
<person:name visibility="public" path="0000-0001-5006-8001">
<common:created-date>2016-04-15T20:45:16.141Z</common:created-date>
<common:last-modified-date>2016-04-15T20:45:16.141Z</common:last-modified-date>
<personal-details:given-names>Marjorie</personal-details:given-names>
<personal-details:family-name>Biffi</personal-details:family-name>
</person:name>
What I want is to extract given-names and family-name: Marjorie Biffi. I am trying to use this code:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('f.xml').getroot()
>>> p=root.findall('{http://www.orcid.org/ns/personal-details}personal-details')
>>> p
[]
I can't figure out how to extract names/surname from this XML file. I am trying also yo use XPath/Selector, but no succes.

This will get you the results you want, but by climbing through each one.
p1 = root.find('{http://www.orcid.org/ns/person}person')
name = p1.find('{http://www.orcid.org/ns/person}name')
given_names = name.find('{http://www.orcid.org/ns/personal-details}given-names')
family_name = name.find('{http://www.orcid.org/ns/personal-details}family-name')
print(given_names.text, '', family_name.text)
You could also just go directly to that sublevel with .\\
family_name = root.find('.//{http://www.orcid.org/ns/personal-details}family-name')
Also I just posted here about simpler ways to parse through xml if you're doing more basic operations. These include xmltodict (converting to an OrderedDict) or untangle which is a little inefficient but very quick and easy to learn.

Merge two XML files by matching elements by attribute value

I have two XML files that I'm trying to merge. I looked at other previous questions, but I don't feel like I can solve my problem from reading those. What I think makes my situation unique is that I have to find elements by attribute value and then merge to the opposite file.
I have two files. One is an English translation catalog and the second is a Japanese translation catalog. Pleas see below.
In the code below you'll see the XML has three elements which I will be merging children on - MessageCatalogueEntry, MessageCatalogueFormEntry, and MessageCatalogueFormItemEntry. I have hundreds of files and each file has thousands of lines. There may be more elements than the three I just listed, but I know for sure that all the elements have a "key" attribute.
My plan:
Iterate through File 1 and create a list of all the values of the "key" attribute.
In this example, the list would be key_values = [321, 260, 320]
Next, I'll go through the key_value list one by one.
I'll search File 1 for an element with attribute key=321.
Next, grab the child of the element with key=321 from File 1.
Next, In File 2,find the element with key=321 and add the child element I previously grabbed from File 1.
Next I'll continue the same process looping through the key_values list.
Next, I'll write the new xml root to a file being careful to keep the utf8 encoding.
File 1:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
File 2:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue[]>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
Output:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
I'm having trouble just even grabbing elements, nevermind grabbing them by key value. For example, I've been playing with the elementtree library and I wrote this code hoping to get just the MessageCatalogueEntry but I'm only getting their children:
from xml.etree import ElementTree as et
tree_japanese = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue_JA.xml')
root_japanese = tree_japanese.getroot()
MC_japanese = root_japanese.findall("MessageCatalogue")
for x in MC_japanese:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
tree_english = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue.xml')
root_english = tree_english.getroot()
MC_english = root_english.findall("MessageCatalogue")
for x in MC_english:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
Any help would be appreciated. I've been at this for a few work days now and I'm not any closer to finishing than I was when I first started!

Actually, you are getting the MessageCatalogEntry's. The problem is in the print statement. An element acts like a list, so m[0] is the first child of the MessageCatalogEntry. In
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
change the print to print et.tostring(m, encoding='utf8') to see the right element.
I personally prefer lxml to elementtree. Assuming you want to associate entries by the 'key' attribute, you could use xpath to index one of the docs and then pull them into other doc.
import lxml.etree
tree_english = lxml.etree.parse('english.xml')
tree_japanese = lxml.etree.parse('japanese.xml')
# index the japanese catalog
j_index = {}
for catalog in tree_japanese.xpath('MessageCatalogue/*[#key]'):
j_index[catalog.get('key')] = catalog
# find catalog entries in english and merge the japanese
for catalog in tree_english.xpath('MessageCatalogue/*[#key]'):
j_catalog = j_index.get(catalog.get('key'))
if j_catalog is not None:
print 'found match'
for child in j_catalog:
print 'add one'
catalog.append(child)
print lxml.etree.tostring(tree_english, pretty_print=True, encoding='utf8')

Finding XML data and Converting it into CSV

JUst need some quick help. Let say i have the following xml formatted like so:
<Solution version="1.0">
<DrillHoles total_holes="238">
<description>
<hole hole_id="1">
<hole collar="5720.44, 3070.94, 2642.19" />
<hole toe="5797.82, 3061.01, 2576.29" />
<hole cost="102.12" />
</hole>
........
EDIT: Here is the code i used to create the hole collar..etc.
for row in reader:
if i > 0:
x1,y1,z1,x2,y2,z2,cost = row
if current_group is None or i != current_group.text:
current_group = SubElement(description, 'hole',{'hole_id':"%s"%i})
collar = SubElement (current_group, 'hole',{'collar':', '.join((x1,y1,z1))}),
toe = SubElement (current_group, 'hole',{'toe':', '.join((x2,y2,z2))})
cost = SubElement(current_group, 'hole',{'cost':cost})
i+=1
and so on, how do i obtain the hole collar, hole toe, and hole cost data. Here is my piece of code so far, i think i am really close.
with open(outfile, 'w') as file_:
writer = csv.writer(file_, delimiter="\t")
for a in zip(root.findall("drillholes/hole/hole collar"),
root.findall("drillholes/hole/hole toe"),
root.findall("drillholes/hole/hole cost")):
writer.writerow([x.text for x in a])
Although my program generates an csv file, the csv file is empty which is why i think this piece of code was unable to obtain the data due to some search and find errors. Can anyone help?

You don't specify, but I assume you're using xml.etree.ElementTree. There are a few issues here:
1) XML is case-sensitive. "drillholes" is not the same thing as "DrillHoles".
2) Your path is missing the "description" element found in your XML.
3) To refer to attributes, you don't use a space, but another path element prefixed with an "#", as in "hole/#collar".
Taking those into consideration, the answer should just be this:
for a in zip(root.findall("DrillHoles/description/hole/hole/#collar"),
root.findall("DrillHoles/description/hole/hole/#toe"),
root.findall("DrillHoles/description/hole/hole/#cost")):
writer.writerow([x.text for x in a])
But it's not. Testing this, it seems findall really doesn't like returning attribute nodes, probably because they don't exist as such in etree's API. So you could do this:
for a in zip(root.findall("DrillHoles/description/hole/hole[#collar]"),
root.findall("DrillHoles/description/hole/hole[#toe]"),
root.findall("DrillHoles/description/hole/hole[#cost]")):
writer.writerow([x[0].get('collar'), x[1].get('toe'), x[2].get('cost')])
But if you have to list out the attributes in the statements in the for loop anyway, personally I'd just do away with the zip and do this:
for a in root.findall("DrillHoles/description/hole"):
writer.writerow([a.find("hole[#"+attr+"]").get(attr) for attr in ("collar", "toe", "cost")])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Minidom XML parsing interchangable upper and lower case node names - python

Related

Parsing subfields in XML and merging with matching columns

How to update value between specific xml tags, where input is string, Python?

Extract data from ORCID XML files using Python

Merge two XML files by matching elements by attribute value

Finding XML data and Converting it into CSV

Categories

Resources