Extract data from ORCID XML files using Python - python

I ma trying to (offline) parse names from ORCID XML files using Python, which is downloaded from :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<record:record xmlns:internal="http://www.orcid.org/ns/internal" xmlns:address="http://www.orcid.org/ns/address" xmlns:email="http://www.orcid.org/ns/email" xmlns:history="http://www.orcid.org/ns/history" xmlns:employment="http://www.orcid.org/ns/employment" xmlns:person="http://www.orcid.org/ns/person" xmlns:education="http://www.orcid.org/ns/education" xmlns:other-name="http://www.orcid.org/ns/other-name" xmlns:personal-details="http://www.orcid.org/ns/personal-details" xmlns:bulk="http://www.orcid.org/ns/bulk" xmlns:common="http://www.orcid.org/ns/common" xmlns:record="http://www.orcid.org/ns/record" xmlns:keyword="http://www.orcid.org/ns/keyword" xmlns:activities="http://www.orcid.org/ns/activities" xmlns:deprecated="http://www.orcid.org/ns/deprecated" xmlns:external-identifier="http://www.orcid.org/ns/external-identifier" xmlns:funding="http://www.orcid.org/ns/funding" xmlns:error="http://www.orcid.org/ns/error" xmlns:preferences="http://www.orcid.org/ns/preferences" xmlns:work="http://www.orcid.org/ns/work" xmlns:researcher-url="http://www.orcid.org/ns/researcher-url" xmlns:peer-review="http://www.orcid.org/ns/peer-review" path="/0000-0001-5006-8001">
<common:orcid-identifier>
<common:uri>http://orcid.org/0000-0001-5006-8001</common:uri>
<common:path>0000-0001-5006-8001</common:path>
<common:host>orcid.org</common:host>
</common:orcid-identifier>
<preferences:preferences>
<preferences:locale>en</preferences:locale>
</preferences:preferences>
<person:person path="/0000-0001-5006-8001/person">
<common:last-modified-date>2016-06-06T15:29:36.952Z</common:last-modified-date>
<person:name visibility="public" path="0000-0001-5006-8001">
<common:created-date>2016-04-15T20:45:16.141Z</common:created-date>
<common:last-modified-date>2016-04-15T20:45:16.141Z</common:last-modified-date>
<personal-details:given-names>Marjorie</personal-details:given-names>
<personal-details:family-name>Biffi</personal-details:family-name>
</person:name>
What I want is to extract given-names and family-name: Marjorie Biffi. I am trying to use this code:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('f.xml').getroot()
>>> p=root.findall('{http://www.orcid.org/ns/personal-details}personal-details')
>>> p
[]
I can't figure out how to extract names/surname from this XML file. I am trying also yo use XPath/Selector, but no succes.

This will get you the results you want, but by climbing through each one.
p1 = root.find('{http://www.orcid.org/ns/person}person')
name = p1.find('{http://www.orcid.org/ns/person}name')
given_names = name.find('{http://www.orcid.org/ns/personal-details}given-names')
family_name = name.find('{http://www.orcid.org/ns/personal-details}family-name')
print(given_names.text, '', family_name.text)
You could also just go directly to that sublevel with .\\
family_name = root.find('.//{http://www.orcid.org/ns/personal-details}family-name')
Also I just posted here about simpler ways to parse through xml if you're doing more basic operations. These include xmltodict (converting to an OrderedDict) or untangle which is a little inefficient but very quick and easy to learn.

Related

Parsing subfields in XML and merging with matching columns

This is a follow-up question from here. it got lost due to high amount of other topic on this forum. Maybe i presented the question too complicated. Since then I improved and simplified the approach.
To sum up: i'd like to extract data from subfields in multiple XML files and attach those to a new df on a matching positions.
This is a sample XML-1:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>238981</vrednost>
</energent>
<energent>
<sifra>energy_to</sifra>
<naziv>Do</naziv>
<vrednost>16359</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
his is a sample XML-2:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>424242</vrednost>
</energent>
<energent>
<sifra>energy_en</sifra>
<naziv>Do</naziv>
<vrednost>29</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
My code:
import xml.etree.ElementTree as ETree
import pandas as pd
xmldata = r"C:\...\S1.xml"
prstree = ETree.parse(xmldata)
root = prstree.getroot()
# print(root)
store_items = []
all_items = []
for storeno in root.iter('energent'):
cona_sifra = storeno.find('sifra').text
cona_vrednost = storeno.find('vrednost').text
store_items = [cona_sifra, cona_vrednost]
all_items.append(store_items)
xmlToDf = pd.DataFrame(all_items, columns=[
'sifra', 'vrednost'])
print(xmlToDf.to_string(index=False))
This results in:
sifra vrednost
energy_e 238981
energy_to 16359
Which is fine for 1 example. But i have 1,000 of XML files and the wish is to 1) have all results in 1 row for each XML and 2) to differentiate between different 'sifra' codes.
There can be e.g. energy_e, energy_en, energy_to
So ideally the final df would look like this
xml energy_e energy_en energy_to
xml-1 238981 0 16539
xml-2 424242 29 0
can it be done?
Simply use pandas.read_xml since the part of the XML you need is a flat part of the document:
energy_df = pd.read_xml("Input.xml", xpath=".//energent") # IF lxml INSTALLED
energy_df = pd.read_xml("Input.xml", xpath=".//energent", parser="etree") # IF lxml NOT INSTALLED
And to bind across many XML files, simply build a list of data frames from a list of XML file paths, adding a column for source file, and then run pandas.concat to row bind all into a single data frame:
xml_files = [...]
energy_dfs = [
pd.read_xml(f, xpath=".//energent", parser="etree").assign(source=f) for f in xml_files
]
energy_long_df = pd.concat(energy_dfs, ignore_index=True)
And from your desired output, you can then pivot values from sifra columns with pivot_table:
energy_wide_df = energy_long_df.pivot_table(
values="vrednost", index="source", columns="sifra", aggfunc="sum"
)
If I understand the situation correctly, this can be done - but because of the complexity, I would use here lxml, instead of ElementTree.
I'll try to annotate the code a bit, but you'll have to really do read up on this.
By the way, the two xml files you posted are not well formed (closing tags for <energenti> and <cone> are missing), but assuming that is fixed - try this:
from lxml import etree
xmls =[XML-1,XML-2]
#note: For simplicity, I'm using the well formed version of the xml strings in your question; you'll have to use actual file names and paths
energies = ["xml", "energy_e", "energy_en", "energy_to", "whatever"]
#I just made up some names - you'll have to use actual names, of course; the first one is for the file identifier - see below
rows = []
for xml in xmls:
row = []
id = "xml-"+str(xmls.index(xml)+1)
#this creates the file identifier
row.append(id)
root = etree.XML(xml.encode())
#in real life, you'll have to use the parse() method
for energy in energies[1:]:
#the '[1:]' is used to skip the first "energy"; it's only used as the file identifier
target = root.xpath(f'//energent[./sifra[.="{energy}"]]/vrednost/text()')
#note the use of f-strings
row.extend( target if len(target)>0 else "0" )
rows.append(row)
print(pd.DataFrame(rows,columns=energies))
Output:
xml energy_e energy_en energy_to whatever
0 xml-1 238981 0 16359 0
1 xml-2 424242 29 0 0

How to update value between specific xml tags, where input is string, Python?

Consider I have a string that looks like the following below. It's type is string but it will always represents an xml document. I'm researching available python libraries for xml. How can I update a value in between 2 specific tags? What library would I be using for that?
<?xml version="1.0"?>
<PostTelemetryRequest xmlns:ns2="urn:com:onstar:global:common:schema:PostTelemetryData:1">
<ns2:PartnerVehicles>
<ns2:PartnerVehicle>
<ns2:partnerNotificationID>251029655</ns2:partnerNotificationID>
</ns2:PartnerVehicle>
</ns2:PartnerVehicles>
</PostTelemetryRequest>
For instance, if the input is the string above how can I update the value between <ns2:partnerNotificationID> and </ns2:partnerNotificationID> tags to a new value?
This is the base code:
>>> from xml.etree import ElementTree
>>> s = """<?xml version="1.0"?>
<PostTelemetryRequest xmlns:ns2="urn:com:onstar:global:common:schema:PostTelemetryData:1">
<ns2:PartnerVehicles>
<ns2:PartnerVehicle>
<ns2:partnerNotificationID>251029655</ns2:partnerNotificationID>
</ns2:PartnerVehicle>
</ns2:PartnerVehicles>
</PostTelemetryRequest>
"""
>>> root = ElementTree.fromstring(s)
>>> for e in root.iter():
... if e.tag=='{urn:com:onstar:global:common:schema:PostTelemetryData:1}partnerNotificationID':
... e.text='mytext'
...
>>> etree.ElementTree.tostring(root)
b'<PostTelemetryRequest xmlns:ns0="urn:com:onstar:global:common:schema:PostTelemetryData:1">\n <ns0:PartnerVehicles>\n <ns0:PartnerVehicle>\n <ns0:partnerNotificationID>mytext</ns0:partnerNotificationID>\n </ns0:PartnerVehicle>\n </ns0:PartnerVehicles>\n</PostTelemetryRequest>'

Appending an xml-node which is read from file breaks pretty_print for adjacent nodes

I'm generating a XML-file with python's etree library. One node in the generated file is read from an existing XML-file. Adding this element breaks the pretty_print for the nodes directly before and after.
import xml.etree.cElementTree as ET
from lxml import etree
root = etree.Element("startNode")
subnode1 = etree.SubElement(root, "SubNode1")
subnode1Child1 = etree.SubElement(subnode1, "subNode1Child1")
etree.SubElement(subnode1Child1, "Child1")
etree.SubElement(subnode1Child1, "Child2")
f = open('/xml_testdata/ext_file.xml','r')
ext_xml = etree.fromstring(f.read())
ext_subnode = ext_xml.find("ExtNode")
subnode1.append(ext_subnode)
subnode1Child2 = etree.SubElement(subnode1, "subNode1Child2")
etree.SubElement(subnode1Child2, "Child1")
etree.SubElement(subnode1Child2, "Child2")
tree = etree.ElementTree(root)
tree.write("testfile.xml", xml_declaration=True, pretty_print=True)
which gives this result:
<startNode>
<SubNode1><subNode1Child1><Child1/><Child2/></subNode1Child1><ExtNode>
<NodeFromExt>
<SubNodeFromExt1/>
</NodeFromExt>
<NodeFromExt>
<SubNodeFromExt2/>
<AnotherSubNodeFromExt2>
<SubSubNode/>
<AllPrettyHere>
<Child/>
</AllPrettyHere>
</AnotherSubNodeFromExt2>
</NodeFromExt>
</ExtNode>
<subNode1Child2><Child1/><Child2/></subNode1Child2></SubNode1>
</startNode>
Not very readable, is it? Even worse when "subNodeChild" contains a lot more subnodes than this example!
Without appending the external elements, it looks like this:
<startNode>
<SubNode1>
<subNode1Child1>
<Child1/>
<Child2/>
</subNode1Child1>
<subNode1Child2>
<Child1/>
<Child2/>
</subNode1Child2>
</SubNode1>
</startNode>
So the problem is caused by appending the external elements!
Is there a way to append the external elements without breaking the pretty_print-output?
You can get nicer pretty-printed output by using a parser object that removes ignorable whitespace when parsing the existing XML file.
Instead of this:
f = open('/xml_testdata/ext_file.xml','r')
ext_xml = etree.fromstring(f.read())
Use this:
f = open('/xml_testdata/ext_file.xml', 'r')
parser = etree.XMLParser(remove_blank_text=True)
ext_xml = etree.fromstring(f.read(), parser)
See also:
http://lxml.de/api/lxml.etree.XMLParser-class.html
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
I've been able to somewhat mitigate the effect by creating "ExtNode" with etree.SubElement and appending the elements inside it:
ext_node = etree.SubElement(subnode1, "ExtNode")
for element in ext_xml.findall("ExtNode/NodeFromExt")
ext_node.append(element)
which has this result:
<startNode>
<SubNode1>
<subNode1Child1>
<Child1/>
<Child2/>
</subNode1Child1>
<ExtNode><NodeFromExt>
<SubNodeFromExt1/>
</NodeFromExt>
<NodeFromExt>
<SubNodeFromExt2/>
<AnotherSubNodeFromExt2>
<SubSubNode/>
<AllPrettyHere>
<Child/>
</AllPrettyHere>
</AnotherSubNodeFromExt2>
</NodeFromExt>
</ExtNode>
<subNode1Child2>
<Child1/>
<Child2/>
</subNode1Child2>
</SubNode1>
</startNode>
Not perfect, but at least human readable (Which is the whole point of pretty_print, right?)
To satisfy my OCD, I'd still be interested if there is a way to get a flawlessly formatted file!

Python Minidom XML parsing interchangable upper and lower case node names

Hoping for help here. I am writing a small script to pull info from a data file. the following is the start of the xml...its quite big.
<?xml version="1.0" encoding="ISO-8859-1"?>
<flatNet Version="1" id="{014852F8-3010-4a5f-8215-8F47B000EA60}" sch="Vbt-mbb-1.4.scs">
<partNumbers>
<PartNumber Name="PN_LIB_NAME" Version="1">
<Properties Ver="1">
<a Key="PARTNAME" Value="PART_NAME"/>
<a Key="ALTPARTREF" Value="PART_REF"/>
My problem is that in some of the files I need to parse, the node is capitalized and in some it is lower case.
How do I get the node name (either "a" or "A") into a variable so I can use it for a function?
Else I am stuck changing it manually every time I want to parse a new file depending on what that file contains.
Thanks heaps in advance!
you will either have to have 2 nodelists and work with them individually
a1 = doc.getElementsByTagName('a')
a2 = doc.getElementsByTagName('A')
#do stuff with a1 and a2
or normalize the tags along the lines of this:
>>> import xml.dom.minidom as minidom
>>> xml = minidom.parseString('<root><head>hi!</head><body>text<a>blah</a><A>blahblah</A></body></root>')
>>> allEls = xml.getElementsByTagName('*')
>>> for i in allEls:
if i.localName.lower() == 'a':
print i.toxml()
<a>blah</a>
<A>blahblah</A>

How would one remove the CDATA tags from but preserve the actual data in Python using LXML or BeautifulSoup

I have some XML I am parsing in which I am using BeautifulSoup as the parser. I pull the CDATA out with the following code, but I only want the data and not the CDATA TAGS.
myXML = open("c:\myfile.xml", "r")
soup = BeautifulSoup(myXML)
data = soup.find(text=re.compile("CDATA"))
print data
<![CDATA[TEST DATA]]>
What I would like to see if the following output:
TEST DATA
I don't care if the solution is in LXML or BeautifulSoup. Just want the best or easiest way to get the job done. Thanks!
Here is a solution:
parser = etree.XMLParser(strip_cdata=False)
root = etree.parse(self.param1, parser)
data = root.findall('./config/script')
for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
Based on the lxml docs:
>>> from lxml import etree
>>> parser = etree.XMLParser(strip_cdata=False)
>>> root = etree.XML('<root><data><![CDATA[test]]></data></root>', parser)
>>> data = root.findall('data')
>>> for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
test # just the text of <![CDATA[test]]>
This might be the best way to get the job done, depending on how amenable your xml structure is to this approach.
Based on BeautifulSoup:
>>> str='<xml> <MsgType><![CDATA[text]]></MsgType> </xml>'
>>> soup=BeautifulSoup(str, "xml")
>>> soup.MsgType.get_text()
u'text'
>>> soup.MsgType.string
u'text'
>>> soup.MsgType.text
u'text'
As the result, it just print the text from msgtype;

Categories

Resources