Extract data from ORCID XML files using Python

Extract data from ORCID XML files using Python - python

I ma trying to (offline) parse names from ORCID XML files using Python, which is downloaded from :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<record:record xmlns:internal="http://www.orcid.org/ns/internal" xmlns:address="http://www.orcid.org/ns/address" xmlns:email="http://www.orcid.org/ns/email" xmlns:history="http://www.orcid.org/ns/history" xmlns:employment="http://www.orcid.org/ns/employment" xmlns:person="http://www.orcid.org/ns/person" xmlns:education="http://www.orcid.org/ns/education" xmlns:other-name="http://www.orcid.org/ns/other-name" xmlns:personal-details="http://www.orcid.org/ns/personal-details" xmlns:bulk="http://www.orcid.org/ns/bulk" xmlns:common="http://www.orcid.org/ns/common" xmlns:record="http://www.orcid.org/ns/record" xmlns:keyword="http://www.orcid.org/ns/keyword" xmlns:activities="http://www.orcid.org/ns/activities" xmlns:deprecated="http://www.orcid.org/ns/deprecated" xmlns:external-identifier="http://www.orcid.org/ns/external-identifier" xmlns:funding="http://www.orcid.org/ns/funding" xmlns:error="http://www.orcid.org/ns/error" xmlns:preferences="http://www.orcid.org/ns/preferences" xmlns:work="http://www.orcid.org/ns/work" xmlns:researcher-url="http://www.orcid.org/ns/researcher-url" xmlns:peer-review="http://www.orcid.org/ns/peer-review" path="/0000-0001-5006-8001">
<common:orcid-identifier>
<common:uri>http://orcid.org/0000-0001-5006-8001</common:uri>
<common:path>0000-0001-5006-8001</common:path>
<common:host>orcid.org</common:host>
</common:orcid-identifier>
<preferences:preferences>
<preferences:locale>en</preferences:locale>
</preferences:preferences>
<person:person path="/0000-0001-5006-8001/person">
<common:last-modified-date>2016-06-06T15:29:36.952Z</common:last-modified-date>
<person:name visibility="public" path="0000-0001-5006-8001">
<common:created-date>2016-04-15T20:45:16.141Z</common:created-date>
<common:last-modified-date>2016-04-15T20:45:16.141Z</common:last-modified-date>
<personal-details:given-names>Marjorie</personal-details:given-names>
<personal-details:family-name>Biffi</personal-details:family-name>
</person:name>
What I want is to extract given-names and family-name: Marjorie Biffi. I am trying to use this code:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('f.xml').getroot()
>>> p=root.findall('{http://www.orcid.org/ns/personal-details}personal-details')
>>> p
[]
I can't figure out how to extract names/surname from this XML file. I am trying also yo use XPath/Selector, but no succes.

This will get you the results you want, but by climbing through each one.
p1 = root.find('{http://www.orcid.org/ns/person}person')
name = p1.find('{http://www.orcid.org/ns/person}name')
given_names = name.find('{http://www.orcid.org/ns/personal-details}given-names')
family_name = name.find('{http://www.orcid.org/ns/personal-details}family-name')
print(given_names.text, '', family_name.text)
You could also just go directly to that sublevel with .\\
family_name = root.find('.//{http://www.orcid.org/ns/personal-details}family-name')
Also I just posted here about simpler ways to parse through xml if you're doing more basic operations. These include xmltodict (converting to an OrderedDict) or untangle which is a little inefficient but very quick and easy to learn.

Related

Parsing subfields in XML and merging with matching columns

This is a follow-up question from here. it got lost due to high amount of other topic on this forum. Maybe i presented the question too complicated. Since then I improved and simplified the approach.
To sum up: i'd like to extract data from subfields in multiple XML files and attach those to a new df on a matching positions.
This is a sample XML-1:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>238981</vrednost>
</energent>
<energent>
<sifra>energy_to</sifra>
<naziv>Do</naziv>
<vrednost>16359</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
his is a sample XML-2:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>424242</vrednost>
</energent>
<energent>
<sifra>energy_en</sifra>
<naziv>Do</naziv>
<vrednost>29</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
My code:
import xml.etree.ElementTree as ETree
import pandas as pd
xmldata = r"C:\...\S1.xml"
prstree = ETree.parse(xmldata)
root = prstree.getroot()
# print(root)
store_items = []
all_items = []
for storeno in root.iter('energent'):
cona_sifra = storeno.find('sifra').text
cona_vrednost = storeno.find('vrednost').text
store_items = [cona_sifra, cona_vrednost]
all_items.append(store_items)
xmlToDf = pd.DataFrame(all_items, columns=[
'sifra', 'vrednost'])
print(xmlToDf.to_string(index=False))
This results in:
sifra vrednost
energy_e 238981
energy_to 16359
Which is fine for 1 example. But i have 1,000 of XML files and the wish is to 1) have all results in 1 row for each XML and 2) to differentiate between different 'sifra' codes.
There can be e.g. energy_e, energy_en, energy_to
So ideally the final df would look like this
xml energy_e energy_en energy_to
xml-1 238981 0 16539
xml-2 424242 29 0
can it be done?

Simply use pandas.read_xml since the part of the XML you need is a flat part of the document:
energy_df = pd.read_xml("Input.xml", xpath=".//energent") # IF lxml INSTALLED
energy_df = pd.read_xml("Input.xml", xpath=".//energent", parser="etree") # IF lxml NOT INSTALLED
And to bind across many XML files, simply build a list of data frames from a list of XML file paths, adding a column for source file, and then run pandas.concat to row bind all into a single data frame:
xml_files = [...]
energy_dfs = [
pd.read_xml(f, xpath=".//energent", parser="etree").assign(source=f) for f in xml_files
]
energy_long_df = pd.concat(energy_dfs, ignore_index=True)
And from your desired output, you can then pivot values from sifra columns with pivot_table:
energy_wide_df = energy_long_df.pivot_table(
values="vrednost", index="source", columns="sifra", aggfunc="sum"
)

If I understand the situation correctly, this can be done - but because of the complexity, I would use here lxml, instead of ElementTree.
I'll try to annotate the code a bit, but you'll have to really do read up on this.
By the way, the two xml files you posted are not well formed (closing tags for <energenti> and <cone> are missing), but assuming that is fixed - try this:
from lxml import etree
xmls =[XML-1,XML-2]
#note: For simplicity, I'm using the well formed version of the xml strings in your question; you'll have to use actual file names and paths
energies = ["xml", "energy_e", "energy_en", "energy_to", "whatever"]
#I just made up some names - you'll have to use actual names, of course; the first one is for the file identifier - see below
rows = []
for xml in xmls:
row = []
id = "xml-"+str(xmls.index(xml)+1)
#this creates the file identifier
row.append(id)
root = etree.XML(xml.encode())
#in real life, you'll have to use the parse() method
for energy in energies[1:]:
#the '[1:]' is used to skip the first "energy"; it's only used as the file identifier
target = root.xpath(f'//energent[./sifra[.="{energy}"]]/vrednost/text()')
#note the use of f-strings
row.extend( target if len(target)>0 else "0" )
rows.append(row)
print(pd.DataFrame(rows,columns=energies))
Output:
xml energy_e energy_en energy_to whatever
0 xml-1 238981 0 16359 0
1 xml-2 424242 29 0 0

How to update value between specific xml tags, where input is string, Python?

Consider I have a string that looks like the following below. It's type is string but it will always represents an xml document. I'm researching available python libraries for xml. How can I update a value in between 2 specific tags? What library would I be using for that?
<?xml version="1.0"?>
<PostTelemetryRequest xmlns:ns2="urn:com:onstar:global:common:schema:PostTelemetryData:1">
<ns2:PartnerVehicles>
<ns2:PartnerVehicle>
<ns2:partnerNotificationID>251029655</ns2:partnerNotificationID>
</ns2:PartnerVehicle>
</ns2:PartnerVehicles>
</PostTelemetryRequest>
For instance, if the input is the string above how can I update the value between <ns2:partnerNotificationID> and </ns2:partnerNotificationID> tags to a new value?

This is the base code:
>>> from xml.etree import ElementTree
>>> s = """<?xml version="1.0"?>
<PostTelemetryRequest xmlns:ns2="urn:com:onstar:global:common:schema:PostTelemetryData:1">
<ns2:PartnerVehicles>
<ns2:PartnerVehicle>
<ns2:partnerNotificationID>251029655</ns2:partnerNotificationID>
</ns2:PartnerVehicle>
</ns2:PartnerVehicles>
</PostTelemetryRequest>
"""
>>> root = ElementTree.fromstring(s)
>>> for e in root.iter():
... if e.tag=='{urn:com:onstar:global:common:schema:PostTelemetryData:1}partnerNotificationID':
... e.text='mytext'
...
>>> etree.ElementTree.tostring(root)
b'<PostTelemetryRequest xmlns:ns0="urn:com:onstar:global:common:schema:PostTelemetryData:1">\n <ns0:PartnerVehicles>\n <ns0:PartnerVehicle>\n <ns0:partnerNotificationID>mytext</ns0:partnerNotificationID>\n </ns0:PartnerVehicle>\n </ns0:PartnerVehicles>\n</PostTelemetryRequest>'

Appending an xml-node which is read from file breaks pretty_print for adjacent nodes

I'm generating a XML-file with python's etree library. One node in the generated file is read from an existing XML-file. Adding this element breaks the pretty_print for the nodes directly before and after.
import xml.etree.cElementTree as ET
from lxml import etree
root = etree.Element("startNode")
subnode1 = etree.SubElement(root, "SubNode1")
subnode1Child1 = etree.SubElement(subnode1, "subNode1Child1")
etree.SubElement(subnode1Child1, "Child1")
etree.SubElement(subnode1Child1, "Child2")
f = open('/xml_testdata/ext_file.xml','r')
ext_xml = etree.fromstring(f.read())
ext_subnode = ext_xml.find("ExtNode")
subnode1.append(ext_subnode)
subnode1Child2 = etree.SubElement(subnode1, "subNode1Child2")
etree.SubElement(subnode1Child2, "Child1")
etree.SubElement(subnode1Child2, "Child2")
tree = etree.ElementTree(root)
tree.write("testfile.xml", xml_declaration=True, pretty_print=True)
which gives this result:
<startNode>
<SubNode1><subNode1Child1><Child1/><Child2/></subNode1Child1><ExtNode>
<NodeFromExt>
<SubNodeFromExt1/>
</NodeFromExt>
<NodeFromExt>
<SubNodeFromExt2/>
<AnotherSubNodeFromExt2>
<SubSubNode/>
<AllPrettyHere>
<Child/>
</AllPrettyHere>
</AnotherSubNodeFromExt2>
</NodeFromExt>
</ExtNode>
<subNode1Child2><Child1/><Child2/></subNode1Child2></SubNode1>
</startNode>
Not very readable, is it? Even worse when "subNodeChild" contains a lot more subnodes than this example!
Without appending the external elements, it looks like this:
<startNode>
<SubNode1>
<subNode1Child1>
<Child1/>
<Child2/>
</subNode1Child1>
<subNode1Child2>
<Child1/>
<Child2/>
</subNode1Child2>
</SubNode1>
</startNode>
So the problem is caused by appending the external elements!
Is there a way to append the external elements without breaking the pretty_print-output?

You can get nicer pretty-printed output by using a parser object that removes ignorable whitespace when parsing the existing XML file.
Instead of this:
f = open('/xml_testdata/ext_file.xml','r')
ext_xml = etree.fromstring(f.read())
Use this:
f = open('/xml_testdata/ext_file.xml', 'r')
parser = etree.XMLParser(remove_blank_text=True)
ext_xml = etree.fromstring(f.read(), parser)
See also:
http://lxml.de/api/lxml.etree.XMLParser-class.html
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output

I've been able to somewhat mitigate the effect by creating "ExtNode" with etree.SubElement and appending the elements inside it:
ext_node = etree.SubElement(subnode1, "ExtNode")
for element in ext_xml.findall("ExtNode/NodeFromExt")
ext_node.append(element)
which has this result:
<startNode>
<SubNode1>
<subNode1Child1>
<Child1/>
<Child2/>
</subNode1Child1>
<ExtNode><NodeFromExt>
<SubNodeFromExt1/>
</NodeFromExt>
<NodeFromExt>
<SubNodeFromExt2/>
<AnotherSubNodeFromExt2>
<SubSubNode/>
<AllPrettyHere>
<Child/>
</AllPrettyHere>
</AnotherSubNodeFromExt2>
</NodeFromExt>
</ExtNode>
<subNode1Child2>
<Child1/>
<Child2/>
</subNode1Child2>
</SubNode1>
</startNode>
Not perfect, but at least human readable (Which is the whole point of pretty_print, right?)
To satisfy my OCD, I'd still be interested if there is a way to get a flawlessly formatted file!

Python Minidom XML parsing interchangable upper and lower case node names

Hoping for help here. I am writing a small script to pull info from a data file. the following is the start of the xml...its quite big.
<?xml version="1.0" encoding="ISO-8859-1"?>
<flatNet Version="1" id="{014852F8-3010-4a5f-8215-8F47B000EA60}" sch="Vbt-mbb-1.4.scs">
<partNumbers>
<PartNumber Name="PN_LIB_NAME" Version="1">
<Properties Ver="1">
<a Key="PARTNAME" Value="PART_NAME"/>
<a Key="ALTPARTREF" Value="PART_REF"/>
My problem is that in some of the files I need to parse, the node is capitalized and in some it is lower case.
How do I get the node name (either "a" or "A") into a variable so I can use it for a function?
Else I am stuck changing it manually every time I want to parse a new file depending on what that file contains.
Thanks heaps in advance!

you will either have to have 2 nodelists and work with them individually
a1 = doc.getElementsByTagName('a')
a2 = doc.getElementsByTagName('A')
#do stuff with a1 and a2
or normalize the tags along the lines of this:
>>> import xml.dom.minidom as minidom
>>> xml = minidom.parseString('<root><head>hi!</head><body>text<a>blah</a><A>blahblah</A></body></root>')
>>> allEls = xml.getElementsByTagName('*')
>>> for i in allEls:
if i.localName.lower() == 'a':
print i.toxml()
<a>blah</a>
<A>blahblah</A>

How would one remove the CDATA tags from but preserve the actual data in Python using LXML or BeautifulSoup

I have some XML I am parsing in which I am using BeautifulSoup as the parser. I pull the CDATA out with the following code, but I only want the data and not the CDATA TAGS.
myXML = open("c:\myfile.xml", "r")
soup = BeautifulSoup(myXML)
data = soup.find(text=re.compile("CDATA"))
print data
<![CDATA[TEST DATA]]>
What I would like to see if the following output:
TEST DATA
I don't care if the solution is in LXML or BeautifulSoup. Just want the best or easiest way to get the job done. Thanks!
Here is a solution:
parser = etree.XMLParser(strip_cdata=False)
root = etree.parse(self.param1, parser)
data = root.findall('./config/script')
for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text

Based on the lxml docs:
>>> from lxml import etree
>>> parser = etree.XMLParser(strip_cdata=False)
>>> root = etree.XML('<root><data><![CDATA[test]]></data></root>', parser)
>>> data = root.findall('data')
>>> for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
test # just the text of <![CDATA[test]]>
This might be the best way to get the job done, depending on how amenable your xml structure is to this approach.

Based on BeautifulSoup:
>>> str='<xml> <MsgType><![CDATA[text]]></MsgType> </xml>'
>>> soup=BeautifulSoup(str, "xml")
>>> soup.MsgType.get_text()
u'text'
>>> soup.MsgType.string
u'text'
>>> soup.MsgType.text
u'text'
As the result, it just print the text from msgtype;

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract data from ORCID XML files using Python - python

Related

Parsing subfields in XML and merging with matching columns

How to update value between specific xml tags, where input is string, Python?

Appending an xml-node which is read from file breaks pretty_print for adjacent nodes

Python Minidom XML parsing interchangable upper and lower case node names

How would one remove the CDATA tags from but preserve the actual data in Python using LXML or BeautifulSoup

Categories

Resources