How to extract text from xml file using python

How to extract text from xml file using python - python

I'm trying to extract text data from this xml file but I don't know why my code not working. How do I get this phone number? Please have a look at this XML file and my code format as well.I'm trying to extract data from this tag Thank you in advance :)
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:voc="urn:hl7-org:v3/voc" xmlns:sdtc="urn:hl7-org:sdtc" xsi:schemaLocation="CDA.xsd">
<realmCode code="US"/>
<languageCode code="en-US"/>
<recordTarget>
<patientRole>
<addr use="HP">
<streetAddressLine>3345 Elm Street</streetAddressLine>
<city>Aurora</city>
<state>CO</state>
<postalCode>80011</postalCode>
<country>US</country>
</addr>
<telecom value="tel:+1(303)-554-8889" use="HP"/>
<patient>
<name use="L">
<given>Janson</given>
<given>J</given>
<family>Example</family>
</name>
</patient>
</patientRole>
</recordTarget>
</ClinicalDocument>
Here is my python code
import xml.etree.ElementTree as ET
tree = ET.parse('country.xml')
root = tree.getroot()
print(root)
for country in root.findall('patientRole'):
number = country.get('telecom')
print(number)

Your XML document has namespace specified, so it becomes something like:
for country in tree.findall('.//{urn:hl7-org:v3}patientRole'):
number = country.find('{urn:hl7-org:v3}telecom').attrib['value']
print(number)
Output:
tel:+1(303)-554-8889

Related

How to change sub element in lxml

My xml file:
<?xml version='1.0' encoding='UTF-8'?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<CtgyPurp>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</CtgyPurp> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>
I want to make a change in the xml file:
<CtgyPurp></CtgyPurp> change in <newName></newName>
I know how to change the value within a tag but not how to change/modify the tag itself with lxml

Something like this should work - note the treatment of namespaces:
from lxml import etree
ctg = """[your xml above"]"""
doc = etree.XML(ctg.encode())
ns = {"xx": "urn:iso:std:iso:20022:tech:xsd:pain.001.001.03"}
target = doc.xpath('//xx:CtgyPurp',namespaces=ns)[0]
target.tag = "newName"
print(etree.tostring(doc).decode())
Output:
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<newName>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</newName> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>

Transform the CSV to XML in python

I have a scenario where the data is extracted from oracle in the form of CSV and then it should be transformed to desired XML format.
Input CSV File:
Id,SubID,Rank,Size
1,123,1,0.1
1,234,2,0.2
2,456,1,0.1
2,123,2,0.2
Expected XML output:
<AA_ITEMS>
<Id ID="1">
<SubId ID="123">
<Rank>1</Rank>
<Size>0.1</Size>
</SubId>
<SubId ID="234">
<Rank>2</Rank>
<Size>0.2</Size>
</SubId>
</Id>
<Id ID="2">
<SubId ID="456">
<Rank>1</Rank>
<Size>0.1</Size>
</SubId>
<SubId ID="123">
<Rank>2</Rank>
<Size>0.2</Size>
</SubId>
</Id>
Note: The CSV file is a daily load and contains around 150K to 200K records
Please assist. Thanks in advance

There are a couple of ways to approach it and though some people dislike building xml from a template, I believe it works best:
from itertools import groupby
from lxml import etree
csv_string = """[your csv ab0ve]
"""
#first deal with the csv
#split it into lines and discard the headers
lines = csv_string.splitlines()[1:]
#group the lines by the first character
grpfunc = lambda x: x[0]
grps = [list(group) for key, group in groupby(lines, grpfunc)]
#now convert the whole thing into xml:
xml_string = """
<AA_ITEMS>
"""
for grp in grps:
elem = f' <Id ID="{grp[0][0]}">'
for g in grp:
entry = g.split(',')
#create an entry template:
id_tmpl = f"""
<SubId ID="{entry[1]}">
<Rank>{entry[2]}</Rank>
<Size>{entry[3]}</Size>
</SubId>
"""
elem+=id_tmpl
#close elem
elem+="""</Id>
"""
xml_string+=elem
#close the xml string
xml_string += """</AA_ITEMS>"""
#finally, show that the output is well formed xml:
print(etree.tostring(etree.fromstring(xml_string)).decode())
The output should be your expected xml.

Parse XML with childs that have different tags in Python

I am trying to parse following xml data from a file with python for print only the elements with tag "zip-code" with his attribute name
<response status="success" code="19"><result total-count="1" count="1">
<address>
<entry name="studio">
<zip-code>14407</zip-code>
<description>Nothing</description>
</entry>
<entry name="mailbox">
<zip-code>33896</zip-code>
<description>Nothing</description>
</entry>
<entry name="garage">
<zip-code>33746</zip-code>
<description>Tony garage</description>
</entry>
<entry name="playstore">
<url>playstation.com</url>
<description>game download</description>
</entry>
<entry name="gym">
<zip-code>33746</zip-code>
<description>Getronics NOC subnet 2</description>
</entry>
<entry name="e-cigars">
<url>vape.com/24</url>
<description>vape juices</description>
</entry>
</address>
</result></response>
The python code that I am trying to run is
from xml.etree import ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
items = root.iter('entry')
for item in items:
zip = item.find('zip-code').text
names = (item.attrib)
print(' {} {} '.format(
names, zip
))
However it fails once it gets to the items without "zip-code" tag.
How I could make this run?
Thanks in advance

As #AmitaiIrron suggested, xpath can help here.
This code searches the document for element named zip-code, and pings back to get the parent of that element. From there, you can get the name attribute, and pair with the text from zip-code element
for ent in root.findall(".//zip-code/.."):
print(ent.attrib.get('name'), ent.find('zip-code').text)
studio 14407
mailbox 33896
garage 33746
gym 33746
OR
{ent.attrib.get('name') : ent.find('zip-code').text
for ent in root.findall(".//zip-code/..")}
{'studio': '14407', 'mailbox': '33896', 'garage': '33746', 'gym': '33746'}

Your loop should look like this:
# Find all <entry> tags in the hierarchy
for item in root.findall('.//entry'):
# Try finding a <zip-code> child
zipc = item.find('./zip-code')
# If found a child, print data for it
if zipc is not None:
names = (item.attrib)
print(' {} {} '.format(
names, zipc.text
))
It's all a matter of learning to use xpath properly when searching through the XML tree.

If you have no problem using regular expressions, the following works just fine:
import re
file = open('file.xml', 'r').read()
pattern = r'name="(.*?)".*?<zip-code>(.*?)<\/zip-code>'
matches = re.findall(pattern, file, re.S)
for m in matches:
print("{} {}".format(m[0], m[1]))
and produces the result:
studio 14407
mailbox 33896
garage 33746
aystore 33746

how to parse unstructured xml file using python?

How can i parse unstructured xml file? i need to get data inside patient tag and title using elementTree.
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>
i want given name , family name , gender and title.

Using BeautifulSoup bs4 and lxml parser library to scrape xml data.
from bs4 import BeautifulSoup
xml_data = '''<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>'''
soup = BeautifulSoup(xml_data, "lxml")
title = soup.find("title")
print(title.text.strip())
patient = soup.find("patient")
given = patient.find("given").text.strip()
family = patient.find("family").text.strip()
gender = patient.find("administrativegendercode")['displayname'].strip()
print(given)
print(family)
print(gender)
O/P:
Summary
fname
lname
Female
Install library dependency:
pip3 install beautifulsoup4==4.7.1
pip3 install lxml==4.3.3

Or you can simply use lxml. Here is tutorial that I used: https://lxml.de/tutorial.html
But it should be similar to:
from lxml import etree
root = etree.Element("patient")
print(root.find("given"))
print(root.find("family"))
print(root.find("give"))

How to extract data from xml file that is deep down the tag

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
I am only interested in 'accelx' and 'accely' value in this data and need to create a csv out of it.
Update: The code given below breaks when I change the second row with the following. Nothing is displayed because of this;
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas">
The following code works:
import xml.etree.ElementTree as etree
tree = etree.parse(r"C:\Users\data.xml")
root = tree.getroot()
val_of_interest = root.findall("./Values/SensorValue")
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract text from xml file using python - python

Your XML document has namespace specified, so it becomes something like: for country in tree.findall('.//{urn:hl7-org:v3}patientRole'): number = country.find('{urn:hl7-org:v3}telecom').attrib['value'] print(number) Output: tel:+1(303)-554-8889

Related

How to change sub element in lxml

Transform the CSV to XML in python

Parse XML with childs that have different tags in Python

how to parse unstructured xml file using python?

How to extract data from xml file that is deep down the tag

Categories

Resources