how to parse unstructured xml file using python?

how to parse unstructured xml file using python? - python

How can i parse unstructured xml file? i need to get data inside patient tag and title using elementTree.
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>
i want given name , family name , gender and title.

Using BeautifulSoup bs4 and lxml parser library to scrape xml data.
from bs4 import BeautifulSoup
xml_data = '''<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>'''
soup = BeautifulSoup(xml_data, "lxml")
title = soup.find("title")
print(title.text.strip())
patient = soup.find("patient")
given = patient.find("given").text.strip()
family = patient.find("family").text.strip()
gender = patient.find("administrativegendercode")['displayname'].strip()
print(given)
print(family)
print(gender)
O/P:
Summary
fname
lname
Female
Install library dependency:
pip3 install beautifulsoup4==4.7.1
pip3 install lxml==4.3.3

Or you can simply use lxml. Here is tutorial that I used: https://lxml.de/tutorial.html
But it should be similar to:
from lxml import etree
root = etree.Element("patient")
print(root.find("given"))
print(root.find("family"))
print(root.find("give"))

Related

filter non-nested tag values from XML

I have an xml that looks like this.
<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>
I am parsing it using BeautifulSoup. Upon doing this:
soup = BeautifulSoup(file, 'xml')
pos = []
for i in (soup.find_all('pos')):
pos.append(i.text)
I get a list of all POS tag values, even the ones that are nested within the tag kat_pis.
So I get (697, 112, 099. 113).
However, I only want to get the POS values of the non-nested tags.
Expected desired result is (697, 099).
How can I achieve this?

Here is one way of getting those first level pos:
from bs4 import BeautifulSoup as bs
xml_doc = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>'''
soup = bs(xml_doc, 'xml')
pos = []
for i in (soup.select('offer > pos')):
pos.append(i.text)
print(pos)
Result in terminal:
['697', '099']

I think the best solution would be to abandon BeautifulSoup for an XML parser with XPath support, like lxml. Using XPath expressions, you can ask for only those tos elements that are children of offer elements:
from lxml import etree
with open('data.xml') as fd:
doc = etree.parse(fd)
pos = []
for ele in (doc.xpath('//offer/pos')):
pos.append(ele.text)
print(pos)
Given your example input, the above code prints:
['697', '099']

How to change sub element in lxml

My xml file:
<?xml version='1.0' encoding='UTF-8'?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<CtgyPurp>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</CtgyPurp> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>
I want to make a change in the xml file:
<CtgyPurp></CtgyPurp> change in <newName></newName>
I know how to change the value within a tag but not how to change/modify the tag itself with lxml

Something like this should work - note the treatment of namespaces:
from lxml import etree
ctg = """[your xml above"]"""
doc = etree.XML(ctg.encode())
ns = {"xx": "urn:iso:std:iso:20022:tech:xsd:pain.001.001.03"}
target = doc.xpath('//xx:CtgyPurp',namespaces=ns)[0]
target.tag = "newName"
print(etree.tostring(doc).decode())
Output:
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<newName>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</newName> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>

How to extract text from xml file using python

I'm trying to extract text data from this xml file but I don't know why my code not working. How do I get this phone number? Please have a look at this XML file and my code format as well.I'm trying to extract data from this tag Thank you in advance :)
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:voc="urn:hl7-org:v3/voc" xmlns:sdtc="urn:hl7-org:sdtc" xsi:schemaLocation="CDA.xsd">
<realmCode code="US"/>
<languageCode code="en-US"/>
<recordTarget>
<patientRole>
<addr use="HP">
<streetAddressLine>3345 Elm Street</streetAddressLine>
<city>Aurora</city>
<state>CO</state>
<postalCode>80011</postalCode>
<country>US</country>
</addr>
<telecom value="tel:+1(303)-554-8889" use="HP"/>
<patient>
<name use="L">
<given>Janson</given>
<given>J</given>
<family>Example</family>
</name>
</patient>
</patientRole>
</recordTarget>
</ClinicalDocument>
Here is my python code
import xml.etree.ElementTree as ET
tree = ET.parse('country.xml')
root = tree.getroot()
print(root)
for country in root.findall('patientRole'):
number = country.get('telecom')
print(number)

Your XML document has namespace specified, so it becomes something like:
for country in tree.findall('.//{urn:hl7-org:v3}patientRole'):
number = country.find('{urn:hl7-org:v3}telecom').attrib['value']
print(number)
Output:
tel:+1(303)-554-8889

Get text inside xml tags by their name

I had a xml code and i want to get text in exact elements(xml tags) using python language .
I have tried couple of solutions and didnt work.
import xml.etree.ElementTree as ET
tree = ET.fromstring(xml)
for node in tree.iter('Model'):
print node
How can i do that ?
Xml Code :
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetVehicleLimitedInfoResponse
xmlns="http://schemas.conversesolutions.com/xsd/dmticta/v1">
<return>
<ResponseMessage xsi:nil="true" />
<ErrorCode xsi:nil="true" />
<RequestId> 2012290007705 </RequestId>
<TransactionCharge>150</TransactionCharge>
<VehicleNumber>GF-0176</VehicleNumber>
<AbsoluteOwner>SIYAPATHA FINANCE PLC</AbsoluteOwner>
<EngineNo>GA15-483936F</EngineNo>
<ClassOfVehicle>MOTOR CAR</ClassOfVehicle>
<Make>NISSAN</Make>
<Model>PULSAR</Model>
<YearOfManufacture>1998</YearOfManufacture>
<NoOfSpecialConditions>0</NoOfSpecialConditions>
<SpecialConditions xsi:nil="true" />
</return>
</GetVehicleLimitedInfoResponse>
</soap:Body>
</soap:Envelope>

Edited and improved answer:
import xml.etree.ElementTree as ET
import re
ns = {"veh": "http://schemas.conversesolutions.com/xsd/dmticta/v1"}
tree = ET.parse('test.xml') # save your xml as test.xml
root = tree.getroot()
def get_tag_name(tag):
return re.sub(r'\{.*\}', '',tag)
for node in root.find(".//veh:return", ns):
print(get_tag_name(node.tag)+': ', node.text)
It should produce something like this:
ResponseMessage: None
ErrorCode: None
RequestId: 2012290007705
TransactionCharge: 150
VehicleNumber: GF-0176
AbsoluteOwner: SIYAPATHA FINANCE PLC
EngineNo: GA15-483936F
ClassOfVehicle: MOTOR CAR
Make: NISSAN
Model: PULSAR
YearOfManufacture: 1998
NoOfSpecialConditions: 0
SpecialConditions: None

python, beautiful soup, xml parsing

How can I get values of latitude and longitude from the following XML:
<?xml version="1.0" encoding="utf-8"?>
<location source="FoundByWifi">
<coordinates latitude="49.7926292" longitude="24.0538406"
nlatitude="49.7935180" nlongitude="24.0552174" />
</location>
I tried to use get_text but it doesn't work in this way(
r = requests.get(url)
soup = BeautifulSoup(r.text)
lat = soup.find('coordinates','latitude').get_text(strip=True)

'latitude' is an attribute within the 'coordinates' tag. Once you found the coordinates, the soup object stores all the attributes in a dict-like key-value store.
So, in your case, after finding the coordinates tag, check the 'latitude' key as so:
lat = soup.find('coordinates')['latitude']
You can even strip the resultant of any extraneous whitespace at the beginning or end:
lat = soup.find('coordinates')['latitude'].strip()

Check online demo
html_doc = """
<?xml version="1.0" encoding="utf-8"?>
<location source="FoundByWifi">
<coordinates latitude="49.7926292" longitude="24.0538406"
nlatitude="49.7935180" nlongitude="24.0552174" />
</location>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
lat = soup.find_all('coordinates')
for i in lat:
print(i.attrs['latitude'])
print(i.attrs['longitude'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to parse unstructured xml file using python? - python

Or you can simply use lxml. Here is tutorial that I used: https://lxml.de/tutorial.html But it should be similar to: from lxml import etree root = etree.Element("patient") print(root.find("given")) print(root.find("family")) print(root.find("give"))

Related

filter non-nested tag values from XML

How to change sub element in lxml

How to extract text from xml file using python

Get text inside xml tags by their name

python, beautiful soup, xml parsing

Categories

Resources