python, beautiful soup, xml parsing

python, beautiful soup, xml parsing - python

How can I get values of latitude and longitude from the following XML:
<?xml version="1.0" encoding="utf-8"?>
<location source="FoundByWifi">
<coordinates latitude="49.7926292" longitude="24.0538406"
nlatitude="49.7935180" nlongitude="24.0552174" />
</location>
I tried to use get_text but it doesn't work in this way(
r = requests.get(url)
soup = BeautifulSoup(r.text)
lat = soup.find('coordinates','latitude').get_text(strip=True)

'latitude' is an attribute within the 'coordinates' tag. Once you found the coordinates, the soup object stores all the attributes in a dict-like key-value store.
So, in your case, after finding the coordinates tag, check the 'latitude' key as so:
lat = soup.find('coordinates')['latitude']
You can even strip the resultant of any extraneous whitespace at the beginning or end:
lat = soup.find('coordinates')['latitude'].strip()

Check online demo
html_doc = """
<?xml version="1.0" encoding="utf-8"?>
<location source="FoundByWifi">
<coordinates latitude="49.7926292" longitude="24.0538406"
nlatitude="49.7935180" nlongitude="24.0552174" />
</location>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
lat = soup.find_all('coordinates')
for i in lat:
print(i.attrs['latitude'])
print(i.attrs['longitude'])

Related

filter non-nested tag values from XML

I have an xml that looks like this.
<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>
I am parsing it using BeautifulSoup. Upon doing this:
soup = BeautifulSoup(file, 'xml')
pos = []
for i in (soup.find_all('pos')):
pos.append(i.text)
I get a list of all POS tag values, even the ones that are nested within the tag kat_pis.
So I get (697, 112, 099. 113).
However, I only want to get the POS values of the non-nested tags.
Expected desired result is (697, 099).
How can I achieve this?

Here is one way of getting those first level pos:
from bs4 import BeautifulSoup as bs
xml_doc = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>'''
soup = bs(xml_doc, 'xml')
pos = []
for i in (soup.select('offer > pos')):
pos.append(i.text)
print(pos)
Result in terminal:
['697', '099']

I think the best solution would be to abandon BeautifulSoup for an XML parser with XPath support, like lxml. Using XPath expressions, you can ask for only those tos elements that are children of offer elements:
from lxml import etree
with open('data.xml') as fd:
doc = etree.parse(fd)
pos = []
for ele in (doc.xpath('//offer/pos')):
pos.append(ele.text)
print(pos)
Given your example input, the above code prints:
['697', '099']

How to change sub element in lxml

My xml file:
<?xml version='1.0' encoding='UTF-8'?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<CtgyPurp>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</CtgyPurp> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>
I want to make a change in the xml file:
<CtgyPurp></CtgyPurp> change in <newName></newName>
I know how to change the value within a tag but not how to change/modify the tag itself with lxml

Something like this should work - note the treatment of namespaces:
from lxml import etree
ctg = """[your xml above"]"""
doc = etree.XML(ctg.encode())
ns = {"xx": "urn:iso:std:iso:20022:tech:xsd:pain.001.001.03"}
target = doc.xpath('//xx:CtgyPurp',namespaces=ns)[0]
target.tag = "newName"
print(etree.tostring(doc).decode())
Output:
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<newName>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</newName> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>

how to parse unstructured xml file using python?

How can i parse unstructured xml file? i need to get data inside patient tag and title using elementTree.
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>
i want given name , family name , gender and title.

Using BeautifulSoup bs4 and lxml parser library to scrape xml data.
from bs4 import BeautifulSoup
xml_data = '''<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>'''
soup = BeautifulSoup(xml_data, "lxml")
title = soup.find("title")
print(title.text.strip())
patient = soup.find("patient")
given = patient.find("given").text.strip()
family = patient.find("family").text.strip()
gender = patient.find("administrativegendercode")['displayname'].strip()
print(given)
print(family)
print(gender)
O/P:
Summary
fname
lname
Female
Install library dependency:
pip3 install beautifulsoup4==4.7.1
pip3 install lxml==4.3.3

Or you can simply use lxml. Here is tutorial that I used: https://lxml.de/tutorial.html
But it should be similar to:
from lxml import etree
root = etree.Element("patient")
print(root.find("given"))
print(root.find("family"))
print(root.find("give"))

How to call variable inside a string?

I am trying to call a variable inside a string but cannot get it to work. I have tried researching this but cannot find a way to get it to work.
How can I call set_drive in the Drive part of soup = post. :
set_drive = "ON"
soup = post("""
<?xml version="1.0" encoding="UTF-8"?>
<Packet>
<Command>setRequest</Command>
<DatabaseManager>
<Mnet Group="18" Drive=set_drive Mode="COOL" SetTemp="19" AirDirection="HORIZONTAL" FanSpeed="HIGH" />
</DatabaseManager>
</Packet>
""")

The recommanded (and most readable and portable) solution is to use str.format():
set_drive = "ON"
template = """
<?xml version="1.0" encoding="UTF-8"?>
<Packet>
<Command>setRequest</Command>
<DatabaseManager>
<Mnet Group="18" Drive="{set_drive}" Mode="COOL" SetTemp="19" AirDirection="HORIZONTAL" FanSpeed="HIGH" />
</DatabaseManager>
</Packet>
"""
soup = post(template.format(set_drive=set_drive))

Use string formatting. There's many choices in python.
set_drive = "ON"
result = "Drive=%s" % set_drive # traditional way
result = f"Drive={set_drive}" # python 3.6+ way

set_drive = "ON"
query =
"""<?xml version="1.0" encoding="UTF-8"?>
<Packet>
<Command>setRequest</Command>
<DatabaseManager>
<Mnet Group="18" Drive={set_drive} Mode="COOL" SetTemp="19" AirDirection="HORIZONTAL" FanSpeed="HIGH" />
</DatabaseManager>
</Packet>
"""
query.format(set_drive=set_drive)

How parse xml in python?

I have this xml file:
<?xml version="1.0" encoding="utf-8" ?>
<srl>
<role>V</role><txt>Representava</txt>
<role>A2</role><txt>ela</txt>
<role>A1</role>
<txt>uma jibóia
<role>A0</role><txt>que</txt>
<role>V</role><txt>engolia</txt>
<role>A1</role><txt>uma fera</txt>
</txt>
</srl>
How do I extract just this block in python? I'm using Beautiful Soup.
<txt>uma jibóia
<role>A0</role><txt>que</txt>
<role>V</role><txt>engolia</txt>
<role>A1</role><txt>uma fera</txt>
</txt>
I tried this:
soup = bs(open(xml, 'r'), 'lxml')
texts = soup.find_all('txt')
for t in texts:
print t.text

I solved it:
for t in texts:
if len(t.contents) > 1:
print t

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python, beautiful soup, xml parsing - python

Related

filter non-nested tag values from XML

How to change sub element in lxml

how to parse unstructured xml file using python?

How to call variable inside a string?

How parse xml in python?

Categories

Resources