I have this xml file:
<?xml version="1.0" encoding="utf-8" ?>
<srl>
<role>V</role><txt>Representava</txt>
<role>A2</role><txt>ela</txt>
<role>A1</role>
<txt>uma jibóia
<role>A0</role><txt>que</txt>
<role>V</role><txt>engolia</txt>
<role>A1</role><txt>uma fera</txt>
</txt>
</srl>
How do I extract just this block in python? I'm using Beautiful Soup.
<txt>uma jibóia
<role>A0</role><txt>que</txt>
<role>V</role><txt>engolia</txt>
<role>A1</role><txt>uma fera</txt>
</txt>
I tried this:
soup = bs(open(xml, 'r'), 'lxml')
texts = soup.find_all('txt')
for t in texts:
print t.text
I solved it:
for t in texts:
if len(t.contents) > 1:
print t
Related
I have an xml that looks like this.
<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>
I am parsing it using BeautifulSoup. Upon doing this:
soup = BeautifulSoup(file, 'xml')
pos = []
for i in (soup.find_all('pos')):
pos.append(i.text)
I get a list of all POS tag values, even the ones that are nested within the tag kat_pis.
So I get (697, 112, 099. 113).
However, I only want to get the POS values of the non-nested tags.
Expected desired result is (697, 099).
How can I achieve this?
Here is one way of getting those first level pos:
from bs4 import BeautifulSoup as bs
xml_doc = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>'''
soup = bs(xml_doc, 'xml')
pos = []
for i in (soup.select('offer > pos')):
pos.append(i.text)
print(pos)
Result in terminal:
['697', '099']
I think the best solution would be to abandon BeautifulSoup for an XML parser with XPath support, like lxml. Using XPath expressions, you can ask for only those tos elements that are children of offer elements:
from lxml import etree
with open('data.xml') as fd:
doc = etree.parse(fd)
pos = []
for ele in (doc.xpath('//offer/pos')):
pos.append(ele.text)
print(pos)
Given your example input, the above code prints:
['697', '099']
I have an XML file which looks like this:
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
<channel>
<item>
<title>Label: some_title"</title>
<link>some_link</link>
<pubDate>some_date</pubDate>
<dc:creator><![CDATA[University]]></dc:creator>
<guid isPermaLink="false">https://link.link</guid>
<description></description>
<content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some texttext some more text</strong><!--more-->
[caption id="attachment_344" align="aligncenter" width="524"]<img class="-image-" src="link.link.png" alt="" width="524" height="316" /> <em>A screenshot by the people</em>[/caption]
<strong>some more text</strong>
<div class="entry-content">
<em>Leave your comments</em>
</div>
<div class="post-meta wf-mobile-collapsed">
<div class="entry-meta"></div>
</div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
</item>
some more <item> </item>s here
</channel>
I want to extract the raw text within the <content:encoded> section, excluding the tags and urls. I have tried this with BeautifulSoup, and Scarpy, as well as other lxml methods. Most return an empty list.
Is there a way for me to retrieve this information without having to use regex?
Much appreciated.
UPDATE
I opened the XML file using:
content = []
with open(xml_file, "r") as file:
content = file.readlines()
content = "".join(content)
xml = bs(content, "lxml")
then I tried this with scrapy:
response = HtmlResponse(url=xml_file, encoding='utf-8')
response.selector.register_namespace('content',
'http://purl.org/rss/1.0/modules/content/')
response.xpath('channel/item/content:encoded').getall()
which returns an empty list.
and tried the code in the first answer:
soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong"))
print(text)
and get this error: Only the following pseudo-classes are implemented: nth-of-type.
When I opened the file with lxml, I ran this for loop:
data = {}
n = 0
for item in xml.findall('item'):
id = 'claim_id_' + str(n)
keys = {}
title = item.find('title').text
keys['label'] = title.split(': ')[0]
keys['claim'] = title.split(': ')[1]
if item.find('content:encoded'):
keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
data[id] = keys
print(data)
n += 1
It saved the label and claim perfectly well, but nothing for the text. Now that I opened the file using BeautifulSoup, it returns this error: 'NoneType' object is not callable
If you only need text inside <strong> tags, you can use my example. Otherwise, only regex seems suitable here:
from bs4 import BeautifulSoup
xml_doc = """
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
...the XML from the question...
</rss>
"""
soup = BeautifulSoup(xml_doc, "xml")
soup = BeautifulSoup(soup.select_one("content|encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong")
)
print(text)
Prints:
some text text some more text
some more text
RESEARCH | ARTICLE
I eventually got the text part using regular expressions (regex).
import re
for item in root.iter('item'):
grandchildren = item.getchildren()
for grandchild in grandchildren:
if 'encoded' in grandchild.tag:
text = grandchild.text
text = re.sub(r'\[.*?\]', "", text) # gets rid of square brackets and their content
text = re.sub(r'\<.*?\>', "", text) # gets rid of <> signs and their content
text = text.replace(" ", "") # gets rid of
text = " ".join(text.split())
I have an issue converting xml to csv file using BeautifulSoup in azure App Service. when I run it locally everything is fine. the first difference is at the soup line:
Parsing code:
file_path = os.path.join(INPUT_DIRECTOR, "test.xml")
source = open(file_path, encoding="utf8")
soup = BeautifulSoup(source.read(), 'xml') #doesn't work on global server
First = [case.text for case in soup.find_all('FirstAtt')]
Second = [case.text for case in soup.find_all('SecondAtt')]
Third= [case.text for case in soup.find_all('ThirdAtt')]
results = list(zip(First, Second))
columns = ['1', '2']
df = pd.DataFrame(results, columns=columns)
df['ID'] = Third[0]
df.to_csv(OUTPUT_DIRECTORY, index=False, encoding='utf-8-sig')
XML:
<?xml version="1.0" encoding="utf-8"?>
<root>
<ThirdAtt>7290027600007</ThirdAtt>
<Items Count="279">
<Item>
<FirstAtt>2021-09-05 08:00</FirstAtt>
<SecondAtt>5411188134985</SecondAtt>
...
</Item>
<Item>
<FirstAtt>2021-09-05 08:00</FirstAtt>
<SecondAtt>5411188135005</SecondAtt>
...
</Item>
...
On local ip run the soup line is able to read the xml file, but on global server run on azure soup isn't reading the file and remaind as :
soup = <?xml version="1.0" encoding="utf-8"?>
Any ideas as to how to solve this?
UPDATE:
Thanks to #balderman, I have converted my soup use as recommended:
root = ET.fromstring(xml)
headers = []
for idx,item in enumerate(root.findall('.//Item')):
data = []
if idx == 0:
headers = [x.tag for x in list(item)]
for h in headers:
data.append(item.find(h).text)
First.append(data[0])
Second.append(data[1])
results = list(zip(First, Second))
...
Is there a way to use generic indexes for the appends incase the place in data[i] will change?
No need for ANY external lib - just use core python ElementTree
import xml.etree.ElementTree as ET
xml = '''<root>
<ThirdAtt>7290027600007</ThirdAtt>
<Items Count="279">
<Item>
<FirstAtt>2021-09-05 08:00</FirstAtt>
<SecondAtt>5411188134985</SecondAtt>
</Item>
<Item>
<FirstAtt>2021-09-05 08:00</FirstAtt>
<SecondAtt>5411188135005</SecondAtt>
</Item>
</Items>
</root>
'''
root = ET.fromstring(xml)
headers = []
for idx,item in enumerate(root.findall('.//Item')):
data = []
if idx == 0:
headers = [x.tag for x in list(item)]
print(','.join(headers))
for h in headers:
data.append(item.find(h).text)
print(','.join(data))
output
FirstAtt,SecondAtt
2021-09-05 08:00,5411188134985
2021-09-05 08:00,5411188135005
How can i parse unstructured xml file? i need to get data inside patient tag and title using elementTree.
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>
i want given name , family name , gender and title.
Using BeautifulSoup bs4 and lxml parser library to scrape xml data.
from bs4 import BeautifulSoup
xml_data = '''<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>'''
soup = BeautifulSoup(xml_data, "lxml")
title = soup.find("title")
print(title.text.strip())
patient = soup.find("patient")
given = patient.find("given").text.strip()
family = patient.find("family").text.strip()
gender = patient.find("administrativegendercode")['displayname'].strip()
print(given)
print(family)
print(gender)
O/P:
Summary
fname
lname
Female
Install library dependency:
pip3 install beautifulsoup4==4.7.1
pip3 install lxml==4.3.3
Or you can simply use lxml. Here is tutorial that I used: https://lxml.de/tutorial.html
But it should be similar to:
from lxml import etree
root = etree.Element("patient")
print(root.find("given"))
print(root.find("family"))
print(root.find("give"))
How can I get values of latitude and longitude from the following XML:
<?xml version="1.0" encoding="utf-8"?>
<location source="FoundByWifi">
<coordinates latitude="49.7926292" longitude="24.0538406"
nlatitude="49.7935180" nlongitude="24.0552174" />
</location>
I tried to use get_text but it doesn't work in this way(
r = requests.get(url)
soup = BeautifulSoup(r.text)
lat = soup.find('coordinates','latitude').get_text(strip=True)
'latitude' is an attribute within the 'coordinates' tag. Once you found the coordinates, the soup object stores all the attributes in a dict-like key-value store.
So, in your case, after finding the coordinates tag, check the 'latitude' key as so:
lat = soup.find('coordinates')['latitude']
You can even strip the resultant of any extraneous whitespace at the beginning or end:
lat = soup.find('coordinates')['latitude'].strip()
Check online demo
html_doc = """
<?xml version="1.0" encoding="utf-8"?>
<location source="FoundByWifi">
<coordinates latitude="49.7926292" longitude="24.0538406"
nlatitude="49.7935180" nlongitude="24.0552174" />
</location>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
lat = soup.find_all('coordinates')
for i in lat:
print(i.attrs['latitude'])
print(i.attrs['longitude'])