Why does cElementTree iterparse return None elements? - python

I am trying to parse an xml file with cElementTree.iterparse.
However, I can't understand what is going on because iterparse returns empty elements.
I have an xml file that has the following approximate layout:
<DOCS>
<ID id="1">
<HEAD>title1</HEAD>
<DATE>21.01.2010</DATE>
<TEXT>
<P>some text</P>
<P>some text</P>
<P>some text</P>
</TEXT>
</ID>
<ID id="2">
<HEAD>title2</HEAD>
<DATE>21.01.2010</DATE>
<TEXT>
some text
</TEXT>
</ID>
</DATA>
I am trying to extract text from TEXT tag or iterate through TEXT tag children (P tags) and extract text from them as well.
Here is my code:
import xml.etree.cElementTree as cet
docs = {}
id = ''
for event, elem in cet.iterparse(xml_data, events=('end',)):
if elem.tag == 'ID':
id = elem.attrib['id']
if elem.tag == 'TEXT':
if list(elem):
docs[id] = ''.join([p.text for p in elem])
else:
docs[id] = elem.text
#print(docs)
return docs
When I execute my code I get:
docs[id] = ''.join([p.text for p in elem])
TypeError: sequence item 14: expected str instance, NoneType found
Which means that one of p in a list comprehension [p.text for p in elem] is None. Ok, I used print statements to know which was the previous p text to see if there is something wrong with xml file tags. Well, the p element which does not have any text in fact should have it because it has a text body in the xml file. Can somebody explain what is going on?

Stupid mistake of forgetting the if event == 'end': check.
So, what is going on is that only when the event == 'end' we have a fully populated elem object.

Related

Retrieving text data from <content:encoded> in XML file

I have an XML file which looks like this:
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
<channel>
<item>
<title>Label: some_title"</title>
<link>some_link</link>
<pubDate>some_date</pubDate>
<dc:creator><![CDATA[University]]></dc:creator>
<guid isPermaLink="false">https://link.link</guid>
<description></description>
<content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some texttext some more text</strong><!--more-->
[caption id="attachment_344" align="aligncenter" width="524"]<img class="-image-" src="link.link.png" alt="" width="524" height="316" /> <em>A screenshot by the people</em>[/caption]
<strong>some more text</strong>
<div class="entry-content">
<em>Leave your comments</em>
</div>
<div class="post-meta wf-mobile-collapsed">
<div class="entry-meta"></div>
</div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
</item>
some more <item> </item>s here
</channel>
I want to extract the raw text within the <content:encoded> section, excluding the tags and urls. I have tried this with BeautifulSoup, and Scarpy, as well as other lxml methods. Most return an empty list.
Is there a way for me to retrieve this information without having to use regex?
Much appreciated.
UPDATE
I opened the XML file using:
content = []
with open(xml_file, "r") as file:
content = file.readlines()
content = "".join(content)
xml = bs(content, "lxml")
then I tried this with scrapy:
response = HtmlResponse(url=xml_file, encoding='utf-8')
response.selector.register_namespace('content',
'http://purl.org/rss/1.0/modules/content/')
response.xpath('channel/item/content:encoded').getall()
which returns an empty list.
and tried the code in the first answer:
soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong"))
print(text)
and get this error: Only the following pseudo-classes are implemented: nth-of-type.
When I opened the file with lxml, I ran this for loop:
data = {}
n = 0
for item in xml.findall('item'):
id = 'claim_id_' + str(n)
keys = {}
title = item.find('title').text
keys['label'] = title.split(': ')[0]
keys['claim'] = title.split(': ')[1]
if item.find('content:encoded'):
keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
data[id] = keys
print(data)
n += 1
It saved the label and claim perfectly well, but nothing for the text. Now that I opened the file using BeautifulSoup, it returns this error: 'NoneType' object is not callable
If you only need text inside <strong> tags, you can use my example. Otherwise, only regex seems suitable here:
from bs4 import BeautifulSoup
xml_doc = """
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
...the XML from the question...
</rss>
"""
soup = BeautifulSoup(xml_doc, "xml")
soup = BeautifulSoup(soup.select_one("content|encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong")
)
print(text)
Prints:
some text text some more text
some more text
RESEARCH | ARTICLE
I eventually got the text part using regular expressions (regex).
import re
for item in root.iter('item'):
grandchildren = item.getchildren()
for grandchild in grandchildren:
if 'encoded' in grandchild.tag:
text = grandchild.text
text = re.sub(r'\[.*?\]', "", text) # gets rid of square brackets and their content
text = re.sub(r'\<.*?\>', "", text) # gets rid of <> signs and their content
text = text.replace(" ", "") # gets rid of
text = " ".join(text.split())

How do I access elements in an XML when multiple default namespaces are used?

I would expect this code to produce a non-empty list:
import xml.etree.ElementTree as et
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<A
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="a:namespace">
<B xmlns="b:namespace">
<C>"Stuff"</C>
</B>
</A>
'''
namespaces = {'a' : 'a:namespace', 'b' : 'b:namespace'}
xroot = et.fromstring(xml)
res = xroot.findall('b:C', namespaces)
instead, res is an empty array. Why?
When I inspect the contents of xroot I can see that the C item is within b:namespace as expected:
for x in xroot.iter():
print(x)
# result:
<Element '{a:namespace}A' at 0x7f56e13b95e8>
<Element '{b:namespace}B' at 0x7f56e188d2c8>
<Element '{b:namespace}C' at 0x7f56e188def8>
To check whether something was wrong with my namespacing, I tried this as well; xroot.findall('{b:namespace}C') but the result was an empty array as well.
Your findall xpath 'b:C' is searching only tags immediately in the root element; you need to make it './/b:C' so the tag is found anywhere in the tree and it works, e.g.:
import xml.etree.ElementTree as et
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<A
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="a:namespace">
<B xmlns="b:namespace">
<C>"Stuff"</C>
</B>
</A>
'''
namespaces = {'a' : 'a:namespace', 'b' : 'b:namespace'}
xroot = et.fromstring(xml)
######## changed the xpath to start with .//
res = xroot.findall('.//b:C', namespaces)
print( f"{res=}" )
for x in xroot.iter():
print(x)
Output:
res=[<Element '{b:namespace}C' at 0x00000222DFCAAA40>]
<Element '{a:namespace}A' at 0x00000222DFCAA9A0>
<Element '{b:namespace}B' at 0x00000222DFCAA9F0>
<Element '{b:namespace}C' at 0x00000222DFCAAA40>
See here for some useful examples of ElementTree xpath support https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xpath#xpath-support

How to change tags with lxml in Python?

I want to change all tags names <p> to <para> using lxml in python.
Here's an example of what the xml file looks like.
<concept id="id15CDB0Q0Q4G"><title id="id15CDB0R0VHA">General</title>
<conbody><p>This section</p></conbody>
<concept id="id156F7H00GIE"><title id="id15CDB0R0V1W">
System</title>
<conbody><p> </p>
<p>The
</p>
<p>status.</p>
<p>sensors.</p>
And I've been trying to code it like this but it doesn't find the tags with .findall.
from lxml import etree
doc = etree.parse("73-20.xml")
print("\n")
print(etree.tostring(doc, pretty_print=True, xml_declaration=True, encoding="utf-8"))
print("\n")
raiz = doc.getroot()
print(raiz.tag)
children = raiz.getchildren()
print(children)
print("\n")
libros = doc.findall("p")
print(libros)
print("\n")
for i in range(len(libros)):
if libros[i].find("p").tag == "p" :
libros[i].find("p").tag = "para"
Any thoughts?
lxml findall() function provides ability to search by path:
libros = raiz.findall(".//p")
for el in libros:
el.tag = "para"
Here .//p means that lxml will search nested p elements as well.

Extract all attributes of an element from XML in Python

I have multiple XML files containing tweets in a format similar to the one below:
<tweet idtweet='xxxxxxx'>
<topic>#irony</topic>
<date>20171109T03:39</date>
<hashtag>#irony</hashtag>
<irony>1</irony>
<emoji>Laughing with tears</emoji>
<nbreponse>0</nbreponse>
<nbretweet>0</nbretweet>
<textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut>
<text>Some text here #irony </text>
</tweet>
There is a problem with the way the files were created (the closing tag for img is missing) so I made the choice of closing it as in the above example. I know that in HTML you can close it as
<img **something here** />
but I don't know if this holds for XML, as I didn't see it anywhere.
I'm writing a python code that extracts the topic and the plain text, but I am also interested in all the attributes contained by img and I don't seem able to do it. Here is what I've tried so far:
top = []
txt = []
emj = []
for article in root:
topic = article.find('.topic')
textbrut = article.find('.textbrut')
emoji = article.find('.img')
everything = textbrut.attrib
if topic is not None and textbrut is not None:
top.append(topic.text)
txt.append(textbrut.text)
x = list(everything.items())
emj.append(x)
Any help would be greatly appreciated.
Apparently, Element has some useful methods (such as Element.iter()) that help iterate recursively over all the sub-tree below it (its children, their children,...). So here is the solution that worked for me:
for emoji in root.iter('img'):
print(emoji.attrib)
everything = emoji.attrib
x = list(everything.items())
new.append(x)
For more details read here.
Below
import xml.etree.ElementTree as ET
xml = '''<t><tweet idtweet='xxxxxxx'>
<topic>#irony</topic>
<date>20171109T03:39</date>
<hashtag>#irony</hashtag>
<irony>1</irony>
<emoji>Laughing with tears</emoji>
<nbreponse>0</nbreponse>
<nbretweet>0</nbretweet>
<textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut>
<text>Some text here #irony </text>
</tweet></t>'''
root = ET.fromstring(xml)
data = []
for tweet in root.findall('.//tweet'):
data.append({'topic': tweet.find('./topic').text, 'text': tweet.find('./text').text,
'img_attributes': tweet.find('.//img').attrib})
print(data)
output
[{'topic': '#irony', 'text': 'Some text here #irony ', 'img_attributes': {'class': 'Emoji Emoji--forText', 'src': 'source.png', 'draggable': 'false', 'alt': '😁', 'title': 'Laughing with tears', 'aria-label': 'Emoji: Laughing with tears'}}]

xml parsing in python: how to capture child's text when it is placed after grandchildren in the xml tree

How can I get the text "it" from this xml sample using xml parser for python?
<EXP ID="2">
<W>
love
<EXP ID="1">
<PTR src="0" />
it
</EXP>
</W>
</EXP>
Here is what I tried:
import xml.etree.ElementTree as ET
r = ET.fromstring(sample)
for c in r:
print (c.tag, c.attrib, c.text)
for d in c:
print (d.tag, d.attrib, d.text)
The output for this:
W {} love
EXP {'ID': '1'}
But it should be:
W {} love
EXP {'ID': '1'} it
I get the expected result if the word "it" was placed before the sub tag:
<EXP ID="2">
<W>
love
<EXP ID="1">
it
<PTR src="0" />
</EXP>
</W>
</EXP>
How can I get the same output from the original xml doc; text is placed after sub children?
In ElementTree model text node that comes after (following sibling of) an element is stored as tail of that element. So the text node 'it' in this case can be accessed from tail of PTR element :
>>> ptr = r.find('.//PTR')
>>> ptr.tail.strip()
'it'

Categories

Resources