Extract all attributes of an element from XML in Python - python

I have multiple XML files containing tweets in a format similar to the one below:
<tweet idtweet='xxxxxxx'>
<topic>#irony</topic>
<date>20171109T03:39</date>
<hashtag>#irony</hashtag>
<irony>1</irony>
<emoji>Laughing with tears</emoji>
<nbreponse>0</nbreponse>
<nbretweet>0</nbretweet>
<textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut>
<text>Some text here #irony </text>
</tweet>
There is a problem with the way the files were created (the closing tag for img is missing) so I made the choice of closing it as in the above example. I know that in HTML you can close it as
<img **something here** />
but I don't know if this holds for XML, as I didn't see it anywhere.
I'm writing a python code that extracts the topic and the plain text, but I am also interested in all the attributes contained by img and I don't seem able to do it. Here is what I've tried so far:
top = []
txt = []
emj = []
for article in root:
topic = article.find('.topic')
textbrut = article.find('.textbrut')
emoji = article.find('.img')
everything = textbrut.attrib
if topic is not None and textbrut is not None:
top.append(topic.text)
txt.append(textbrut.text)
x = list(everything.items())
emj.append(x)
Any help would be greatly appreciated.

Apparently, Element has some useful methods (such as Element.iter()) that help iterate recursively over all the sub-tree below it (its children, their children,...). So here is the solution that worked for me:
for emoji in root.iter('img'):
print(emoji.attrib)
everything = emoji.attrib
x = list(everything.items())
new.append(x)
For more details read here.

Below
import xml.etree.ElementTree as ET
xml = '''<t><tweet idtweet='xxxxxxx'>
<topic>#irony</topic>
<date>20171109T03:39</date>
<hashtag>#irony</hashtag>
<irony>1</irony>
<emoji>Laughing with tears</emoji>
<nbreponse>0</nbreponse>
<nbretweet>0</nbretweet>
<textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut>
<text>Some text here #irony </text>
</tweet></t>'''
root = ET.fromstring(xml)
data = []
for tweet in root.findall('.//tweet'):
data.append({'topic': tweet.find('./topic').text, 'text': tweet.find('./text').text,
'img_attributes': tweet.find('.//img').attrib})
print(data)
output
[{'topic': '#irony', 'text': 'Some text here #irony ', 'img_attributes': {'class': 'Emoji Emoji--forText', 'src': 'source.png', 'draggable': 'false', 'alt': '😁', 'title': 'Laughing with tears', 'aria-label': 'Emoji: Laughing with tears'}}]

Related

Retrieving text data from <content:encoded> in XML file

I have an XML file which looks like this:
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
<channel>
<item>
<title>Label: some_title"</title>
<link>some_link</link>
<pubDate>some_date</pubDate>
<dc:creator><![CDATA[University]]></dc:creator>
<guid isPermaLink="false">https://link.link</guid>
<description></description>
<content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some texttext some more text</strong><!--more-->
[caption id="attachment_344" align="aligncenter" width="524"]<img class="-image-" src="link.link.png" alt="" width="524" height="316" /> <em>A screenshot by the people</em>[/caption]
<strong>some more text</strong>
<div class="entry-content">
<em>Leave your comments</em>
</div>
<div class="post-meta wf-mobile-collapsed">
<div class="entry-meta"></div>
</div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
</item>
some more <item> </item>s here
</channel>
I want to extract the raw text within the <content:encoded> section, excluding the tags and urls. I have tried this with BeautifulSoup, and Scarpy, as well as other lxml methods. Most return an empty list.
Is there a way for me to retrieve this information without having to use regex?
Much appreciated.
UPDATE
I opened the XML file using:
content = []
with open(xml_file, "r") as file:
content = file.readlines()
content = "".join(content)
xml = bs(content, "lxml")
then I tried this with scrapy:
response = HtmlResponse(url=xml_file, encoding='utf-8')
response.selector.register_namespace('content',
'http://purl.org/rss/1.0/modules/content/')
response.xpath('channel/item/content:encoded').getall()
which returns an empty list.
and tried the code in the first answer:
soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong"))
print(text)
and get this error: Only the following pseudo-classes are implemented: nth-of-type.
When I opened the file with lxml, I ran this for loop:
data = {}
n = 0
for item in xml.findall('item'):
id = 'claim_id_' + str(n)
keys = {}
title = item.find('title').text
keys['label'] = title.split(': ')[0]
keys['claim'] = title.split(': ')[1]
if item.find('content:encoded'):
keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
data[id] = keys
print(data)
n += 1
It saved the label and claim perfectly well, but nothing for the text. Now that I opened the file using BeautifulSoup, it returns this error: 'NoneType' object is not callable
If you only need text inside <strong> tags, you can use my example. Otherwise, only regex seems suitable here:
from bs4 import BeautifulSoup
xml_doc = """
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
...the XML from the question...
</rss>
"""
soup = BeautifulSoup(xml_doc, "xml")
soup = BeautifulSoup(soup.select_one("content|encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong")
)
print(text)
Prints:
some text text some more text
some more text
RESEARCH | ARTICLE
I eventually got the text part using regular expressions (regex).
import re
for item in root.iter('item'):
grandchildren = item.getchildren()
for grandchild in grandchildren:
if 'encoded' in grandchild.tag:
text = grandchild.text
text = re.sub(r'\[.*?\]', "", text) # gets rid of square brackets and their content
text = re.sub(r'\<.*?\>', "", text) # gets rid of <> signs and their content
text = text.replace(" ", "") # gets rid of
text = " ".join(text.split())

ElementTree wrong encoding

im searching like for hours but I cant find the solution online so im trying to ask you now here in this topic.
I just want to print the inside Content of a html tag in a xml document but im getting only things like (&lt, &gt, and and and...)
It looks like this in the XML Document
<data table="tt_content" elementUid="2490" key="tt_content:NEW/1/2490:bodytext"><![CDATA[<img src="/fileadmin/public/Redaktion/Bilder/Icons/Icon-CE.png" width="28" height="21" class="float-left mt-1 mr-2">
<h4>EU-Baumusterprüfbescheinigung</h4>
When I print it it looks like this
<data table="tt_content" elementUid="2490" key="tt_content:NEW/1/2490:bodytext"><img src="/fileadmin/public/Redaktion/Bilder/Icons/Icon-CE.png" width="28" height="21" class="float-left mt-1 mr-2">
<h4>EU-Baumusterprüfbescheinigung</h4>
as you can see it is very different not only the german characters not being displayed but also the "CDATA" which is very important to me.
There are replaced with &lt.. and so on.
And now to my Code
raw = <data table="tt_content" elementUid="2490" key="tt_content:NEW/1/2490:bodytext"><![CDATA[<img src="/fileadmin/public/Redaktion/Bilder/Icons/Icon-CE.png" width="28" height="21" class="float-left mt-1 mr-2">
<h4>EU-Baumusterprüfbescheinigung</h4>
raw = ET.tostring(data).decode()
print(raw) # print is showed before
What I've also tried
# raw = ET.tostring(raw, encoding='unicode', method='xml')
# raw = ET.tostring(raw, encoding='unicode', method='xml')
At first im iterating to the position where i have the data table which i showed you before
def copy_content():
for pageGrp in root.findall('pageGrp'):
for data in pageGrp.iter('data'):
tag = data.get("key").split(":")[2]
if (tag == "bodytext"):
raw = ET.tostring(data).decode() IT Starts HERE
# ET.dump(data)
# print(raw)
# file = open('new.xml', 'a')
# file.write(raw)
print(raw)
I hope you can help me.. Thanks in advance

Replacing a custom "HTML" tag in a Python string

I want to be able to include a custom "HTML" tag in a string, such as: "This is a <photo id="4" /> string".
In this case the custom tag is <photo id="4" />. I would also be fine changing this custom tag to be written differently if it makes it easier, ie [photo id:4] or something.
I want to be able to pass this string to a function that will extract the tag <photo id="4" />, and allow me to transform this to some more complicated template like <div class="photo"><img src="...." alt="..."></div>, which I can then use to replace the tag in the original string.
I'm imaging it work something like this:
>>> content = "This is a <photo id="4" /> string"
# Pass the string to a function that returns all the tags with the given name.
>>> tags = parse_tags('photo', string)
>>> print(tags)
[{'tag': 'photo', 'id': 4, 'raw': '<photo id="4" />'}]
# Now that I know I need to render a photo with ID 4, so I can pass that to some sort of template thing
>>> rendered = render_photo(id=tags[0]['id'])
>>> print(rendered)
<div class="photo"><img src="...." alt="..."></div>
>>> content = content.replace(tags[0]['raw'], rendered)
>>> print(content)
This is a <div class="photo"><img src="...." alt="..."></div> string
I think this is a fairly common pattern, for something like putting a photo in a blog post, so I'm wondering if there is a library out there that will do something similar to the example parse_tags function above. Or do I need to write it?
This example of the photo tag is just a single example. I would want to have tags with different names. As a different example, maybe I have a database of people and I want a tag like <person name="John Doe" />. In that case the output I want is something like {'tag': 'person', 'name': 'John Doe', 'raw': '<person name="John Doe" />'}. I can then use the name to look that person up and return a rendered template of the person's vcard or something.
If you're working with HTML5, I would suggest looking into the xml module (etree). It will allow you to parse the whole document into a tree structure and manipulate tags individually (and then turn the resut bask into an html document).
You could also use regular expressions to perform text substitutions. This would likely be faster than loading a xml tree structure if you don't have too many changes to make.
import re
text = """<html><body>some text <photo> and tags <photo id="4"> more text <person name="John Doe"> yet more text"""
tags = ["photo","person","abc"]
patterns = "|".join([ f"(<{tag} .*?>)|(<{tag}>)" for tag in tags ])
matches = list(re.finditer(patterns,text))
for match in reversed(matches):
tag = text[match.start():match.end()]
print(match.start(),match.end(),tag)
# substitute what you need for that tag
text = text[:match.start()] + "***" + text[match.end():]
print(text)
This will be printed:
64 88 <person name="John Doe">
39 53 <photo id="4">
22 29 <photo>
<html><body>some text *** and tags *** more text *** yet more text
Performing the replacements in reverse order ensures that the ranges found by finditer() remain valid as the text changes with the substitutions.
For this kind of "surgical" parsing (where you want to isolate specific tags instead of creating a full hierarchical document), pyparsing's makeHTMLTags method can be very useful.
See the annotated script below, showing the creation of the parser, and using it for parseTag and replaceTag methods:
import pyparsing as pp
def make_tag_parser(tag):
# makeHTMLTags returns 2 parsers, one for the opening tag and one for the
# closing tag - we only need the opening tag; the parser will return parsed
# fields of the tag itself
tag_parser = pp.makeHTMLTags(tag)[0]
# instead of returning parsed bits of the tag, use originalTextFor to
# return the raw tag as token[0] (specifying asString=False will retain
# the parsed attributes and tag name as attributes)
parser = pp.originalTextFor(tag_parser, asString=False)
# add one more callback to define the 'raw' attribute, copied from t[0]
def add_raw_attr(t):
t['raw'] = t[0]
parser.addParseAction(add_raw_attr)
return parser
# parseTag to find all the matches and report their attributes
def parseTag(tag, s):
return make_tag_parser(tag).searchString(s)
content = """This is a <photo id="4" /> string"""
tag_matches = parseTag("photo", content)
for match in tag_matches:
print(match.dump())
print("raw: {!r}".format(match.raw))
print("tag: {!r}".format(match.tag))
print("id: {!r}".format(match.id))
# transform tag to perform tag->div transforms
def replaceTag(tag, transform, s):
parser = make_tag_parser(tag)
# add one more parse action to do transform
parser.addParseAction(lambda t: transform.format(**t))
return parser.transformString(s)
print(replaceTag("photo",
'<div class="{tag}"><img src="<src_path>/img_{id}.jpg." alt="{tag}_{id}"></div>',
content))
Prints:
['<photo id="4" />']
- empty: True
- id: '4'
- raw: '<photo id="4" />'
- startPhoto: ['photo', ['id', '4'], True]
[0]:
photo
[1]:
['id', '4']
[2]:
True
- tag: 'photo'
raw: '<photo id="4" />'
tag: 'photo'
id: '4'
This is a <div class="photo"><img src="<src_path>/img_4.jpg." alt="photo_4"></div> string

Why does cElementTree iterparse return None elements?

I am trying to parse an xml file with cElementTree.iterparse.
However, I can't understand what is going on because iterparse returns empty elements.
I have an xml file that has the following approximate layout:
<DOCS>
<ID id="1">
<HEAD>title1</HEAD>
<DATE>21.01.2010</DATE>
<TEXT>
<P>some text</P>
<P>some text</P>
<P>some text</P>
</TEXT>
</ID>
<ID id="2">
<HEAD>title2</HEAD>
<DATE>21.01.2010</DATE>
<TEXT>
some text
</TEXT>
</ID>
</DATA>
I am trying to extract text from TEXT tag or iterate through TEXT tag children (P tags) and extract text from them as well.
Here is my code:
import xml.etree.cElementTree as cet
docs = {}
id = ''
for event, elem in cet.iterparse(xml_data, events=('end',)):
if elem.tag == 'ID':
id = elem.attrib['id']
if elem.tag == 'TEXT':
if list(elem):
docs[id] = ''.join([p.text for p in elem])
else:
docs[id] = elem.text
#print(docs)
return docs
When I execute my code I get:
docs[id] = ''.join([p.text for p in elem])
TypeError: sequence item 14: expected str instance, NoneType found
Which means that one of p in a list comprehension [p.text for p in elem] is None. Ok, I used print statements to know which was the previous p text to see if there is something wrong with xml file tags. Well, the p element which does not have any text in fact should have it because it has a text body in the xml file. Can somebody explain what is going on?
Stupid mistake of forgetting the if event == 'end': check.
So, what is going on is that only when the event == 'end' we have a fully populated elem object.

Parsing HTML with lxml (python)

I'm trying to save the content of a HTML-page in a .html-file, but I only want to save the content under the tag "table". In addition, I'd like to remove all empty tags like <b></b>.
I did all these things already with BeautifulSoup:
f = urllib2.urlopen('http://test.xyz')
html = f.read()
f.close()
soup = BeautifulSoup(html)
txt = ""
for text in soup.find_all("table", {'class': 'main'}):
txt += str(text)
text = BeautifulSoup(text)
empty_tags = text.find_all(lambda tag: tag.name == 'b' and tag.find(True) is None and (tag.string is None or tag.string.strip()==""))
[empty_tag.extract() for empty_tag in empty_tags]
My question is: Is this also possible with lxml? If yes: How would this +/- look like?
Thanks a lot for any help.
import lxml.html
# lxml can download pages directly
root = lxml.html.parse('http://test.xyz').getroot()
# use a CSS selector for class="main",
# or use root.xpath('//table[#class="main"]')
tables = root.cssselect('table.main')
# extract HTML content from all tables
# use lxml.html.tostring(t, method="text", encoding=unicode)
# to get text content without tags
"\n".join([lxml.html.tostring(t) for t in tables])
# removing only specific empty tags, here <b></b> and <i></i>
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
empty.getparent().remove(empty)
# removing all empty tags (tags that do not have children nodes)
for empty in root.xpath('//*[not(node())]'):
empty.getparent().remove(empty)
# root does not contain those empty tags anymore

Categories

Resources