ElementTree wrong encoding

ElementTree wrong encoding - python

im searching like for hours but I cant find the solution online so im trying to ask you now here in this topic.
I just want to print the inside Content of a html tag in a xml document but im getting only things like (&lt, &gt, and and and...)
It looks like this in the XML Document
<data table="tt_content" elementUid="2490" key="tt_content:NEW/1/2490:bodytext"><![CDATA[<img src="/fileadmin/public/Redaktion/Bilder/Icons/Icon-CE.png" width="28" height="21" class="float-left mt-1 mr-2">
<h4>EU-Baumusterprüfbescheinigung</h4>
When I print it it looks like this
<data table="tt_content" elementUid="2490" key="tt_content:NEW/1/2490:bodytext"><img src="/fileadmin/public/Redaktion/Bilder/Icons/Icon-CE.png" width="28" height="21" class="float-left mt-1 mr-2">
<h4>EU-Baumusterprüfbescheinigung</h4>
as you can see it is very different not only the german characters not being displayed but also the "CDATA" which is very important to me.
There are replaced with &lt.. and so on.
And now to my Code
raw = <data table="tt_content" elementUid="2490" key="tt_content:NEW/1/2490:bodytext"><![CDATA[<img src="/fileadmin/public/Redaktion/Bilder/Icons/Icon-CE.png" width="28" height="21" class="float-left mt-1 mr-2">
<h4>EU-Baumusterprüfbescheinigung</h4>
raw = ET.tostring(data).decode()
print(raw) # print is showed before
What I've also tried
# raw = ET.tostring(raw, encoding='unicode', method='xml')
# raw = ET.tostring(raw, encoding='unicode', method='xml')
At first im iterating to the position where i have the data table which i showed you before
def copy_content():
for pageGrp in root.findall('pageGrp'):
for data in pageGrp.iter('data'):
tag = data.get("key").split(":")[2]
if (tag == "bodytext"):
raw = ET.tostring(data).decode() IT Starts HERE
# ET.dump(data)
# print(raw)
# file = open('new.xml', 'a')
# file.write(raw)
print(raw)
I hope you can help me.. Thanks in advance

Related

Retrieving text data from <content:encoded> in XML file

I have an XML file which looks like this:
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
<channel>
<item>
<title>Label: some_title"</title>
<link>some_link</link>
<pubDate>some_date</pubDate>
<dc:creator><![CDATA[University]]></dc:creator>
<guid isPermaLink="false">https://link.link</guid>
<description></description>
<content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some texttext some more text</strong><!--more-->
[caption id="attachment_344" align="aligncenter" width="524"]<img class="-image-" src="link.link.png" alt="" width="524" height="316" /> <em>A screenshot by the people</em>[/caption]
<strong>some more text</strong>
<div class="entry-content">
<em>Leave your comments</em>
</div>
<div class="post-meta wf-mobile-collapsed">
<div class="entry-meta"></div>
</div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
</item>
some more <item> </item>s here
</channel>
I want to extract the raw text within the <content:encoded> section, excluding the tags and urls. I have tried this with BeautifulSoup, and Scarpy, as well as other lxml methods. Most return an empty list.
Is there a way for me to retrieve this information without having to use regex?
Much appreciated.
UPDATE
I opened the XML file using:
content = []
with open(xml_file, "r") as file:
content = file.readlines()
content = "".join(content)
xml = bs(content, "lxml")
then I tried this with scrapy:
response = HtmlResponse(url=xml_file, encoding='utf-8')
response.selector.register_namespace('content',
'http://purl.org/rss/1.0/modules/content/')
response.xpath('channel/item/content:encoded').getall()
which returns an empty list.
and tried the code in the first answer:
soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong"))
print(text)
and get this error: Only the following pseudo-classes are implemented: nth-of-type.
When I opened the file with lxml, I ran this for loop:
data = {}
n = 0
for item in xml.findall('item'):
id = 'claim_id_' + str(n)
keys = {}
title = item.find('title').text
keys['label'] = title.split(': ')[0]
keys['claim'] = title.split(': ')[1]
if item.find('content:encoded'):
keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
data[id] = keys
print(data)
n += 1
It saved the label and claim perfectly well, but nothing for the text. Now that I opened the file using BeautifulSoup, it returns this error: 'NoneType' object is not callable

If you only need text inside <strong> tags, you can use my example. Otherwise, only regex seems suitable here:
from bs4 import BeautifulSoup
xml_doc = """
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
...the XML from the question...
</rss>
"""
soup = BeautifulSoup(xml_doc, "xml")
soup = BeautifulSoup(soup.select_one("content|encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong")
)
print(text)
Prints:
some text text some more text
some more text
RESEARCH | ARTICLE

I eventually got the text part using regular expressions (regex).
import re
for item in root.iter('item'):
grandchildren = item.getchildren()
for grandchild in grandchildren:
if 'encoded' in grandchild.tag:
text = grandchild.text
text = re.sub(r'\[.*?\]', "", text) # gets rid of square brackets and their content
text = re.sub(r'\<.*?\>', "", text) # gets rid of <> signs and their content
text = text.replace(" ", "") # gets rid of
text = " ".join(text.split())

ElementTree, .set() and iteration

This is my first post on Stack Overflow and am a novice programmer.
I am having trouble using ElementTree and the .set() method. Using an f-string I am able to assign recipe_id with the correct number.
When I try to set the recipe_name attribute, it returns only the last element in the name_list array. I'm a bit lost! I'm sure it's something in my syntax or just my understanding of how I'm actually iterating through the items...I just don't understand because the recipe_id portion works just fine.
Expected output (within the XML)
<recipes recipe_id="1" recipe_name="Apples">
<recipes recipe_id="2" recipe_name="Oranges">
Instead I get:
<recipes recipe_id="1" recipe_name="Oranges">
<recipes recipe_id="2" recipe_name="Oranges">
My code:
#!/usr/bin/env python3
import os
import os.path
import xml.etree.ElementTree as ET
filename = "my_recipes.xml"
xmlTree = ET.parse(filename)
root = xmlTree.getroot()
#change the recipe id in recipes
i = 0
for element in root.iter("recipes"):
i += 1
element.set('recipe_id', f"{i}")
#for every tbody in the tree
name_list = []
for tbody in root.iter("tbody"):
#for every time you find a row
for row in tbody.findall('row'):
data = row.find('entry').text
#get those rows attribs
rec_name = row.find('entry').attrib
#if the row is the row that i want (contains the recipe name)...i couldn't figure out a better way to get this value precisely
if rec_name == {'namest': 'c1', 'nameend': 'c2', 'align': 'left', 'valign': 'bottom'}:
#yoink name and stick it in name_list
name_list.append(data)
for recipes in root.findall('recipes'):
for i in range(len(name_list)):
recipes.set('recipe_name', F"{name_list[i]}")
xmlTree.write(filename, encoding='UTF-8', xml_declaration=True)
My XML:
<?xml version='1.0' encoding='UTF-8'?>
<Root>
<recipes>
<tbody>
<row>
<entry namest="c1" nameend="c2" align="left" valign="bottom">Apples</entry>
</row>
</tbody>
</recipes>
<recipes>
<tbody>
<row>
<entry namest="c1" nameend="c2" align="left" valign="bottom">Oranges</entry>
</row>
</tbody>
</recipes>
</Root>
FIXED:
The code should be
for i,recipes in enumerate(root.findall('recipes')):
recipes.set('recipe_name',name_list[i])
and I am just stupid.

Extract all attributes of an element from XML in Python

I have multiple XML files containing tweets in a format similar to the one below:
<tweet idtweet='xxxxxxx'>
<topic>#irony</topic>
<date>20171109T03:39</date>
<hashtag>#irony</hashtag>
<irony>1</irony>
<emoji>Laughing with tears</emoji>
<nbreponse>0</nbreponse>
<nbretweet>0</nbretweet>
<textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut>
<text>Some text here #irony </text>
</tweet>
There is a problem with the way the files were created (the closing tag for img is missing) so I made the choice of closing it as in the above example. I know that in HTML you can close it as
<img **something here** />
but I don't know if this holds for XML, as I didn't see it anywhere.
I'm writing a python code that extracts the topic and the plain text, but I am also interested in all the attributes contained by img and I don't seem able to do it. Here is what I've tried so far:
top = []
txt = []
emj = []
for article in root:
topic = article.find('.topic')
textbrut = article.find('.textbrut')
emoji = article.find('.img')
everything = textbrut.attrib
if topic is not None and textbrut is not None:
top.append(topic.text)
txt.append(textbrut.text)
x = list(everything.items())
emj.append(x)
Any help would be greatly appreciated.

Apparently, Element has some useful methods (such as Element.iter()) that help iterate recursively over all the sub-tree below it (its children, their children,...). So here is the solution that worked for me:
for emoji in root.iter('img'):
print(emoji.attrib)
everything = emoji.attrib
x = list(everything.items())
new.append(x)
For more details read here.

Below
import xml.etree.ElementTree as ET
xml = '''<t><tweet idtweet='xxxxxxx'>
<topic>#irony</topic>
<date>20171109T03:39</date>
<hashtag>#irony</hashtag>
<irony>1</irony>
<emoji>Laughing with tears</emoji>
<nbreponse>0</nbreponse>
<nbretweet>0</nbretweet>
<textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut>
<text>Some text here #irony </text>
</tweet></t>'''
root = ET.fromstring(xml)
data = []
for tweet in root.findall('.//tweet'):
data.append({'topic': tweet.find('./topic').text, 'text': tweet.find('./text').text,
'img_attributes': tweet.find('.//img').attrib})
print(data)
output
[{'topic': '#irony', 'text': 'Some text here #irony ', 'img_attributes': {'class': 'Emoji Emoji--forText', 'src': 'source.png', 'draggable': 'false', 'alt': '😁', 'title': 'Laughing with tears', 'aria-label': 'Emoji: Laughing with tears'}}]

Using requests and Beautifulsoup to find text in page (With CSS)

I'm doing a request to a webpage and I'm trying to retrieve some text on it. The text is splitup with span tags like this:
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
There are "inline style sheets" (CSS sheets) that says if we have to print or not the text to the screen and thus, not print the gibberish text on the screen. This is an example of 1 of the sheet:
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}
but there are more CSS files like this.. So I don't know if there are any better way to achieve my goal (print the text that shows on screen and not use the gibberish that is not displayed)
My script is able to print the text.. but all of it (with gibberish) as the following: "This is jvgviehrgjfne my gt4ugirdfgr script!"

If i understood you right, what you should do is to parse css files with regex for attributes associated with inline and provide the results to beautiful soup api. Here is a way:
import re
import bs4
page_txt = """
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
"""
css_file_read_output = """
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}"""
css_file_lines = css_file_read_output.splitlines()
css_lines_text = []
for line in css_file_lines:
inline_search = re.search(".*inline.*", line)
if inline_search is not None:
inline_group = inline_search.group()
class_name_search = re.search("\..*\{", inline_group)
class_name_group = class_name_search.group()
class_name_group = class_name_group[1:-1] # getting rid of the last { and first .
css_lines_text.append(class_name_group)
else:
pass
page_bs = bs4.BeautifulSoup(page_txt,"lxml")
wanted_text_list = []
for line in css_lines_text:
wanted_line = page_bs.find("span", class_=line)
wanted_text = wanted_line.get_text(strip=True)
wanted_text_list.append(wanted_text)
wanted_string = " ".join(wanted_text_list)

Extracting Data from Mysql XML dump with xml.dom.minidom

I exported a mysql database to xml with phpmyadmin and now I would like to parse it with minidom but I'm having trouble getting the content in the form that I need it.
Summary: I need to assign the variable title to the text contained within <column name="news_title">This is the title</column>
The extracted db looks like this:
<pma_xml_export version="1.0" >
<database name="dbname">
<!-- Table newsbox -->
<table name="newsbox">
<column name="news_id">1</column>
<column name="news_title">This is the title</column>
<column name="news_text">This is the news text</column>
<column name="date">Thu, 28 Feb 2008 20:10:30 -0500</column>
<column name="author">author</column>
<column name="category">site_announcement</column>
</table>
</database>
</pma_xml_export>
I am able to extract the text with the following script but it's not in the form that I need:
doc = parseString(document)
pmaexport = doc.getElementsByTagName("pma_xml_export")[0]
columns = pmaexport.getElementsByTagName("column")
for item in columns:
name = item.getAttribute("name")
text = item.firstChild.data.strip()
print name, text
What I need is something where I can assign the text contents of these elements to variables which can be passed on e.g.,
for item in columns:
title = ???
text = ???
date = ???
author = ???
If the db output was in the form of <title>Here's the Title</title> I would have plenty of examples to go off, but I just can't find any reference to something like <column name="news_title">This is the title</column>

It's been a while since I've used xml.dom.minidom but this should work...
columns = [c.firstChild.data for c in pmaexport.getElementsByTagName('column') if c.getAttribute('name') == 'news_title']
Plus, like, list comprehension!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ElementTree wrong encoding - python

Related

Retrieving text data from <content:encoded> in XML file

ElementTree, .set() and iteration

Extract all attributes of an element from XML in Python

Using requests and Beautifulsoup to find text in page (With CSS)

Extracting Data from Mysql XML dump with xml.dom.minidom

Categories

Resources