Keep html file structure after modifying it with BeautifullSoup - python

I´m using python and BeautifullSoup for finding and replacing some text on html page, and my problem is that i need to keep file struсture (indentations, spaces, new lines etc) unchanged and change only desired elements. How can I achieve this? Both str(soup) and soup.prettify() are altering source file in many ways.
P.S. sample code:
soup = BeautifulSoup(text)
for element in soup.findAll(text=True):
if not element.parent.name in ['style', 'script', 'head', 'title','pre']:
element.replaceWith(process(element))
result = str(soup)

I'd say there's no easy way (or no way at all). From BeautifulStoneSoup's doc:
__str__(self, encoding='utf-8', prettyPrint=False, indentLevel=0)
Returns a string or Unicode representation of this tag and
its contents. To get Unicode, pass None for encoding.
NOTE: since Python's HTML parser consumes whitespace, this
method is not certain to reproduce the whitespace present in
the original string.
According to the note, the original whitespaces are lost to the internal representation.

Related

How to convert CSS selected field into normal python string

My scrapy project is giving me a strange encoding for items when using CSS selectors.
Here is the relevent code:
Once the scrapy request is made and the webpage is downloaded, parse_page is called with the response...
def parse_page(self, response):
# Using Selenium WebDriver to select elements
records = self.driver.find_elements_by_css_selector('#searchResultsTable > tbody > tr')
for record in records:
# Convert selenium object into scrapy.Selector object (necessary to use .add_css methods)
sel = Selector(text=record.get_attribute('outerHTML'))
# Instantiate RecordLoader (a custom item loader)
il = RecordLoader(item=Record(), selector=sel)
# Select element and pass to example_field's input processor
il.add_css('example_field', 'td:nth-child(2) > a::text')
il.add_css() passes the result of the CSS selector to example_field's input processor which for demonstration purposes is only print statements and shows the issue...
def example_field_input_processor(text_html):
print(text_html)
print(type(text_html))
print(text_html.encode('utf-8'))
Output:
'\xa0\xa004/29/2020 10:50:24 AM,\xa0\xa0\xa0'
<class 'str'>
b'\xc2\xa0\xc2\xa004/29/2020 10:50:24 AM,\xc2\xa0\xc2\xa0\xc2\xa0'
Here are my questions:
1) Why is it that the CSS selector didn't just give me a normal Python string? Does it have to do with the CSS selector casting to text with ::text. Is it because the webpage is in a different encoding? I checked if there was a <meta> tag that specified the site's encoding but there wasn't one.
2) When I force an encoding of 'utf-8' why don't I get a normal python string instead of a bytes string that shows all the Unicode characters?
3) My goal is to have just a normal python string (No prepended b, no weird characters) that I can parse. How?
While scraping you sometimes have to clean your results from unicode characters
They are usually as a result of spaces tabs and sometimes span
As a common practice clean all texts you scrape:
def string_cleaner(rouge_text):
return ("".join(rouge_text.strip()).encode('ascii', 'ignore').decode("utf-8"))
Explaination:
Use split() and join to translate the characters and clear it of unicodes.
This part of the code "".join(rouge_text.strip())
Then encode it to ascii and decode it to utf-8 to remove special characters
This part of the code .encode('ascii','ignore').decode("utf-8"))
How you would apply it in your code
print(string_cleaner(text_html))

How can I remove bad data in an XPath element using Python?

I have this short example to demonstrate my problem:
from lxml import html
post = """<p>This a page with URLs.
This goes to
Google<br/>
This
goes to Yahoo!<br/>
<a
href="http://example.com">This is invalid due to that
line feed character</p>
"""
doc = html.fromstring(post)
for link in doc.xpath('//a'):
print link.get('href')
This outputs:
http://google.com
http://yahoo.com
None
The problem is that my data has
characters embedded in it. For my last link, it is embedded directly between the anchor and the href attribute. The line feeds outside of the elements are important to me.
doc.xpath('//a') correctly saw the <a
href="http://example.com"> as a link, but it can't access the href attribute when I do link.get('href').
How can I clean the data if link.get('href') returns None, so that I can still retrieve the discovered href attribute?
I can't strip all of the
characters from the entire post element as the ones in the text are important.
Module unidecode
Since you need the data outside of the tags, you could try using unidecode. It doesn't tackle Chinese and Korean, but it'll do things like change left and right quotes to ASCII quotes. It should help with these
characters as well, changing them to spaces instead of non-breaking spaces. Hopefully that's all you need in regards to preserving the other data. str.replace(u"\#xa", u" ") is less heavy handed if the ascii space is okay.
import unidecode, urllib2
from lxml import html
html_text = urllib2.urlopen("http://www.yourwebsite.com")
ascii_text = unidecode.unidecode(html_text)
html.fromstring(ascii_text)
Explanation of issue
There seems to be a known issue with this in several versions of Python. And it's C# as well. A related closed issue seems to indicate that the issue was closed because XML attribute tags aren't built to support carriage returns, so escaping it in all xml contexts would be silly. As it turns out, the W3C spec requires that the unicode be put in when parsing (see sec. 1).
All line breaks must have been normalized on input to #xA as described in 2.11 End-of-Line Handling, so the rest of this algorithm operates on text normalized in this way.
You may solve your specific problem with:
post = post.replace('
', '\n')
Resulting test program:
from lxml import html
post = """<p>This a page with URLs.
This goes to
Google<br/>
This
goes to Yahoo!<br/>
<a
href="http://example.com">This is invalid due to that
line feed character</p>
"""
post = post.replace('
', '\n')
doc = html.fromstring(post)
for link in doc.xpath('//a'):
print link.get('href')
Output:
http://google.com
http://yahoo.com
http://example.com

Elementtree and Unicode or UTF-8 confusion

Okay, I feel a bit lost right now. I have some problems with unicode (or utf-8 ?)
I am using Python3.3 on linux (But I have the same problem on windows).
I try to create an XML file with Elementtree.
item = ET.Element("item")
item_title = Et.SubElement(item, "title")
That is of course not everything, just an example.
So now I want to have the tag 'title' have a text like this (replace the ##Content## with random content, doesnt matter so much):
# Thats how I create the text for the tag
item.title.text = u'<![CDATA[##CONTENT##]>'
# This is how I want it to look like
<title><![CDATA[##CONTENT##]></title>
# Thats what I get
<title><![CDATA[##CONTENT##]></title>
# These are some of the things I tried for writing it to an xml file
ET.ElementTree(item).write(myOutputFile, encoding="unicode")
myOutputFile.write(ET.tostring(item, encoding='unicode', method='xml')))
myOutputFile.write(str(ET.tostring(item, encoding='utf-8', method='xml')))
myOutputFile.write(str(ET.tostring(item)
# Oh and thats how I open the file for writing
myOutputFile = codecs.open(HereIsMyFile, 'w', encoding='utf-8')
I tried to search and found some similar sounding problems (some of the things I tried are from SO already), but none seems to work. They changed some stuff in the output, but never showed the < or >.
I also noticed, if I use utf-8 I have to use str() when writing to the file. That got me also confused about the difference in unicode and utf-8, I tried to read some stuff about that but that didn't really help me in my actual problem.
At this point I don't really know where to look for my error and I would love a hint where to look.
Is it the way I write to the file? How I open it?
Or is it Elementtree causing the error? (I didn't try something else, like lxml, because well, that would mean rewriting a lot of stuff I guess).
I hope you can help me and if something isn't clear I will try to explain it a bit better!
Edit: Oh and I also tried to open the file without codecs, because I somewhere read it is not needed anymore in Python3.x but I wasn't so sure anymore, so I tried it.
The correct way to write an XML document with ElementTree is:
with codecs.open(HereIsMyFile, 'w', encoding='utf-8'):
root.write(myOutputFile)
If you specify an encoding for write(), you must use what the XML standard defines. unicode isn't an encoding, it's a standard.
ElementTree doesn't support CDATA. The effect you're seeing is that ElementTree notices special characters in the text of the node and it escapes them; there is no way to prevent that.
This answer contains the implementation of a CDATA element: How to output CDATA using ElementTree
There seem to be a couple of layers of confusion here.
Taking the lower level first: encodings such as UTF-8 convert Unicode characters into bytes. Your problem is that the characters in your generated XML aren’t the ones you want, not with how those characters are stored as bytes, so there isn’t anything to fix there.
Secondly, you seem to be expecting the wrong thing from this line:
item.title.text = u'<![CDATA[##CONTENT##]>'
This tells ElementTree that you want that text in the parsed document. Consider this:
item.title.text = u'I <3 ASCII art.'
ElementTree won’t store that directly in the markup: it’ll turn it into
<title>I <3 ASCII art.</title>
Likewise:
item.title.text = u"This </title> isn’t the end of the title"
becomes
<title>This </title> isn’t the end of the title</title>
Hopefully you can see the value of this: no matter what text you put in there, it won’t break the element markup, or indeed affect it in any way.
Note that because of this automatic conversion, you very likely don’t need CDATA sections at all.
If for some reason you do, though, you can do it by stating it explicitly (using lxml.etree):
title = lxml.etree.Element('title')
title.text = lxml.etree.CDATA('###CONTENT###')
print(lxml.etree.tostring(title))
outputs:
<title><![CDATA[###CONTENT###]]></title>

lxml: extracting unicode text from HTML

Update:
My code works fine on most Hebrew page, but fails on 10% of them. I was unfortunate enough to start with two 'bad' ones.
Here's an example of a 'good' page: http://m.sport5.co.il/Pages/Article.aspx?articleId=154765,
and this is a 'bad' one: http://www.havoda.org.il/Web/Default.aspx.
I still need to deal with the bad ones, and I still don't know how...
Original question:
I'm using lxml.html to parse HTML, and extract only text (to be later used for text classification). I couldn't manage to properly deal with unicode (Hebrew text, in my case).
The tree elements don't seem to be encoded correctly:
When I look at element[i].text , where type(element[i].text) = UnicodeType, I see something like this: "u'\xd7\x9e\xd7\xa9\xd7\x94 \xd7\xa9\xd7\xa8\xd7\xaa (1955-1954)'", and this is not right - this entity cannot be encoded or decoded! (or I haven't found how...) Printing it brings, of course, something like this: "××©× ×©×¨×ª (1955-1954)", and that's not Hebrew...
A workable text string should look like:
1. u'\u05de\u05e9\u05d4 \u05e9\u05e8\u05ea (1955-1954)' - a proper unicode string; or:
2. '\xd7\x9e\xd7\xa9\xd7\x94 \xd7\xa9\xd7\xa8\xd7\xaa (1955-1954)' - unicode encoded into a regular text string; but not:
3. u'\xd7\x9e\xd7\xa9\xd7\x94 \xd7\xa9\xd7\xa8\xd7\xaa (1955-1954)' - a useless hybrid entity ('ascii' codec can't decode byte...)
What do I do to solve it? What am I doing wrong? Here's the code I'm using:
import lxml.html as lh
from types import *
f = urlopen(url)
html = f.read()
root = lh.fromstring(html)
all_elements = root.cssselect('*')
all_text = ''
for i in range(len(all_elements)):
if all_elements[i].tag not in ['script','style']:
if type(all_elements[i].text) in [StringType, UnicodeType]:
all_text = all_text + all_elements[i].text.strip() + ' '
Everything works just fine with pure English (non unicode) html.
Almost all of the answers here refer to lxml.etree, and not lxml.html that I'm using. Do I have to switch? (I don't want to...)
probably (but hard to know for sure without having the data), the page is UTF-8 encoded, but the HTML parser defaults to iso-8859-1 (as opposed to the XML parser which defaults to UTF-8)

How can I disable 'output escaping' in minidom

I'm trying to build an xml document from scratch using xml.dom.minidom. Everything was going well until I tried to make a text node with a ® (Registered Trademark) symbol in. My objective is for when I finally hit print mydoc.toxml() this particular node will actually contain a ® symbol.
First I tried:
import xml.dom.minidom as mdom
data = '®'
which gives the rather obvious error of:
File "C:\src\python\HTMLGen\test2.py", line 3
SyntaxError: Non-ASCII character '\xae' in file C:\src\python\HTMLGen\test2.py on line 3, but no encoding declared; see http://www.python.or
g/peps/pep-0263.html for details
I have of course also tried changing the encoding of my python script to 'utf-8' using the opening line comment method, but this didn't help.
So I thought
import xml.dom.minidom as mdom
data = '®' #Both accepted xml encodings for registered trademark
data = '®'
text = mdom.Text()
text.data = data
print data
print text.toxml()
But because when I print text.toxml(), the ampersands are being escaped, I get this output:
®
&reg;
My question is, does anybody know of a way that I can force the ampersands not to be escaped in the output, so that I can have my special character reference carry through to the XML document?
Basically, for this node, I want print text.toxml() to produce output of ® or ® in a happy and cooperative way!
EDIT 1:
By the way, if minidom actually doesn't have this capacity, I am perfectly happy using another module that you can recommend which does.
EDIT 2:
As Hugh suggested, I tried using data = u'®' (while also using data # -*- coding: utf-8 -*- Python source tags). This almost helped in the sense that it actually caused the ® symbol itself to be outputted to my xml. This is actually not the result I am looking for. As you may have guessed by now (and perhaps I should have specified earlier) this xml document happens to be an HTML page, which needs to work in a browser. So having ® in the document ends up causing rubbish in the browser (® to be precise!).
I also tried:
data = unichr(174)
text.data = data.encode('ascii','xmlcharrefreplace')
print text.toxml()
But of course this lead to the same origional problem where all that happens is the ampersand gets escaped by .toxml().
My ideal scenario would be some way of escaping the ampersand so that the XML printing function won't "escape" it on my behalf for the document (in other words, achieving my original goal of having ® or ® appear in the document).
Seems like soon I'm going to have to resort to regular expressions!
EDIT 2a:
Or perhaps not. Seems like getting my html meta information correct <META http-equiv="Content-Type" Content="text/html; charset=UTF-8"> could help, but I'm not sure yet how this fits in with the xml structure...
Two options that work, one with the escaping ® and the other without. It's not really obvious why you want escaping ... it's 6 bytes instead of the 2 or 3 bytes for non-CJK characters.
import xml.dom.minidom as mdom
text = mdom.Text()
# Start with unicode
text.data = u'\xae'
f = open('reg1.html', 'w')
f.write("header saying the file is ascii")
uxml = text.toxml()
bxml = uxml.encode('ascii', 'xmlcharrefreplace')
f.write(bxml)
f.close()
f = open('reg2.html', 'w')
f.write("header saying the file is UTF-8")
xml = text.toxml(encoding='UTF-8')
f.write(xml)
f.close()
If I understand correctly, what you really want is to be able to create a text node from a unicode object (e.g. u'®' or u'\u00ae') and then have toxml() output unicode characters encoded as entities (e.g. ®). Looking at the source of minidom.py, however, it seems that minidom doesn't support entity encoding on output except the special cases of &, ", < and >.
You also ask about alternative modules that could help, however. There are several possible candidates, but ElementTree (xml.etree) seems to do the appropriate encoding. For example, if you take the first example from this blog post by Doug Hellmann but replace:
child_with_tail.text = 'This child has regular text.'
... with:
child_with_tail.text = u'This child has regular text \u00ae.'
... and run the script, you should see the output contains:
This child has regular text®.
You could also use the lxml implementation of ElementTree in that example just by replacing the import statement with:
from lxml.etree import Element, SubElement, Comment, tostring
Update: the alternative answer from John Machin takes the nice approach of running .encode('ascii', 'xmlcharrefreplace') on the output from minidom's toxml(), which converts any non-ASCII characters to their equivalent XML numeric character references.
Default unescape:
from xml.sax.saxutils import unescape
unescape("< & >")
The result is,
'< & >'
And, unescape more:
unescape("&apos; "", {"&apos;": "'", """: '"'})
Check details here, https://wiki.python.org/moin/EscapingXml

Categories

Resources