My scrapy project is giving me a strange encoding for items when using CSS selectors.
Here is the relevent code:
Once the scrapy request is made and the webpage is downloaded, parse_page is called with the response...
def parse_page(self, response):
# Using Selenium WebDriver to select elements
records = self.driver.find_elements_by_css_selector('#searchResultsTable > tbody > tr')
for record in records:
# Convert selenium object into scrapy.Selector object (necessary to use .add_css methods)
sel = Selector(text=record.get_attribute('outerHTML'))
# Instantiate RecordLoader (a custom item loader)
il = RecordLoader(item=Record(), selector=sel)
# Select element and pass to example_field's input processor
il.add_css('example_field', 'td:nth-child(2) > a::text')
il.add_css() passes the result of the CSS selector to example_field's input processor which for demonstration purposes is only print statements and shows the issue...
def example_field_input_processor(text_html):
print(text_html)
print(type(text_html))
print(text_html.encode('utf-8'))
Output:
'\xa0\xa004/29/2020 10:50:24 AM,\xa0\xa0\xa0'
<class 'str'>
b'\xc2\xa0\xc2\xa004/29/2020 10:50:24 AM,\xc2\xa0\xc2\xa0\xc2\xa0'
Here are my questions:
1) Why is it that the CSS selector didn't just give me a normal Python string? Does it have to do with the CSS selector casting to text with ::text. Is it because the webpage is in a different encoding? I checked if there was a <meta> tag that specified the site's encoding but there wasn't one.
2) When I force an encoding of 'utf-8' why don't I get a normal python string instead of a bytes string that shows all the Unicode characters?
3) My goal is to have just a normal python string (No prepended b, no weird characters) that I can parse. How?
While scraping you sometimes have to clean your results from unicode characters
They are usually as a result of spaces tabs and sometimes span
As a common practice clean all texts you scrape:
def string_cleaner(rouge_text):
return ("".join(rouge_text.strip()).encode('ascii', 'ignore').decode("utf-8"))
Explaination:
Use split() and join to translate the characters and clear it of unicodes.
This part of the code "".join(rouge_text.strip())
Then encode it to ascii and decode it to utf-8 to remove special characters
This part of the code .encode('ascii','ignore').decode("utf-8"))
How you would apply it in your code
print(string_cleaner(text_html))
Related
I am scraping a page with Beautiful Soup, and the output contains non-standard Latin characters that are showing up as hex.
I am scraping https://www.archchinese.com. It contains pinyin words, which use non-standard latin characters (ǎ, ā, for example). I've been trying to loop through a series of links that contain pinyin, using the BeautifulSoup .string function along with utf-8 encoding to output these words. The word comes out with hex in the places of non-standard characters. The word "hǎo" comes out as "h\xc7\x8eo". I'm sure I'm doing something wrong with encoding it, but I don't know enough to know what to fix. I tried decoding with utf-8 first, but I'm getting an error that the element has no decode function. Trying to print the string without encoding gives me an error about the characters being undefined, which, I figure, is because they need to be encoded to something first.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
url = "https://www.archchinese.com/"
driver = webdriver.Chrome() #Set selenium up for opening page with Chrome.
driver.implicitly_wait(30)
driver.get(url)
driver.find_element_by_id('dictSearch').send_keys('好') # This character is hǎo.
python_button = driver.find_element_by_id('dictSearchBtn')
python_button.click() # Look for submit button and click it.
soup=BeautifulSoup(driver.page_source, 'lxml')
div = soup.find(id='charDef') # Find div with the target links.
for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
print (a.string.encode('utf-8')) # Loop through all links with pinyin and attempt to encode.
Actual results:
b'h\xc7\x8eo'
b'h\xc3\xa0o'
Expected results:
hǎo
hào
EDIT: The problem seems to be related to the UnicodeEncodeError in Windows. I've tried to install win-unicode-console, but no luck. Thanks to snakecharmerb for the info.
You don't need to encode the values when printing - the print function will take care of this automatically. Right now, you're printing the representation of the bytes that make up the encoded value rather than just the string itself.
>>> s = 'hǎo'
>>> print(s)
hǎo
>>> print(s.encode('utf-8'))
b'h\xc7\x8eo'
Use encode while you are calling BeautifulSoup, not after.
soup=BeautifulSoup(driver.page_source.encode('utf-8'), 'lxml')
div = soup.find(id='charDef') # Find div with the target links.
for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
print (a.string)
I have this short example to demonstrate my problem:
from lxml import html
post = """<p>This a page with URLs.
This goes to
Google<br/>
This
goes to Yahoo!<br/>
<a
href="http://example.com">This is invalid due to that
line feed character</p>
"""
doc = html.fromstring(post)
for link in doc.xpath('//a'):
print link.get('href')
This outputs:
http://google.com
http://yahoo.com
None
The problem is that my data has
characters embedded in it. For my last link, it is embedded directly between the anchor and the href attribute. The line feeds outside of the elements are important to me.
doc.xpath('//a') correctly saw the <a
href="http://example.com"> as a link, but it can't access the href attribute when I do link.get('href').
How can I clean the data if link.get('href') returns None, so that I can still retrieve the discovered href attribute?
I can't strip all of the
characters from the entire post element as the ones in the text are important.
Module unidecode
Since you need the data outside of the tags, you could try using unidecode. It doesn't tackle Chinese and Korean, but it'll do things like change left and right quotes to ASCII quotes. It should help with these
characters as well, changing them to spaces instead of non-breaking spaces. Hopefully that's all you need in regards to preserving the other data. str.replace(u"\#xa", u" ") is less heavy handed if the ascii space is okay.
import unidecode, urllib2
from lxml import html
html_text = urllib2.urlopen("http://www.yourwebsite.com")
ascii_text = unidecode.unidecode(html_text)
html.fromstring(ascii_text)
Explanation of issue
There seems to be a known issue with this in several versions of Python. And it's C# as well. A related closed issue seems to indicate that the issue was closed because XML attribute tags aren't built to support carriage returns, so escaping it in all xml contexts would be silly. As it turns out, the W3C spec requires that the unicode be put in when parsing (see sec. 1).
All line breaks must have been normalized on input to #xA as described in 2.11 End-of-Line Handling, so the rest of this algorithm operates on text normalized in this way.
You may solve your specific problem with:
post = post.replace('
', '\n')
Resulting test program:
from lxml import html
post = """<p>This a page with URLs.
This goes to
Google<br/>
This
goes to Yahoo!<br/>
<a
href="http://example.com">This is invalid due to that
line feed character</p>
"""
post = post.replace('
', '\n')
doc = html.fromstring(post)
for link in doc.xpath('//a'):
print link.get('href')
Output:
http://google.com
http://yahoo.com
http://example.com
Right now my output to a file is like:
<b>Nov 22–24</b> <b>Nov 29–Dec 1</b> <b>Dec 6–8</b> <b>Dec 13–15</b> <b>Dec 20–22</b> <b>Dec 27–29</b> <b>Jan 3–5</b> <b>Jan 10–12</b> <b>Jan 17–19</b> <b><i>Jan 17–20</i></b> <b>Jan 24–26</b> <b>Jan 31–Feb 2</b> <b>Feb 7–9</b> <b>Feb 14–16</b> <b><i>Feb 14–17</i></b> <b>Feb 21–23</b> <b>Feb 28–Mar 2</b> <b>Mar 7–9</b> <b>Mar 14–16</b> <b>Mar 21–23</b> <b>Mar 28–30</b>
I want to remove all the "Â" and css tags (< b >, < / b >). I tried using the .remove and .replace functions but I get an error:
SyntaxError: Non-ASCII character '\xc2' in file -- FILE NAME-- on line 70, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
The output above is in a list, which I get from a webcrawling function:
def getWeekend(item_url):
dates = []
href = item_url[:37]+"page=weekend&" + item_url[37:]
response = requests.get(href)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
date= soup.select('table.chart-wide > tr > td > nobr > font > a > b')
return date
I write it to a file like so:
for item in listOfDate:
wr.writerow(item)
How can I remove all the tags so that only the date is left?
I'm not sure, but I think aString.regex_replace('toFind', 'toReplace') should work. Either that or writeb it to a file, and then run sed on it like: sed -i 's/toFind/toReplace/g'
You already got a working solution, but for the future:
Use get_text() to get rid of the tags
date = soup.select('table.chart-wide > tr > td > nobr > font > a > b').get_text()
Use .replace(u'\xc2',u'') to get rid of the Â. the u makes u'\xc2' a unicode string. (This might take some futzing around with encoding, but for me get_Text() is already returning a unicode object.)
(Additionally, possibly consider .replace(u'\u2013',u'-') because right now, you have an en-dash :P.)
date = date.replace(u'\xc2',u'').replace(u'\u2013',u'-')
The problem is that you don't have an ASCII string from the website. You need to convert the non-ASCII text into something Python can understand before manipulating it.
Python will use Unicode when given a chance. There's plenty of information out there if you just have a look. For example, you can find more help from other questions on this website:
Python: Converting from ISO-8859-1/latin1 to UTF-8
python: unicode in Windows terminal, encoding used?
What is the difference between encode/decode?
If your Python 2 source code has literal non-ASCII characters such as  then you should declare the source code encoding as the error message says. Put at the top of your Python file:
# -*- coding: utf-8 -*-
Make sure the file is saved using the utf-8 encoding and use Unicode strings to work with the text.
I´m using python and BeautifullSoup for finding and replacing some text on html page, and my problem is that i need to keep file struсture (indentations, spaces, new lines etc) unchanged and change only desired elements. How can I achieve this? Both str(soup) and soup.prettify() are altering source file in many ways.
P.S. sample code:
soup = BeautifulSoup(text)
for element in soup.findAll(text=True):
if not element.parent.name in ['style', 'script', 'head', 'title','pre']:
element.replaceWith(process(element))
result = str(soup)
I'd say there's no easy way (or no way at all). From BeautifulStoneSoup's doc:
__str__(self, encoding='utf-8', prettyPrint=False, indentLevel=0)
Returns a string or Unicode representation of this tag and
its contents. To get Unicode, pass None for encoding.
NOTE: since Python's HTML parser consumes whitespace, this
method is not certain to reproduce the whitespace present in
the original string.
According to the note, the original whitespaces are lost to the internal representation.
I am scraping a webpage that contains HTML that looks like this in the browser
<td>LGG® MAX multispecies probiotic consisting of four bacterial trains</td>
<td>LGG® MAX helps to reduce gastro-intestinal discomfort</td>
Taking just the LGG®, in the first instance it is LGG® In the second instance, ® is written as ® in the source code.
I am using Python 2.7, mechanize and BeautifulSoup.
My difficulty is that the ® is uplifted by mechanize, and carried through and is ultimately printed out or written to file.
There are many other special characters. Some are 'converted' on output and the ® are converted to a muddle.
The webpage is declared as UTF-8 and the only reference I make to encoding is when I open my out file. I've declared UTF-8. If I don't the writing to file bombs on other characters.
I am working on Windows 7. Other details:
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_GB', 'cp1252')
>>>
Can anyone give me any tips on the best way to handle the special characters? Or should they be called HTML entities? This must be a fairly common problem but I haven't been able to find any straightforward explanations on the web.
UPDATE: I've made some progress here.
The basic algorithm is
Read the webpage in mechanize
Use beautiful soup to do what.. as i write it down i have no idea
what this pre-processing stage is for, exactly.
Use beautiful soup to extract information from a table that is
orderly other than for the treatment of special characters.
Write the information to file delimited by | to account for
punctuation in long cell entries and to allow for importing into
Excel etc.
The progress is in stage 3. I've used some regex and htmlentityrefs to change the code cell entry by cell entry. See this blog post.
Remaining difficulty: the code written to file (and printed to screen) is still incorrect but it appears that the problem is now a matter of specifying the coding correctly. The problem seems smaller at least.
To answer the question from the title:
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
html = u"""
<td>LGG® MAX multispecies probiotic consisting of four bacterial trains</td>
<td>LGG® MAX helps to reduce gastro-intestinal discomfort</td>
"""
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
print(''.join(soup('td', text=True)))
Output
LGG® MAX multispecies probiotic consisting of four bacterial trains
LGG® MAX helps to reduce gastro-intestinal discomfort