Beautiful Soup 4 UnicodeEncodeError When Using get_text() - python

I am trying to get all the text from a web page, following this tutorial. However I can not seem to get all the text from the web page using get_text(), instead I get the error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u1d90' in
position 2473: character maps to
Here is what my source code looks like:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()
soup = bs.BeautifulSoup(source,'lxml')
print(soup.get_text())
Any ideas on where I am going wrong?. I have followed several other answers on Stack Overflow and tried:
soup = bs.BeautifulSoup(source,'lxml').encode('UTF-8')
but get the error:
AttributeError: 'bytes' object has no attribute 'get_text'

Related

"'ascii' codec can't encode character" error from BeautifulSoup

Python newbie here. Currently writing a crawler for a lyrics website, and I'm running into this problem when trying to parse the HTML. I'm using BeautifulSoup and requests.
Code right now is (after all imports and whatnot):
import requests as r
from bs4 import BeautifulSoup as bs
def function(artist_name):
temp = "https://www.lyrics.com/lyrics/"
if ' ' in artist_name:
artist_name = artist_name.replace(' ', '%20')
page = r.get(temp + artist_name.lower()).content
soup = bs(page, 'html.parser')
return soup
When I try to test this out, I keep getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 8767: ordinal not in range(128)
I've tried adding .encode('utf-8') to the end of the soup line, and it got rid of the error but wouldn't let me use any of the BeautifulSoup methods since it returns bytes.
I've taken a look at the other posts on here, and tried other solutions they've provided for similar errors. There's still a lot I have to understand about Python and Unicode, but if someone could help out and give some guidance, would be much appreciated.

Unicode Encode Error: Charmap cannot encode character \xa9 in Python

Hi there I am writing scraping code but when i try to get all paragraph from website it give me following error
Unicode Encode Error: Charmap cannot encode character '\xa9'
here is my code:
#Loading Libraries
import urllib
from urllib.parse import urlparse
from urllib.parse import urljoin
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
newsurl = "http://www.techspot.com/news/67832-netflix-exceeds-growth-expectations-home-abroad-stock-soars.html"
thepage = urllib.request.urlopen(newsurl)
soup = BeautifulSoup(thepage ,"html.parser")
article = soup.find_all('div' , {'class','articleBody'})
for pg in article:
paragraph = soup.findAll('p')
ptag = paragraph
print(ptag)
Error I am getting is following:
Let me how to remove this error
soup.findAll() returns a ResultSet object which is basically a list which does not have an attribute encode. You either meant to use .text instead:
text = soup.text
Or, "join" the texts:
text = "".join(soup.findAll(whatever, you, want))
At times this error occurs while using Beautiful soup 4 or bs4 or using getData requests or command . So try using the below mentioned code with your print statement.
print(myHtmlData.encode("utf-8"))

Why is urlopen giving me a strange string of characters?

I am trying to scrape the NBA game predictions on FiveThirtyEight. I usually use urllib2 and BeautifulSoup to scrape data from the web. However, the html that is returning from this process is very strange. It is a string of characters such as "\x82\xdf\x97S\x99\xc7\x9d". I cannot encode it into regular text. Here is my code:
from urllib2 import urlopen
html = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/').read()
This method works on other websites and other pages on 538, but not this one.
Edit: I tried to decode the string using
html.decode('utf-8')
and the method located here, but I got the following error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte
That page appears to return gzipped data by default. The following should do the trick:
from urllib2 import urlopen
import zlib
opener = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/')
if 'gzip' in opener.info().get('Content-Encoding', 'NOPE'):
html = zlib.decompress(opener.read(), 16 + zlib.MAX_WBITS)
else:
html = opener.read()
The result went into BeautifulSoup with no issues.
The HTTP headers (returned by the .info() above) are often helpful when trying to deduce the cause of issues with the Python url libraries.

Unable to extract data from BeautifulSoup object after utf-8 conversion due to 'str' typecasting

I'm trying to build my own web scraper using Python. One of the steps involves parsing an HTML page, for which I am using BeautifulSoup, which is the parser recommended in most tutorials. Here is my code which should extract the page and print it:
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
soup = soup.prettify()
print (soup)
However, there seems to be an error when I do soup.prettify() and then print it. The error is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in
position 16052: ordinal not in range(128)
To resolve this, I googled further and came across this answer of SO which resolved it. I basically had to set the encoding to 'utf=8' which I did. So here is the modded code (last 2 lines only):
soup = soup.prettify().encode('utf-8')
print (soup)
This works just fine. The problem arises when I try to use the soup.get_text() method as mentioned on a tutorial here. Whenever I do soup.get_text(), I get an error:
AttributeError: 'str' object has no attribute 'get_text'
I think this is expected since I'm encoding the soup to 'utf-8' and it's changing it to a str. I tried printing type(soup) before and after utf-8 conversion and as expected, before conversion it was an Object of the bs4.BeautifulSoup class and after, it was str.
How do I work around this? I'm pretty sure I'm doing something wrong and there's a proper way around this. Unfortunately, I'm not too familiar with Python, so please bear with me
You should not discard your original soup object. You can call soup.prettify().encode('utf-8') when you need to print it (or save it into a different variable).
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
html_code = soup.prettify().encode('utf-8')
text = soup.get_text().encode('utf-8')
print html_code
print "#################"
print text
# a = soup.find()
# l = []
# for i in a.next_elements:
# l.append(i)

Error in printing scraped webpage through bs4

Code:
import requests
import urllib
from bs4 import BeautifulSoup
page1 = urllib.request.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.get_text())
print(soup.prettify())
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 9, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to <undefined>
I think the problem lies mainly with urlib package. Here I am using urllib3 package. They changed the urlopen syntax from 2 to 3 version, which maybe the cause of error. But that being said I have included the latest syntax only.
Python version 3.4
since you are importing requests you can use it instead of urllib like this:
import requests
from bs4 import BeautifulSoup
page1 = requests.get("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1.text)
print(soup.get_text())
print(soup.prettify())
Your problem is that python cannot encode the characters from the page that you are scraping. For some more information see here: https://stackoverflow.com/a/16347188/2638310
Since the wikipedia page is in UTF-8, it seems that BeautifulSoup is guessing the encoding incorrectly. Try passing the from_encoding argument in your code like this:
soup = BeautifulSoup(page1.text, from_encoding="UTF-8")
For more on encodings in BeautifulSoup have a look here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings
I am using Python2.7, so I don't have request method inside the urllib module.
#!/usr/bin/python3
# coding: utf-8
import requests
from bs4 import BeautifulSoup
URL = "http://en.wikipedia.org/wiki/List_of_human_stampedes"
soup = BeautifulSoup(requests.get(URL).text)
print(soup.get_text())
print(soup.prettify())
https://www.python.org/dev/peps/pep-0263/
Put those print lines inside a Try-Catch block so if there is an illegal character, then you won't get an error.
try:
print(soup.get_text())
print(soup.prettify())
except Exception:
print(str(soup.get_text().encode("utf-8")))
print(str(soup.prettify().encode("utf-8")))

Categories

Resources