I am trying to scrape the NBA game predictions on FiveThirtyEight. I usually use urllib2 and BeautifulSoup to scrape data from the web. However, the html that is returning from this process is very strange. It is a string of characters such as "\x82\xdf\x97S\x99\xc7\x9d". I cannot encode it into regular text. Here is my code:
from urllib2 import urlopen
html = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/').read()
This method works on other websites and other pages on 538, but not this one.
Edit: I tried to decode the string using
html.decode('utf-8')
and the method located here, but I got the following error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte
That page appears to return gzipped data by default. The following should do the trick:
from urllib2 import urlopen
import zlib
opener = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/')
if 'gzip' in opener.info().get('Content-Encoding', 'NOPE'):
html = zlib.decompress(opener.read(), 16 + zlib.MAX_WBITS)
else:
html = opener.read()
The result went into BeautifulSoup with no issues.
The HTTP headers (returned by the .info() above) are often helpful when trying to deduce the cause of issues with the Python url libraries.
Related
Python newbie here. Currently writing a crawler for a lyrics website, and I'm running into this problem when trying to parse the HTML. I'm using BeautifulSoup and requests.
Code right now is (after all imports and whatnot):
import requests as r
from bs4 import BeautifulSoup as bs
def function(artist_name):
temp = "https://www.lyrics.com/lyrics/"
if ' ' in artist_name:
artist_name = artist_name.replace(' ', '%20')
page = r.get(temp + artist_name.lower()).content
soup = bs(page, 'html.parser')
return soup
When I try to test this out, I keep getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 8767: ordinal not in range(128)
I've tried adding .encode('utf-8') to the end of the soup line, and it got rid of the error but wouldn't let me use any of the BeautifulSoup methods since it returns bytes.
I've taken a look at the other posts on here, and tried other solutions they've provided for similar errors. There's still a lot I have to understand about Python and Unicode, but if someone could help out and give some guidance, would be much appreciated.
I need help for encoding/decoding non-ascii url to appropriate form for feeding urlopen() method. My code for scraping url(non-ascii url) from a page and going to next page:
from urllib.request import urlopen
from bs4 import BeautifulSoup
Enterance url copy-pasted from chrome browser:
url = 'https://www.sheypoor.com/%DA%A9%D9%85%D8%AF %D9%86%D9%88%D8%AC%D9%88%D8%A7%D9%86-34926671.html'
for i in range(1,10):
html = urlopen(url)
page = BeautifulSoup(html.read(), 'html.parser')
url_obj = page.findAll('a')[13]['href'].strip()
print(url_obj)
url = url_obj
But I got an error:
'ascii' codec can't encode characters in position 5-9: ordinal not in range(128)
When I checked "UnicodeEncodeError", I saw this notification:
----> 8 html = urlopen(url)
As you are aware of the process: In first loop, urlopen() method can work with "enterance url", because it is in form of:
https://www.sheypoor.com/%DA%A9%D9%85%D8%AF-%D9%86%D9%88%D8%AC%D9%88%D8%A7%D9%86-34926671.html
But the problem will start when url_obj, which is scraped from BeautifulSoup object, is in form of
https://www.sheypoor.com/سرویس-تخت-کمد-نوجوان-44887762.html
replaced with older url, and this form is not appropriate for feeding to urlopen() method:
I tried to find solution for converting my url_object to correct url form such as enterance url,but I failed! :-(
I would be so pleased for your support and guide to solving this problem.
you could use something like this
from urllib.request import urlopen
from urllib.parse import quote
persian_url = 'https://www.isna.ir/news/99010100077/' + quote('حواشی-در-آکروباتیک-ژیمناستیک-بالا-گرفت-دبیر-هم-استعفا-کرد')
page = urlopen(persian_url)
the url was : 'https://www.isna.ir/news/99010100077/حواشی-در-آکروباتیک-ژیمناستیک-بالا-گرفت-دبیر-هم-استعفا-کرد'
I'm trying to save the page source to a file, so that I don't have to constantly re-run my code every time I want to test something.
I have:
html_source = driver.page_source
soup = BeautifulSoup(html_source, 'lxml') # added `lxml` only b/c I got a warning saying I should
soup = soup.prettify()
with open('pagesource.html', 'wb') as f_out:
f_out.write(soup)
The error I get is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 223871: ordinal not in range(128)
I also tried f_out.write(str(soup)), which didn't work.
How do I write the content to a file?
BeautifulSoup is for parsing Html and not grabbing it.
If you can import urllib, try urlretrieve:
import urllib
urllib.urlretrieve("http://www.example.com/test.html", "test.txt")
This works for me:
import urllib2
html = urllib2.urlopen('http://www.example.com').read()
Now html contains the source code of that url.
with open('web.html', 'w') as f:
f.write(html)
You should now be able to open that with a browser.
From bs4 documentation:
UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (or just about any other UnicodeEncodeError) - This is not a problem with Beautiful Soup. This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn’t know how to display. (See this page on the Python wiki for help.) Second, when you’re writing to a file and you pass in a Unicode character that’s not supported by your default encoding. In this case, the simplest solution is to explicitly encode the Unicode string into UTF-8 with u.encode("utf8").
I got the same error and solved it using:
soup = BeautifulSoup(page.content, 'html.parser', from_encoding="utf8")
with open(file_name_with_path, mode="w", encoding="utf8") as code:
code.write(str(soup2.prettify()))
You should avoid writing in binary mode. Try using mode="w" instead of mode="wb". Also you have to specify that your file is being written in utf8 encoding. Your error was not due to bs4 but due to incapability of the file writing process from accepting utf8 coding.
I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:
self.html_doc = self.html_doc.decode('gb2312','ignore')
But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.
Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)
>>> soup.original_encoding
u'gbk'
And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:
<meta http-equiv="content-type" content="text/html; charset=GBK" />
>>> soup.meta['content']
u'text/html; charset=GBK'
Now you can decode the HTML:
decoded_html = html.decode(soup.original_encoding)
but there not much point since the HTML is already available as unicode:
>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐
It is also possible to attempt to detect it using the chardet module (although it is a bit slow):
>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}
Another solution.
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.sohu.com') # This will automatically help you find the correct encoding
doc = SimplifiedDoc(html)
print (doc.title.text)
I know this is an old question, but I spent a while today puzzling over a particularly problematic website so I thought I'd share the solution that worked for me, which I got from here: http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html
Requests has a feature that will automatically get the actual encoding of the website, meaning you don't have to wrestle with encoding/decoding it (before I found this, I was getting all sorts of errors trying to encode/decode strings/bytes and never getting any output which was readable). This feature is called apparent_encoding. Here's how it worked for me:
from bs4 import BeautifulSoup
import requests
url = 'http://url_youre_using_here.html'
readOut = requests.get(url)
readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
soup = BeautifulSoup(readOut.text, "lxml")
I am using BeautifulSoup to extract news stories(just the titles) from Hacker News and have this much up till now-
import urllib2
from BeautifulSoup import BeautifulSoup
HN_url = "http://news.ycombinator.com"
def get_page():
page_html = urllib2.urlopen(HN_url)
return page_html
def get_stories(content):
soup = BeautifulSoup(content)
titles_html =[]
for td in soup.findAll("td", { "class":"title" }):
titles_html += td.findAll("a")
return titles_html
print get_stories(get_page()
)
When I run the code, however, it gives an error-
Traceback (most recent call last):
File "terminalHN.py", line 19, in <module>
print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)
How do I get this to work?
Because BeautifulSoup works internally with unicode strings. Printing unicode strings to the console will cause Python to try the conversion of unicode to the default encoding of Python which is usually ascii. This will in general fail for non-ascii web-site. You may learn the basics about Python and Unicode by googling for "python + unicode". Meanwhile convert
your unicode strings to utf-8 using
print some_unicode_string.decode('utf-8')
One thing to note about your code is that findAll returns a list (in this case a list of BeautifulSoup objects) and you just want the titles. You might want to use find instead. And rather than printing out a list of the BeautifulSoup objects, you say that you just want the titles. The following works fine, for example:
import urllib2
from BeautifulSoup import BeautifulSoup
HN_url = "http://news.ycombinator.com"
def get_page():
page_html = urllib2.urlopen(HN_url)
return page_html
def get_stories(content):
soup = BeautifulSoup(content)
titles = []
for td in soup.findAll("td", { "class":"title" }):
a_element = td.find("a")
if a_element:
titles.append(a_element.string)
return titles
print get_stories(get_page())
So now get_stories() returns a list of unicode objects, which prints out as you'd expect.
It works fine, what's broken is the output. Either explicitly encode to your console's charset, or find a different way to run your code (e.g., from within IDLE).