Python + Beautiful Soup: Write html source to file - python

I'm trying to save the page source to a file, so that I don't have to constantly re-run my code every time I want to test something.
I have:
html_source = driver.page_source
soup = BeautifulSoup(html_source, 'lxml') # added `lxml` only b/c I got a warning saying I should
soup = soup.prettify()
with open('pagesource.html', 'wb') as f_out:
f_out.write(soup)
The error I get is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 223871: ordinal not in range(128)
I also tried f_out.write(str(soup)), which didn't work.
How do I write the content to a file?

BeautifulSoup is for parsing Html and not grabbing it.
If you can import urllib, try urlretrieve:
import urllib
urllib.urlretrieve("http://www.example.com/test.html", "test.txt")

This works for me:
import urllib2
html = urllib2.urlopen('http://www.example.com').read()
Now html contains the source code of that url.
with open('web.html', 'w') as f:
f.write(html)
You should now be able to open that with a browser.

From bs4 documentation:
UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (or just about any other UnicodeEncodeError) - This is not a problem with Beautiful Soup. This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn’t know how to display. (See this page on the Python wiki for help.) Second, when you’re writing to a file and you pass in a Unicode character that’s not supported by your default encoding. In this case, the simplest solution is to explicitly encode the Unicode string into UTF-8 with u.encode("utf8").
I got the same error and solved it using:
soup = BeautifulSoup(page.content, 'html.parser', from_encoding="utf8")
with open(file_name_with_path, mode="w", encoding="utf8") as code:
code.write(str(soup2.prettify()))
You should avoid writing in binary mode. Try using mode="w" instead of mode="wb". Also you have to specify that your file is being written in utf8 encoding. Your error was not due to bs4 but due to incapability of the file writing process from accepting utf8 coding.

Related

Wrong accented characters using Beautiful Soup in Python on a local HTML file

I'm quite familiar with Beautiful Soup in Python, I have always used to scrape live site.
Now I'm scraping a local HTML file (link, in case you want to test the code), the only problem is that accented characters are not represented in the correct way (this never happened to me when scraping live sites).
This is a simplified version of the code
import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('AH.html'), "html.parser")
tables = soup.find_all('table')
titles = tables[0].find_all('tr')
print(titles[55].text)
which prints the following output
2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]
while the correct output should be
2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]
I looked for a solution, read many questions/answers and found this answer, which I implemented in the following way
import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
import codecs
response = open('AH.html')
content = response.read()
html = codecs.decode(content, 'utf-8')
soup = BeautifulSoup(html, "html.parser")
However, it runs the following error
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
TypeError: a bytes-like object is required, not 'str'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\user\Desktop\score.py", line 8, in <module>
html = codecs.decode(content, 'utf-8')
TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'str')
I guess is easy to solve the problem, but how to do it?
Using open('AH.html') decodes the file using a default encoding that may not be the encoding of the file. BeautifulSoup understands the HTML headers, specifically the following content indicates the file is UTF-8-encoded:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Open the file in binary mode and let BeautifulSoup figure it out:
with open("AH.html","rb") as f:
soup = BeautifulSoup(f, 'html.parser')
Sometimes, websites set the encoding incorrectly. In that case you can specify the encoding yourself if you know what it should be.
with open("AH.html",encoding='utf8') as f:
soup = BeautifulSoup(f, 'html.parser')
from bs4 import BeautifulSoup
with open("AH.html") as f:
soup = BeautifulSoup(f, 'html.parser')
tb = soup.find("table")
for item in tb.find_all("tr")[55]:
print(item.text)
I've to say, that your first code is actually fine and should works.
Regarding the second code, you are trying to decode str which is faulty. as decode function is for byte object.
I believe that you are using Windows where the default encoding of it is cp1252 not UTF-8.
Could you please run the following code:
import sys
print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.stderr.encoding)
And check your output if it's UTF-8 or cp1252.
note that if you are using VSCode with Code-Runner, kindly run your code in the terminal as py code.py
SOLUTIONS (from the chat)
(1) If you are on windows 10
Open Control Panel and change view by Small icons
Click Region
Click the Administrative tab
Click on Change system locale...
Tick the box "Beta: Use Unicode UTF-8..."
Click OK and restart your pc
(2) If you are not on Windows 10 or just don't want to change the previous setting, then in the first code change open("AH.html") to open("AH.html", encoding="UTF-8"), that is write:
from bs4 import BeautifulSoup
with open("AH.html", encoding="UTF-8") as f:
soup = BeautifulSoup(f, 'html.parser')
tb = soup.find("table")
for item in tb.find_all("tr")[55]:
print(item.text)

"'ascii' codec can't encode character" error from BeautifulSoup

Python newbie here. Currently writing a crawler for a lyrics website, and I'm running into this problem when trying to parse the HTML. I'm using BeautifulSoup and requests.
Code right now is (after all imports and whatnot):
import requests as r
from bs4 import BeautifulSoup as bs
def function(artist_name):
temp = "https://www.lyrics.com/lyrics/"
if ' ' in artist_name:
artist_name = artist_name.replace(' ', '%20')
page = r.get(temp + artist_name.lower()).content
soup = bs(page, 'html.parser')
return soup
When I try to test this out, I keep getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 8767: ordinal not in range(128)
I've tried adding .encode('utf-8') to the end of the soup line, and it got rid of the error but wouldn't let me use any of the BeautifulSoup methods since it returns bytes.
I've taken a look at the other posts on here, and tried other solutions they've provided for similar errors. There's still a lot I have to understand about Python and Unicode, but if someone could help out and give some guidance, would be much appreciated.

Why is urlopen giving me a strange string of characters?

I am trying to scrape the NBA game predictions on FiveThirtyEight. I usually use urllib2 and BeautifulSoup to scrape data from the web. However, the html that is returning from this process is very strange. It is a string of characters such as "\x82\xdf\x97S\x99\xc7\x9d". I cannot encode it into regular text. Here is my code:
from urllib2 import urlopen
html = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/').read()
This method works on other websites and other pages on 538, but not this one.
Edit: I tried to decode the string using
html.decode('utf-8')
and the method located here, but I got the following error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte
That page appears to return gzipped data by default. The following should do the trick:
from urllib2 import urlopen
import zlib
opener = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/')
if 'gzip' in opener.info().get('Content-Encoding', 'NOPE'):
html = zlib.decompress(opener.read(), 16 + zlib.MAX_WBITS)
else:
html = opener.read()
The result went into BeautifulSoup with no issues.
The HTTP headers (returned by the .info() above) are often helpful when trying to deduce the cause of issues with the Python url libraries.

Unable to extract data from BeautifulSoup object after utf-8 conversion due to 'str' typecasting

I'm trying to build my own web scraper using Python. One of the steps involves parsing an HTML page, for which I am using BeautifulSoup, which is the parser recommended in most tutorials. Here is my code which should extract the page and print it:
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
soup = soup.prettify()
print (soup)
However, there seems to be an error when I do soup.prettify() and then print it. The error is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in
position 16052: ordinal not in range(128)
To resolve this, I googled further and came across this answer of SO which resolved it. I basically had to set the encoding to 'utf=8' which I did. So here is the modded code (last 2 lines only):
soup = soup.prettify().encode('utf-8')
print (soup)
This works just fine. The problem arises when I try to use the soup.get_text() method as mentioned on a tutorial here. Whenever I do soup.get_text(), I get an error:
AttributeError: 'str' object has no attribute 'get_text'
I think this is expected since I'm encoding the soup to 'utf-8' and it's changing it to a str. I tried printing type(soup) before and after utf-8 conversion and as expected, before conversion it was an Object of the bs4.BeautifulSoup class and after, it was str.
How do I work around this? I'm pretty sure I'm doing something wrong and there's a proper way around this. Unfortunately, I'm not too familiar with Python, so please bear with me
You should not discard your original soup object. You can call soup.prettify().encode('utf-8') when you need to print it (or save it into a different variable).
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
html_code = soup.prettify().encode('utf-8')
text = soup.get_text().encode('utf-8')
print html_code
print "#################"
print text
# a = soup.find()
# l = []
# for i in a.next_elements:
# l.append(i)

BeautifulSoup findall with class attribute- unicode encode error

I am using BeautifulSoup to extract news stories(just the titles) from Hacker News and have this much up till now-
import urllib2
from BeautifulSoup import BeautifulSoup
HN_url = "http://news.ycombinator.com"
def get_page():
page_html = urllib2.urlopen(HN_url)
return page_html
def get_stories(content):
soup = BeautifulSoup(content)
titles_html =[]
for td in soup.findAll("td", { "class":"title" }):
titles_html += td.findAll("a")
return titles_html
print get_stories(get_page()
)
When I run the code, however, it gives an error-
Traceback (most recent call last):
File "terminalHN.py", line 19, in <module>
print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)
How do I get this to work?
Because BeautifulSoup works internally with unicode strings. Printing unicode strings to the console will cause Python to try the conversion of unicode to the default encoding of Python which is usually ascii. This will in general fail for non-ascii web-site. You may learn the basics about Python and Unicode by googling for "python + unicode". Meanwhile convert
your unicode strings to utf-8 using
print some_unicode_string.decode('utf-8')
One thing to note about your code is that findAll returns a list (in this case a list of BeautifulSoup objects) and you just want the titles. You might want to use find instead. And rather than printing out a list of the BeautifulSoup objects, you say that you just want the titles. The following works fine, for example:
import urllib2
from BeautifulSoup import BeautifulSoup
HN_url = "http://news.ycombinator.com"
def get_page():
page_html = urllib2.urlopen(HN_url)
return page_html
def get_stories(content):
soup = BeautifulSoup(content)
titles = []
for td in soup.findAll("td", { "class":"title" }):
a_element = td.find("a")
if a_element:
titles.append(a_element.string)
return titles
print get_stories(get_page())
So now get_stories() returns a list of unicode objects, which prints out as you'd expect.
It works fine, what's broken is the output. Either explicitly encode to your console's charset, or find a different way to run your code (e.g., from within IDLE).

Categories

Resources