Error in printing scraped webpage through bs4 - python

Code:
import requests
import urllib
from bs4 import BeautifulSoup
page1 = urllib.request.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.get_text())
print(soup.prettify())
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 9, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to <undefined>
I think the problem lies mainly with urlib package. Here I am using urllib3 package. They changed the urlopen syntax from 2 to 3 version, which maybe the cause of error. But that being said I have included the latest syntax only.
Python version 3.4

since you are importing requests you can use it instead of urllib like this:
import requests
from bs4 import BeautifulSoup
page1 = requests.get("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1.text)
print(soup.get_text())
print(soup.prettify())
Your problem is that python cannot encode the characters from the page that you are scraping. For some more information see here: https://stackoverflow.com/a/16347188/2638310
Since the wikipedia page is in UTF-8, it seems that BeautifulSoup is guessing the encoding incorrectly. Try passing the from_encoding argument in your code like this:
soup = BeautifulSoup(page1.text, from_encoding="UTF-8")
For more on encodings in BeautifulSoup have a look here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings

I am using Python2.7, so I don't have request method inside the urllib module.
#!/usr/bin/python3
# coding: utf-8
import requests
from bs4 import BeautifulSoup
URL = "http://en.wikipedia.org/wiki/List_of_human_stampedes"
soup = BeautifulSoup(requests.get(URL).text)
print(soup.get_text())
print(soup.prettify())
https://www.python.org/dev/peps/pep-0263/

Put those print lines inside a Try-Catch block so if there is an illegal character, then you won't get an error.
try:
print(soup.get_text())
print(soup.prettify())
except Exception:
print(str(soup.get_text().encode("utf-8")))
print(str(soup.prettify().encode("utf-8")))

Related

Wrong accented characters using Beautiful Soup in Python on a local HTML file

I'm quite familiar with Beautiful Soup in Python, I have always used to scrape live site.
Now I'm scraping a local HTML file (link, in case you want to test the code), the only problem is that accented characters are not represented in the correct way (this never happened to me when scraping live sites).
This is a simplified version of the code
import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('AH.html'), "html.parser")
tables = soup.find_all('table')
titles = tables[0].find_all('tr')
print(titles[55].text)
which prints the following output
2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]
while the correct output should be
2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]
I looked for a solution, read many questions/answers and found this answer, which I implemented in the following way
import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
import codecs
response = open('AH.html')
content = response.read()
html = codecs.decode(content, 'utf-8')
soup = BeautifulSoup(html, "html.parser")
However, it runs the following error
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
TypeError: a bytes-like object is required, not 'str'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\user\Desktop\score.py", line 8, in <module>
html = codecs.decode(content, 'utf-8')
TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'str')
I guess is easy to solve the problem, but how to do it?
Using open('AH.html') decodes the file using a default encoding that may not be the encoding of the file. BeautifulSoup understands the HTML headers, specifically the following content indicates the file is UTF-8-encoded:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Open the file in binary mode and let BeautifulSoup figure it out:
with open("AH.html","rb") as f:
soup = BeautifulSoup(f, 'html.parser')
Sometimes, websites set the encoding incorrectly. In that case you can specify the encoding yourself if you know what it should be.
with open("AH.html",encoding='utf8') as f:
soup = BeautifulSoup(f, 'html.parser')
from bs4 import BeautifulSoup
with open("AH.html") as f:
soup = BeautifulSoup(f, 'html.parser')
tb = soup.find("table")
for item in tb.find_all("tr")[55]:
print(item.text)
I've to say, that your first code is actually fine and should works.
Regarding the second code, you are trying to decode str which is faulty. as decode function is for byte object.
I believe that you are using Windows where the default encoding of it is cp1252 not UTF-8.
Could you please run the following code:
import sys
print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.stderr.encoding)
And check your output if it's UTF-8 or cp1252.
note that if you are using VSCode with Code-Runner, kindly run your code in the terminal as py code.py
SOLUTIONS (from the chat)
(1) If you are on windows 10
Open Control Panel and change view by Small icons
Click Region
Click the Administrative tab
Click on Change system locale...
Tick the box "Beta: Use Unicode UTF-8..."
Click OK and restart your pc
(2) If you are not on Windows 10 or just don't want to change the previous setting, then in the first code change open("AH.html") to open("AH.html", encoding="UTF-8"), that is write:
from bs4 import BeautifulSoup
with open("AH.html", encoding="UTF-8") as f:
soup = BeautifulSoup(f, 'html.parser')
tb = soup.find("table")
for item in tb.find_all("tr")[55]:
print(item.text)

How to fix Cyrillic characters while web-scraping with Python

I'm scraping a Cyrillic website with python using BeautifulSoup, but I'm having some trouble, every word is showing like this:
СилÑановÑка Ðавкова во Ðази
I also tried some other Cyrillic websites, but they are working good.
My code is this:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
How should I fix it?
requests fails to detect it as utf-8.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://time.mk/') # don't convert to text just yet
# print(source.encoding)
# prints out ISO-8859-1
source.encoding = 'utf-8' # override encoding manually
soup = BeautifulSoup(source.text, 'lxml') # this will now decode utf-8 correctly

Unicode Encode Error: Charmap cannot encode character \xa9 in Python

Hi there I am writing scraping code but when i try to get all paragraph from website it give me following error
Unicode Encode Error: Charmap cannot encode character '\xa9'
here is my code:
#Loading Libraries
import urllib
from urllib.parse import urlparse
from urllib.parse import urljoin
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
newsurl = "http://www.techspot.com/news/67832-netflix-exceeds-growth-expectations-home-abroad-stock-soars.html"
thepage = urllib.request.urlopen(newsurl)
soup = BeautifulSoup(thepage ,"html.parser")
article = soup.find_all('div' , {'class','articleBody'})
for pg in article:
paragraph = soup.findAll('p')
ptag = paragraph
print(ptag)
Error I am getting is following:
Let me how to remove this error
soup.findAll() returns a ResultSet object which is basically a list which does not have an attribute encode. You either meant to use .text instead:
text = soup.text
Or, "join" the texts:
text = "".join(soup.findAll(whatever, you, want))
At times this error occurs while using Beautiful soup 4 or bs4 or using getData requests or command . So try using the below mentioned code with your print statement.
print(myHtmlData.encode("utf-8"))

How to read html body from web-site using Python version 3x

I would like to connect and receive http response from a specific web site link.
I have many Python codes :
import urllib.request
import os,sys,re,datetime
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode(encoding=sys.stdout.encoding)
fp.close()
when I pass the response as a parameter to:
BeautifulSoup(str(mystr), 'html.parser')
to get the cleaned html text, I got the following error:
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25bc' in position 1139: character maps to <undefined>.
The question how can I solve this problem?
complete code :
import urllib.request
import os,sys,re,datetime
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode(encoding=sys.stdout.encoding)
fp.close()
from bs4 import BeautifulSoup
soup = BeautifulSoup(str(mystr), 'html.parser')
mystr = soup;
print(mystr.get_text())
BeautifulSoup is perfectly happy to consume the file-like object returned by urlopen:
from urllib.request import urlopen
from bs4 import BeautifulSoup
with urlopen("...") as website:
soup = BeautifulSoup(website)
print(soup.prettify())
If you use the requests library you can avoid these complications:)
import requests
fp = requests.get("http://www.python.org")
mystr = fp.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(mystr, 'html.parser')
mystr = soup;
print(mystr.get_text())

BeautifulSoup findall with class attribute- unicode encode error

I am using BeautifulSoup to extract news stories(just the titles) from Hacker News and have this much up till now-
import urllib2
from BeautifulSoup import BeautifulSoup
HN_url = "http://news.ycombinator.com"
def get_page():
page_html = urllib2.urlopen(HN_url)
return page_html
def get_stories(content):
soup = BeautifulSoup(content)
titles_html =[]
for td in soup.findAll("td", { "class":"title" }):
titles_html += td.findAll("a")
return titles_html
print get_stories(get_page()
)
When I run the code, however, it gives an error-
Traceback (most recent call last):
File "terminalHN.py", line 19, in <module>
print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)
How do I get this to work?
Because BeautifulSoup works internally with unicode strings. Printing unicode strings to the console will cause Python to try the conversion of unicode to the default encoding of Python which is usually ascii. This will in general fail for non-ascii web-site. You may learn the basics about Python and Unicode by googling for "python + unicode". Meanwhile convert
your unicode strings to utf-8 using
print some_unicode_string.decode('utf-8')
One thing to note about your code is that findAll returns a list (in this case a list of BeautifulSoup objects) and you just want the titles. You might want to use find instead. And rather than printing out a list of the BeautifulSoup objects, you say that you just want the titles. The following works fine, for example:
import urllib2
from BeautifulSoup import BeautifulSoup
HN_url = "http://news.ycombinator.com"
def get_page():
page_html = urllib2.urlopen(HN_url)
return page_html
def get_stories(content):
soup = BeautifulSoup(content)
titles = []
for td in soup.findAll("td", { "class":"title" }):
a_element = td.find("a")
if a_element:
titles.append(a_element.string)
return titles
print get_stories(get_page())
So now get_stories() returns a list of unicode objects, which prints out as you'd expect.
It works fine, what's broken is the output. Either explicitly encode to your console's charset, or find a different way to run your code (e.g., from within IDLE).

Categories

Resources