Unicode Encode Error: Charmap cannot encode character \xa9 in Python - python

Hi there I am writing scraping code but when i try to get all paragraph from website it give me following error
Unicode Encode Error: Charmap cannot encode character '\xa9'
here is my code:
#Loading Libraries
import urllib
from urllib.parse import urlparse
from urllib.parse import urljoin
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
newsurl = "http://www.techspot.com/news/67832-netflix-exceeds-growth-expectations-home-abroad-stock-soars.html"
thepage = urllib.request.urlopen(newsurl)
soup = BeautifulSoup(thepage ,"html.parser")
article = soup.find_all('div' , {'class','articleBody'})
for pg in article:
paragraph = soup.findAll('p')
ptag = paragraph
print(ptag)
Error I am getting is following:
Let me how to remove this error

soup.findAll() returns a ResultSet object which is basically a list which does not have an attribute encode. You either meant to use .text instead:
text = soup.text
Or, "join" the texts:
text = "".join(soup.findAll(whatever, you, want))

At times this error occurs while using Beautiful soup 4 or bs4 or using getData requests or command . So try using the below mentioned code with your print statement.
print(myHtmlData.encode("utf-8"))

Related

How to fix Cyrillic characters while web-scraping with Python

I'm scraping a Cyrillic website with python using BeautifulSoup, but I'm having some trouble, every word is showing like this:
СилÑановÑка Ðавкова во Ðази
I also tried some other Cyrillic websites, but they are working good.
My code is this:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
How should I fix it?
requests fails to detect it as utf-8.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://time.mk/') # don't convert to text just yet
# print(source.encoding)
# prints out ISO-8859-1
source.encoding = 'utf-8' # override encoding manually
soup = BeautifulSoup(source.text, 'lxml') # this will now decode utf-8 correctly

Encoding issues request results

im learning python, and im trying to retrieve data from wikipedia, but is giving me encoding issues on special charecters of the links, text, etc:
My code:
import sys
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://pt.wikipedia.org/wiki/Jair_Bolsonaro")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
result:
/wiki/Hamilton_Mour%C3%A3o
/wiki/Michel_Temer
/wiki/C%C3%A2mara_dos_Deputados_do_Brasil
...
Should be:
/wiki/Hamilton_Mourão
/wiki/Michel_Temer
/wiki/Câmara_dos_Deputados_do_Brasil
...
Solution:
import urllib.parse
And in print line changed to:
print(urllib.parse.unquote(link.attrs['href']))

Unable to extract data from BeautifulSoup object after utf-8 conversion due to 'str' typecasting

I'm trying to build my own web scraper using Python. One of the steps involves parsing an HTML page, for which I am using BeautifulSoup, which is the parser recommended in most tutorials. Here is my code which should extract the page and print it:
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
soup = soup.prettify()
print (soup)
However, there seems to be an error when I do soup.prettify() and then print it. The error is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in
position 16052: ordinal not in range(128)
To resolve this, I googled further and came across this answer of SO which resolved it. I basically had to set the encoding to 'utf=8' which I did. So here is the modded code (last 2 lines only):
soup = soup.prettify().encode('utf-8')
print (soup)
This works just fine. The problem arises when I try to use the soup.get_text() method as mentioned on a tutorial here. Whenever I do soup.get_text(), I get an error:
AttributeError: 'str' object has no attribute 'get_text'
I think this is expected since I'm encoding the soup to 'utf-8' and it's changing it to a str. I tried printing type(soup) before and after utf-8 conversion and as expected, before conversion it was an Object of the bs4.BeautifulSoup class and after, it was str.
How do I work around this? I'm pretty sure I'm doing something wrong and there's a proper way around this. Unfortunately, I'm not too familiar with Python, so please bear with me
You should not discard your original soup object. You can call soup.prettify().encode('utf-8') when you need to print it (or save it into a different variable).
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
html_code = soup.prettify().encode('utf-8')
text = soup.get_text().encode('utf-8')
print html_code
print "#################"
print text
# a = soup.find()
# l = []
# for i in a.next_elements:
# l.append(i)

How to read html body from web-site using Python version 3x

I would like to connect and receive http response from a specific web site link.
I have many Python codes :
import urllib.request
import os,sys,re,datetime
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode(encoding=sys.stdout.encoding)
fp.close()
when I pass the response as a parameter to:
BeautifulSoup(str(mystr), 'html.parser')
to get the cleaned html text, I got the following error:
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25bc' in position 1139: character maps to <undefined>.
The question how can I solve this problem?
complete code :
import urllib.request
import os,sys,re,datetime
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode(encoding=sys.stdout.encoding)
fp.close()
from bs4 import BeautifulSoup
soup = BeautifulSoup(str(mystr), 'html.parser')
mystr = soup;
print(mystr.get_text())
BeautifulSoup is perfectly happy to consume the file-like object returned by urlopen:
from urllib.request import urlopen
from bs4 import BeautifulSoup
with urlopen("...") as website:
soup = BeautifulSoup(website)
print(soup.prettify())
If you use the requests library you can avoid these complications:)
import requests
fp = requests.get("http://www.python.org")
mystr = fp.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(mystr, 'html.parser')
mystr = soup;
print(mystr.get_text())

Error in printing scraped webpage through bs4

Code:
import requests
import urllib
from bs4 import BeautifulSoup
page1 = urllib.request.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.get_text())
print(soup.prettify())
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 9, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to <undefined>
I think the problem lies mainly with urlib package. Here I am using urllib3 package. They changed the urlopen syntax from 2 to 3 version, which maybe the cause of error. But that being said I have included the latest syntax only.
Python version 3.4
since you are importing requests you can use it instead of urllib like this:
import requests
from bs4 import BeautifulSoup
page1 = requests.get("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1.text)
print(soup.get_text())
print(soup.prettify())
Your problem is that python cannot encode the characters from the page that you are scraping. For some more information see here: https://stackoverflow.com/a/16347188/2638310
Since the wikipedia page is in UTF-8, it seems that BeautifulSoup is guessing the encoding incorrectly. Try passing the from_encoding argument in your code like this:
soup = BeautifulSoup(page1.text, from_encoding="UTF-8")
For more on encodings in BeautifulSoup have a look here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings
I am using Python2.7, so I don't have request method inside the urllib module.
#!/usr/bin/python3
# coding: utf-8
import requests
from bs4 import BeautifulSoup
URL = "http://en.wikipedia.org/wiki/List_of_human_stampedes"
soup = BeautifulSoup(requests.get(URL).text)
print(soup.get_text())
print(soup.prettify())
https://www.python.org/dev/peps/pep-0263/
Put those print lines inside a Try-Catch block so if there is an illegal character, then you won't get an error.
try:
print(soup.get_text())
print(soup.prettify())
except Exception:
print(str(soup.get_text().encode("utf-8")))
print(str(soup.prettify().encode("utf-8")))

Categories

Resources