str1="khloé kardashian"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128)
how to encode it in perfect way.
I am trying to replace this in URL in flask app: It works well on command line but return above error in the app:
>>> url ="google.com/q=apple"
>>> url.replace("q=apple", "q={}".format(str1))
'google.com/q=khlo\xc3\xa9 kardashian'
You should use urllib to construct the URL correctly. You have other issues in your URL, e.g., a white space. urllib takes care of them.
params = {'q': str1}
"google.com/" + urllib.urlencode(params)
#'google.com/q=khlo%C3%A9%20kardashian'
use utf-8 instead
str1="khloé kardashian"
str1.encode("utf-8")
A URL, per the standard, cannot have é in it. You need to use the appropriate URL encoding, which is handled by the built-in urllib package.
Related
I use the requests module in Python to fetch a result of a web page. However, I found that if the URL includes a character à in its URL, it issues the UnicodeDecodeError:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 27: invalid continuation byte
Strangely, this only happens if I also add a space in the URL. So for example, the following does not issue an error.
requests.get("http://myurl.com/àieou")
However, the following does:
requests.get("http://myurl.com/àienah aie")
Why does it happen and how can I make the request correctly?
using the lib urllib to auto-encode characters.
import urllib
requests.get("http://myurl.com/"+urllib.quote_plus("àieou"))
Use quote_plus().
from urllib.parse import quote_plus
requests.get("http://myurl.com/" + quote_plus("àienah aie"))
You can try to url encode your value:
requests.get("http://myurl.com/%C3%A0ieou")
The value for à is %C3%A0 once encoded.
The following code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
uClient = uReq('http://www.google.com')
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html.decode('utf-8', 'ignore'), 'lxml')
print(page_soup.find_all('p'))
...produces the following error:
C:\>python ws1.py
Traceback (most recent call last):
File "ws1.py", line 10, in <module>
print(page_soup.find_all('p'))
File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 40
: character maps to <undefined>
I have searched, in vain, for a solution as every post I have read suggests using a specific encoding none of which has eradicated the problem.
Any help would be appreciated.
Thank you.
You're trying to print a Unicode string that contains characters that can't be represented in the encoding used by your console.
It appears you're using the Windows command line, which means your problem could be solved simply by switching to Python 3.6 - it bypasses the console encoding altogether and sends Unicode straight to Windows.
If that's not possible, you can encode the string yourself and specify that unprintable characters should be replaced with an escape sequence. Then you must decode it again so that print will work properly.
bstr = page_soup.find_all('p').encode(sys.stdout.encoding, errors='backslashreplace')
print(bstr.decode(sys.stdout.encoding))
When I use urllib in Python3 to get the HTML code of a web page, I use this code:
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read().decode('utf-8')
print(html)
return html
However, this fails every time with the error:
Traceback (most recent call last):
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('https://www.hltv.org/team/7900/spirit-academy')
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The page is in UTF-8 and I am decoding it properly according to the urllib docs. The page is not gzipped or in another charset from what I can tell.
url.info().get_charset() returns None for the page, however the meta tags specify UTF-8. I have no problems viewing the HTML in any program.
I do not want to use any external libraries.
Is there a solution? What is going on? This works fine with the following Python2 code:
def getHTML(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(url)
html = response.read()
return html
You don't need to decode('utf-8')
The following should return the fetched html.
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read()
return html
There, found your error, the parsing was done just fine, everything was evaluated alright. But when you read the Traceback carefully:
Traceback (most recent call last): File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('hltv.org/team/7900/spirit-academy') File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The error was caused by the print statement, as you can see, this is in the traceback print(html).
This is somewhat common exception, it's just telling you that with your current system encoding, some of the text cannot be printed to the console. One simple solution will be to add print(html.encode('ascii', 'ignore')) to ignore all the unprintable characters. You still can do all the other stuff with html, it's just that you can't print it.
See this if you want a better "fix": https://wiki.python.org/moin/PrintFails
btw: re module can search byte strings. Copy this exactly as-is, will work:
import re
print(re.findall(b'hello', b'hello world'))
I am trying to create an HTML Parser in Python 3.4.2 on a Macbook Air(OS X):
plaintext.py:
from html.parser import HTMLParser
import urllib.request, formatter, sys
website = urllib.request.urlopen("http://www.profmcmillan.com")
data = website.read()
website.close()
format = formatter.AbstractFormatter(formatter.DumbWriter(sys.stdout))
ptext = HTMLParser(format)
ptext.feed(data)
ptext.close()
But I get the following error:
Traceback (most recent call last):
File "/Users/deannarobertazzi/Documents/plaintext.py", line 9, in <module>
ptext.feed(data)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/html/parser.py", line 164, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
I looked at the Python documentation and apparently the way you parse HTML data in Python 3 is vastly different from doing such a thing in Python 2. I don't know how to modify my code so that it works for Python 3. Thank you.
2.x implicit conversions only worked if all the bytes were in the ascii range.[0-127]
>>> u'a' + 'b'
u'ab'
>>> u'a' + '\xca'
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
u'a' + '\xca'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xca in position 0: ordinal not in range(128)
What often happened, and why this was dropped, is that code would work when tested with ascii data, such as Prof. McMillan's site seems to be today, and later fail, such as if Prof. McMillan were to add a title with a non-ascii char, or if another source were used that were not all-ascii.
The doc for HTMLParser.feed(data) says that the data must be 'text', which in 3.x means a unicode string. So bytes from the web must be decoded to unicode. Decoding the site with utf-8 works today because ascii is a subset of utf-8. However, the page currently has
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
So if a non-ascii char were to be added, and the encoding not changed, utf-8 would not work. There is really no substitute for paying attention to encoding of bytes. How to discover or guess the encoding of a web page (assuming that there is only one encoding used) is a separate subject.
I am more than a bit tired, but here goes:
I am doing tome HTML scraping in python 2.6.5 with BeautifulSoap on an ubuntubox
Reason for python 2.6.5: BeautifulSoap sucks under 3.1
I try to run the following code:
# dataretriveal from html files from DETHERM
# -*- coding: utf-8 -*-
import sys,os,re,csv
from BeautifulSoup import BeautifulSoup
sys.path.insert(0, os.getcwd())
raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)
for numdiv in soup.findAll('div', {"id" : "sec"}):
currenttable = numdiv.find('table',{"class" : "data"})
if currenttable:
numrow=0
numcol=0
data_list=[]
for row in currenttable.findAll('td', {"class" : "dataHead"}):
numrow=numrow+1
for ncol in currenttable.findAll('th', {"class" : "dataHead"}):
numcol=numcol+1
for col in currenttable.findAll('td'):
col2 = ''.join(col.findAll(text=True))
if col2.index('±'):
col2=col2[:col2.index('±')]
print(col2.encode("utf-8"))
ref=numdiv.find('a')
niceref=''.join(ref.findAll(text=True))
Now due to the ± signs i get the following error when trying to interprent the code with:
python code.py
Traceback (most recent call last):
File "detherm-wtest.py", line 25, in
if col2.index('±'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
How do i solve this? putting an u in so we have: '±' -> u'±' results in:
Traceback (most recent call last):
File "detherm-wtest.py", line 25, in
if col2.index(u'±'):
ValueError: substring not found
current code file encoding is utf-8
thank you
Byte strings like "±" (in Python 2.x) are encoded in the source file's encoding, which might not be what you want. If col2 is really a Unicode object, you should use u"±" instead like you already tried. You might know that somestring.index raises an exception if it doesn't find an occurrence whereas somestring.find returns -1. Therefore, this
if col2.index('±'):
col2=col2[:col2.index('±')] # this is not indented correctly in the question BTW
print(col2.encode("utf-8"))
should be
if u'±' in col2:
col2=col2[:col2.index(u'±')]
print(col2.encode("utf-8"))
so that the if statement doesn't lead to an exception.