selenium unicode encode error - python

When retrieving the content of a google search result page I get this error?
print driver.find_element_by_tag_name('body').get_attribute('innerHTML')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 15663: ordinal not in range(128)
I'm calling the python script from PHP like this
exec('python selenium_scrape.py');
This solves the problem, but then all unicode chars will be encoded twice
print driver.find_element_by_tag_name('body').get_attribute('innerHTML').encode('utf-8')

That's probably because you're printing to a stdout that uses ASCII (7 bit) encoding. Call Python with a locale setting that uses utf-8, or do some appropriate encoding of the (unicode) HTML content to a 7-bit character string first.

Try to encode the the text before printing:
print driver.find_element_by_tag_name('body').get_attribute('innerHTML').encode("utf-‌​8")

Related

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.

URLDecoding requests

I am trying to get the original url from requests. Here is what I have so far:
res = requests.get(...)
url = urllib.unquote(res.url).decode('utf8')
I then get an error that says:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)
The original url I requested is:
https://www.microsoft.com/de-at/store/movies/american-pie-pr\xc3\xa4sentiert-nackte-tatsachen/8d6kgwzl63ql
And here is what happens when I try printing:
>>> print '111', res.url
111 https://www.microsoft.com/de-at/store/movies/american-pie-pr%C3%A4sentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '222', urllib.unquote( res.url )
222 https://www.microsoft.com/de-at/store/movies/american-pie-präsentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '333', urllib.unquote(res.url).decode('utf8')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)
Why is this occurring, and how would I fix this?
UnicodeEncodeError: 'ascii' codec can't encode characters
You are trying to decode a string that is Unicode already. It raises AttributeError on Python 3 (unicode string has no .decode() method there). Python 2 tries to encode the string into bytes first using sys.getdefaultencoding() ('ascii') before passing it to .decode('utf8') which leads to UnicodeEncodeError.
In short, do not call .decode() on Unicode strings, use this instead:
print urllib.unquote(res.url.encode('ascii')).decode('utf-8')
Without .decode() call, the code prints bytes (assuming a bytestring is passed to unquote()) that may lead to mojibake if the character encoding used by your environment is not utf-8. To avoid mojibake, always print Unicode (don't print text as bytes), do not hardcode the character encoding of your environment inside your script i.e., .decode() is necessary here.
There is a bug in urllib.unquote() if you pass it a Unicode string:
>>> print urllib.unquote(u'​%C3%A4')
ä
>>> print urllib.unquote('​%C3%A4') # utf-8 output
ä
Pass bytestrings to unquote() on Python 2.

Unicode error in python program output

I am trying run a bash command from my python program which out put the result in a file.I am using os.system to execute the bash command.But I am getting an error as follows:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 793: ordinal not in range(128)
I am not able to understand how to handle it.Please suggest me a solution for it.
Have a look at this Blog post
These messages usually means that you’re trying to either mix Unicode strings with 8-bit strings, or is trying to write Unicode strings to an output file or device that only handles ASCII.
Try to do the following to encode your string:
This can then be used to properly convert input data to Unicode. Assuming the string referred to by value is encoded as UTF-8:
value = unicode(value, "utf-8")
You need to encode your string as:
your_string = your_string.encode('utf-8')
For example:
>>> print(u'\u201c'.encode('utf - 8'))
“

unicode issue with scraped data via scrapy

I am having hard time last 2 weeks to handle some data that I scraped with scrapy. I am using python 2.7 on a windows7. This is a small snippet of data scraped and extracted through scrapy xpath selector:
{'city': [u'Mangenberger Str.\xa0162', u'42655\xa0Solingen']}
These data are scraped from a page utf-8 encoded, at least that is what it says:
Content-Type: text/html;charset=utf-8
So I believe that I need to decode them in order to get:
Mangenberger Str. 16242655 Solingen
This is what I am getting in my console:
>>> s='Mangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'
>>> s1=s.decode('utf-8')
>>> print s1
Mangenberger Str. 16242655 Solingen
Perfect!
But this is far away from what I receive when I run my script. I tried to encode and decode:
uft-8 encoding
{'city': 'Mangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'}
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 17:
utf-8-sig encoding
{'city': '\xef\xbb\xbfMangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'}
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
utf-8 decoding
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 17:
utf-8-sig decoding
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 17:
Encode code:
item['city']= "".join(element.select('//div[#id="bubble_2"]/div/text()').extract()).encode('utf-8')
Decode code:
item['city']= "".join(element.select('//div[#id="bubble_2"]/div/text()').extract()).decode('utf-8')
From what I understand that BOM byte is the problem in case when I try to decode this string? But then why does it work without problems in my console and doesn't work (error) once I run scrapy?
\xa0 in that Python unicode string is the Non-breaking space character
u'Mangenberger Str.\xa0162' and u'42655\xa0Solingen' are perfectly valid unicode strings. Python works with unicode strings wonderfully.
Scrapy XPath selector extract() calls get you list of unicode strings. And dealing with unicode all along is usually the way to go.
I would NOT recommend encoding the unicode string to something else in your scrapy code.
(and it's encoding you're after, decoding is for non-unicode strings to convert them to unicode strings)
The only step it makes sense to encode the strings is at the end, when exporting the data (CSV, XML) and even that is handled already.
Maybe you can explain what is causing you trouble with these unicode strings.

Printing decoded JSON string

I am receiving a JSON string, pass it through json.loads and ends with an array of unicode strings. That's all well and good. One of the strings in the array is:
u'\xc3\x85sum'
now should translate into 'Åsum' when decoded using decode('utf8') but instead I get an error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
To test what's wrong I did the following
'Åsum'.encode('utf8')
'\xc3\x85sum'
print '\xc3\x85sum'.decode('utf8')
Åsum
So that worked fine, but if I make it to a unicode string as json.loads does I get the same error:
print u'\xc3\x85sum'.decode('utf8')
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
I tried doing json.loads(jsonstring, encoding = 'uft8') but that changes nothing.
Is there a way to solve it? Make json.loads not make it unicode or make it decode using 'utf8' as I ask it to.
Edit:
The original string I receive look like this, or the part that causes trouble:
"\\u00c3\\u0085sum"
You already have a Unicode value, so trying to decode it forces an encode first, using the default codec.
It looks like you received malformed JSON instead; JSON values are already unicode. If you have UTF-8 data in your Unicode values, the only way to recover is to encode to Latin-1 (which maps the first 255 codepoints to bytes one-on-one), then decode from that as UTF8:
>>> print u'\xc3\x85sum'.encode('latin1').decode('utf8')
Åsum
The better solution is to fix the JSON source, however; it should not doubly-encode to UTF-8. The correct representation would be:
json.dumps(u'Åsum')
'"\\u00c5sum"'

Categories

Resources