Encode UTF-8 for list

Encode UTF-8 for list - python

I'm using selenium to retrieve a list from a javascript object.
search_reply = driver.find_element_by_class_name("ac_results")
When trying to write to csv, I get this error:
Traceback (most recent call last):
File "insref_lookup15.py", line 54, in <module>
wr_insref.writerow(instrument_name)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 22: ordinal not in range(128)
I have tried placing .encode("utf-8") on both:
search_reply = driver.find_element_by_class_name("ac_results").encode("utf-8")
and
wr_insref.writerow(instrument_name).encode("utf-8")
but I just get the message
AttributeError: 'xxx' object has no attribute 'encode'

You need to encode the elements in the list:
wr_insref.writerow([v.encode('utf8') for v in instrument_name])
The csv module documentation has an Examples section that covers writing Unicode objects in more detail, including utility classes to handle this automatically.

Related

Emoji support reading from file in python? [duplicate]

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>

Since you are using Python 3, just add the encoding parameter to open():
corpus = open(
r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()

emoji not printing in python shows UnicodeEncodeError

I'm trying to print an emoji on Python
Code:
print("\U0001f600")
I'm trying to print using unicode but after compiling it shows error
Error:
Traceback (most recent call last): File
"C:/Users/RAHUL/Desktop/new.py", line 1, in print("\U0001f600")
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f600'
in position 0: Non-BMP character not supported in Tk
I have searched on the internet but did't found the answer

All of a sudden str.decode('unicode_escape') stopped working [2.7.3]

This snippet is taken from my recent python work. And it used to work just fine
strr = "What is th\u00e9 point?"
print strr.decode('unicode_escape')
But now it throws the unicode decoding error:
Traceback (most recent call last):
File "C:\Users\Lenon\Documents\WorkDir\pyWork\ocrFinale\F1\tests.py", line 49, in <module>
print strr.decode('unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128)
What is the possible cause of this?

You either have enabled unicode literals or have created a Unicode object by another means, by mistake.
The strr value is already a unicode object, so in order to decode the value Python first tries to encode to a byte string.
If you have an actual byte string your code works:
>>> strr = "What is th\u00e9 point?"
>>> strr.decode('unicode_escape')
u'What is th\xe9 point?'
but as soon as strr is in fact a Unicode object, you get the error as Python tries to encode the object using the default ASCII codec first:
>>> strr.decode('unicode_escape').decode('unicode_escape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128)
It could be that you enabled unicode_literals, for example:
>>> from __future__ import unicode_literals
>>> strr = "What is th\u00e9 point?"
>>> type(strr)
<type 'unicode'>
>>> strr.decode('unicode_escape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128)

Print succeeds but logging module throws exception

I'm trying to log the contents of a file, but I get some funny behavior from the logging module (and not only that one).
Here is the file contents:
"Testing …"
Testing å¨'æøöä
"Testing å¨'æøöä"
And here is how I open and log it:
with codecs.open(f, "r", encoding="utf-8") as myfile:
script = myfile.read()
log.debug("Script type: {}".format(type(script)))
print(script)
log.debug("{}".format(script.encode("utf8")))
The line where I log the type of the object shows up as follows in my logs:
Script type: <type 'unicode'>
Then the print ... line prints the contents correctly to console, but, the logging module throws an exception:
Traceback (most recent call last):
File "/usr/lib/python2.7/logging/__init__.py", line 882, in emit
stream.write(fs % msg.encode("UTF-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 63: ordinal not in range(128)
When I remove the .encode("utf8") bit from that last line, I get the expected exception:
'ascii' codec can't encode character u'\u2026' in position 9: ordinal not in range(128)
This is just to demonstrate the problem. It's not only the logging module. Rest of my code also throws similar exceptions when dealing with this "unicode" string.
What am I doing wrong?

Logging handles Unicode values just fine:
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> script = u'"Testing …"'
>>> logging.debug(script)
DEBUG:root:"Testing …"
(Writing to a log file will result in UTF-8 encoded messages).
Where you went wrong is by mixing byte strings and Unicode values while using str.format():
>>> "{}".format(script)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 9: ordinal not in range(128)
If you used a unicode format string you avoid the forced implicit encoding:
>>> u"{}".format(script)
u'"Testing \u2026"'

dictionary data extraction issue

top_100 is a mongodb collection:
the following code:
x=[]
thread=[]
for doc in top_100.find():
x.append(doc['_id'])
db = Connection().test
top_100 = db.top_100_thread
thread = [a["thread"] for a in x]
for doc in thread:
print doc
gives this error:
Traceback (most recent call last):
File "C:\Users\chatterjees\workspace\de.vogella.python.first\src\top_100_thread.py", line 21, in <module>
print doc
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u03b9' in position 10: character maps to <undefined>
what's going on?

Its because your document contains some unicode data.
You need to correctly output unicode data
instead of directly printing it.
see:
python 3.0, how to make print() output unicode?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encode UTF-8 for list - python

You need to encode the elements in the list: wr_insref.writerow([v.encode('utf8') for v in instrument_name]) The csv module documentation has an Examples section that covers writing Unicode objects in more detail, including utility classes to handle this automatically.

Related

Emoji support reading from file in python? [duplicate]

emoji not printing in python shows UnicodeEncodeError

All of a sudden str.decode('unicode_escape') stopped working [2.7.3]

Print succeeds but logging module throws exception

dictionary data extraction issue

Categories

Resources