problems with handling unicode characters in python

problems with handling unicode characters in python - python

I wrote the below code for writing the contents to the file,
with codecs.open(name,"a","utf8") as myfile:
myfile.write(str(eachrecord[1]).encode('utf8'))
myfile.write(" ")
myfile.write(str(eachrecord[0]).encode('utf8'))
myfile.write("\n")`
The above code does not work properly when writing uni-code characters....Even though i am using codecs and doing the encoding. I keep getting the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 6: ordinal not in range(128)
Can anyone see where i am doing it wrong ?
Edit:
with codecs.open(name,"a","utf8") as myfile:
myfile.write(unicode(eachrecord[1]))
myfile.write(" ")
myfile.write(unicode(eachrecord[0]))
myfile.write("\n")
This worked..Thanks for all the quick comments and answers..that really helps ..I did not realize that python has 'unicode' option until you guys told me

Remove the str() calls. They are doing an implicit encoding with the default encoding (ascii in this case).

Related

UnicodeDecodeError with nltk

I am working with python2.7 and nltk on a large txt file of content scraped from various websites..however I am getting various unicode errors such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
My question is not so much how I can 'fix' this with python but instead is there anything I can do to the .txt file (as in formatting) before 'feeding' it to python, such as 'make plain text' to avoid this issue entirely?
Update:
I looked around and found a solution within python that seems to work perfectly:
import sys
reload(sys)
sys.setdefaultencoding('utf8')

try opening the file with:
f = open(fname, encoding="ascii", errors="surrogateescape")
Change the "ascii" with the desired encoding.

Cant encode ascii character u'\xe9'

I am getting a message 'ascii' codec can't encode character u'\xe9' when I am writing a string to my file, heres how I am writing my file
my_file = open(output_path, "w")
my_file.write(output_string)
my_file.close()
I have been searching and found answers like this UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128) and the first one didn't work and then this one I'm confused why I am encoding data I want to be able to read
import io
f = io.open(filename, 'w', encoding='utf8')
Thanks for the help

As mentioned, you're trying to write non-ASCII characters with the ASCII encoding. Since the built-in open function doesn't support the encoding parameter, then consider always using io.open in Python 2.7 (which is the default since Python 3.x).

Strange characters in console Python

Reading the word "beyoncè" from a text file, python is handling it as "beyonc\xc3\xa9".
If I write it into a file, it shows correctly, but in console It shows like that.
Also If I try to use it in my program I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128)
how can I let Python read beyoncè from a text file as beyonce and getting rid of this problem?

See if this helps:
f= open('mytextfile.txt', encoding='utf-8', 'w')
f.write(line)

try
string="beyonc\xc3\xa9"
string.decode("utf-8")
foo=open("foo.txt","wb")
foo.write(string)
foo.close()

Python decode and encode with utf-8

I am trying to encode and decode with utf-8. What is wierd is that I get an error trackback saying that I am using gbk.
oneword.decode("utf-8")]
below is the error trackback.
UnicodeEncodeError: 'gbk' codec can't encode character '\u2769' in position 1: illegal multibyte sequence
Can anyone tell me what to do? I seems that the decode parameter does not have effect.

I got it solved.
Actually, I intended to output to a file instead of the console. In such situation, I have to explicitly indicate the decoding of the output target file. Instead of using open I used codecs.open.
import codecs
f = codecs.open(filename, mode='w', encoding='utf-8')
Thanks to #Bakuriu from the comments:
If you are using Python 3 you no longer need to import the codecs module. Just pass the encoding parameter to the built-in open function.

Python write (iPhone) Emoji to a file

I have been trying to write a simple script that can save user input (originating from an iPhone) to a text file. The issue I'm having is that when a user uses an Emoji icon, it breaks the whole thing.
OS: Ubuntu
Python Version: 2.7.3
My code currently looks like this
f = codecs.open(path, "w+", encoding="utf8")
f.write("Desc: " + json_obj["description"])
f.close()
When an Emoji character is passed in the description variable, I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
Any possible help is appreciated.

The most likely problem here is that json_obj["description"] is actually a UTF-8-encoded str, not a unicode. So, when you try to write it to a codecs-wrapped file, Python has to decode it from str to unicode so it can re-encode it. And that's the part that fails, because that automatic decoding uses sys.getdefaultencoding(), which is 'ascii'.
For example:
>>> f = codecs.open('emoji.txt', 'w+', encoding='utf-8')
>>> e = u'\U0001f1ef'
>>> print e
🇯
>>> e
u'\U0001f1ef'
>>> f.write(e)
>>> e8 = e.encode('utf-8')
>>> e8
'\xf0\x9f\x87\xaf'
>>> f.write(e8)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)
There are two possible solutions here.
First, you can explicitly decode everything to unicode as early as possible. I'm not sure where your json_obj is coming from, but I suspect it's not actually the stdlib json.loads, because by default, that always gives you unicode keys and values. So, replacing whatever you're using for JSON with the stdlib functions will probably solve the problem.
Second, you can leave everything as UTF-8 str objects and stay in binary mode. If you know you have UTF-8 everywhere, just open the file instead of codecs.open, and write without any encoding.
Also, you should strongly consider using io.open instead of codecs.open. It has a number of advantages, including:
Raises an exception instead of doing the wrong thing if you pass it incorrect values.
Often faster.
Forward-compatible with Python 3.
Has a number of bug fixes that will never be back-ported to codecs.
The only disadvantage is that it's not backward compatible to Python 2.5. Unless that matters to you, don't use codecs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

problems with handling unicode characters in python - python

Remove the str() calls. They are doing an implicit encoding with the default encoding (ascii in this case).

Related

UnicodeDecodeError with nltk

Cant encode ascii character u'\xe9'

Strange characters in console Python

Python decode and encode with utf-8

Python write (iPhone) Emoji to a file

Categories

Resources