codec can't encode character: character maps to <undefined>

codec can't encode character: character maps to <undefined> - python

Im' trying read a docx file in python 2.7 with this code:
import docx
document = docx.Document('sim_dir_administrativo.docx')
docText = '\n\n'.join([
paragraph.text.encode('utf-8') for paragraph in document.paragraphs])
And then I'm trying to decode the string inside the file with this code, because I have some special characters (e.g. ã):
print docText.decode("utf-8")
But, I'm getting this error:
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
494457: character maps to <undefined>
How can I solve this?

The print function can only print characters that are in your local encoding. You can find out what that is with sys.stdout.encoding. To print with special characters you must first encode to your local encoding.
# -*- coding: utf-8 -*-
import sys
print sys.stdout.encoding
print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')
This code snippet was taken from this stackoverflow response.

Related

UnicodeEncodeError when using scipy.io.savemat

I have a dateframe contain Chinese character like this:
I want to save as .mat file using
datanew = r'data/newmat.mat'
scio.savemat(datanew,{'date':df})
But got error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I have tried add code # -*- coding: utf-8 -*-in the first line and import sys
default_encoding = 'utf-8'. But didn't work.

How to convert a byte string with a unicode character to normal text in Python?

I have the following string in Python 3:
bytestring = b'Zeer ge\xc3\xafnteresseerd naar iemands verhalen luisteren.'
How do I get this to a string with normal characters? That is:
'Zeer geïnteresseerd naar iemands verhalen luisteren.'
I've already tried decoding it using:
bytestring.decode('utf-8)
But when I try to print that value to the console Python gives me the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xef' in position 7: ordinal not in range(128)
Any help appreciated.
SOLUTION
I solved the problem by typing the following in the terminal:
export PYTHONIOENCODING=UTF-8
After that I was able to print the decoded bytestring to the console.

It seems like you are working with unicode rather than string. See if this helps. You decode using this custom function; first with UTF8 and then with Latin1 then encode to ascii.
def CustomDecode(mystring):
'''Accepts string and tries decode with UTF8 first and then Latin1'''
c=''.join(map(lambda x: chr(ord(x)),mystring))
decval = None
try:
decval = c.decode('utf8')
except UnicodeDecodeError:
decval = c.decode('latin1')
return decval
CustomDecode(mystring).encode('ascii', 'ignore')
Result:
'Zeer genteresseerd naar iemands verhalen luisteren.'

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128) chinese characters

Im trying to write Chinese characters into a text file from a SQL output called result.
result looks like this:
[('你好吗', 345re4, '2015-07-20'), （'我很好',45dde2, '2015-07-20').....]
This is my code:
#result is a list of tuples
file = open("my.txt", "w")
for row in result:
print >> file, row[0].encode('utf-8')
file.close()
row[0] contains Chinese text like this: 你好吗
I also tried:
print >> file, str(row[0]).encode('utf-8')
and
print >> file, 'u'+str(row[0]).encode('utf-8')
but both gave the same error.
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128)

Found a simple solution instead of doing encoding and decoding by formatting the file to "utf-8" from the beginning using codecs.
import codecs
file = codecs.open("my.txt", "w", "utf-8")

Don't forget to ad the UTF8 BOM on the file beginning if you wish to view your file in text editor correctly:
file = open(...)
file.write("\xef\xbb\xbf")
for row in result:
print >> file, u""+row[0].decode("mbcs").encode("utf-8")
file.close()
I think you'll have to decode from your machines default encoding to unicode(), then encode it as UTF-8.
mbcs represents (at least it did ages a go) default encoding on Windows.
But do not rely on that.
Did you try the codecs module?

Figuring out unicode: 'ascii' codec can't decode

I currently use Sublime 2 and run my python code there.
When I try to run this code. I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
# -*- coding: utf-8 -*-
s = unicode('abcdefö')
print s
I have been reading the python documentation on unicode and as far as I understand this should work, or is it the console that's not working
Edit: Using s = u'abcdefö' as a string produces almost the same result. The result I get is
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 6: ordinal not in range(128)

What happens is that unicode('abcdefö') tries to decode the encoded string to unicode during runtime. The coding: utf-8 line only tells Python that the source file is encoded in utf8. When the script runs it has been compiled and string has been stored as a encoded string. So when Python tries to decode the string it uses ascii by default. As the string is actually utf8 encoded this fails.
You can do s = u'abcdefö' which tells the compiler to decode the string with the encoding declared for the file and store it as unicode. s = unicode('abcdefö', 'utf8') or s = 'abcdefö'.decode('utf8') would do the same thing during runtime.
However does not necessarily mean that you can print s now. First the internal unicode string has to be encoded in a character set that the stdout (the console/editor/IDE) can actually display. Sadly often Python fails at figuring out the right character set and defaults to ascii again and you get an error when the string contains non-ascii characters. The Python Wiki knows a few ways to set up stdout properly.

You need to mark the string as a unicode string:
s = u'abcdefö'

s = 'abcdefö'
DO NOT TRY unicode() if string is already in unicode. i.e. unicode(s) is wrong.
IF type(s) == str but contains unicode characters:
First convert to unicode
str_val = unicode(s,'utf-8’)
str_val = unicode(s,'utf-8’,’replace')
Finally encode to string
str_val.encode('utf-8')
Now you can print:
print s

Diacritic signs

How should I write "mąka" in Python without an exception?
I've tried var= u"mąka" and var= unicode("mąka") etc... nothing helps
I have coding definition in first line in my document, and still I've got that exception:
'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte

Save the following 2 lines into write_mako.py:
# -*- encoding: utf-8 -*-
open(u"mąka.txt", 'w').write("mąka\n")
Run:
$ python write_mako.py
mąka.txt file that contains the word mąka should be created in the current directory.
If it doesn't work then you can use chardet to detect actual encoding of the file (see chardet example usage):
import chardet
print chardet.detect(open('write_mako.py', 'rb').read())
In my case it prints:
{'confidence': 0.75249999999999995, 'encoding': 'utf-8'}

The # -- coding: -- line must specify the encoding the source file is saved in. This error message:
'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte
indicates you aren't saving the source file in UTF-8. You can save your source file in any encoding that supports the characters you are using in the source code, just make sure you know what it is and have an appropriate coding line.

What exception are you getting?
You might try saving your source code file as UTF-8, and putting this at the top of the file:
# coding=utf-8
That tells Python that the file’s saved as UTF-8.

This code works for me, saving the file as UTF-8:
v = u"mąka"
print repr(v)
The output I get is:
u'm\u0105ka'
Please copy and paste the exact error you are getting. If you are getting this error:
UnicodeEncodeError: 'charmap' codec can't encode character ... in position ...: character maps to <undefined>
Then you are trying to output the character somewhere that does not support UTF-8 (e.g. your shell's character encoding is set to something other than UTF-8).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

codec can't encode character: character maps to <undefined> - python

Related

UnicodeEncodeError when using scipy.io.savemat

How to convert a byte string with a unicode character to normal text in Python?

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128) chinese characters

Figuring out unicode: 'ascii' codec can't decode

Diacritic signs

Categories

Resources