Python SyntaxError: Non-ASCII character - python

I have a string column in my pandas dataframe. I am facing error when i try to export this column to an excel file. Am able to export the dataframe if i delete that column.
I tried to decode all the values in that string column.I got the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 16: ordinal not in range(128)
I debugged a little and found a column value in that particular column.I am not able to decode this value ('Einen Use case fr Brems-').
I took that value and tried the following code , facing error again.
#!/usr/bin/python
# -*- coding: utf-8 -*-
text = 'Einen Use case fr Brems-'
print text.decode()
1) i have tried adding following lines as well.
import sys
reload(sys)
sys.setdefaultencoding('utf8')
2) I tried this in my system:
import sys
print sys.getdefaultencoding()
i got 'ascii' as output.

Related

UnicodeEncodeError when using scipy.io.savemat

I have a dateframe contain Chinese character like this:
I want to save as .mat file using
datanew = r'data/newmat.mat'
scio.savemat(datanew,{'date':df})
But got error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I have tried add code # -*- coding: utf-8 -*-in the first line and import sys
default_encoding = 'utf-8'. But didn't work.

Python encoding/decoding error

I'm using Python 2.7.3. My operating system is Windows7(32-bit).
In the cmd, I typed this code:
chcp 1254
and I converted decoding system to 1254.
But,
#!/usr/bin/env python
# -*- coding:cp1254 -*-
print "öçışğüÖÇİŞĞÜ"
When I ran above codes, I got that output:
÷²■­³Íæ̺▄
But when I put u after the print command (print u"öçışğüÖÇİŞĞÜ")
When I edited codes as that:
#!/usr/bin/env python
# -*- coding:cp1254 -*-
import os
a = r"C:\\"
b = "ö"
print os.path.join(a, b)
I got that output:
÷
That's why when I tried
print unicode(os.path.join(a, b))
command. I got that error:
print unicode(os.path.join(a, b))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 13: ordinal
not in range(128)
By trying a different way:
print os.path.join(a, b).decode("utf-8").encode(sys.stdout.encoding)
When I tried above code, I got that error:
print os.path.join(a, b).decode("utf-8").encode(sys.stdout.encoding)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 13: invalid start byte
As a result, I can't get rid of this error. How can I solve it ?
Thanks.
When I run your original code, but use chcp 857 (the Turkish OEM code page) I can reproduce your issue, so I do not think you were running chcp 1254:
÷²■­³Íæ̺▄
If you declare your source encoding as:
# -*- coding:cp1254 -*-
You must save your source code in that encoding. If you don't use Unicode strings, you must also use the same encoding at the console. Then it works correctly.
Example (source declared cp1254, but saved incorrectly as cp1252, and console chcp 1254):
öçisgüÖÇISGÜ
Example (source declared and saved correctly as cp1254, console chcp 1254):
öçışğüÖÇİŞĞÜ
It is important to note that with Unicode strings, you don't have to match the source encoding with the encoding of your console.
Example (declared and saved as UTF-8, with Unicode string):
#!python2
# -*- coding:utf8 -*-
print u"öçışğüÖÇİŞĞÜ"
Output (use any code page that supports Turkish...1254, 857, 1026...):
öçışğüÖÇİŞĞÜ

codec can't encode character: character maps to <undefined>

Im' trying read a docx file in python 2.7 with this code:
import docx
document = docx.Document('sim_dir_administrativo.docx')
docText = '\n\n'.join([
paragraph.text.encode('utf-8') for paragraph in document.paragraphs])
And then I'm trying to decode the string inside the file with this code, because I have some special characters (e.g. ã):
print docText.decode("utf-8")
But, I'm getting this error:
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
494457: character maps to <undefined>
How can I solve this?
The print function can only print characters that are in your local encoding. You can find out what that is with sys.stdout.encoding. To print with special characters you must first encode to your local encoding.
# -*- coding: utf-8 -*-
import sys
print sys.stdout.encoding
print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')
This code snippet was taken from this stackoverflow response.

UnicodeDecodeError error writing .xlsx file using xlsxwriter

I am trying to write about 1000 rows to a .xlsx file from my python application. The data is basically a combination of integers and strings. I am getting intermittent error while running wbook.close() command. The error is the following:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15:
ordinal not in range(128)
My data does not have anything in unicode. I am wondering why the decoder is being at all. Has anyone noticed this problem?
0xc3 is "À". So what you need to do is change the encoding. Use the decode() method.
string.decode('utf-8')
Also depending on your needs and uses you could add
# -*- coding: utf-8 -*-
at the beginning of your script, but only if you are sure that the encoding will not interfere and break something else.
As Alex Hristov points out you have some non-ascii data in your code that needs to be encoded as UTF-8 for Excel.
See the following examples from the docs which each have instructions on handling UTF-8 with XlsxWriter in different scenarios:
Example: Simple Unicode with Python 2
Example: Simple Unicode with Python 3
Example: Unicode - Polish in UTF-8

Parsing unicode input using python json.loads

What is the best way to load JSON Strings in Python?
I want to use json.loads to process unicode like this:
import json
json.loads(unicode_string_to_load)
I also tried supplying 'encoding' parameter with value 'utf-16', but the error did not go away.
Full SSCCE with error:
# -*- coding: utf-8 -*-
import json
value = '{"foo" : "bar"}'
print(json.loads(value)['foo']) #This is correct, prints 'bar'
some_unicode = unicode("degradé")
#last character is latin e with acute "\xe3\xa9"
value = '{"foo" : "' + some_unicode + '"}'
print(json.loads(value)['foo']) #incorrect, throws error
Error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
I typecasting the string into unicode string using 'latin-1' fixed the error:
UnicodeDecodeError: 'utf16' codec can't decode byte 0x38 in
position 6: truncated data
Fixed code:
import json
ustr_to_load = unicode(str_to_load, 'latin-1')
json.loads(ustr_to_load)
And then the error is not thrown.
The OP clarifies (in a comment!)...:
Source data is huge unicode encoded
string
Then you have to know which of the many unicode encodings it uses -- clearly not 'utf-16', since that failed, but there are so many others -- 'utf-8', 'iso-8859-15', and so forth. You either try them all until one works, or print repr(str_to_load[:80]) and paste what it shows as an edit of your question, so we can guess on your behalf!-).
The simplest way I have found is
import simplejson as json
that way your code remains the same
json.loads(str_to_load)
reference: https://simplejson.readthedocs.org/en/latest/
With django you can use SimpleJSON and use loads instead of just load.
from django.utils import simplejson
simplejson.loads(str_to_load, "utf-8")

Categories

Resources