Python encoding/decoding error - python

I'm using Python 2.7.3. My operating system is Windows7(32-bit).
In the cmd, I typed this code:
chcp 1254
and I converted decoding system to 1254.
But,
#!/usr/bin/env python
# -*- coding:cp1254 -*-
print "öçışğüÖÇİŞĞÜ"
When I ran above codes, I got that output:
÷²■­³Íæ̺▄
But when I put u after the print command (print u"öçışğüÖÇİŞĞÜ")
When I edited codes as that:
#!/usr/bin/env python
# -*- coding:cp1254 -*-
import os
a = r"C:\\"
b = "ö"
print os.path.join(a, b)
I got that output:
÷
That's why when I tried
print unicode(os.path.join(a, b))
command. I got that error:
print unicode(os.path.join(a, b))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 13: ordinal
not in range(128)
By trying a different way:
print os.path.join(a, b).decode("utf-8").encode(sys.stdout.encoding)
When I tried above code, I got that error:
print os.path.join(a, b).decode("utf-8").encode(sys.stdout.encoding)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 13: invalid start byte
As a result, I can't get rid of this error. How can I solve it ?
Thanks.

When I run your original code, but use chcp 857 (the Turkish OEM code page) I can reproduce your issue, so I do not think you were running chcp 1254:
÷²■­³Íæ̺▄
If you declare your source encoding as:
# -*- coding:cp1254 -*-
You must save your source code in that encoding. If you don't use Unicode strings, you must also use the same encoding at the console. Then it works correctly.
Example (source declared cp1254, but saved incorrectly as cp1252, and console chcp 1254):
öçisgüÖÇISGÜ
Example (source declared and saved correctly as cp1254, console chcp 1254):
öçışğüÖÇİŞĞÜ
It is important to note that with Unicode strings, you don't have to match the source encoding with the encoding of your console.
Example (declared and saved as UTF-8, with Unicode string):
#!python2
# -*- coding:utf8 -*-
print u"öçışğüÖÇİŞĞÜ"
Output (use any code page that supports Turkish...1254, 857, 1026...):
öçışğüÖÇİŞĞÜ

Related

'ascii' codec can't encode character when redirecting Python script to file through Bash [duplicate]

I have a python script that grabs a bunch of recent tweets from the twitter API and dumps them to screen. It works well, but when I try to direct the output to a file something strange happens and a print statement causes an exception:
> ./tweets.py > tweets.txt
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 61: ordinal not in range(128)
I understand that the problem is with a UTF-8 character in one of the tweets that doesn't translate well to ASCII, but what is a simple way to dump the output to a file? Do I fix this in the python script or is there a way to coerce it at the commandline?
BTW, the script was written in Python2.
Without modifying the script, you can just set the environment variable PYTHONIOENCODING=utf8 and Python will assume that encoding when redirecting to a file.
References:
https://docs.python.org/2.7/using/cmdline.html#envvar-PYTHONIOENCODING
https://docs.python.org/3.3/using/cmdline.html#envvar-PYTHONIOENCODING
You may need encode the unicode object with .encode('utf-8')
In your python file append this to first line
# -*- coding: utf-8 -*-
If your script file is working standalone, append it to second line
#!/usr/local/bin/python
# -*- coding: utf-8 -*-
Here is the document: PEP 0263

How to solve this encoding issue in with Spyder in Anaconda (Python 3)?

I'm trying to run the following:
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
But I get the following error :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
6987: ordinal not in range(128)
From the internet I've found that it should be because the encoding needs to be set to utf-8, but my issue is that it's already in utf-8.
sys.getdefaultencoding()
Out[43]: 'utf-8'
Also, it looks like my file is in utf-8, so I'm really confused
Also, the following code works :
In [15]: path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
In [16]: open(path).readline()
Is there a way to solve this ?
Thanks !
EDIT:
When I run the code in my console it works, but not when I run it in Spyder provided by Anaconda (https://www.continuum.io/downloads)
Do you know what can go wrong ?
The text file contains some non-ascii characters on a line somewhere. Somehow on your setup the default file encoding is set to ascii instead of utf-8 so do the following and specify the file's encoding explicitly:
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line.strip()) for line in open(path, encoding="utf-8"))]
(Doing this is a good idea anyway even when the default works)
I try to ran this program with one additional line at the top:
# -*- coding: utf-8 -*-
It fetches the lines and shows the output (with u' prefixed strings; probably a conversion might be required after this). But, it didn't throw any error as you mentioned.

Python 2.7.6 - UnicodeEncodeError in Sublime 2 but NO error in Terminal

I have a script that reads from a website. The website has thai characters.
When I run the script in the Terminal, it prints the text fine.
When I run the script in the Sublime 2 (cmd+B) I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-7: ordinal not in range(128)
I have googled and read but nothing seems to work. Any tips?
The Sublime Text 2 command window apparently encodes Unicode strings as ascii before outputting them if they don't have an encoding attached.
Test case that runs in Terminal, but fails to run under Sublime Cmd+B:
# -*- coding: utf-8 -*-
print u'Hello 漢字!'
Encoding the unicode object when printing it works around this for me:
# -*- coding: utf-8 -*-
print u'Hello 漢字!'.encode('utf-8')
Try File-> Save With Encoding -> UTF-8 and run it again. This should work.

Why does my Python program get UnicodeDecodeError in IntelliJ but is OK from the command line?

I have a simple program that loads a .json file which contains a funny character. The program (see below) runs fine in Terminal but gets this error in IntelliJ:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
2: ordinal not in range(128)
The crucial code is:
with open(jsonFileName) as f:
jsonData = json.load(f)
if I replace the open with:
with open(jsonFileName, encoding='utf-8') as f:
Then it works in both IntelliJ and Terminal. I'm still new to Python and the IntelliJ plugin, and I don't understand why they're different. I thought sys.path might be different, but the output makes me think that's not the cause. Could someone please explain? Thanks!
Versions:
OS: Mac OS X 10.7.4 (also tested on 10.6.8)
Python 3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) /Library/Frameworks/Python.framework/Versions/3.2/bin/python3.2
IntelliJ: 11.1.3 Ultimate
Files (2):
1. unicode-error-demo.py
#!/usr/bin/python
import json
from pprint import pprint as pp
import sys
def main():
if len(sys.argv) is not 2:
print(sys.argv[0], "takes one arg: a .json file")
return
jsonFileName = sys.argv[1]
print("sys.path:")
pp(sys.path)
print("processing", jsonFileName)
# with open(jsonFileName) as f: # OK in Terminal, but BUG in IntelliJ: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
with open(jsonFileName, encoding='utf-8') as f: # OK in both
jsonData = json.load(f)
pp(jsonData)
if __name__ == "__main__":
main()
2. encode-temp.json
["™"]
The JSON .load() function expects Unicode data, not raw bytes. Python automatically tries to decode the byte string to a Unicode string for you using a default codec (in your case ASCII), and fails. By opening the file with the UTF-8 codec, Python makes an explicit conversion for you. See the open() function, which states:
In text mode, if encoding is not specified the encoding used is platform dependent.
The encoding that would be used is determined as follows:
Try os.device_encoding() to see if there is a terminal encoding.
Use locale.getpreferredencoding() function, which depends on the environment you run your code in. The do_setlocale of that function is set to False.
Use 'ASCII' as a default if both methods have returned None.
This is all done in C, but it's python equivalent would be:
if encoding is None:
encoding = os.device_encoding()
if encoding is None:
encoding = locale.getpreferredencoding(False)
if encoding is None:
encoding = 'ASCII'
So when you run your program in a terminal, os.deviceencoding() returns 'UTF-8', but when running under IntelliJ there is no terminal, and if no locale is set either, python uses 'ASCII'.
The Python Unicode HOWTO tells you all about the difference between unicode strings and bytestrings, as well as encodings. Another essential article on the subject is Joel Spolsky's Absolute Minimum Unicode knowledge article.
Python 2.x has strings and unicode strings. The basic strings are encoded with ASCII. ASCII uses only 7 bits/char, which allow to encode 128 characters, while modern UTF-8 uses up to 4 bytes/char. UTF-8 is compatible with ASCII (so that any ASCII-encoded string is a valid UTF-8 string), but not the other way round.
Apparently, your file name contains non-ASCII characters. And python by default wants to read it in as simple ASCII-encoded string, spots a non-ASCII character (its first bit is not 0 as it's 0xe2) and says, 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128).
Has nothing to do with python, but still my favourite tutorial about encodings:
http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html

Diacritic signs

How should I write "mąka" in Python without an exception?
I've tried var= u"mąka" and var= unicode("mąka") etc... nothing helps
I have coding definition in first line in my document, and still I've got that exception:
'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte
Save the following 2 lines into write_mako.py:
# -*- encoding: utf-8 -*-
open(u"mąka.txt", 'w').write("mąka\n")
Run:
$ python write_mako.py
mąka.txt file that contains the word mąka should be created in the current directory.
If it doesn't work then you can use chardet to detect actual encoding of the file (see chardet example usage):
import chardet
print chardet.detect(open('write_mako.py', 'rb').read())
In my case it prints:
{'confidence': 0.75249999999999995, 'encoding': 'utf-8'}
The # -- coding: -- line must specify the encoding the source file is saved in. This error message:
'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte
indicates you aren't saving the source file in UTF-8. You can save your source file in any encoding that supports the characters you are using in the source code, just make sure you know what it is and have an appropriate coding line.
What exception are you getting?
You might try saving your source code file as UTF-8, and putting this at the top of the file:
# coding=utf-8
That tells Python that the file’s saved as UTF-8.
This code works for me, saving the file as UTF-8:
v = u"mąka"
print repr(v)
The output I get is:
u'm\u0105ka'
Please copy and paste the exact error you are getting. If you are getting this error:
UnicodeEncodeError: 'charmap' codec can't encode character ... in position ...: character maps to <undefined>
Then you are trying to output the character somewhere that does not support UTF-8 (e.g. your shell's character encoding is set to something other than UTF-8).

Categories

Resources