UnicodeEncodeError when using the compile function - python

Using python 3.2 in Windows 7 I am getting the following in IDLE:
>>compile('pass', r'c:\temp\工具\module1.py', 'exec')
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
Can anybody explain why the compile statement tries to convert the unicode filename using mbcs? I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.
for example:
f = open(r'c:\temp\工具\module1.py')
works.
For a more complete test save the following in a utf8 encoded file and run it using the standard python.exe version 3.2
# -*- coding: utf8 -*-
fname = r'c:\temp\工具\module1.py'
# I do have the a file named fname but you can comment out the following two lines
f = open(fname)
print('ok')
cmp = compile('pass', fname, 'exec')
print(cmp)
Output:
ok
Traceback (most recent call last):
File "module8.py", line 6, in <module>
cmp = compile('pass', fname, 'exec')
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: inval
id character

From Python issue 10114, it seems that the logic is that all filenames used by Python should be valid for the platform where they are used. It is encoded using the filesystem encoding to be used in the C internals of Python.
I agree that it probably shouldn't throw an error on Windows, because any Unicode filename is valid. You may wish to file a bug report with Python for this. But be aware that the necessary changes might not be trivial, because any C code using the filename has to have something to do if it can't be encoded.

Here a solution that worked for me: Issue 427: UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range (128):
If you look the PyScripter help file in the topic "Encoded Python
Source Files" (last paragraph) it tells you how to configure Python to
support other encodings by modifying the site.py file. This file is
in the lib subdirectory of the Python installation directory. Find
the function setencoding and make sure that the support locale aware
default string encodings is on. (see below)
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0: <<<--- set this to 1 ---------------------------------
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale ()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding (encoding) # Needs Python Unicode
build !

I think you could try to change the "\" in the path of file into "/",just like
compile('pass', r'c:\temp\工具\module1.py', 'exec')
compile('pass', r'c:/temp/工具/module1.py', 'exec')
I have met a problem just like you, I used this method to solve the problem. I hope it can work with yours.

Related

How to solve this encoding issue in with Spyder in Anaconda (Python 3)?

I'm trying to run the following:
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
But I get the following error :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
6987: ordinal not in range(128)
From the internet I've found that it should be because the encoding needs to be set to utf-8, but my issue is that it's already in utf-8.
sys.getdefaultencoding()
Out[43]: 'utf-8'
Also, it looks like my file is in utf-8, so I'm really confused
Also, the following code works :
In [15]: path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
In [16]: open(path).readline()
Is there a way to solve this ?
Thanks !
EDIT:
When I run the code in my console it works, but not when I run it in Spyder provided by Anaconda (https://www.continuum.io/downloads)
Do you know what can go wrong ?
The text file contains some non-ascii characters on a line somewhere. Somehow on your setup the default file encoding is set to ascii instead of utf-8 so do the following and specify the file's encoding explicitly:
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line.strip()) for line in open(path, encoding="utf-8"))]
(Doing this is a good idea anyway even when the default works)
I try to ran this program with one additional line at the top:
# -*- coding: utf-8 -*-
It fetches the lines and shows the output (with u' prefixed strings; probably a conversion might be required after this). But, it didn't throw any error as you mentioned.

How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"

as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File "/usr/local/bin/wok", line 4, in
Engine()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 104, in init
self.load_pages()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 238, in load_pages
p = Page.from_file(os.path.join(root, f), self.options, self, renderer)
File "/usr/local/lib/python2.7/site-packages/wok/page.py", line 111, in from_file
page.meta['content'] = page.renderer.render(page.original)
File "/usr/local/lib/python2.7/site-packages/wok/renderers.py", line 46, in render
return markdown(plain, Markdown.plugins)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 419, in markdown
return md.convert(text)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 281, in convert
source = unicode(source)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 1: ordinal not in range(128). -- Note: Markdown only accepts unicode input!
How to fix it?
In some other python-based static blog apps, Chinese post can be published successfully.
Such as this app: http://github.com/vrypan/bucket3. In my site http://bc3.brite.biz/, Chinese post can be published successfully.
tl;dr / quick fix
Don't decode/encode willy nilly
Don't assume your strings are UTF-8 encoded
Try to convert strings to Unicode strings as soon as possible in your code
Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
Don't be tempted to use quick reload hacks
Unicode Zen in Python 2.x - The Long Version
Without seeing the source it's difficult to know the root cause, so I'll have to speak generally.
UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.
In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.
The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.
Unicode strings can be declared in your code using the u prefix to strings. E.g.
>>> my_u = u'my ünicôdé strįng'
>>> type(my_u)
<type 'unicode'>
Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.
Gotchas
Conversion from str to Unicode can happen even when you don't explicitly call unicode().
The following scenarios cause UnicodeDecodeError exceptions:
# Explicit conversion without encoding
unicode('€')
# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('€')
# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % '€'
# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + '€'
Examples
In the following diagram, you can see how the word café has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull:
In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can't contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:
The Unicode Sandwich
It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.
Input / Decode
Source code
If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.
u'Zürich'
To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:
# encoding: utf-8
This is only necessary when you have non-ASCII in your source code.
Files
Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
my_unicode_string = my_file.read()
my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you've probably used the wrong encoding value.
CSV Files
The Python 2.7 CSV module does not support non-ASCII characters 😩. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.
Use it like above but pass the opened file to it:
from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
for row in csv.reader(my_file):
yield row
Databases
Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.
MySQL
In the connection string add:
charset='utf8',
use_unicode=True
E.g.
>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
PostgreSQL
Add:
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
HTTP
Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.
Manually
If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you've probably got the wrong encoding.
The meat of the sandwich
Work with Unicodes as you would normal strs.
Output
stdout / printing
print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console's encoding. For example, if a Linux shell's locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.
An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.
Files
Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.
Database
The same configuration for reading will allow Unicodes to be written directly.
Python 3
Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes.
The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people's Unicode problems.
Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.
Why you shouldn't use sys.setdefaultencoding('utf8')
It's a nasty hack (there's a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen.
See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details
Finally I got it:
as3:/usr/local/lib/python2.7/site-packages# cat sitecustomize.py
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
Let me check:
as3:~/ngokevin-site# python
Python 2.7.6 (default, Dec 6 2013, 14:49:02)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.getdefaultencoding()
'utf8'
>>>
The above shows the default encoding of python is utf8. Then the error is no more.
This is the classic "unicode issue". I believe that explaining this is beyond the scope of a StackOverflow answer to completely explain what is happening.
It is well explained here.
In very brief summary, you have passed something that is being interpreted as a string of bytes to something that needs to decode it into Unicode characters, but the default codec (ascii) is failing.
The presentation I pointed you to provides advice for avoiding this. Make your code a "unicode sandwich". In Python 2, the use of from __future__ import unicode_literals helps.
Update: how can the code be fixed:
OK - in your variable "source" you have some bytes. It is not clear from your question how they got in there - maybe you read them from a web form? In any case, they are not encoded with ascii, but python is trying to convert them to unicode assuming that they are. You need to explicitly tell it what the encoding is. This means that you need to know what the encoding is! That is not always easy, and it depends entirely on where this string came from. You could experiment with some common encodings - for example UTF-8. You tell unicode() the encoding as a second parameter:
source = unicode(source, 'utf-8')
In some cases, when you check your default encoding (print sys.getdefaultencoding()), it returns that you are using ASCII. If you change to UTF-8, it doesn't work, depending on the content of your variable.
I found another way:
import sys
reload(sys)
sys.setdefaultencoding('Cp1252')
I was searching to solve the following error message:
unicodedecodeerror: 'ascii' codec can't decode byte 0xe2 in position 5454: ordinal not in range(128)
I finally got it fixed by specifying 'encoding':
f = open('../glove/glove.6B.100d.txt', encoding="utf-8")
Wish it could help you too.
"UnicodeDecodeError: 'ascii' codec can't decode byte"
Cause of this error: input_string must be unicode but str was given
"TypeError: Decoding Unicode is not supported"
Cause of this error: trying to convert unicode input_string into unicode
So first check that your input_string is str and convert to unicode if necessary:
if isinstance(input_string, str):
input_string = unicode(input_string, 'utf-8')
Secondly, the above just changes the type but does not remove non ascii characters. If you want to remove non-ascii characters:
if isinstance(input_string, str):
input_string = input_string.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.
elif isinstance(input_string, unicode):
input_string = input_string.encode('ascii', 'ignore')
In order to resolve this on an operating system level in an Ubuntu installation check the following:
$ locale charmap
If you get
locale: Cannot set LC_CTYPE to default locale: No such file or directory
instead of
UTF-8
then set LC_CTYPE and LC_ALL like this:
$ export LC_ALL="en_US.UTF-8"
$ export LC_CTYPE="en_US.UTF-8"
I find the best is to always convert to unicode - but this is difficult to achieve because in practice you'd have to check and convert every argument to every function and method you ever write that includes some form of string processing.
So I came up with the following approach to either guarantee unicodes or byte strings, from either input. In short, include and use the following lambdas:
# guarantee unicode string
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
_uu = lambda *tt: tuple(_u(t) for t in tt)
# guarantee byte string in UTF8 encoding
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
_uu8 = lambda *tt: tuple(_u8(t) for t in tt)
Examples:
text='Some string with codes > 127, like Zürich'
utext=u'Some string with codes > 127, like Zürich'
print "==> with _u, _uu"
print _u(text), type(_u(text))
print _u(utext), type(_u(utext))
print _uu(text, utext), type(_uu(text, utext))
print "==> with u8, uu8"
print _u8(text), type(_u8(text))
print _u8(utext), type(_u8(utext))
print _uu8(text, utext), type(_uu8(text, utext))
# with % formatting, always use _u() and _uu()
print "Some unknown input %s" % _u(text)
print "Multiple inputs %s, %s" % _uu(text, text)
# but with string.format be sure to always work with unicode strings
print u"Also works with formats: {}".format(_u(text))
print u"Also works with formats: {},{}".format(*_uu(text, text))
# ... or use _u8 and _uu8, because string.format expects byte strings
print "Also works with formats: {}".format(_u8(text))
print "Also works with formats: {},{}".format(*_uu8(text, text))
Here's some more reasoning about this.
Got a same error and this solved my error. Thanks!
python 2 and python 3 differing in unicode handling is making pickled files quite incompatible to load. So Use python pickle's encoding argument. Link below helped me solve the similar problem when I was trying to open pickled data from my python 3.7, while my file was saved originally in python 2.x version.
https://blog.modest-destiny.com/posts/python-2-and-3-compatible-pickle-save-and-load/
I copy the load_pickle function in my script and called the load_pickle(pickle_file) while loading my input_data like this:
input_data = load_pickle("my_dataset.pkl")
The load_pickle function is here:
def load_pickle(pickle_file):
try:
with open(pickle_file, 'rb') as f:
pickle_data = pickle.load(f)
except UnicodeDecodeError as e:
with open(pickle_file, 'rb') as f:
pickle_data = pickle.load(f, encoding='latin1')
except Exception as e:
print('Unable to load data ', pickle_file, ':', e)
raise
return pickle_data
Encode converts a unicode object in to a string object. I think you are trying to encode a string object. first convert your result into unicode object and then encode that unicode object into 'utf-8'.
for example
result = yourFunction()
result.decode().encode('utf-8')
This worked for me:
file = open('docs/my_messy_doc.pdf', 'rb')
I had the same error, with URLs containing non-ascii chars (bytes with values > 128), my solution:
url = url.decode('utf8').encode('utf-8')
Note: utf-8, utf8 are simply aliases . Using only 'utf8' or 'utf-8' should work in the same way
In my case, worked for me, in Python 2.7, I suppose this assignment changed 'something' in the str internal representation--i.e., it forces the right decoding of the backed byte sequence in url and finally puts the string into a utf-8 str with all the magic in the right place.
Unicode in Python is black magic for me.
Hope useful
I had the same problem but it didn't work for Python 3. I followed this and it solved my problem:
enc = sys.getdefaultencoding()
file = open(menu, "r", encoding = enc)
You have to set the encoding when you are reading/writing the file.
I got the same problem with the string "Pastelería Mallorca" and I solved with:
unicode("Pastelería Mallorca", 'latin-1')
In short, to ensure proper unicode handling in Python 2:
use io.open for reading/writing files
use from __future__ import unicode_literals
configure other data inputs/outputs (e.g., databases, network) to use unicode
if you cannot configure outputs to utf-8, convert your output for them print(text.encode('ascii', 'replace').decode())
For explanations, see #Alastair McCormack's detailed answer.
In a Django (1.9.10)/Python 2.7.5 project I have frequent UnicodeDecodeError exceptions; mainly when I try to feed unicode strings to logging. I made a helper function for arbitrary objects to basically format to 8-bit ascii strings and replacing any characters not in the table to '?'. I think it's not the best solution but since the default encoding is ascii (and i don't want to change it) it will do:
def encode_for_logging(c, encoding='ascii'):
if isinstance(c, basestring):
return c.encode(encoding, 'replace')
elif isinstance(c, Iterable):
c_ = []
for v in c:
c_.append(encode_for_logging(v, encoding))
return c_
else:
return encode_for_logging(unicode(c))
`
This error occurs when there are some non ASCII characters in our string and we are performing any operations on that string without proper decoding.
This helped me solve my problem.
I am reading a CSV file with columns ID,Text and decoding characters in it as below:
train_df = pd.read_csv("Example.csv")
train_data = train_df.values
for i in train_data:
print("ID :" + i[0])
text = i[1].decode("utf-8",errors="ignore").strip().lower()
print("Text: " + text)
Here is my solution, just add the encoding.
with open(file, encoding='utf8') as f
And because reading glove file will take a long time, I recommend to the glove file to a numpy file. When netx time you read the embedding weights, it will save your time.
import numpy as np
from tqdm import tqdm
def load_glove(file):
"""Loads GloVe vectors in numpy array.
Args:
file (str): a path to a glove file.
Return:
dict: a dict of numpy arrays.
"""
embeddings_index = {}
with open(file, encoding='utf8') as f:
for i, line in tqdm(enumerate(f)):
values = line.split()
word = ''.join(values[:-300])
coefs = np.asarray(values[-300:], dtype='float32')
embeddings_index[word] = coefs
return embeddings_index
# EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
EMBEDDING_PATH = 'glove.840B.300d.txt'
embeddings = load_glove(EMBEDDING_PATH)
np.save('glove_embeddings.npy', embeddings)
Gist link: https://gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227
Specify: # encoding= utf-8 at the top of your Python File, It should fix the issue
I experienced this error with Python2.7. It happened to me while trying to run many python programs, but I managed to reproduce it with this simple script:
#!/usr/bin/env python
import subprocess
import sys
result = subprocess.Popen([u'svn', u'info'])
if not callable(getattr(result, "__enter__", None)) and not callable(getattr(result, "__exit__", None)):
print("foo")
print("bar")
On success, it should print out 'foo' and 'bar', and probably an error message if you're not in a svn folder.
On failure, it should print 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 39: ordinal not in range(128)'.
After trying to regenerate my locales and many other solutions posted in this question, I learned the error was happening because I had a special character (ĺ) encoded in my PATH environment variable. After fixing the PATH in '~/.bashrc', and exiting my session and entering again, (apparently sourcing '~/.bashrc' didn't work), the issue was gone.

UnicodeDecodeError during encode?

We're running into a problem (which is described http://wiki.python.org/moin/UnicodeDecodeError) -- read the second paragraph '...Paradoxically...'.
Specifically, we're trying to up-convert a string to unicode and we are receiving a UnicodeDecodeError.
Example:
>>> unicode('\xab')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)
But of course, this works without any problems
>>> unicode(u'\xab')
u'\xab'
Of course, this code is to demonstrate the conversion problem. In our actual code, we are not using string literals and we can cannot just pre-pend the unicode 'u' prefix, but instead we are dealing with strings returned from an os.walk(), and the file name includes the above value. Since we cannot coerce the value to a unicode without calling unicode() constructor, we're not sure how to proceed.
One really horrible hack that occurs is to write our own str2uni() method, something like:
def str2uni(val):
r"""brute force coersion of str -> unicode"""
try:
return unicode(src)
except UnicodeDecodeError:
pass
res = u''
for ch in val:
res += unichr(ord(ch))
return res
But before we do this -- wanted to see if anyone else had any insight?
UPDATED
I see everyone is getting focused on HOW I got to the example I posted, rather than the result. Sigh -- ok, here's the code that caused me to spend hours reducing the problem to the simplest form I shared above.
for _,_,files in os.walk('/path/to/folder'):
for fname in files:
filename = unicode(fname)
That piece of code tosses a UnicodeDecodeError exception when the filename has the following value '3\xab Floppy (A).link'
To see the error for yourself, do the following:
>>> unicode('3\xab Floppy (A).link')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 1: ordinal not in range(128)
UPDATED
I really appreciate everyone trying to help. And I also appreciate that most people make some pretty simple mistakes related to string/unicode handling. But I'd like to underline the reference to the UnicodeDecodeError exception. We are getting this when calling the unicode() constructor!!!
I believe the underlying cause is described in the aforementioned Wiki article http://wiki.python.org/moin/UnicodeDecodeError. Read from the second paragraph on down about how "Paradoxically, a UnicodeDecodeError may happen when encoding...". The Wiki article very accurately describes what we are experiencing -- but while it elaborates on the cuases, it makes no suggestions for resolutions.
As a matter of fact, the third paragraph starts with the following astounding admission "Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided...".
Since I am not used to "cant get there from here" information as a developer, I thought it would be interested to cast about on Stack Overflow for the experiences of others.
I think you're confusing Unicode strings and Unicode encodings (like UTF-8).
os.walk(".") returns the filenames (and directory names etc.) as strings that are encoded in the current codepage. It will silently remove characters that are not present in your current codepage (see this question for a striking example).
Therefore, if your file/directory names contain characters outside of your encoding's range, then you definitely need to use a Unicode string to specify the starting directory, for example by calling os.walk(u"."). Then you don't need to (and shouldn't) call unicode() on the results any longer, because they already are Unicode strings.
If you don't do this, you first need to decode the filenames (as in mystring.decode("cp850")) which will give you a Unicode string:
>>> "\xab".decode("cp850")
u'\xbd'
Then you can encode that into UTF-8 or any other encoding.
>>> _.encode("utf-8")
'\xc2\xbd'
If you're still confused why unicode("\xab") throws a decoding error, maybe the following explanation helps:
"\xab" is an encoded string. Python has no way of knowing which encoding that is, but before you can convert it to Unicode, it needs to be decoded first. Without any specification from you, unicode() assumes that it is encoded in ASCII, and when it tries to decode it under this assumption, it fails because \xab isn't part of ASCII. So either you need to find out which encoding is being used by your filesystem and call unicode("\xab", encoding="cp850") or whatever, or start with Unicode strings in the first place.
for fname in files:
filename = unicode(fname)
The second line will complaint if fname is not ASCII. If you want to convert the string to Unicode, instead of unicode(fname) you should do fname.decode('<the encoding here>').
I would suggest the encoding but you don't tell us what does \xab is in your .link file. You can search in google for the encoding anyways so it would stay like this:
for fname in files:
filename = fname.decode('<encoding>')
UPDATE: For example, IF the encoding of your filesystem's names is ISO-8859-1 then \xab char would be "«". To read it into python you should do:
for fname in files:
filename = fname.decode('latin1') #which is synonym to #ISO-8859-1
Hope this helps!
As I understand it your issue is that os.walk(unicode_path) fails to decode some filenames to Unicode. This problem is fixed in Python 3.1+ (see PEP 383: Non-decodable Bytes in System Character Interfaces):
File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding
or not. This PEP proposes a means of dealing with such irregularities
by embedding the bytes in character strings in such a way that allows
recreation of the original byte string.
Windows provides Unicode API to access filesystem so there shouldn't be this problem.
Python 2.7 (utf-8 filesystem on Linux):
>>> import os
>>> list(os.walk("."))
[('.', [], ['\xc3('])]
>>> list(os.walk(u"."))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/os.py", line 284, in walk
if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py", line 71, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: \
ordinal not in range(128)
Python 3.3:
>>> import os
>>> list(os.walk(b'.'))
[(b'.', [], [b'\xc3('])]
>>> list(os.walk(u'.'))
[('.', [], ['\udcc3('])]
Your str2uni() function tries (it introduces ambiguous names) to solve the same issue as "surrogateescape" error handler on Python 3. Use bytestrings for filenames on Python 2 if you are expecting filenames that can't be decoded using sys.getfilesystemencoding().
'\xab'
Is a byte, number 171.
u'\xab'
Is a character, U+00AB Left-pointing double angle quotation mark («).
u'\xab' is a short-hand way of saying u'\u00ab'. It's not the same (not even the same datatype) as the byte '\xab'; it would probably have been clearer to always use the \u syntax in Unicode string literals IMO, but it's too late to fix that now.
To go from bytes to characters is known as a decode operation. To go from characters to bytes is known as an encode operation. For either direction, you need to know which encoding is used to map between the two.
>>> unicode('\xab')
UnicodeDecodeError
unicode is a character string, so there is an implicit decode operation when you pass bytes to the unicode() constructor. If you don't tell it which encoding you want you get the default encoding which is often ascii. ASCII doesn't have a meaning for byte 171 so you get an error.
>>> unicode(u'\xab')
u'\xab'
Since u'\xab' (or u'\u00ab') is already a character string, there is no implicit conversion in passing it to the unicode() constructor - you get an unchanged copy.
res = u''
for ch in val:
res += unichr(ord(ch))
return res
The encoding that maps each input byte to the Unicode character with the same ordinal value is ISO-8859-1. Consequently you could replace this loop with just:
return unicode(val, 'iso-8859-1')
(However note that if Windows is in the mix, then the encoding you want is probably not that one but the somewhat-similar windows-1252.)
One really horrible hack that occurs is to write our own str2uni() method
This isn't generally a good idea. UnicodeErrors are Python telling you you've misunderstood something about string types; ignoring that error instead of fixing it at source means you're more likely to hide subtle failures that will bite you later.
filename = unicode(fname)
So this would be better replaced with: filename = unicode(fname, 'iso-8859-1') if you know your filesystem is using ISO-8859-1 filenames. If your system locales are set up correctly then it should be possible to find out the encoding your filesystem is using, and go straight to that:
filename = unicode(fname, sys.getfilesystemencoding())
Though actually if it is set up correctly, you can skip all the encode/decode fuss by asking Python to treat filesystem paths as native Unicode instead of byte strings. You do that by passing a Unicode character string into the os filename interfaces:
for _,_,files in os.walk(u'/path/to/folder'): # note u'' string
for fname in files:
filename = fname # nothing more to do!
PS. The character in 3″ Floppy should really be U+2033 Double Prime, but there is no encoding for that in ISO-8859-1. Better in the long term to use UTF-8 filesystem encoding so you can include any character.

Python: Ascii characters from file display wrong

Here's my code:
import sys, os
print("█████") #<-- Those are solid blocks.
f= open('file.txt')
for line in f:
print(line)
In file.txt is this:
hay hay, guys
████████████
But the output is this:
██████
hay hay, guys <----- ***Looks like it outptutted this correctly!***
Traceback (most recent call last):
File "echofile.py", line 6, in <module>
print(line)
File "C:\python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-2: cha
racter maps to <undefined> <------ ***But not from the file!***
Anybody have any suggestions as to why it is doing this? I wrote the code in IDLE, tried editing the file.txt in both Programmer's Notepad and IDLE. The file is ASCII / ANSI.
I'm using Python 3, by the way. 3.3 alpha win-64 if it matters.
This is clearly an issue with character encodings.
In Python 3.x, all strings are Unicode. But when reading or writing a file, it will be necessary to translate the Unicode to some specific encoding.
By default, a Python source file is handled as UTF-8. I don't know exactly what characters you pasted into your source file for the blocks, but whatever it is, Python reads it as UTF-8 and it seems to work. Maybe your text editor converted to valid UTF-8 when you inserted those?
The backtrace suggests that Python is treating the input file as "Code Page 437" or the original IBM PC 8-bit character set. Is that correct?
This link shows how to set a specific decoder to handle a particular file encoding on input:
http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide/
EDIT: I found a better resource:
http://docs.python.org/release/3.0.1/howto/unicode.html
And based on that, here's some sample code:
with open('mytextfile.txt', encoding='utf-8') as f:
for line in f:
print(line, end='')
Originally I had the above set to "cp437" but in a comment you said "utf-8" was correct, so I made that change to this example. I'm specifying end='' here because the input lines from the file already have a newline on the end, so we don't need print() to supply another newline.
EDIT: I found a short discussion of default encodings here:
http://docs.python.org/release/3.0.1/whatsnew/3.0.html
The important bit: "There is a platform-dependent default encoding, which on Unixy platforms can be set with the LANG environment variable (and sometimes also with some other platform-specific locale-related environment variables). In many cases, but not all, the system default is UTF-8; you should never count on this default."
So, I had thought that Python defaulted to UTF-8, but not always, it seems. Actually, from your stack backtrace, I think on your system with your LANG environment setting you are getting "cp437" as your default.
So, I learned something too by answering your question!
P.S. I changed the code example above to specify utf-8 since that is what you needed.
Try making that string unicode:
print(u"█████")
^ Add this

python unichr problem

I've got some problem with unichr() on my server. Please see below:
On my server (Ubuntu 9.04):
>>> print unichr(255)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 0: ordinal not in range(128)
On my desktop (Ubuntu 9.10):
>>> print unichr(255)
ÿ
I'm fairly new to python so I don't know how to solve this. Anyone care to help? Thanks.
When using the "print" keyword, you'll be writing to the sys.stdout output stream. sys.stdout can usually only display Unicode strings if the characters can be converted to ascii using str(message).
You'll need to encode to your OS's terminal encoding when printing to be able to do this.
The locale module can sometimes detect the encoding of the output console:
import locale
print unichr(0xff).encode(locale.getdefaultlocale()[1], 'replace')
but it's usually better to just specify the encoding yourself, as python often gets it wrong:
print unichr(0xff).encode('latin-1', 'replace')
UTF-8 or latin-1 I think is often used in many modern linux distros.
If you know the encoding of your console, the lines below will encode Unicode strings automatically when you use "print":
import sys
import codecs
sys.stdout = codecs.getwriter(ENCODING)(sys.stdout)
If the encoding is ascii or something similar, you may need to change the console encoding of your OS to be able to display that character.
See also: http://wiki.python.org/moin/PrintFails
The terminal settings on your server are different, probably set to 7-bit US ASCII.
It's not really unichr() related. Problem is with locale setting in your server environment, as it's probably set to something like en_US and it's not unicode aware.
Consider using an explicit encoding when printing unicode strings where OS settings are not uniform.
unicode.encode([encoding[, errors]])
Return an encoded version of the string. Default encoding is the current default string encoding. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Codec Base Classes. For a list of possible encodings, see section Standard Encodings.
For example,
>>> print unichr(0xff).encode('iso8859-1')
����??
>>>

Categories

Resources