python utf-8 encoding throws UnicodeDecodeError despite "errors = 'replace' " - python

I'm trying to write out some text and encode it as utf-8 where possible, using the following code:
outf.write((lang_name + "," + (script_name or "") + "\n").encode("utf-8", errors='replace'))
I'm getting the following error:
File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6: character maps to <undefined>
I thought the errors='replace' part of my encode call would handle that?
fwiw, I'm just opening the file with
outf = open(outfile, 'w')
without explicitly declaring the encoding.
print repr(outf)
produces:
<open file 'myfile.csv', mode 'w' at 0x000000000315E930>
I separated out the write statement into a separate concatenation, encoding, and file write:
outstr = lang_name + "," + (script_name or "") + "\n"
encoded_outstr = outstr.encode("utf-8", errors='replace')
outf.write(encoded_outstr)
It is the concatenation that throws the exception.
The string are, via print repr(foo)
lang_name: 'G\xc4\x81ndh\xc4\x81r\xc4\xab'
script_name: u'Kharo\u1e63\u1e6dh\u012b'
Further detective work reveals that I can concatenate either one of those with a plain ascii string without any difficulty - it's putting them both into the same string that is breaking things.

So, the problem is that you are concatenating the bytestring 'G\xc4\x81ndh\xc4\x81r\xc4\xab' and the Unicode string u'Kharo\u1e63\u1e6dh\u012b'.
To be able to do that, Python 2.7 tries to decode the bytestring using its default encoding, to turn it into Unicode. Your default encoding is cp1252 instead of ASCII, for reasons I can't know from here, but anyway it fails just like it would had it been ASCII because that string is UTF8.
Your best solution is probably to make sure that this doesn't happen, by changing the way the variables get those values in the first place.
If you can't, since you are encoding to UTF8 on the next line anyway, it's probably easiest to only encode script_name:
encoded_outstr = lang_name + b"," + (script_name.encode('utf-8') or b"") + b"\n"
Note that I used b"," to explicitly make those string literals bytestrings and not Unicode strings; if you are using from __future__ import unicode_literals for Python 3 compatibility, then they are Unicode by default and the problem would just occur again.

When you concatenate a byte string and a Unicode string, Python 2 attempts to convert the byte string to Unicode first. If the byte string contains any non-ASCII characters in the range of \x80 to \xff, the automatic conversion will fail with the error you show. Notice that it says can't decode, not can't encode - this shows that the error did not occur in your call to encode.
The solution is to decode the byte string into Unicode yourself, using the proper code page, so that all the inputs to the concatenation are Unicode strings.
outstr = lang_name.decode("utf-8") + u"," + (script_name or u"") + u"\n"

Related

Python unable to decode byte string

I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:
fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte
Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)
Code is:
with open(inputne, "rb") as file:
while 1:
readBytes= file.read(dataMaxSize)
fileStrings.append(readBytes)
if not readBytes:
break
readBytes= ''
filesize=0
for i in range(0, len(fileStrings)):
fileStrings[i] = fileStrings[i].decode()
filesize += len(fileStrings[i])
Edit: For anyone having same issue, parameter len() will give you size without b''.
In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).
As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.

encode hash in utf-8

I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))
I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded.
I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.
[unicodedecodeerror: 'ascii' codec can't decode byte in position
ordinal not in range(128)]
Sample input:
Start: myUsername: myÜsername:
What am I missing ?
EDIT_
Traceback (most recent call last):
File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)
Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical.
You have two problems; one you're hitting now, and one you'll hit if you fix your current code.
Your first problem is line is already a str in (apparently) UTF-8 encoded bytes, not unicode, so encodeing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed.
The solution to this problem is to just not encode line at all; it's already UTF-8 encoded, so you're already golden.
Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. But of course, since the input was a str, the group is a str too, and you'll encounter the same problem trying to encode a str; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode.
The reason:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decodeing it as such, and otherwise acting as an expensive no-op.
To fix the second problem, just change:
m.group(4).encode()
to:
m.group(4)
That leaves your final code as:
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
line)
Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line:
try:
line.decode('utf-8')
except Exception as e:
sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))
which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode, since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type).
I found .. in my eyes a workaround.
Doesn't feel right though, but it does the job.
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
I thought it could be done with .encode('utf-8')
file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()
Because of unicode object must be encode as string before hash.

How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"

as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File "/usr/local/bin/wok", line 4, in
Engine()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 104, in init
self.load_pages()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 238, in load_pages
p = Page.from_file(os.path.join(root, f), self.options, self, renderer)
File "/usr/local/lib/python2.7/site-packages/wok/page.py", line 111, in from_file
page.meta['content'] = page.renderer.render(page.original)
File "/usr/local/lib/python2.7/site-packages/wok/renderers.py", line 46, in render
return markdown(plain, Markdown.plugins)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 419, in markdown
return md.convert(text)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 281, in convert
source = unicode(source)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 1: ordinal not in range(128). -- Note: Markdown only accepts unicode input!
How to fix it?
In some other python-based static blog apps, Chinese post can be published successfully.
Such as this app: http://github.com/vrypan/bucket3. In my site http://bc3.brite.biz/, Chinese post can be published successfully.
tl;dr / quick fix
Don't decode/encode willy nilly
Don't assume your strings are UTF-8 encoded
Try to convert strings to Unicode strings as soon as possible in your code
Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
Don't be tempted to use quick reload hacks
Unicode Zen in Python 2.x - The Long Version
Without seeing the source it's difficult to know the root cause, so I'll have to speak generally.
UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.
In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.
The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.
Unicode strings can be declared in your code using the u prefix to strings. E.g.
>>> my_u = u'my ünicôdé strįng'
>>> type(my_u)
<type 'unicode'>
Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.
Gotchas
Conversion from str to Unicode can happen even when you don't explicitly call unicode().
The following scenarios cause UnicodeDecodeError exceptions:
# Explicit conversion without encoding
unicode('€')
# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('€')
# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % '€'
# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + '€'
Examples
In the following diagram, you can see how the word café has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull:
In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can't contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:
The Unicode Sandwich
It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.
Input / Decode
Source code
If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.
u'Zürich'
To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:
# encoding: utf-8
This is only necessary when you have non-ASCII in your source code.
Files
Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
my_unicode_string = my_file.read()
my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you've probably used the wrong encoding value.
CSV Files
The Python 2.7 CSV module does not support non-ASCII characters 😩. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.
Use it like above but pass the opened file to it:
from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
for row in csv.reader(my_file):
yield row
Databases
Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.
MySQL
In the connection string add:
charset='utf8',
use_unicode=True
E.g.
>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
PostgreSQL
Add:
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
HTTP
Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.
Manually
If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you've probably got the wrong encoding.
The meat of the sandwich
Work with Unicodes as you would normal strs.
Output
stdout / printing
print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console's encoding. For example, if a Linux shell's locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.
An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.
Files
Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.
Database
The same configuration for reading will allow Unicodes to be written directly.
Python 3
Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes.
The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people's Unicode problems.
Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.
Why you shouldn't use sys.setdefaultencoding('utf8')
It's a nasty hack (there's a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen.
See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details
Finally I got it:
as3:/usr/local/lib/python2.7/site-packages# cat sitecustomize.py
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
Let me check:
as3:~/ngokevin-site# python
Python 2.7.6 (default, Dec 6 2013, 14:49:02)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.getdefaultencoding()
'utf8'
>>>
The above shows the default encoding of python is utf8. Then the error is no more.
This is the classic "unicode issue". I believe that explaining this is beyond the scope of a StackOverflow answer to completely explain what is happening.
It is well explained here.
In very brief summary, you have passed something that is being interpreted as a string of bytes to something that needs to decode it into Unicode characters, but the default codec (ascii) is failing.
The presentation I pointed you to provides advice for avoiding this. Make your code a "unicode sandwich". In Python 2, the use of from __future__ import unicode_literals helps.
Update: how can the code be fixed:
OK - in your variable "source" you have some bytes. It is not clear from your question how they got in there - maybe you read them from a web form? In any case, they are not encoded with ascii, but python is trying to convert them to unicode assuming that they are. You need to explicitly tell it what the encoding is. This means that you need to know what the encoding is! That is not always easy, and it depends entirely on where this string came from. You could experiment with some common encodings - for example UTF-8. You tell unicode() the encoding as a second parameter:
source = unicode(source, 'utf-8')
In some cases, when you check your default encoding (print sys.getdefaultencoding()), it returns that you are using ASCII. If you change to UTF-8, it doesn't work, depending on the content of your variable.
I found another way:
import sys
reload(sys)
sys.setdefaultencoding('Cp1252')
I was searching to solve the following error message:
unicodedecodeerror: 'ascii' codec can't decode byte 0xe2 in position 5454: ordinal not in range(128)
I finally got it fixed by specifying 'encoding':
f = open('../glove/glove.6B.100d.txt', encoding="utf-8")
Wish it could help you too.
"UnicodeDecodeError: 'ascii' codec can't decode byte"
Cause of this error: input_string must be unicode but str was given
"TypeError: Decoding Unicode is not supported"
Cause of this error: trying to convert unicode input_string into unicode
So first check that your input_string is str and convert to unicode if necessary:
if isinstance(input_string, str):
input_string = unicode(input_string, 'utf-8')
Secondly, the above just changes the type but does not remove non ascii characters. If you want to remove non-ascii characters:
if isinstance(input_string, str):
input_string = input_string.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.
elif isinstance(input_string, unicode):
input_string = input_string.encode('ascii', 'ignore')
In order to resolve this on an operating system level in an Ubuntu installation check the following:
$ locale charmap
If you get
locale: Cannot set LC_CTYPE to default locale: No such file or directory
instead of
UTF-8
then set LC_CTYPE and LC_ALL like this:
$ export LC_ALL="en_US.UTF-8"
$ export LC_CTYPE="en_US.UTF-8"
I find the best is to always convert to unicode - but this is difficult to achieve because in practice you'd have to check and convert every argument to every function and method you ever write that includes some form of string processing.
So I came up with the following approach to either guarantee unicodes or byte strings, from either input. In short, include and use the following lambdas:
# guarantee unicode string
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
_uu = lambda *tt: tuple(_u(t) for t in tt)
# guarantee byte string in UTF8 encoding
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
_uu8 = lambda *tt: tuple(_u8(t) for t in tt)
Examples:
text='Some string with codes > 127, like Zürich'
utext=u'Some string with codes > 127, like Zürich'
print "==> with _u, _uu"
print _u(text), type(_u(text))
print _u(utext), type(_u(utext))
print _uu(text, utext), type(_uu(text, utext))
print "==> with u8, uu8"
print _u8(text), type(_u8(text))
print _u8(utext), type(_u8(utext))
print _uu8(text, utext), type(_uu8(text, utext))
# with % formatting, always use _u() and _uu()
print "Some unknown input %s" % _u(text)
print "Multiple inputs %s, %s" % _uu(text, text)
# but with string.format be sure to always work with unicode strings
print u"Also works with formats: {}".format(_u(text))
print u"Also works with formats: {},{}".format(*_uu(text, text))
# ... or use _u8 and _uu8, because string.format expects byte strings
print "Also works with formats: {}".format(_u8(text))
print "Also works with formats: {},{}".format(*_uu8(text, text))
Here's some more reasoning about this.
Got a same error and this solved my error. Thanks!
python 2 and python 3 differing in unicode handling is making pickled files quite incompatible to load. So Use python pickle's encoding argument. Link below helped me solve the similar problem when I was trying to open pickled data from my python 3.7, while my file was saved originally in python 2.x version.
https://blog.modest-destiny.com/posts/python-2-and-3-compatible-pickle-save-and-load/
I copy the load_pickle function in my script and called the load_pickle(pickle_file) while loading my input_data like this:
input_data = load_pickle("my_dataset.pkl")
The load_pickle function is here:
def load_pickle(pickle_file):
try:
with open(pickle_file, 'rb') as f:
pickle_data = pickle.load(f)
except UnicodeDecodeError as e:
with open(pickle_file, 'rb') as f:
pickle_data = pickle.load(f, encoding='latin1')
except Exception as e:
print('Unable to load data ', pickle_file, ':', e)
raise
return pickle_data
Encode converts a unicode object in to a string object. I think you are trying to encode a string object. first convert your result into unicode object and then encode that unicode object into 'utf-8'.
for example
result = yourFunction()
result.decode().encode('utf-8')
This worked for me:
file = open('docs/my_messy_doc.pdf', 'rb')
I had the same error, with URLs containing non-ascii chars (bytes with values > 128), my solution:
url = url.decode('utf8').encode('utf-8')
Note: utf-8, utf8 are simply aliases . Using only 'utf8' or 'utf-8' should work in the same way
In my case, worked for me, in Python 2.7, I suppose this assignment changed 'something' in the str internal representation--i.e., it forces the right decoding of the backed byte sequence in url and finally puts the string into a utf-8 str with all the magic in the right place.
Unicode in Python is black magic for me.
Hope useful
I had the same problem but it didn't work for Python 3. I followed this and it solved my problem:
enc = sys.getdefaultencoding()
file = open(menu, "r", encoding = enc)
You have to set the encoding when you are reading/writing the file.
I got the same problem with the string "Pastelería Mallorca" and I solved with:
unicode("Pastelería Mallorca", 'latin-1')
In short, to ensure proper unicode handling in Python 2:
use io.open for reading/writing files
use from __future__ import unicode_literals
configure other data inputs/outputs (e.g., databases, network) to use unicode
if you cannot configure outputs to utf-8, convert your output for them print(text.encode('ascii', 'replace').decode())
For explanations, see #Alastair McCormack's detailed answer.
In a Django (1.9.10)/Python 2.7.5 project I have frequent UnicodeDecodeError exceptions; mainly when I try to feed unicode strings to logging. I made a helper function for arbitrary objects to basically format to 8-bit ascii strings and replacing any characters not in the table to '?'. I think it's not the best solution but since the default encoding is ascii (and i don't want to change it) it will do:
def encode_for_logging(c, encoding='ascii'):
if isinstance(c, basestring):
return c.encode(encoding, 'replace')
elif isinstance(c, Iterable):
c_ = []
for v in c:
c_.append(encode_for_logging(v, encoding))
return c_
else:
return encode_for_logging(unicode(c))
`
This error occurs when there are some non ASCII characters in our string and we are performing any operations on that string without proper decoding.
This helped me solve my problem.
I am reading a CSV file with columns ID,Text and decoding characters in it as below:
train_df = pd.read_csv("Example.csv")
train_data = train_df.values
for i in train_data:
print("ID :" + i[0])
text = i[1].decode("utf-8",errors="ignore").strip().lower()
print("Text: " + text)
Here is my solution, just add the encoding.
with open(file, encoding='utf8') as f
And because reading glove file will take a long time, I recommend to the glove file to a numpy file. When netx time you read the embedding weights, it will save your time.
import numpy as np
from tqdm import tqdm
def load_glove(file):
"""Loads GloVe vectors in numpy array.
Args:
file (str): a path to a glove file.
Return:
dict: a dict of numpy arrays.
"""
embeddings_index = {}
with open(file, encoding='utf8') as f:
for i, line in tqdm(enumerate(f)):
values = line.split()
word = ''.join(values[:-300])
coefs = np.asarray(values[-300:], dtype='float32')
embeddings_index[word] = coefs
return embeddings_index
# EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
EMBEDDING_PATH = 'glove.840B.300d.txt'
embeddings = load_glove(EMBEDDING_PATH)
np.save('glove_embeddings.npy', embeddings)
Gist link: https://gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227
Specify: # encoding= utf-8 at the top of your Python File, It should fix the issue
I experienced this error with Python2.7. It happened to me while trying to run many python programs, but I managed to reproduce it with this simple script:
#!/usr/bin/env python
import subprocess
import sys
result = subprocess.Popen([u'svn', u'info'])
if not callable(getattr(result, "__enter__", None)) and not callable(getattr(result, "__exit__", None)):
print("foo")
print("bar")
On success, it should print out 'foo' and 'bar', and probably an error message if you're not in a svn folder.
On failure, it should print 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 39: ordinal not in range(128)'.
After trying to regenerate my locales and many other solutions posted in this question, I learned the error was happening because I had a special character (ĺ) encoded in my PATH environment variable. After fixing the PATH in '~/.bashrc', and exiting my session and entering again, (apparently sourcing '~/.bashrc' didn't work), the issue was gone.

Reading UTF8 encoded CSV and converting to UTF-16

I'm reading in a CSV file that has UTF8 encoding:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print repr(row[0])
This works fine, and prints out what I expect it to print out; a UTF8 encoded str:
> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...
Furthermore when I simply print the str (as opposed to repr()) the output displays ok (which I don't understand eitherway - shouldn't this cause an error?):
> Álvaro Salazar
> Élodie Yung
but when I try to convert my UTF8 encoded strs to unicode:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print unicode(name, 'utf-8') # or name.decode('utf-8')
I get the infamous:
Traceback (most recent call last):
File "scripts/script.py", line 33, in <module>
print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)
So I looked at the unicode strings that are created:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
unicode_name = unicode(name, 'utf-8')
print repr(unicode_name)
and the output is
> u'\xc1lvaro Salazar'
> u'\xc9lodie Yung'
So now I'm totally confused as these seem to be mangled hex values. I've read this question:
Reading a UTF8 CSV file with Python
and it appears I am doing everything correctly, leading me to believe that my file is not actually UTF8, but when I initially print out the repr values of the cells, they appear to to correct UTF8 hex values. Can anyone either point out my problem or indicate where my understanding is breaking down (as I'm starting to get lost in the jungle of encodings)
As an aside, I believe I could use codecs to open the file and read it directly into unicode objects, but the csv module doesn't support unicode natively so I can use this approach.
Your default encoding is ASCII. When you try to print a unicode object, the interpreter therefore tries to encode it using the ASCII codec, which fails because your text includes characters that don't exist in ASCII.
The reason that printing the UTF-8 encoded bytestring doesn't produce an error (which seems to confuse you, although it shouldn't) is that this simply sends the bytes to your terminal. It will never produce a Python error, although it may produce ugly output if your terminal doesn't know what to do with the bytes.
To print a unicode, use print some_unicode.encode('utf-8'). (Or whatever encoding your terminal is actually using).
As for the u'\xc1lvaro Salazar', nothing here is mangled. The character Á is at the unicode codepoint C1 (which has nothing to do with it's UTF-8 representation, but happens to be the same value as in Latin-1), and Python uses \x hex escapes instead of \u unicode codepoint notation for codepoints that would have 00 as the most significant byte to save space (it could also have displayed this as \u00c1.)
To get a good overview of how Unicode works in Python, I suggest http://nedbatchelder.com/text/unipain.html

Figuring out unicode: 'ascii' codec can't decode

I currently use Sublime 2 and run my python code there.
When I try to run this code. I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
# -*- coding: utf-8 -*-
s = unicode('abcdefö')
print s
I have been reading the python documentation on unicode and as far as I understand this should work, or is it the console that's not working
Edit: Using s = u'abcdefö' as a string produces almost the same result. The result I get is
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 6: ordinal not in range(128)
What happens is that unicode('abcdefö') tries to decode the encoded string to unicode during runtime. The coding: utf-8 line only tells Python that the source file is encoded in utf8. When the script runs it has been compiled and string has been stored as a encoded string. So when Python tries to decode the string it uses ascii by default. As the string is actually utf8 encoded this fails.
You can do s = u'abcdefö' which tells the compiler to decode the string with the encoding declared for the file and store it as unicode. s = unicode('abcdefö', 'utf8') or s = 'abcdefö'.decode('utf8') would do the same thing during runtime.
However does not necessarily mean that you can print s now. First the internal unicode string has to be encoded in a character set that the stdout (the console/editor/IDE) can actually display. Sadly often Python fails at figuring out the right character set and defaults to ascii again and you get an error when the string contains non-ascii characters. The Python Wiki knows a few ways to set up stdout properly.
You need to mark the string as a unicode string:
s = u'abcdefö'
s = 'abcdefö'
DO NOT TRY unicode() if string is already in unicode. i.e. unicode(s) is wrong.
IF type(s) == str but contains unicode characters:
First convert to unicode
str_val = unicode(s,'utf-8’)
str_val = unicode(s,'utf-8’,’replace')
Finally encode to string
str_val.encode('utf-8')
Now you can print:
print s

Categories

Resources