I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))
I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded.
I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.
[unicodedecodeerror: 'ascii' codec can't decode byte in position
ordinal not in range(128)]
Sample input:
Start: myUsername: myĂśsername:
What am I missing ?
EDIT_
Traceback (most recent call last):
File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)
Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical.
You have two problems; one you're hitting now, and one you'll hit if you fix your current code.
Your first problem is line is already a str in (apparently) UTF-8 encoded bytes, not unicode, so encodeing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed.
The solution to this problem is to just not encode line at all; it's already UTF-8 encoded, so you're already golden.
Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. But of course, since the input was a str, the group is a str too, and you'll encounter the same problem trying to encode a str; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode.
The reason:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decodeing it as such, and otherwise acting as an expensive no-op.
To fix the second problem, just change:
m.group(4).encode()
to:
m.group(4)
That leaves your final code as:
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
line)
Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line:
try:
line.decode('utf-8')
except Exception as e:
sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))
which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode, since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type).
I found .. in my eyes a workaround.
Doesn't feel right though, but it does the job.
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
I thought it could be done with .encode('utf-8')
file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()
Because of unicode object must be encode as string before hash.
Related
So, I'm having issues with Python3 encoding. I have a few bytes I want to work as strings. (long story)
In few words, this works
a = "\x85".encode()
print(a.decode())
But this doesn't
b = (0x85).to_bytes(1,"big")
print(b.decode())
UnicodeDecodeError: utf-8 codec can't decode byte 0x85 in position 0:
invalid start byte
I have read a handful of articles on the subject, but they insist that 'python3 is broken' or that 'you shouldn't be using strings for that'. Plenty articles on Stackoverflow just use "work arounds" (such as "use replace on error" or "user utc-16").
Could anyone tell me where the difference lies and why the function works while the second one doesn't? Shouldn't both of them work identically? Why can't utf-8 decode the byte on the second attempt?
In the first case '\x85'.encode() encodes the Unicode code point U+0085 in the Python 3 default encoding of UTF-8. So the output is the correct two-byte UTF-8 encoding of that code point:
>>> '\x85'.encode()
b'\xc2\x85'
Decode then works because it was correctly encoded in UTF-8 to begin with:
>>> b'\xc2\x85'.decode()
'\x85'
The second case is a complicated way of creating a single byte string:
>>> (0x85).to_bytes(1,'big')
b'\x85'
This byte string is not correctly encoded as UTF-8, so it fails to decode:
>>> b'\x85'.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte
Python 3 is definitely not "broken". It cleanly separates byte data from text.
If you have raw bytes, work with them as bytes. Raw data in Python 3 is intended to be manipulated in byte strings or byte arrays. Unicode strings are for text. Decode bytes to text to manipulate it, then encode back to bytes to serialize to file, socket, database, etc.
If for some reason you feel the need to use Unicode strings for raw data, the first 256 code points of Unicode correspond to the latin1 codec for 1:1 mapping of one to the other.
>>> '\x85'.encode('latin1')
b'\x85'
>>> b'\x85'.decode('latin1')
'\x85'
This is often used to correct programming errors due to encoding/decoding with the wrong encodings.
I'm trying to write out some text and encode it as utf-8 where possible, using the following code:
outf.write((lang_name + "," + (script_name or "") + "\n").encode("utf-8", errors='replace'))
I'm getting the following error:
File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6: character maps to <undefined>
I thought the errors='replace' part of my encode call would handle that?
fwiw, I'm just opening the file with
outf = open(outfile, 'w')
without explicitly declaring the encoding.
print repr(outf)
produces:
<open file 'myfile.csv', mode 'w' at 0x000000000315E930>
I separated out the write statement into a separate concatenation, encoding, and file write:
outstr = lang_name + "," + (script_name or "") + "\n"
encoded_outstr = outstr.encode("utf-8", errors='replace')
outf.write(encoded_outstr)
It is the concatenation that throws the exception.
The string are, via print repr(foo)
lang_name: 'G\xc4\x81ndh\xc4\x81r\xc4\xab'
script_name: u'Kharo\u1e63\u1e6dh\u012b'
Further detective work reveals that I can concatenate either one of those with a plain ascii string without any difficulty - it's putting them both into the same string that is breaking things.
So, the problem is that you are concatenating the bytestring 'G\xc4\x81ndh\xc4\x81r\xc4\xab' and the Unicode string u'Kharo\u1e63\u1e6dh\u012b'.
To be able to do that, Python 2.7 tries to decode the bytestring using its default encoding, to turn it into Unicode. Your default encoding is cp1252 instead of ASCII, for reasons I can't know from here, but anyway it fails just like it would had it been ASCII because that string is UTF8.
Your best solution is probably to make sure that this doesn't happen, by changing the way the variables get those values in the first place.
If you can't, since you are encoding to UTF8 on the next line anyway, it's probably easiest to only encode script_name:
encoded_outstr = lang_name + b"," + (script_name.encode('utf-8') or b"") + b"\n"
Note that I used b"," to explicitly make those string literals bytestrings and not Unicode strings; if you are using from __future__ import unicode_literals for Python 3 compatibility, then they are Unicode by default and the problem would just occur again.
When you concatenate a byte string and a Unicode string, Python 2 attempts to convert the byte string to Unicode first. If the byte string contains any non-ASCII characters in the range of \x80 to \xff, the automatic conversion will fail with the error you show. Notice that it says can't decode, not can't encode - this shows that the error did not occur in your call to encode.
The solution is to decode the byte string into Unicode yourself, using the proper code page, so that all the inputs to the concatenation are Unicode strings.
outstr = lang_name.decode("utf-8") + u"," + (script_name or u"") + u"\n"
I have been trying to write a simple script that can save user input (originating from an iPhone) to a text file. The issue I'm having is that when a user uses an Emoji icon, it breaks the whole thing.
OS: Ubuntu
Python Version: 2.7.3
My code currently looks like this
f = codecs.open(path, "w+", encoding="utf8")
f.write("Desc: " + json_obj["description"])
f.close()
When an Emoji character is passed in the description variable, I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
Any possible help is appreciated.
The most likely problem here is that json_obj["description"] is actually a UTF-8-encoded str, not a unicode. So, when you try to write it to a codecs-wrapped file, Python has to decode it from str to unicode so it can re-encode it. And that's the part that fails, because that automatic decoding uses sys.getdefaultencoding(), which is 'ascii'.
For example:
>>> f = codecs.open('emoji.txt', 'w+', encoding='utf-8')
>>> e = u'\U0001f1ef'
>>> print e
🇯
>>> e
u'\U0001f1ef'
>>> f.write(e)
>>> e8 = e.encode('utf-8')
>>> e8
'\xf0\x9f\x87\xaf'
>>> f.write(e8)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)
There are two possible solutions here.
First, you can explicitly decode everything to unicode as early as possible. I'm not sure where your json_obj is coming from, but I suspect it's not actually the stdlib json.loads, because by default, that always gives you unicode keys and values. So, replacing whatever you're using for JSON with the stdlib functions will probably solve the problem.
Second, you can leave everything as UTF-8 str objects and stay in binary mode. If you know you have UTF-8 everywhere, just open the file instead of codecs.open, and write without any encoding.
Also, you should strongly consider using io.open instead of codecs.open. It has a number of advantages, including:
Raises an exception instead of doing the wrong thing if you pass it incorrect values.
Often faster.
Forward-compatible with Python 3.
Has a number of bug fixes that will never be back-ported to codecs.
The only disadvantage is that it's not backward compatible to Python 2.5. Unless that matters to you, don't use codecs.
We're running into a problem (which is described http://wiki.python.org/moin/UnicodeDecodeError) -- read the second paragraph '...Paradoxically...'.
Specifically, we're trying to up-convert a string to unicode and we are receiving a UnicodeDecodeError.
Example:
>>> unicode('\xab')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)
But of course, this works without any problems
>>> unicode(u'\xab')
u'\xab'
Of course, this code is to demonstrate the conversion problem. In our actual code, we are not using string literals and we can cannot just pre-pend the unicode 'u' prefix, but instead we are dealing with strings returned from an os.walk(), and the file name includes the above value. Since we cannot coerce the value to a unicode without calling unicode() constructor, we're not sure how to proceed.
One really horrible hack that occurs is to write our own str2uni() method, something like:
def str2uni(val):
r"""brute force coersion of str -> unicode"""
try:
return unicode(src)
except UnicodeDecodeError:
pass
res = u''
for ch in val:
res += unichr(ord(ch))
return res
But before we do this -- wanted to see if anyone else had any insight?
UPDATED
I see everyone is getting focused on HOW I got to the example I posted, rather than the result. Sigh -- ok, here's the code that caused me to spend hours reducing the problem to the simplest form I shared above.
for _,_,files in os.walk('/path/to/folder'):
for fname in files:
filename = unicode(fname)
That piece of code tosses a UnicodeDecodeError exception when the filename has the following value '3\xab Floppy (A).link'
To see the error for yourself, do the following:
>>> unicode('3\xab Floppy (A).link')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 1: ordinal not in range(128)
UPDATED
I really appreciate everyone trying to help. And I also appreciate that most people make some pretty simple mistakes related to string/unicode handling. But I'd like to underline the reference to the UnicodeDecodeError exception. We are getting this when calling the unicode() constructor!!!
I believe the underlying cause is described in the aforementioned Wiki article http://wiki.python.org/moin/UnicodeDecodeError. Read from the second paragraph on down about how "Paradoxically, a UnicodeDecodeError may happen when encoding...". The Wiki article very accurately describes what we are experiencing -- but while it elaborates on the cuases, it makes no suggestions for resolutions.
As a matter of fact, the third paragraph starts with the following astounding admission "Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided...".
Since I am not used to "cant get there from here" information as a developer, I thought it would be interested to cast about on Stack Overflow for the experiences of others.
I think you're confusing Unicode strings and Unicode encodings (like UTF-8).
os.walk(".") returns the filenames (and directory names etc.) as strings that are encoded in the current codepage. It will silently remove characters that are not present in your current codepage (see this question for a striking example).
Therefore, if your file/directory names contain characters outside of your encoding's range, then you definitely need to use a Unicode string to specify the starting directory, for example by calling os.walk(u"."). Then you don't need to (and shouldn't) call unicode() on the results any longer, because they already are Unicode strings.
If you don't do this, you first need to decode the filenames (as in mystring.decode("cp850")) which will give you a Unicode string:
>>> "\xab".decode("cp850")
u'\xbd'
Then you can encode that into UTF-8 or any other encoding.
>>> _.encode("utf-8")
'\xc2\xbd'
If you're still confused why unicode("\xab") throws a decoding error, maybe the following explanation helps:
"\xab" is an encoded string. Python has no way of knowing which encoding that is, but before you can convert it to Unicode, it needs to be decoded first. Without any specification from you, unicode() assumes that it is encoded in ASCII, and when it tries to decode it under this assumption, it fails because \xab isn't part of ASCII. So either you need to find out which encoding is being used by your filesystem and call unicode("\xab", encoding="cp850") or whatever, or start with Unicode strings in the first place.
for fname in files:
filename = unicode(fname)
The second line will complaint if fname is not ASCII. If you want to convert the string to Unicode, instead of unicode(fname) you should do fname.decode('<the encoding here>').
I would suggest the encoding but you don't tell us what does \xab is in your .link file. You can search in google for the encoding anyways so it would stay like this:
for fname in files:
filename = fname.decode('<encoding>')
UPDATE: For example, IF the encoding of your filesystem's names is ISO-8859-1 then \xab char would be "«". To read it into python you should do:
for fname in files:
filename = fname.decode('latin1') #which is synonym to #ISO-8859-1
Hope this helps!
As I understand it your issue is that os.walk(unicode_path) fails to decode some filenames to Unicode. This problem is fixed in Python 3.1+ (see PEP 383: Non-decodable Bytes in System Character Interfaces):
File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding
or not. This PEP proposes a means of dealing with such irregularities
by embedding the bytes in character strings in such a way that allows
recreation of the original byte string.
Windows provides Unicode API to access filesystem so there shouldn't be this problem.
Python 2.7 (utf-8 filesystem on Linux):
>>> import os
>>> list(os.walk("."))
[('.', [], ['\xc3('])]
>>> list(os.walk(u"."))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/os.py", line 284, in walk
if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py", line 71, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: \
ordinal not in range(128)
Python 3.3:
>>> import os
>>> list(os.walk(b'.'))
[(b'.', [], [b'\xc3('])]
>>> list(os.walk(u'.'))
[('.', [], ['\udcc3('])]
Your str2uni() function tries (it introduces ambiguous names) to solve the same issue as "surrogateescape" error handler on Python 3. Use bytestrings for filenames on Python 2 if you are expecting filenames that can't be decoded using sys.getfilesystemencoding().
'\xab'
Is a byte, number 171.
u'\xab'
Is a character, U+00AB Left-pointing double angle quotation mark («).
u'\xab' is a short-hand way of saying u'\u00ab'. It's not the same (not even the same datatype) as the byte '\xab'; it would probably have been clearer to always use the \u syntax in Unicode string literals IMO, but it's too late to fix that now.
To go from bytes to characters is known as a decode operation. To go from characters to bytes is known as an encode operation. For either direction, you need to know which encoding is used to map between the two.
>>> unicode('\xab')
UnicodeDecodeError
unicode is a character string, so there is an implicit decode operation when you pass bytes to the unicode() constructor. If you don't tell it which encoding you want you get the default encoding which is often ascii. ASCII doesn't have a meaning for byte 171 so you get an error.
>>> unicode(u'\xab')
u'\xab'
Since u'\xab' (or u'\u00ab') is already a character string, there is no implicit conversion in passing it to the unicode() constructor - you get an unchanged copy.
res = u''
for ch in val:
res += unichr(ord(ch))
return res
The encoding that maps each input byte to the Unicode character with the same ordinal value is ISO-8859-1. Consequently you could replace this loop with just:
return unicode(val, 'iso-8859-1')
(However note that if Windows is in the mix, then the encoding you want is probably not that one but the somewhat-similar windows-1252.)
One really horrible hack that occurs is to write our own str2uni() method
This isn't generally a good idea. UnicodeErrors are Python telling you you've misunderstood something about string types; ignoring that error instead of fixing it at source means you're more likely to hide subtle failures that will bite you later.
filename = unicode(fname)
So this would be better replaced with: filename = unicode(fname, 'iso-8859-1') if you know your filesystem is using ISO-8859-1 filenames. If your system locales are set up correctly then it should be possible to find out the encoding your filesystem is using, and go straight to that:
filename = unicode(fname, sys.getfilesystemencoding())
Though actually if it is set up correctly, you can skip all the encode/decode fuss by asking Python to treat filesystem paths as native Unicode instead of byte strings. You do that by passing a Unicode character string into the os filename interfaces:
for _,_,files in os.walk(u'/path/to/folder'): # note u'' string
for fname in files:
filename = fname # nothing more to do!
PS. The character in 3″ Floppy should really be U+2033 Double Prime, but there is no encoding for that in ISO-8859-1. Better in the long term to use UTF-8 filesystem encoding so you can include any character.
I currently use Sublime 2 and run my python code there.
When I try to run this code. I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
# -*- coding: utf-8 -*-
s = unicode('abcdefö')
print s
I have been reading the python documentation on unicode and as far as I understand this should work, or is it the console that's not working
Edit: Using s = u'abcdefö' as a string produces almost the same result. The result I get is
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 6: ordinal not in range(128)
What happens is that unicode('abcdefö') tries to decode the encoded string to unicode during runtime. The coding: utf-8 line only tells Python that the source file is encoded in utf8. When the script runs it has been compiled and string has been stored as a encoded string. So when Python tries to decode the string it uses ascii by default. As the string is actually utf8 encoded this fails.
You can do s = u'abcdefö' which tells the compiler to decode the string with the encoding declared for the file and store it as unicode. s = unicode('abcdefö', 'utf8') or s = 'abcdefö'.decode('utf8') would do the same thing during runtime.
However does not necessarily mean that you can print s now. First the internal unicode string has to be encoded in a character set that the stdout (the console/editor/IDE) can actually display. Sadly often Python fails at figuring out the right character set and defaults to ascii again and you get an error when the string contains non-ascii characters. The Python Wiki knows a few ways to set up stdout properly.
You need to mark the string as a unicode string:
s = u'abcdefö'
s = 'abcdefö'
DO NOT TRY unicode() if string is already in unicode. i.e. unicode(s) is wrong.
IF type(s) == str but contains unicode characters:
First convert to unicode
str_val = unicode(s,'utf-8’)
str_val = unicode(s,'utf-8’,’replace')
Finally encode to string
str_val.encode('utf-8')
Now you can print:
print s