python unicode woes - convert cp1252 string to unicode

python unicode woes - convert cp1252 string to unicode - python

I think I'm just fundamentally confused about char sets that are not ascii.
I have a python file that I have declared at the top to be # -*- coding: cp1252 -*-.
In the file I have question = "what is your borther’s name", for example.
type(question)
>> str
question
>> 'what is your borther\xe2\x80\x99s name'
And I cannot convert to unicode at this point, presumably because you can't go from ASCII to Unicode.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 20: ordinal not in range(128)
if I declare as unicode to begin with:
question = "what is your borther’s name"
>> u'what is your borther\u2019s name'
How do I get "what is your borther’s name" back? Or is just a how python interpreter displays unicode strings and it in fact will encode correctly when I pass it to an unicode-aware application (in this case, Office)?
I need to preserve the special characters but I still need to do a string comparison using Levenshtein library (pip install python-Levenshtein).
Levenshtein.ratio takes str or unicode for both of its arguments, but not mixed.

I have a plain text file that I have declared at the top to be # -*- coding: cp1252 -*-.
That does nothing.
with codecs.open(..., encoding='cp1252') as fp:
...

Related

encode hash in utf-8

I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))
I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded.
I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.
[unicodedecodeerror: 'ascii' codec can't decode byte in position
ordinal not in range(128)]
Sample input:
Start: myUsername: myÜsername:
What am I missing ?
EDIT_
Traceback (most recent call last):
File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)

Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical.
You have two problems; one you're hitting now, and one you'll hit if you fix your current code.
Your first problem is line is already a str in (apparently) UTF-8 encoded bytes, not unicode, so encodeing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed.
The solution to this problem is to just not encode line at all; it's already UTF-8 encoded, so you're already golden.
Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. But of course, since the input was a str, the group is a str too, and you'll encounter the same problem trying to encode a str; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode.
The reason:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decodeing it as such, and otherwise acting as an expensive no-op.
To fix the second problem, just change:
m.group(4).encode()
to:
m.group(4)
That leaves your final code as:
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
line)
Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line:
try:
line.decode('utf-8')
except Exception as e:
sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))
which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode, since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type).

I found .. in my eyes a workaround.
Doesn't feel right though, but it does the job.
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
I thought it could be done with .encode('utf-8')

file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()
Because of unicode object must be encode as string before hash.

UnicodeDecodeError error writing .xlsx file using xlsxwriter

I am trying to write about 1000 rows to a .xlsx file from my python application. The data is basically a combination of integers and strings. I am getting intermittent error while running wbook.close() command. The error is the following:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15:
ordinal not in range(128)
My data does not have anything in unicode. I am wondering why the decoder is being at all. Has anyone noticed this problem?

0xc3 is "À". So what you need to do is change the encoding. Use the decode() method.
string.decode('utf-8')
Also depending on your needs and uses you could add
# -*- coding: utf-8 -*-
at the beginning of your script, but only if you are sure that the encoding will not interfere and break something else.

As Alex Hristov points out you have some non-ascii data in your code that needs to be encoded as UTF-8 for Excel.
See the following examples from the docs which each have instructions on handling UTF-8 with XlsxWriter in different scenarios:
Example: Simple Unicode with Python 2
Example: Simple Unicode with Python 3
Example: Unicode - Polish in UTF-8

Python unicode string literals in module declared as utf-8

I have a dummie Python module with the utf-8 header that looks like this:
# -*- coding: utf-8 -*-
a = "á"
print type(a), a
Which prints:
<type 'str'> á
But I thought that all string literals inside a Python module declared as utf-8 whould automatically be of type unicode, intead of str. Am I missing something or is this the correct behaviour?
In order to get a as an unicode string I use:
a = u"á"
But this doesn't seem very "polite", nor practical. Is there a better option?

# -*- coding: utf-8 -*-
doesn't make the string literals Unicode. Take this example, I have a file with an Arabic comment and string, file is utf-8:
# هذا تعليق عربي
print type('نص عربي')
if I run it it will throw a SyntaxError exception:
SyntaxError: Non-ASCII character '\xd9' in file file.py
on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
so to allow this I have to add that line to tell the interpreter that the file is UTF-8 encoded:
# -*-coding: utf-8 -*-
# هذا تعليق عربي
print type('نص عربي')
now it runs fine but it still prints <type 'str'> unless I make the string Unicode:
# -*-coding: utf-8 -*-
# هذا تعليق عربي
print type(u'نص عربي')

No, the codec at the top only informs Python how to interpret the source code, and uses that codec to interpret Unicode literals. It does not turn literal bytestrings into unicode values. As PEP 263 states:
This PEP proposes to introduce a syntax to declare the encoding of
a Python source file. The encoding information is then used by the
Python parser to interpret the file using the given encoding. Most
notably this enhances the interpretation of Unicode literals in
the source code and makes it possible to write Unicode literals
using e.g. UTF-8 directly in an Unicode aware editor.
Emphasis mine.
Without the codec declaration, Python has no idea how to interpret non-ASCII characters:
$ cat /tmp/test.py
example = '☃'
$ python2.7 /tmp/test.py
File "/tmp/test.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file /tmp/test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
If Python behaved the way you expect it to, you would not be able to literal bytestring values that contain non-ASCII byte values either.
If your terminal is configured to display UTF-8 values, then printing a UTF-8 encoded byte string will look 'correct', but only by virtue of luck that the encodings match.
The correct way to get unicode values, is by using unicode literals or by otherwise producing unicode (decoding from byte strings, converting integer codepoints to unicode characters, etc.):
unicode_snowman = '\xe2\x98\x83'.decode('utf8')
unicode_snowman = unichr(0x2603)
In Python 3, the codec also applies to how variable names are interpreted, as you can use letters and digits outside of the ASCII range in names. The default codec in Python 3 is UTF-8, as opposed to ASCII in Python 2.

No this is just source code encoding. Please see http://www.python.org/dev/peps/pep-0263/
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors)
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
This doesn't make all literals unicode just point how unicode literals should be decoded.
One should use unicode function or u prefix to set literal as unicode.
N.B. in python3 all strings are unicode.

Figuring out unicode: 'ascii' codec can't decode

I currently use Sublime 2 and run my python code there.
When I try to run this code. I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
# -*- coding: utf-8 -*-
s = unicode('abcdefö')
print s
I have been reading the python documentation on unicode and as far as I understand this should work, or is it the console that's not working
Edit: Using s = u'abcdefö' as a string produces almost the same result. The result I get is
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 6: ordinal not in range(128)

What happens is that unicode('abcdefö') tries to decode the encoded string to unicode during runtime. The coding: utf-8 line only tells Python that the source file is encoded in utf8. When the script runs it has been compiled and string has been stored as a encoded string. So when Python tries to decode the string it uses ascii by default. As the string is actually utf8 encoded this fails.
You can do s = u'abcdefö' which tells the compiler to decode the string with the encoding declared for the file and store it as unicode. s = unicode('abcdefö', 'utf8') or s = 'abcdefö'.decode('utf8') would do the same thing during runtime.
However does not necessarily mean that you can print s now. First the internal unicode string has to be encoded in a character set that the stdout (the console/editor/IDE) can actually display. Sadly often Python fails at figuring out the right character set and defaults to ascii again and you get an error when the string contains non-ascii characters. The Python Wiki knows a few ways to set up stdout properly.

You need to mark the string as a unicode string:
s = u'abcdefö'

s = 'abcdefö'
DO NOT TRY unicode() if string is already in unicode. i.e. unicode(s) is wrong.
IF type(s) == str but contains unicode characters:
First convert to unicode
str_val = unicode(s,'utf-8’)
str_val = unicode(s,'utf-8’,’replace')
Finally encode to string
str_val.encode('utf-8')
Now you can print:
print s

error :: UnicodeDecodeError

I am getting
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 104: ordinal not in range(128)
I am using intgereproparty, stringproparty, datetimeproparty

That's because 0xb0 (decimal 176) is not a valid character code in ASCII (which defines only values between 0 and 127).
Check where you got that string from and use the proper encoding.
If you need further help, post the code.

You are trying to put Unicode data (probably text with accents) into an ASCII string.
You can use Python's codecs module to open a text file with UTF-8 encoding and write the Unicode data to it.
The .encode method may also help (u"õ".encode('utf-8') for example)

Python defaults to ASCII encoding - if you are dealing with chars outside of the ASCII range, you need to specify that in your code.
One way to do this is setting the defining the encoding at the top of your code.
This snippet sets the encoding at the top of the file to encoding to Latin-1 (which includes 0xb0):
#!/usr/bin/python
# -*- coding: latin-1 -*-
import os, sys
...
See PEP for more info on encoding.

When I write my foreign language "flashcard" programs, I always use python 3.x as its native encoding is utf-8. You're encoding problems will generally be far less frequent.
If you're working on a program that many people will share, however, you may want to consider using encode and decode with python 2.x, but only when storing and retrieving data elements in persistent storage. encode your non-ASCII characters, silently manipulate hexadecimal representations of those unicode strings in memory, and save them as hexadecimal. Finally, use decode when fetching unicode strings from persistant storage, but for end user display only. This will eliminate the need to constantly encode and decode your strings in your program.
#jcoon also has a pretty standard response to this problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.