Python String Comparison--Problems With Special/Unicode Characters

Python String Comparison--Problems With Special/Unicode Characters - python

I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?

Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.

You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.

Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"

To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().

Related

Python – How do I convert an ASCII string into UTF-8?

I am using a package in python that returns a string using ASCII characters as opposed to unicode (eg. returns 'serÃ©' as opposed to seré).
Given this is python 3.8, the string is actually encoded in unicode, the package just seems to output it as if it were ASCII. As such, when I try to perform x.decode('utf-8') or x.encode('ascii'), neither work. Is there a way to make python treat the string as if it were ASCII, such that I can decode it to unicode? Or is there a package that can serve this purpose.
I am relatively new to python so I apologise if my explanation is unclear. I am happy to clarify things if needed.
Code
from spanishconjugator import Conjugator as c
verb = c().conjugate('pasar', 'preterite', 'indicative', 'yo')
print(verb)
This returns the string 'pasÃ©' where it should return 'pasé'.
Update
From further searching and from your answers, it appears to be an issue to do with single 2-byte UTF-8 (é) characters being literally interpreted as two 1-byte latin-1 (Ã©) characters (nothing to do with ASCII, my mistake).
Managed to fix it with:
verb.encode('latin-1').decode('utf-8')
Thank you to those that commented.

If the input string contains the raw byte ordinals (such as \xc3\xa9/Ã© instead of é) use latin1 to encode it to bytes verbatim, then decode with the desired encoding.
>>> "pasÃ©".encode('latin1').decode()
'pasé'

Unicode (Cyrillic) character indexing, re-writing in python

I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":
>>>print [u"ё"]
[u'\u0451']
But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7

There are two possible situations here.
Either your str represents valid UTF-8 encoded data, or it does not.
If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.
If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...
Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.

These are actually different encodings:
>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']
What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.
But the strs are being passed around as variables, and so can't be
prefixed with u
You mean the data are strings, and need to be converted into the unicode type:
>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'
You need to coerce the two-byte strings into double-byte width unicode:
>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'
And you'll see with this transform they're perfectly fine.

To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:
>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'
The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python
Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.
Note: a single user-perceived character may span several Unicode codepoints e.g.:
>>> print(u'\u0435\u0308')
ё

How to convert a unicode character representation from string to unicode in python?

Ok I've found a lot of threads about how to convert a string from something like "/xe3" to "ã" but how the hell am I supposed to do it the other way around?
My concrete problem: I am using an API and everything works great except I provide some strings which then result in a json object. The result is sorted after the names (strings) I provided however they are returned as their unicode representation and as json APIs always work in pure strings. So all I need is a way to get from "ã" to "/xe3" but it can't for the love of god get it to work.
Every type of encoding or decoding I try either defaults back to a normal string, a string without that character, a string with a plain A or an unicode error that ascii can't decode it. (<- this was due to a horrible shell setup. Yay for old me.)
All I want is the plain encoded string!
(yea no not at all past me. All you want is the unicode representation of a character as string)
PS: All in python if that wasn't obvious from the title already.
Edit: Even though this is quite old I wanted to update this to not completely embarrass myself in the future.
The issue was an API which provided unicode representations of characters as string as a response. All I wanted to do was checking if they are the same however I had major issues getting python to interpret the string as unicode especially since those characters were just some inside of a longer text partially with backslashes.
This did help but I just stumbled across this horribly written question and just couldn't leave it like that.

"\xe3" in python is a string literal that represents a single byte with value 227:
>>> print len("\xe3")
1
>>> print ord("\xe3")
227
This single byte represents the 'ã' character in the latin-1 encoding (http://en.wikipedia.org/wiki/ISO/IEC_8859-1).
"ã" in python is a string literal consisting of two bytes: 0xC3, 0xA3 (195, 163):
>>> print len("ã")
2
>>> print ord("ã"[0])
195
>>> print ord("ã"[1])
163
This byte sequence is the UTF-8 encoding of the character "ã".
So, to go from "ã" in python to "\xe3", you first need to decode the utf-8 byte sequence into a python unicode string:
>>> "ã".decode("utf-8")
u'\xe3'
Now, you can take that unicode string and encode it however you like (e.g. into latin-1):
>>> "ã".decode("utf-8").encode("latin-1")
'\xe3'

Please read http://www.joelonsoftware.com/articles/Unicode.html . You should realize tehre is no such a thing as "a plain encoded string". There is "an encoded string in a given text encoding". So you are really in need to understand the better the concepts of Unicode.
Among other things, this is plain wrong: "The result is sorted after the names (strings) I provided however they are returned in encoded form." JSON uses Unicode, so you get the string in a decoded form.

Since I assume you are, perhaps unknowingly, working with UTF-8, you should be aware that \xe3 is the Unicode code point for the character ã. Not to be mistaken for the actual bytes that UTF-8 uses to reference that code point:
http://hexutf8.com/?q=U+e3
I.e. UTF-8 maps the byte sequence c3 a3 to the code point U+e3 which represents the character ã.
UTF-16 maps a different byte sequence, 00 e3 to that exact same code point. (Note how much simpler, but less space efficient the UTF-16 encoding is...)

Python: what does "...".encode("utf8") fix?

I wanted to url encode a python string and got exceptions with hebrew strings.
I couldn't fix it and started doing some guess oriented programming.
Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day.
Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (i.e. prefixed by a u).

My original string was a unicode string anyways (i.e. prefixed by a u)
...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \uXXXX entities when you print repr(my_u_str).
To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii') works just fine.
Do my_u_str = u"\u2119ython" and then type(my_u_str) and type(my_u_str.encode('utf8')) to see the difference in types: The first is <type 'unicode'> and the second is <type 'str'>. (Under Python 2.5 and 2.6, anyway).
Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.

You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data.
The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. For Hebrew characters that cannot be represented as ASCII, this will lead to errors.

What does .encode("utf8") do?
It depends on which version of Python you're using:
In Python 3.x, it converts a str object (encoded in UTF-16 or UTF-32) into a bytes object containing the UTF-8 representation of the string.
In Python 2.x, it converts a unicode object into a str object encoded in UTF-8. But str has an encode method too, and writing '...'.encode('UTF-8') is equivalent to writing '...'.decode('ascii').encode('UTF-8').
Since you mentioned the "u" prefix, you must be using 2.x. If you don't require any 2.x-only libraries, I'd recommend switching to 3.x, which has a nice clear distinction between text and binary data.
Dive into Python 3 has a good explanation of the issue.
Can somebody explain what happened?
It would help if you told us what the error message was.
The urllib.quote function expects a str object. It also happens to work with unicode objects that contain only ASCII characters, but not when they contain Hebrew letters.
In Python 3.x, urllib.parse.quote accepts both str (=Python 2.x unicode) and bytes objects. Strings are automatically encoded in UTF-8.

"...".encode("utf-8") transforms the in-memory representation of the string into an UTF-8 -encoded string.
url encoder likely expected a bytestring, that is, string representation where each character is represented with a single byte.

It returns a UTF-8 encoded version of the Unicode string, mystr. It is important to realize that UTF-8 is simply 1 way of encoding Unicode. Python can work with many other encodings (eg. mystr.encode("utf32") or even mystr.encode("ascii")).

The link that balpha posted explains it all. In short:
The fact that your string was prefixed with "u" just means it's composed of Unicode characters (or code points). UTF-8 is an encoding of this string into a sequence of bytes.

"Broken" unicode strings encoded in UTF-8?

I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.

u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.

Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.