I have a string originating from an email. I'm not sure of the string's original encoding, however in an email client it shows up like so:
'Somebody LastNáme'
I assumed that it was a utf-8 encoding. When I decode it from utf-8,...
'Somebody LastNáme'.decode('utf-8')
... I get the following unicode string in a python shell:
u'Somebody LastN\xe1me'
I also tried a latin-1 encoding, after browsing the docs on unicode, and seeing an acute accent encoded as latin-1. Decoding as that, it shows up the same way, i.e. with the byte not able to be represented as ascii showing up instead as \xe1.
I would like to know 1) if it is possible to get python to show me accented characters when I am looking at non-ascii strings in a terminal (instead of escaped), and 2) whether (or how I can make sure that) the string will be rendered properly in a browser when I subsequently make use of it.
Related
I'm using Python 3.7 and Django 2.0. I want to strip out non-UTF-8 characters from a string, that I'm obtaining by reading this CSV file. I tried this ...
web_site = row['website'].strip().encode("utf-8", 'ignore').decode("utf-8")
but this doesn't seem to be doing the job, since I have a resulting string that looks like ...
web_site: "wbez.org<200e>"
Whatever this "<200e>" thing is, is evidently non-UTF-8 string, because when I try and insert this into a MySQL database (deployed as a docker image), I get the following error ...
web_1 | django.db.utils.OperationalError: Problem installing fixture '/app/maps/fixtures/seed_data.yaml': Could not load maps.Coop(pk=191): (1366, "Incorrect string value: '\\xE2\\x80\\x8E' for column 'web_site' at row 1")
Your row['website'] is already a Unicode string. UTF-8 can support all valid Unicode code points, so .encode('utf8','ignore') doesn't typically ignore anything and encodes the entire string in UTF-8, and .decode('utf8') changes it back to a Unicode string again.
If you simply want to strip non-ASCII characters, use the following to filter only ASCII characters and ignore the rest.
row['website'].encode('ascii','ignore').decode('ascii')
I think you are confusing the encodings.
Python has a standard character set: Unicode
UTF-8 is just and encoding of Unicode. All characters in Unicode can be encoded in UTF-8, and all valid UTF-8 codes can be interpreted as unicode characters.
So you are just encoding and decoding Unicode strings, so the code should do nothing. (There is really some exceptional cases: Python strings really are a superset of Unicode, so your code would just remove non Unicode characters, see surrogateescape, for such extremely seldom case, usually you will enconter only by reading sys.argv or os.environ).
In any case, I think you are doing thing wrong. Search in this site for the general question (e.g. "remove non-ascii characters"). It is often better to decompose (with K, compatibility), and then remove accent, and then remove non-ascii characters, so that you will get more characters translated. There are various function to create slug, which do a better job, or there is also a library which translate more characters in "nearly equivalent" ascii characters (Unicode has various representation of LETTER A, and you may want to translate also Alpha and Aleph and ... into A (better then discarding, especially if you have a foreign language, which possibly you will discard everything).
I first tried typing in a Unicode character, encode it in UTF-8, and decode it back. Python happily gives back the original character.
I took a look at the encoded string, it is b'\xe6\x88\x91'. I don't understand what this is, it looks like 3 hex numbers.
Then I did some research and I found that the CJK set starts from 4E00, so now I want Python to show me what this character looks like. How do I do that? Do I need to convert 4E00 to the form of something like the one above?
The text b'\xe6\x88\x91' is the representation of the bytes that are the utf-8 encoding of the unicode codepoint \u6211 which is the character 我. So there is no need in converting something, other than to a unicode string with .decode('utf-8').
You'll need to decode it using the UTF-8 encoding:
>>> print(b'\xe6\x88\x91'.decode('UTF-8'))
我
By decoding it you're turning the bytes (which is what b'...' is) into a Unicode string and that's how you can display / use the text.
I read a text file, which has some characters like that '\260' (it means '°'), and then I add it to DB (sqlite3).
After that, I try to get the information from DB, but the sql-query will be built with '\xb0'(it means '°' too), because I get this information from XML file.
I try to replace hex characters with octal chracters: text = text.replace(r'\xb0', '\260') but it doesn't work, why? I cannot build correct sql-query.
Maybe there are some solutions for this problem e.g. encode, decode etc.
\260 is the same thing as \xb0:
>>> '\xb0'
'\xb0'
>>> '\260'
'\xb0'
You probably want to decode your input to unicode and insert that instead. If your data is encoded to Latin 1 then decode:
>>> print '\xb0'.decode('latin1')
°
sqlite3 can handle unicode just fine, and by decoding you make sure you are handling text values, not byte values, which can differ from codec to codec.
I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?
Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.
You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.
Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"
To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().
I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.
u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.
Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.