Convert or strip out "illegal" Unicode characters

Convert or strip out "illegal" Unicode characters - python

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.
However for some characters, it explodes. I get complaints like this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)
Is there some way I can convert the chars to proper unicode versions? Or strip them out?

Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:
u = s.decode('latin-1')
and use u instead of s in the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1 is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).
As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.

When you decode, just pass 'ignore' to strip those characters
there is some more way of stripping / converting those are
'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd'
'ignore': ignore malformed data and continue without further notice
'backslashreplace': replace with backslashed escape sequences (for encoding only)
Test
>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

Related

Python string cent symbol conversion

The string I have is
u'3.4\xa2 / each'
The '\xa2' is the "cent" symbol, and I want to show it that way.
I tried
i= "3.4\xa2 / each"
print unicode(i, errors='replace')
In the result, the cent symbol is shown as a question mark inside a solid circle.
I also tried
i= "3.4\xa2 / each"
print i.encode('utf-8')
I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa2 in position 3: ordinal not in range(128)
So what is the right way to accomplish this?

'\xa2' is a byte. It may be interpreted as a cent symbol, but only if you specify the right codec. By specifying the right codec can you decode it to the Unicode codepoint equivalent. Latin-1 would do:
>>> print '\xa2'.decode('latin1')
¢
There are a whole series of encodings that encode the ¢ cent codepoint as A2, however.
Alternatively, start with a Unicode string to begin with. \xa2 in a Unicode string expression is the same thing as \u00a2, which happens to be the right codepoint:
>>> print u'\xa2'
¢
>>> print u'\u00a2'
¢
That's because the first 256 codepoints of the Unicode standard happen to match the Latin-1 (ISO-8859-1) standard.
You may have trouble printing; if you are using a terminal or console, print is supposed to automatically encode Unicode data to match your terminal or console configuration, but that may not always be correct or be set to a codec that can handle the characters you are trying to print!
Note that I decoded. If you encode, Python tries to be helpful and decode the bytes to a Unicode object first, so that it can be encoded afterwards. Because \xa2 in not a valid ASCII byte, that decoding failed.
You may want to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
before continuing.

A few points:
encode is a method to convert unicode strings to bytes. If you call encode on a byte string, Python2 will first try to decode it with ASCII and then encode it. That's where your error is coming from.
Your string cannot be decoded with UTF-8, because not every sequence of bytes is valid UTF-8.
Demo:
>>> "3.4\xa2 / each".decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 3: invalid start byte
You can use the latin-1 encoding here, because it maps every byte to the corresponding unicode ordinal.
Demo:
>>> print("3.4\xa2 / each".decode('latin-1'))
3.4¢ / each

You can try:
print "3.4" + u"\u00A2" +"each"
Works for me.

Python 2.7.6 + unicode_literals - UnicodeDecodeError: 'ascii' codec can't decode byte

I'm trying to print the following unicode string but I'm receiving a UnicodeDecodeError: 'ascii' codec can't decode byte error. Can you please help form this query so it can print the unicode string properly?
>>> from __future__ import unicode_literals
>>> ts='now'
>>> free_form_request='[EXID(이엑스아이디)] 위아래 (UP&DOWN) MV'
>>> nick='me'
>>> print('{ts}: free form request {free_form_request} requested from {nick}'.format(ts=ts,free_form_request=free_form_request.encode('utf-8'),nick=nick))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 6: ordinal not in range(128)
Thank you very much in advance!

Here's what happen when you construct this string:
'{ts}: free form request {free_form_request} requested from {nick}'.format(ts=ts,free_form_request=free_form_request.encode('utf-8'),nick=nick)
free_form_request is encode-d into a byte string using utf-8 as the encoding. This works because utf-8 can represent [EXID(이엑스아이디)] 위아래 (UP&DOWN) MV.
However, the format string ('{ts}: free form request {free_form_request} requested from {nick}') is a unicode string (because of imported from __future__ import unicode_literals).
You can't use byte strings as format arguments for a unicode string, so Python attempts to decode the byte string created in 1. to create a unicode string (which would be valid as an format argument).
Python attempts the decode-ing using the default encoding, which is ascii, and fails, because the byte string is a utf-8 byte string that includes byte values that don't make sense in ascii.
Python throws a UnicodeDecodeError.
Note that while the code is obviously doing something here, this would actually not throw an exception on Python 3, which would instead substitute the repr of the byte string (the repr being a unicode string).
To fix your issue, just pass unicode strings to format.
That is, don't do step 1. where you encoded free_form_request as a byte string: keep it as a unicode string by removing .encode(...):
'{ts}: free form request {free_form_request} requested from {nick}'.format(
ts=ts,
free_form_request=free_form_request,
nick=nick)
Note Padraic Cunningham's answer in the comments as well.

What happens when you call str() on a unicode string?

I'm wondering what happens internally when you call str() on a unicode string.
# coding: utf-8
s2 = str(u'hello')
Is s2 just the unicode byte representation of the str() arg?

It will try to encode it with your default encoding. On my system, that's ASCII, and if there's any non-ASCII characters, it will fail:
>>> str(u'あ')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)
Note that this is the same error you'd get if you called encode('ascii') on it:
>>> u'あ'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)
As you might imagine, str working on some arguments and failing on others makes it easy to write code that on first glance seems to work, but stops working once you throw some international characters in there. Python 3 avoids this by making the problem blatantly obvious: you can't convert Unicode to a byte string without an explicit encoding:
>>> bytes(u'あ')
TypeError: string argument without an encoding

string encoding and decoding?

Here are my attempts with error messages. What am I doing wrong?
string.decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 37: ordinal not in range(128)
string.encode('utf-8', "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
37: ordinal not in range(128)

You can't decode a unicode, and you can't encode a str. Try doing it the other way around.

Guessing at all the things omitted from the original question, but, assuming Python 2.x the key is to read the error messages carefully: in particular where you call 'encode' but the message says 'decode' and vice versa, but also the types of the values included in the messages.
In the first example string is of type unicode and you attempted to decode it which is an operation converting a byte string to unicode. Python helpfully attempted to convert the unicode value to str using the default 'ascii' encoding but since your string contained a non-ascii character you got the error which says that Python was unable to encode a unicode value. Here's an example which shows the type of the input string:
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
In the second case you do the reverse attempting to encode a byte string. Encoding is an operation that converts unicode to a byte string so Python helpfully attempts to convert your byte string to unicode first and, since you didn't give it an ascii string the default ascii decoder fails:
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

Aside from getting decode and encode backwards, I think part of the answer here is actually don't use the ascii encoding. It's probably not what you want.
To begin with, think of str like you would a plain text file. It's just a bunch of bytes with no encoding actually attached to it. How it's interpreted is up to whatever piece of code is reading it. If you don't know what this paragraph is talking about, go read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets right now before you go any further.
Naturally, we're all aware of the mess that created. The answer is to, at least within memory, have a standard encoding for all strings. That's where unicode comes in. I'm having trouble tracking down exactly what encoding Python uses internally for sure, but it doesn't really matter just for this. The point is that you know it's a sequence of bytes that are interpreted a certain way. So you only need to think about the characters themselves, and not the bytes.
The problem is that in practice, you run into both. Some libraries give you a str, and some expect a str. Certainly that makes sense whenever you're streaming a series of bytes (such as to or from disk or over a web request). So you need to be able to translate back and forth.
Enter codecs: it's the translation library between these two data types. You use encode to generate a sequence of bytes (str) from a text string (unicode), and you use decode to get a text string (unicode) from a sequence of bytes (str).
For example:
>>> s = "I look like a string, but I'm actually a sequence of bytes. \xe2\x9d\xa4"
>>> codecs.decode(s, 'utf-8')
u"I look like a string, but I'm actually a sequence of bytes. \u2764"
What happened here? I gave Python a sequence of bytes, and then I told it, "Give me the unicode version of this, given that this sequence of bytes is in 'utf-8'." It did as I asked, and those bytes (a heart character) are now treated as a whole, represented by their Unicode codepoint.
Let's go the other way around:
>>> u = u"I'm a string! Really! \u2764"
>>> codecs.encode(u, 'utf-8')
"I'm a string! Really! \xe2\x9d\xa4"
I gave Python a Unicode string, and I asked it to translate the string into a sequence of bytes using the 'utf-8' encoding. So it did, and now the heart is just a bunch of bytes it can't print as ASCII; so it shows me the hexadecimal instead.
We can work with other encodings, too, of course:
>>> s = "I have a section \xa7"
>>> codecs.decode(s, 'latin1')
u'I have a section \xa7'
>>> codecs.decode(s, 'latin1')[-1] == u'\u00A7'
True
>>> u = u"I have a section \u00a7"
>>> u
u'I have a section \xa7'
>>> codecs.encode(u, 'latin1')
'I have a section \xa7'
('\xa7' is the section character, in both
Unicode and Latin-1.)
So for your question, you first need to figure out what encoding your str is in.
Did it come from a file? From a web request? From your database? Then the source determines the encoding. Find out the encoding of the source and use that to translate it into a unicode.
s = [get from external source]
u = codecs.decode(s, 'utf-8') # Replace utf-8 with the actual input encoding
Or maybe you're trying to write it out somewhere. What encoding does the destination expect? Use that to translate it into a str. UTF-8 is a good choice for plain text documents; most things can read it.
u = u'My string'
s = codecs.encode(u, 'utf-8') # Replace utf-8 with the actual output encoding
[Write s out somewhere]
Are you just translating back and forth in memory for interoperability or something? Then just pick an encoding and stick with it; 'utf-8' is probably the best choice for that:
u = u'My string'
s = codecs.encode(u, 'utf-8')
newu = codecs.decode(s, 'utf-8')
In modern programming, you probably never want to use the 'ascii' encoding for any of this. It's an extremely small subset of all possible characters, and no system I know of uses it by default or anything.
Python 3 does its best to make this immensely clearer simply by changing the names. In Python 3, str was replaced with bytes, and unicode was replaced with str.

That's because your input string can’t be converted according to the encoding rules (strict by default).
I don't know, but I always encoded using directly unicode() constructor, at least that's the ways at the official documentation:
unicode(your_str, errors="ignore")

Convert hash.digest() to unicode

import hashlib
string1 = u'test'
hashstring = hashlib.md5()
hashstring.update(string1)
string2 = hashstring.digest()
unicode(string2)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 1: ordinal
not in range(128)
The string HAS to be unicode for it to be any use to me, can this be done?
Using python 2.7 if that helps...

Ignacio just gave the perfect answer. Just a complement: when you convert some string from an encoding which has chars not found in ASCII to unicode, you have to pass the encoding as a parameter:
>>> unicode("órgão")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> unicode("órgão", "UTF-8")
u'\xf3rg\xe3o'
If you cannot say what is the original encoding (UTF-8 in my example) you really cannot convert to Unicode. It is a signal that something is not pretty correct in your intentions.
Last but not least, encodings are pretty confusing stuff. This comprehensive text about them can make them clear.

The result of .digest() is a bytestring¹, so converting it to Unicode is pointless. Use .hexdigest() if you want a readable representation.
¹ Some bytestrings can be converted to Unicode, but the bytestrings returned by .digest() do not contain textual data. They can contain any byte including the null byte: they're usually not printable without using escape sequences.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert or strip out "illegal" Unicode characters - python

Related

Python string cent symbol conversion

Python 2.7.6 + unicode_literals - UnicodeDecodeError: 'ascii' codec can't decode byte

What happens when you call str() on a unicode string?

string encoding and decoding?

Convert hash.digest() to unicode

Categories

Resources