Converting to safe unicode in python

Converting to safe unicode in python - python

I'm dealing with unknown data and trying to insert into a MySQL database using Python/Django. I'm getting some errors that I don't quite understand and am looking for some help. Here is the error.
Incorrect string value: '\xEF\xBF\xBDs m...'
My guess is that the string is not being properly converted to unicode? Here is my code for unicode conversion.
s = unicode(content, "utf-8", errors="replace")
Without the above unicode conversion, the error I get is
'utf8' codec can't decode byte 0x92 in position 31: unexpected code byte. You passed in 'Fabulous home on one of Decatur\x92s most
Any help is appreciated!

What is the original encoding? I'm assuming "cp1252", from pixelbeat's answer. In that case, you can do
>>> orig # Byte string, encoded in cp1252
'Fabulous home on one of Decatur\x92s most'
>>> uni = orig.decode('cp1252')
>>> uni # Unicode string
u'Fabulous home on one of Decatur\u2019s most'
>>> s = uni.encode('utf8')
>>> s # Correct byte string encoded in utf-8
'Fabulous home on one of Decatur\xe2\x80\x99s most'

0x92 is right single curly quote in windows cp1252 encoding.
\xEF\xBF\xBD is the UTF8 encoding of the unicode replacement character
(which was inserted instead of the erroneous cp1252 character).
So it looks like your database is not accepting the valid UTF8 data?
2 options:
1. Perhaps you should be using unicode(content,"cp1252")
2. If you want to insert UTF-8 into the DB, then you'll need to config it appropriately. I'll leave that answer to others more knowledgeable

The "Fabulous..." string doesn't look like utf-8: 0x92 is above 128 and as such should be a continuation of a multi-byte character. However, in that string it appears on its own (apparently representing an apostrophe).

Related

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.

Python read non-ascii text file

I am trying to load a text file, which contains some German letters with
content=open("file.txt","r").read()
which results in this error message
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
if I modify the file to contain only ASCII characters everything works as expected.
Apperently using
content=open("file.txt","rb").read()
or
content=open("file.txt","r",encoding="utf-8").read()
both do the job.
Why is it possible to read with "binary" mode and get the same result as with utf-8 encoding?

In Python 3, using 'r' mode and not specifying an encoding just uses a default encoding, which in this case is ASCII. Using 'rb' mode reads the file as bytes and makes no attempt to interpret it as a string of characters.

ASCII is limited to characters in the range of [0,128). If you try to decode a byte that is outside that range, one gets that error.
When you read the string in as bytes, you're "widening" the acceptable range of character to [0,256). So your \0xc3 character Ã is now read in without error. But despite it seeming to work, it's still not "correct".
If your strings are indeed unicode encoded, then the possibility exists that one will contain a multibyte character, that is, a character whose byte representation actually spans multiple bytes.
It is in this case where the difference between reading a file as a byte string and properly decoding it will be quite apparent.
A character like this: č
Will be read in as two bytes, but properly decoded, will be one character:
bytes = bytes('č', encoding='utf-8')
print(len(bytes)) # 2
print(len(bytes.decode('utf-8'))) # 1

string encoding and decoding?

Here are my attempts with error messages. What am I doing wrong?
string.decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 37: ordinal not in range(128)
string.encode('utf-8', "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
37: ordinal not in range(128)

You can't decode a unicode, and you can't encode a str. Try doing it the other way around.

Guessing at all the things omitted from the original question, but, assuming Python 2.x the key is to read the error messages carefully: in particular where you call 'encode' but the message says 'decode' and vice versa, but also the types of the values included in the messages.
In the first example string is of type unicode and you attempted to decode it which is an operation converting a byte string to unicode. Python helpfully attempted to convert the unicode value to str using the default 'ascii' encoding but since your string contained a non-ascii character you got the error which says that Python was unable to encode a unicode value. Here's an example which shows the type of the input string:
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
In the second case you do the reverse attempting to encode a byte string. Encoding is an operation that converts unicode to a byte string so Python helpfully attempts to convert your byte string to unicode first and, since you didn't give it an ascii string the default ascii decoder fails:
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

Aside from getting decode and encode backwards, I think part of the answer here is actually don't use the ascii encoding. It's probably not what you want.
To begin with, think of str like you would a plain text file. It's just a bunch of bytes with no encoding actually attached to it. How it's interpreted is up to whatever piece of code is reading it. If you don't know what this paragraph is talking about, go read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets right now before you go any further.
Naturally, we're all aware of the mess that created. The answer is to, at least within memory, have a standard encoding for all strings. That's where unicode comes in. I'm having trouble tracking down exactly what encoding Python uses internally for sure, but it doesn't really matter just for this. The point is that you know it's a sequence of bytes that are interpreted a certain way. So you only need to think about the characters themselves, and not the bytes.
The problem is that in practice, you run into both. Some libraries give you a str, and some expect a str. Certainly that makes sense whenever you're streaming a series of bytes (such as to or from disk or over a web request). So you need to be able to translate back and forth.
Enter codecs: it's the translation library between these two data types. You use encode to generate a sequence of bytes (str) from a text string (unicode), and you use decode to get a text string (unicode) from a sequence of bytes (str).
For example:
>>> s = "I look like a string, but I'm actually a sequence of bytes. \xe2\x9d\xa4"
>>> codecs.decode(s, 'utf-8')
u"I look like a string, but I'm actually a sequence of bytes. \u2764"
What happened here? I gave Python a sequence of bytes, and then I told it, "Give me the unicode version of this, given that this sequence of bytes is in 'utf-8'." It did as I asked, and those bytes (a heart character) are now treated as a whole, represented by their Unicode codepoint.
Let's go the other way around:
>>> u = u"I'm a string! Really! \u2764"
>>> codecs.encode(u, 'utf-8')
"I'm a string! Really! \xe2\x9d\xa4"
I gave Python a Unicode string, and I asked it to translate the string into a sequence of bytes using the 'utf-8' encoding. So it did, and now the heart is just a bunch of bytes it can't print as ASCII; so it shows me the hexadecimal instead.
We can work with other encodings, too, of course:
>>> s = "I have a section \xa7"
>>> codecs.decode(s, 'latin1')
u'I have a section \xa7'
>>> codecs.decode(s, 'latin1')[-1] == u'\u00A7'
True
>>> u = u"I have a section \u00a7"
>>> u
u'I have a section \xa7'
>>> codecs.encode(u, 'latin1')
'I have a section \xa7'
('\xa7' is the section character, in both
Unicode and Latin-1.)
So for your question, you first need to figure out what encoding your str is in.
Did it come from a file? From a web request? From your database? Then the source determines the encoding. Find out the encoding of the source and use that to translate it into a unicode.
s = [get from external source]
u = codecs.decode(s, 'utf-8') # Replace utf-8 with the actual input encoding
Or maybe you're trying to write it out somewhere. What encoding does the destination expect? Use that to translate it into a str. UTF-8 is a good choice for plain text documents; most things can read it.
u = u'My string'
s = codecs.encode(u, 'utf-8') # Replace utf-8 with the actual output encoding
[Write s out somewhere]
Are you just translating back and forth in memory for interoperability or something? Then just pick an encoding and stick with it; 'utf-8' is probably the best choice for that:
u = u'My string'
s = codecs.encode(u, 'utf-8')
newu = codecs.decode(s, 'utf-8')
In modern programming, you probably never want to use the 'ascii' encoding for any of this. It's an extremely small subset of all possible characters, and no system I know of uses it by default or anything.
Python 3 does its best to make this immensely clearer simply by changing the names. In Python 3, str was replaced with bytes, and unicode was replaced with str.

That's because your input string can’t be converted according to the encoding rules (strict by default).
I don't know, but I always encoded using directly unicode() constructor, at least that's the ways at the official documentation:
unicode(your_str, errors="ignore")

Conversion of Unicode

I am a newbie in python.
I have a unicode in Tamil.
When I use the sys.getdefaultencoding() I get the output as "Cp1252"
My requirement is that when I use text = testString.decode("utf-8") I get the error "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-8: character maps to undefined"

When I use the
sys.getdefaultencoding() I get the
output as "Cp1252"
Two comments on that: (1) it's "cp1252", not "Cp1252". Don't type from memory. (2) Whoever caused sys.getdefaultencoding() to produce "cp1252" should be told politely that that's not a very good idea.
As for the rest, let me guess. You have a unicode object that contains some text in the Tamil language. You try, erroneously, to decode it. Decode means to convert from a str object to a unicode object. Unfortunately you don't have a str object, and even more unfortunately you get bounced by one of the very few awkish/perlish warts in Python 2: it tries to make a str object by encoding your unicode string using the system default encoding. If that's 'ascii' or 'cp1252', encoding will fail. That's why you get a Unicode*En*codeError instead of a Unicode*De*codeError.
Short answer: do text = testString.encode("utf-8"), if that's what you really want to do. Otherwise please explain what you want to do, and show us the result of print repr(testString).

add this as your 1st line of code
# -*- coding: utf-8 -*-
later in your code...
text = unicode(testString,"UTF-8")

you need to know which character-encoding is testString using. if not utf8, an error will occur when using decode('utf8').

Decoding not reversing unicode encoding in Django/Python

Ok, I have a hardcoded string I declare like this
name = u"Par Catégorie"
I have a # -- coding: utf-8 -- magic header, so I am guessing it's converted to utf-8
Down the road it's outputted to xml through
xml_output.toprettyxml(indent='....', encoding='utf-8')
And I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Most of my data is in French and is ouputted correctly in CDATA nodes, but that one harcoded string keep ... I don't see why an ascii codec is called.
what's wrong ?

The coding header in your source file tells Python what encoding your source is in. It's the encoding Python uses to decode the source of the unicode string literal (u"Par Catégorie") into a unicode object. The unicode object itself has no encoding; it's raw unicode data. (Internally, Python will use one of two encodings, depending on how it was configured, but Python code shouldn't worry about that.)
The UnicodeDecodeError you get means that somewhere, you are mixing unicode strings and bytestrings (normal strings.) When mixing them together (concatenating, performing string interpolation, et cetera) Python will try to convert the bytestring into a unicode string by decoding the bytestring using the default encoding, ASCII. If the bytestring contains non-ASCII data, this will fail with the error you see. The operation being done may be in a library somewhere, but it still means you're mixing inputs of different types.
Unfortunately the fact that it'll work just fine as long as the bytestrings contain just ASCII data means this type of error is all too frequent even in library code. Python 3.x solves that problem by getting rid of the implicit conversion between unicode strings (just str in 3.x) and bytestrings (the bytes type in 3.x.)

Wrong parameter name? From the doc, I can see the keyword argument name is supposed to be encoding and not coding.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting to safe unicode in python - python

The "Fabulous..." string doesn't look like utf-8: 0x92 is above 128 and as such should be a continuation of a multi-byte character. However, in that string it appears on its own (apparently representing an apostrophe).

Related

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

Python read non-ascii text file

string encoding and decoding?

Conversion of Unicode

Decoding not reversing unicode encoding in Django/Python

Categories

Resources