Python: unwanted unicode type

Python: unwanted unicode type - python

I've got unicode strings (from an API query) that should have been encoded as regular ascii strings (as they already contain unicode representations). How can I change the encoding without actually changing the the characters being encoded?
To wit:
string = '165\xc2\xba F' # What I want
print(string)
my_string = u'165\xc2\xba F' # What I have
print(my_string)
PS I realize \xc2\xba is actually for ordinal number and not the degree sign (\xc2\xb0) but that's what I got.

What you have is not "unicode" is the byte sequence for the UTF-8 encoding of the string you want.
You can retrieve the text by using the "latin-1" codec to transparently transport your byte sequence to a byte-string (from your unicode-string) and
them, decode it normally from UTF-8:
In[]: u'165\xc2\xba F'.encode("latin1").decode("utf-8")
Out[]: u'165º F'
Why the latin-1 codec is special and works in this case is described on the second paragraph from here: https://docs.python.org/3/library/codecs.html#encodings-and-unicode
When you've got some minutes to spare it would be useful to read this nice article on Unicode to know what are codecs, and what does text in unicode mean.

Related

Replacing non-ascii characters in an ascii encoded string

I have this code fragment (Python 2.7):
from bs4 import BeautifulSoup
content = ' foo bar';
soup = BeautifulSoup(content, 'html.parser')
w = soup.get_text()
At this point w has a byte with value 160 in it, but it's encoding is ASCII.
How do replace all of the \xa0 bytes by another character?
I've tried:
w = w.replace(chr(160), ' ')
w = w.replace('\xa0', ' ')
but I am getting the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
And why does BS return an ASCII encoded string with an invalid character in it?
Is there a way to convert w to a 'latin1` encoded string?

At this point w has a byte with value 160 in it, but it's encoding is 'ascii'.
You have an unicode string:
>>> w
u'\xa0 foo bar'
>>> type(w)
<type 'unicode'>
How do replace all of the \xa0 bytes with another character?
>>> x = w.replace(u'\xa0', ' ')
>>> x
u' foo bar'
And why does BS return an 'ascii' encoded string with an invalid character in it?
As mentioned above, it is not an ascii encoded string, but an Unicode string instance.
Is there a way to convert w to a 'latin1` encoded string?
Sure:
>>> w.encode('latin1')
'\xa0 foo bar'
(Note this last string is an encoded string, not an unicode object, and its representation is not prefixed by 'u' like the previous unicode objects).
Notes (edited):
If you are typing strings into your source files, note that encoding of source files matters. Python will assume your source files are ASCII. The command line interpreter, on the other hand, will assume you are entering strings in your default system encoding. Of course you can override all this.
Avoid latin1, use UTF-8 if possible: ie. w.encode('utf8')
When encoding and decoding can tell Python to ignore errors, or replace characters that cannot be encoded with some marker character . I don't recommend to ignore encoding errors (at least without logging them), except for the hopefully rare cases when you know there are encoding errors or you need to encode text into a more reduced character set, requiring replacement of the code points that cannot be represented (ie if you need to encode 'España' into ASCII, you definitely should replace the 'ñ'). But for these cases there are imho better alternatives and you should look into the magical unicodedata module (see https://stackoverflow.com/a/1207479/401656).
There is a Python Unicode HOWTO: https://docs.python.org/2/howto/unicode.html

How to convert unicode to its original character in Python

I first tried typing in a Unicode character, encode it in UTF-8, and decode it back. Python happily gives back the original character.
I took a look at the encoded string, it is b'\xe6\x88\x91'. I don't understand what this is, it looks like 3 hex numbers.
Then I did some research and I found that the CJK set starts from 4E00, so now I want Python to show me what this character looks like. How do I do that? Do I need to convert 4E00 to the form of something like the one above?

The text b'\xe6\x88\x91' is the representation of the bytes that are the utf-8 encoding of the unicode codepoint \u6211 which is the character 我. So there is no need in converting something, other than to a unicode string with .decode('utf-8').

You'll need to decode it using the UTF-8 encoding:
>>> print(b'\xe6\x88\x91'.decode('UTF-8'))
我
By decoding it you're turning the bytes (which is what b'...' is) into a Unicode string and that's how you can display / use the text.

How do I check whether have encoded in utf-8 successfully

Given a string
u ='abc'
which syntax is the right one to encode into utf8?
u.encode('utf-8')
or
u.encode('utf8')
And how do I know that I have already encoded in utr-8?

First of all you need to make a distinction if you're talking about Python 2 or Python 3 because unicode handling is one of the biggest differences between the two versions.
Python 2
unicode type contains text characters
str contains sequences of 8-bit bytes, sometimes representing text in some unspecified encoding
s.decode(encoding) takes a sequence bytes and builds a text string out of it, once given the encoding used by the bytes. It goes from str to unicode, for example "Citt\xe0".decode("iso8859-1") will give you the text "Città" (Italian for city) and the same will happen for "Citt\xc3\xa0".decode("utf-8"). The encoding may be omitted and in that case the meaning is "use the default encoding".
u.encode(encoding) takes a text string and builds the byte sequence representing it in the given encoding, thus reversing the processing of decode. It goes from unicode to str. As above the encoding can be omitted.
Part of the confusion when handling unicode with Python is that the language tries to be a bit too smart and does things automatically.
For example you can call encode also on an str object and the meaning is "encode the text that comes from decoding these bytes when using the default encoding, eventually using the specified encoding or the default encoding if not specified".
Similarly you can also call decode on an unicode object, meaning "decode the bytes that come from this text when using the default encoding, eventually using the specified encoding".
For example if I write
u"Citt\u00e0".decode("utf-8")
Python gives as error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in
position 3: ordinal not in range(128)
NOTE: the error is about encoding that failed, while I asked for decoding. The reason is that I asked to decode text (nonsense because that is already "decoded"... it's text) and Python decided to first encode it using the "ascii" encoding and that failed. IMO much better would have to just not have decode on unicode objects and not have encode on string objects: the error message would have been clearer.
More confusion is that in Python 2 str is used for unencoded bytes, but it's also used everywhere for text and for example string literals are str objects.
Python 3
To solve some of the issues Python 3 made a few key changes
str is for text and contains unicode characters, string literals are unicode text
unicode type doesn't exist any more
bytes type is used for 8-bit bytes sequences that may represent text in some unspecified encoding
For example in Python 3
'Città'.encode('iso8859-1') → b'Citt\xe0'
'Città'.encode('utf-8') → b'Citt\xc3\xa0'
also you cannot call decode on text strings and you cannot call encode on byte sequences.
Failures
Sometimes encoding text into bytes may fail, because the specified encoding cannot handle all of unicode. For example iso8859-1 cannot handle Chinese. These errors can be processed in a few ways like raising an exception (default), or replacing characters that cannot be encoded with something else.
The encoding utf-8 however is able to encode any unicode character and thus encoding to utf-8 never fails. Thus it doesn't make sense to ask how to know if encoding text into utf-8 was done correctly, because it always happens (for utf-8).
Also decoding may fail, because the sequence of bytes may make no sense in the specified encoding. For example the sequence of bytes 0x43 0x69 0x74 0x74 0xE0 cannot be interpreted as utf-8 because the byte 0xE0 cannot appear without a proper prefix.
There are encodings like iso8859-1 where however decoding cannot fail because any byte 0..255 has a meaning as a character. Most "local encodings" are of this type... they map all 256 possible 8-bit values to some character, but only covering a tiny fraction of the unicode characters.
Decoding using iso8859-1 will never raise an error (any byte sequence is valid) but of course it can give you nonsense text if the bytes where using another encoding.

First solution:
isinstance(u, unicode)
Second solution:
try:
u.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"

Why does chardet say my UTF-8-encoded string (originally decoded from ISO-8859-1) is ASCII?

I'm trying to convert ascii characters to utf-8. This little example below still returns ascii characters:
chunk = chunk.decode('ISO-8859-1').encode('UTF-8')
print chardet.detect(chunk[0:2000])
It returns:
{'confidence': 1.0, 'encoding': 'ascii'}
How come?

Quoting from Python's documentation:
UTF-8 has several convenient properties:
It can handle any Unicode code point.
A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
A string of ASCII text is also a valid UTF-8 text.
All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)
To make it clear, check out this console session:
>>> s = 'test'
>>> s.encode('ascii') == s.encode('utf-8')
True
>>>
However, not all string with UTF-8 encoding is valid ASCII string:
>>> foreign_string = u"éâô"
>>> foreign_string.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\xb4'
>>> foreign_string.encode('ascii') #This won't work, since it's invalid in ASCII encoding
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
foreign_string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>>
So, chardet is still right. Only if there is a character that is not ascii, chardet would be able to tell, it's not ascii encoded.
Hope this simple explanation helps!

UTF-8 is a superset of ASCII. This means that that every valid Ascii file (that only uses the first 128 characters, not the extended characters) will also be a valid UTF-8 file. Since encoding is not stored explicitly, but guessed each time, it will default to the simpler character set. However, if you were to encode anything beyond the basic 128 characters (like foreign text and such) in UTF-8, it would be very likely to guess the encoding as UTF-8.

this is the reason why You got ascii
https://github.com/erikrose/chardet/blob/master/chardet/universaldetector.py#L135
If all characters in sequence is ascii symbols chardet consider string encoding as ascii
N.B.
The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.