Python read non-ascii text file

Python read non-ascii text file - python

I am trying to load a text file, which contains some German letters with
content=open("file.txt","r").read()
which results in this error message
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
if I modify the file to contain only ASCII characters everything works as expected.
Apperently using
content=open("file.txt","rb").read()
or
content=open("file.txt","r",encoding="utf-8").read()
both do the job.
Why is it possible to read with "binary" mode and get the same result as with utf-8 encoding?

In Python 3, using 'r' mode and not specifying an encoding just uses a default encoding, which in this case is ASCII. Using 'rb' mode reads the file as bytes and makes no attempt to interpret it as a string of characters.

ASCII is limited to characters in the range of [0,128). If you try to decode a byte that is outside that range, one gets that error.
When you read the string in as bytes, you're "widening" the acceptable range of character to [0,256). So your \0xc3 character Ã is now read in without error. But despite it seeming to work, it's still not "correct".
If your strings are indeed unicode encoded, then the possibility exists that one will contain a multibyte character, that is, a character whose byte representation actually spans multiple bytes.
It is in this case where the difference between reading a file as a byte string and properly decoding it will be quite apparent.
A character like this: č
Will be read in as two bytes, but properly decoded, will be one character:
bytes = bytes('č', encoding='utf-8')
print(len(bytes)) # 2
print(len(bytes.decode('utf-8'))) # 1

Related

Python3: print text with emojis read from text-file with non ASCII-characters (unicode_escape)

I want to read lines of a text-file that include emojis and non ASCII-characters and finally print them out. The problem is that I either can print the emoji glyphe correctly or the non ASCII-character (e.g. ü).
Line in text-file (with UTF-8 format):
I am tired. - Ich bin müde \U0001F4A4
Code to read:
with open(path_txt,"r", encoding="unicode_escape") as file:
content = file.readlines()
print(content[0])
With encoding="unicode_escape" I get the sleep-emoji and some cryptic character for "ü".
With encoding="utf-8" (or default) it prints the unicode sequence \U0001F4A4 for the emoji and the correct "ü".
In the second case \U... gets double escaped to \U. I thougt str.replace("\U", "\U") could be a workaround but ERROR:
'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape
I also tried encoding="raw_unicode_escape".
As a beginner I don't understand the whole unicode topic. Thanks for your help/workarounds!!
Similar/Same Problem here (04/2014): https://bugs.python.org/issue21331

It seems that the content is in some mixture of escapes (for the emoji) and UTF-8-encoded characters (for "ü").
It's not entirely clear from your post, but I assume if you would read the file in binary mode (open(path, 'rb')) and print the first line, you would see this:
b'm\xc3\xbcde \\U0001f4a4'
This means that "ü" was encoded with UTF-8, but the emoji was escaped.
Note: You see escape sequences for "ü" too, but that's just the representation.
Try len(b'\xc3') and you'll see that this is actually a length-1 byte string. b'\\U0001f4a4' on the other hand is really an escape sequence with length 10.
Now the "unicode-escape" sequence does not expect exactly this format.
It interprets unescaped non-ASCII characters as Latin-1 – that's why you see garbled characters instead of "ü" when using this codec:
>>> b'm\xc3\xbcde \\U0001f4a4'.decode('unicode-escape')
'mÃ¼de 💤'
But if "unicode-escape" wants Latin-1, we can give it!
First, we decode with UTF-8 to get "ü" right:
>>> b'm\xc3\xbcde \\U0001f4a4'.decode('utf8')
'müde \\U0001f4a4'
This doesn't touch the emoji escape, since it's all ASCII.
Characters from the ASCII range are encoded identically for Latin-1 and UTF-8 (and ASCII).
Now we encode with Latin-1:
>>> b'm\xc3\xbcde \\U0001f4a4'.decode('utf8').encode('latin1')
b'm\xfcde \\U0001f4a4'
and this is something the "unicode-escape" codec understands:
>>> b'm\xc3\xbcde \\U0001f4a4'.decode('utf8').encode('latin1').decode('unicode-escape')
'müde 💤'
In your setup, you can defer the first decode step to the internal processing of open():
with open(path_txt, "r", encoding="utf-8") as file:
for line in file:
line = line.encode('latin1').decode('unicode-escape')
# do something with line

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.

How to write a unicode object into a file in Python?

I try to write a "string" to a file and get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xcd' in position 6: ordinal not in range(128)
I tried the following methods:
print >>f, txt
print >>f, txt.decode('utf-8')
print >>f, txt.encode('utf-8')
None of them work. I have the same error message.
What is the idea behind encoding and decoding? If I have a unicode object can I write it to the file directly or I need to transform it to a string?
How can I find out what codding is used? How can I know if it is utf-8 or ascii or something else?
ADDED
I think I have just managed to save a string into a file. print >>f, txt as well as print >>f, txt.decode('utf-8') did not work but print >>f, txt.encode('utf-8') works. I get no error message and I see Chinese characters in my file.

I recently posted another answer that addresses this very issue. Key quote:
For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.
In Python 2, unicode objects are character strings. Regular str objects can be either character strings or byte strings. (Pro tip: use Python 3, it makes keeping track a lot easier.)
You should be passing character strings (not byte strings) to print, but you will need to be sure that those character strings can be encoded by the codec (such as ASCII or UTF-8) associated with the destination file object f. As part of the output process, Python encodes the string for you. If the string contains characters that cannot be encoded by the file object's codec, you will get errors like the one you're seeing.
Without knowing what is in your txt object I can't be more specific.

I think you need to use codecs library:
import codecs
file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\xcd')
file.close()
Works fine.
The Story of Encoding/Decoding:
In the past, there were only about ~60 characters available in computers (including upper-case and lower-case letters + numbers + some special characters). So only 1 byte was enough to assign a unique number to each letter. Assigning numbers to letters for storing in memory is called encoding. This one byte encoding that is used in python by default is named ASCII.
With growth of computers in the world, we need to have more letters and characters in computer. So 1 byte is not enough. Different encoding schemes appeared. Unicode is one of the famous. The character that you are trying to store in your file is a Unicode character and it need 2 bytes, So you must explicitly indicate to Python that you don't want to use the default encoding, i.e. the ASCII (because you need 2 bytes for this character).

How do I check whether have encoded in utf-8 successfully

Given a string
u ='abc'
which syntax is the right one to encode into utf8?
u.encode('utf-8')
or
u.encode('utf8')
And how do I know that I have already encoded in utr-8?

First of all you need to make a distinction if you're talking about Python 2 or Python 3 because unicode handling is one of the biggest differences between the two versions.
Python 2
unicode type contains text characters
str contains sequences of 8-bit bytes, sometimes representing text in some unspecified encoding
s.decode(encoding) takes a sequence bytes and builds a text string out of it, once given the encoding used by the bytes. It goes from str to unicode, for example "Citt\xe0".decode("iso8859-1") will give you the text "Città" (Italian for city) and the same will happen for "Citt\xc3\xa0".decode("utf-8"). The encoding may be omitted and in that case the meaning is "use the default encoding".
u.encode(encoding) takes a text string and builds the byte sequence representing it in the given encoding, thus reversing the processing of decode. It goes from unicode to str. As above the encoding can be omitted.
Part of the confusion when handling unicode with Python is that the language tries to be a bit too smart and does things automatically.
For example you can call encode also on an str object and the meaning is "encode the text that comes from decoding these bytes when using the default encoding, eventually using the specified encoding or the default encoding if not specified".
Similarly you can also call decode on an unicode object, meaning "decode the bytes that come from this text when using the default encoding, eventually using the specified encoding".
For example if I write
u"Citt\u00e0".decode("utf-8")
Python gives as error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in
position 3: ordinal not in range(128)
NOTE: the error is about encoding that failed, while I asked for decoding. The reason is that I asked to decode text (nonsense because that is already "decoded"... it's text) and Python decided to first encode it using the "ascii" encoding and that failed. IMO much better would have to just not have decode on unicode objects and not have encode on string objects: the error message would have been clearer.
More confusion is that in Python 2 str is used for unencoded bytes, but it's also used everywhere for text and for example string literals are str objects.
Python 3
To solve some of the issues Python 3 made a few key changes
str is for text and contains unicode characters, string literals are unicode text
unicode type doesn't exist any more
bytes type is used for 8-bit bytes sequences that may represent text in some unspecified encoding
For example in Python 3
'Città'.encode('iso8859-1') → b'Citt\xe0'
'Città'.encode('utf-8') → b'Citt\xc3\xa0'
also you cannot call decode on text strings and you cannot call encode on byte sequences.
Failures
Sometimes encoding text into bytes may fail, because the specified encoding cannot handle all of unicode. For example iso8859-1 cannot handle Chinese. These errors can be processed in a few ways like raising an exception (default), or replacing characters that cannot be encoded with something else.
The encoding utf-8 however is able to encode any unicode character and thus encoding to utf-8 never fails. Thus it doesn't make sense to ask how to know if encoding text into utf-8 was done correctly, because it always happens (for utf-8).
Also decoding may fail, because the sequence of bytes may make no sense in the specified encoding. For example the sequence of bytes 0x43 0x69 0x74 0x74 0xE0 cannot be interpreted as utf-8 because the byte 0xE0 cannot appear without a proper prefix.
There are encodings like iso8859-1 where however decoding cannot fail because any byte 0..255 has a meaning as a character. Most "local encodings" are of this type... they map all 256 possible 8-bit values to some character, but only covering a tiny fraction of the unicode characters.
Decoding using iso8859-1 will never raise an error (any byte sequence is valid) but of course it can give you nonsense text if the bytes where using another encoding.

First solution:
isinstance(u, unicode)
Second solution:
try:
u.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"

Converting to safe unicode in python

I'm dealing with unknown data and trying to insert into a MySQL database using Python/Django. I'm getting some errors that I don't quite understand and am looking for some help. Here is the error.
Incorrect string value: '\xEF\xBF\xBDs m...'
My guess is that the string is not being properly converted to unicode? Here is my code for unicode conversion.
s = unicode(content, "utf-8", errors="replace")
Without the above unicode conversion, the error I get is
'utf8' codec can't decode byte 0x92 in position 31: unexpected code byte. You passed in 'Fabulous home on one of Decatur\x92s most
Any help is appreciated!

What is the original encoding? I'm assuming "cp1252", from pixelbeat's answer. In that case, you can do
>>> orig # Byte string, encoded in cp1252
'Fabulous home on one of Decatur\x92s most'
>>> uni = orig.decode('cp1252')
>>> uni # Unicode string
u'Fabulous home on one of Decatur\u2019s most'
>>> s = uni.encode('utf8')
>>> s # Correct byte string encoded in utf-8
'Fabulous home on one of Decatur\xe2\x80\x99s most'

0x92 is right single curly quote in windows cp1252 encoding.
\xEF\xBF\xBD is the UTF8 encoding of the unicode replacement character
(which was inserted instead of the erroneous cp1252 character).
So it looks like your database is not accepting the valid UTF8 data?
2 options:
1. Perhaps you should be using unicode(content,"cp1252")
2. If you want to insert UTF-8 into the DB, then you'll need to config it appropriately. I'll leave that answer to others more knowledgeable

The "Fabulous..." string doesn't look like utf-8: 0x92 is above 128 and as such should be a continuation of a multi-byte character. However, in that string it appears on its own (apparently representing an apostrophe).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.