How to encode unicode surrogate pairs, to write to file - python

I'm creating an compression / encryption program that utilises (nearly) all unicode characters, and then writes the data to a file. However, to write to the file, I need to encode the characters to bytes. However, when I do so, It gives this error:
unicodeencodeerror: 'utf-8' codec can't encode character '\udd7d' in position 1323: surrogates not allowed
I've tried all of the built-in python codecs, none of them work, except for 'utf-7', however, this simply encodes the unicode to base64, which defeats the object of what I'm trying to achieve.
file = open(str(file_name.capitalize()) + ".Unicode_File","wb")
file.write(unicode_madness.encode("utf-8"))
file.close()
I expect it to write the variable 'unicode_madness' to the file, which it does, however sometimes it tries to use a surrogate unicode character.
To resolve this, I either need to be able to avoid the surrogate characters (while keeping the compression lossless), or I need to find out which unicode characters use surrogates, and I can adjust the program accordingly.
Thanks for any help!

Related

Write unmapped characters to file?

For example: the character \x80, or 128 in decimal, has no UTF-8 character assigned to it. But if I understand text files correctly, I should still be able to create a file that contains that character, even if nothing can display it. However, when I try to print an array that contains one of these characters, it writes as '\x80', and when I try to write it directly as a chr, I get an error "UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to ". Am I doing something fundamentally wrong, or is there a fix I just don't know about here?
The bytes type in Python is what I should have been using for this. Though I didn't quite understand it when I was posting the question, I needed a list of single-byte variables. This is exactly what the bytes object does, and even better, it can be used exactly like a string.

Python read non-ascii text file

I am trying to load a text file, which contains some German letters with
content=open("file.txt","r").read()
which results in this error message
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
if I modify the file to contain only ASCII characters everything works as expected.
Apperently using
content=open("file.txt","rb").read()
or
content=open("file.txt","r",encoding="utf-8").read()
both do the job.
Why is it possible to read with "binary" mode and get the same result as with utf-8 encoding?
In Python 3, using 'r' mode and not specifying an encoding just uses a default encoding, which in this case is ASCII. Using 'rb' mode reads the file as bytes and makes no attempt to interpret it as a string of characters.
ASCII is limited to characters in the range of [0,128). If you try to decode a byte that is outside that range, one gets that error.
When you read the string in as bytes, you're "widening" the acceptable range of character to [0,256). So your \0xc3 character à is now read in without error. But despite it seeming to work, it's still not "correct".
If your strings are indeed unicode encoded, then the possibility exists that one will contain a multibyte character, that is, a character whose byte representation actually spans multiple bytes.
It is in this case where the difference between reading a file as a byte string and properly decoding it will be quite apparent.
A character like this: č
Will be read in as two bytes, but properly decoded, will be one character:
bytes = bytes('č', encoding='utf-8')
print(len(bytes)) # 2
print(len(bytes.decode('utf-8'))) # 1

How to write a unicode object into a file in Python?

I try to write a "string" to a file and get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xcd' in position 6: ordinal not in range(128)
I tried the following methods:
print >>f, txt
print >>f, txt.decode('utf-8')
print >>f, txt.encode('utf-8')
None of them work. I have the same error message.
What is the idea behind encoding and decoding? If I have a unicode object can I write it to the file directly or I need to transform it to a string?
How can I find out what codding is used? How can I know if it is utf-8 or ascii or something else?
ADDED
I think I have just managed to save a string into a file. print >>f, txt as well as print >>f, txt.decode('utf-8') did not work but print >>f, txt.encode('utf-8') works. I get no error message and I see Chinese characters in my file.
I recently posted another answer that addresses this very issue. Key quote:
For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.
In Python 2, unicode objects are character strings. Regular str objects can be either character strings or byte strings. (Pro tip: use Python 3, it makes keeping track a lot easier.)
You should be passing character strings (not byte strings) to print, but you will need to be sure that those character strings can be encoded by the codec (such as ASCII or UTF-8) associated with the destination file object f. As part of the output process, Python encodes the string for you. If the string contains characters that cannot be encoded by the file object's codec, you will get errors like the one you're seeing.
Without knowing what is in your txt object I can't be more specific.
I think you need to use codecs library:
import codecs
file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\xcd')
file.close()
Works fine.
The Story of Encoding/Decoding:
In the past, there were only about ~60 characters available in computers (including upper-case and lower-case letters + numbers + some special characters). So only 1 byte was enough to assign a unique number to each letter. Assigning numbers to letters for storing in memory is called encoding. This one byte encoding that is used in python by default is named ASCII.
With growth of computers in the world, we need to have more letters and characters in computer. So 1 byte is not enough. Different encoding schemes appeared. Unicode is one of the famous. The character that you are trying to store in your file is a Unicode character and it need 2 bytes, So you must explicitly indicate to Python that you don't want to use the default encoding, i.e. the ASCII (because you need 2 bytes for this character).

Python String Comparison--Problems With Special/Unicode Characters

I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?
Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.
You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.
Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"
To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().

Decoding not reversing unicode encoding in Django/Python

Ok, I have a hardcoded string I declare like this
name = u"Par Catégorie"
I have a # -- coding: utf-8 -- magic header, so I am guessing it's converted to utf-8
Down the road it's outputted to xml through
xml_output.toprettyxml(indent='....', encoding='utf-8')
And I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Most of my data is in French and is ouputted correctly in CDATA nodes, but that one harcoded string keep ... I don't see why an ascii codec is called.
what's wrong ?
The coding header in your source file tells Python what encoding your source is in. It's the encoding Python uses to decode the source of the unicode string literal (u"Par Catégorie") into a unicode object. The unicode object itself has no encoding; it's raw unicode data. (Internally, Python will use one of two encodings, depending on how it was configured, but Python code shouldn't worry about that.)
The UnicodeDecodeError you get means that somewhere, you are mixing unicode strings and bytestrings (normal strings.) When mixing them together (concatenating, performing string interpolation, et cetera) Python will try to convert the bytestring into a unicode string by decoding the bytestring using the default encoding, ASCII. If the bytestring contains non-ASCII data, this will fail with the error you see. The operation being done may be in a library somewhere, but it still means you're mixing inputs of different types.
Unfortunately the fact that it'll work just fine as long as the bytestrings contain just ASCII data means this type of error is all too frequent even in library code. Python 3.x solves that problem by getting rid of the implicit conversion between unicode strings (just str in 3.x) and bytestrings (the bytes type in 3.x.)
Wrong parameter name? From the doc, I can see the keyword argument name is supposed to be encoding and not coding.

Categories

Resources