Working with foreign symbols in python - python

I'm parsing a JSON feed in Python and it contains this character, causing it not to validate.
Is there a way to handle these symbols? Can they be converted or is they're a tidy way to remove them?
I don't even know what this symbol is called or what causes them, otherwise I would research it myself.
EDIT: Stackover Flow is stripping the character so here:
http://files.getdropbox.com/u/194177/symbol.jpg
It's that [?] symbol in "Classic 80s"

That probably means the text you have is in some sort of encoding, and you need to figure out what encoding, and convert it to Unicode with a thetext.decode('encoding') call.
I not sure, but it could possibly be the [?] character, meaning that the display you have there also doesn't know how to display it. That would probably mean that the data you have is incorrect, and that there is a character in there that doesn't exist in the encoding that you are supposed to use. To handle that you call the decode like this: thetext.decode('encoding', 'ignore'). There are other options than ignore, like "replace", "xmlcharrefreplace" and more.

JSON must be encoded in one of UTF-8, UTF-16, or UTF-32. If a JSON file contains bytes which are illegal in its current encoding, it is garbage.
If you don't know which encoding it's using, you can try parsing using my jsonlib library, which includes an encoding-detector. JSON parsed using jsonlib will be provided to the programmer as Unicode strings, so you don't have to worry about encoding at all.

Related

Problems with unicode, beautifulsoup, cld2, and python [duplicate]

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?
... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.
Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.
Convert to unicode from cp437. This way you get your bytes right to unicode and back.

ASCII as default encoding in python instead of utf-8

I only code in English but I have to deal with python unicode all the time.
Sometimes its hard to remove unicode character from and dict.
How can I change python default character encoding to ASCII???
That would be the wrong thing to do. As in very wrong. To start with, it would only give you an UnicodeDecodeError instead of removing the characters. Learn proper encoding and decoding to/from unicode so that you can filter out tthe values using rules like errors="ignore"
You can't just ignore the characters taht are part of your data, just because
you 'dislike" then. It is text, and in an itnerconected World, text is not composed of only 26 glyphs.
I'd suggest you get started by reading this document: http://www.joelonsoftware.com/articles/Unicode.html

Bulletproof work with encoding in Python

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?
... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.
Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.
Convert to unicode from cp437. This way you get your bytes right to unicode and back.

Python, ascii can't encode byte error

Hi I have an abstract that I download from a website, sometimes it doesn't have special chars and it doesn't give me any problem when I try to store it in the db...
Other times I get an error, Python can't encode byte error blah blah blah...
I use .encode('ascii', 'ignore') in order to avoid those issues but in the end those special characters don't get saved... what should I use to store every character without problems??
As Alex mentioned, you first need to determine how the file is encoded (response.encoding in scrapy). Assuming UTF-8, you'd simply .encode('utf-8'), or any of Python's other supported encodings.
You also need to make sure that the table in which you are trying to store the data accepts that particular encoding. See this question for examples on how to enable UTF-8 storage. You can also use any of the other character sets that MySQL supports in the same manner.

"Broken" unicode strings encoded in UTF-8?

I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.
u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.
Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.

Categories

Resources