Problems with unicode, beautifulsoup, cld2, and python [duplicate]

Problems with unicode, beautifulsoup, cld2, and python [duplicate] - python

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?

... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.

Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.

Convert to unicode from cp437. This way you get your bytes right to unicode and back.

Related

Bulletproof work with encoding in Python

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?

... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.

Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.

Convert to unicode from cp437. This way you get your bytes right to unicode and back.

dealing with multiple charset in python 3

I'm using python 3.3.0 in Windows 8.
requrl = urllib.request.Request(url)
response = urllib.request.urlopen(requrl)
source = response.read()
source = source.decode('utf-8')
It will work fine if the websites have utf-8 charset but what if it has iso-8859-1 or any other charset. Means I may have different website url with different charset.
So, how to deal with multiple charset?
Now let me tell you my efforts when I tried to resolve this issue like:
b1 = b'charset=iso-8859-1'
b1 = b1.decode('iso-8859-1')
if b1 in source:
source = source.decode('iso-8859-1')
It gave me an error like TypeError: Type str doesn't support the buffer API
So, I'm assuming that it's considering b1 as string! and this is not the correct way! :(
Please, don't say that manually change charset in the source code or have you read python docs!
I have already tried to put my head into python 3 docs but still have no luck or I may not be picking up correct modules/contents to read!

In Python 3, a str is actually a sequence of unicode characters (equivalent to u'mystring' syntax in Python 2). What you get back from response.read() is a byte string (a sequence of bytes).
The reason your b1 in source fails is you are trying to find a unicode character sequence inside a byte string. This makes no sense, so it fails. If you take out the line b1.decode('iso-8859-1'), it should work because you are now comparing two byte sequences.
Now back to your real underlying issue. To support multiple charsets, you need to determine the character set so you cn decode it to a Unicode string. This is tricky to do. Normally you can examine the Content-Type header of the response. (See the rules below.) However, so many websites declare the wrong encoding in the header that we have had to develop other complicated encoding sniffing rules for html. Please read that link so you realize what a difficult problem this is!
I recommend you either:
Use the requests library instead of urllib, because it automatically takes care of most unicode conversions properly. (It's also much easier to use.) If conversion to unicode at this layer fails:
Try to pass the bytes directly to an underlying library you are using (e.g. lxml or html5lib) and let them deal with determining the encoding. They often implement the right charset-sniffing algorithms for the document type.
If neither of these work, you can get more aggressive and use libraries like chardet to detect the encoding, but in my experience people who serve their web pages this incorrectly are so incompetent that they produce mixed-encoding documents, so you will end up with garbage characters no matter what you do!
Here are the rules for interpreting the charset declared in a content-type header.
With no explicit charset declared:
text/* (e.g., text/html) is in ASCII.
application/* (e.g. application/json, application/xhtml+xml) is utf-8.
With an explicit charset declared:
if type is text/html and charset is iso-8859-1, it's actually win-1252 (==CP1252)
otherwise use the charset declared.
(Note that the html5 spec willfully violates the w3c specs by looking for UTF8 and UTF16 byte markers in preference to the Content-Type header. Please read that encoding detection algorithm link and see why we can't have nice things...)

The big problem here is that in many cases you can't be sure about the encoding of a webpage, even if it defines a charset. I've seen enough pages declaring one charset but acutally being in another, or having a different charsets in their Content-Type header then in their meta-tag or xml declaration.
In such cases chardet can be helpful.

You're checking whether str bytes contained within bytes object:
>>> 'df' in b'df'
Traceback (most recent call last):
File "<pyshell#107>", line 1, in <module>
'df' in b'df'
TypeError: Type str doesn't support the buffer API
So, yes, it considers b1 a str, because you've decoded bytes object into a str object with the certain encoding. Instead, you should check against original value of b1. It's not clear why you do .decode on it.

Have a look at the HTML standard, Parsing HTML documents, Determine character set (HTML5 is sufficient for our purposes).
There is an algorithm to take. For your purpose boils down to the following:
Check for identifying sequences for UTF-16 or UTF-8 (see provided link)
Use the character set supplied by HTTP (via the Content-Type header)
Apply the algorithm described a little later in Prescan a byte-stream to determine its encoding. This is basically searching for "charset=" in the document and extracting the value.

Splitting results from chardet output to collect encoding type

I am testing chardet in one of my scripts. I wanted to identify the encoding type of a result variable and chardet seems to do fine here.
So this is what I am doing:
myvar1 <-- gets its value from other functions
myvar2 = chardet.detect(myvar1) <-- to detect
the encoding type of myvar1
Now when I do a print myvar2, I receive the output:
{'confidence': 1.0, 'encoding':
'ascii'}
Question 1: Can someone give pointer on how to collect only the encoding value part out of this, i.e. ascii.
Edit:
The scenario is as follows:
I am using unicode(myvar1) to write all input as unicode. But as soon as myvar1 gets a value like 0xab, unicode(myvar1) fails with the error:
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xab in position xxx: ordinal not in range(128)
Therefore, I am tring to:
first identify the encoding type of the input which comes in myvar1,
take the encoding type in myvar2,
decode the input (myvar1) with this encoding (myvar2) using decode() [?]
pass it on to unicode.
The input coming in is variable and not in my control.
I am sure there are other ways to do this, but I am new to this. And I am open to trying.
Any pointer please.
Many Thanks.

print myvar2['encoding']
Now for the added related info: chardet is an attempt of detecting the encoding. It isn't 100% reliable and fails sometimes. However it's the best you've got since reliable encoding detection is impossible. Just provide a way for your users to specify encoding if chardet fails for them.
You can't read a text you don't know specific encoding type for. It is impossible -- because the same byte sequence can mean different chars on different encodings. In other words, encodings are ambiguous. chardet is just a guess. It can and will fail in the wild. The best and only reliable way is to ask whoever generated the string which encoding was used in first place.
EDIT:
for your scenario, the only way to stay sane is to ask whoever generated the string what's the encoding used. You said that
"The input coming in is variable and
not in my control."
If that's true, then you can't correctly read the input. You can't read a text input from a bunch of bytes without knowing beforehand which encoding it used. It's impossible. By definition.
Please ask whoever is generating the bytestrings to provide you the encoding used to generate the bytestrings, together with the bytestrings themselves, so you can make sense of them. Without the encoding, a bytestring is just a chunk of bytes and you can't know which chars are there. It's like having a bunch of data but not knowing how to interpret them.
Where do those bytes comes from? Why don't you have control over which encoding was used to generate the data? Does the data provider know that the data they're providing is useless since you can't correctly interpret it?
I will repeat once more to make it really clear: You can't correctly, reliably read a bunch of bytes as text without knowing the encoding used to generate the bytes. There's no way it will work reliably. You need some kind of agreement with the producer so you'll know the encoding.

second problem: as the traceback says, aBuf is an int but it's expecting a string. You need to find out why.
uhhhh ... just worked it out; you are feeding it a single byte, expressed as an integer (0xab) instead of a string ('\xab'). In any case, chardet requires much more than 1 byte to be able to guess an encoding. Feeding any charset detector one byte is utterly pointless.

"Broken" unicode strings encoded in UTF-8?

I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.

u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.

Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.

Working with foreign symbols in python

I'm parsing a JSON feed in Python and it contains this character, causing it not to validate.
Is there a way to handle these symbols? Can they be converted or is they're a tidy way to remove them?
I don't even know what this symbol is called or what causes them, otherwise I would research it myself.
EDIT: Stackover Flow is stripping the character so here:
http://files.getdropbox.com/u/194177/symbol.jpg
It's that [?] symbol in "Classic 80s"

That probably means the text you have is in some sort of encoding, and you need to figure out what encoding, and convert it to Unicode with a thetext.decode('encoding') call.
I not sure, but it could possibly be the [?] character, meaning that the display you have there also doesn't know how to display it. That would probably mean that the data you have is incorrect, and that there is a character in there that doesn't exist in the encoding that you are supposed to use. To handle that you call the decode like this: thetext.decode('encoding', 'ignore'). There are other options than ignore, like "replace", "xmlcharrefreplace" and more.

JSON must be encoded in one of UTF-8, UTF-16, or UTF-32. If a JSON file contains bytes which are illegal in its current encoding, it is garbage.
If you don't know which encoding it's using, you can try parsing using my jsonlib library, which includes an encoding-detector. JSON parsed using jsonlib will be provided to the programmer as Unicode strings, so you don't have to worry about encoding at all.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.