dealing with multiple charset in python 3 - python

I'm using python 3.3.0 in Windows 8.
requrl = urllib.request.Request(url)
response = urllib.request.urlopen(requrl)
source = response.read()
source = source.decode('utf-8')
It will work fine if the websites have utf-8 charset but what if it has iso-8859-1 or any other charset. Means I may have different website url with different charset.
So, how to deal with multiple charset?
Now let me tell you my efforts when I tried to resolve this issue like:
b1 = b'charset=iso-8859-1'
b1 = b1.decode('iso-8859-1')
if b1 in source:
source = source.decode('iso-8859-1')
It gave me an error like TypeError: Type str doesn't support the buffer API
So, I'm assuming that it's considering b1 as string! and this is not the correct way! :(
Please, don't say that manually change charset in the source code or have you read python docs!
I have already tried to put my head into python 3 docs but still have no luck or I may not be picking up correct modules/contents to read!

In Python 3, a str is actually a sequence of unicode characters (equivalent to u'mystring' syntax in Python 2). What you get back from response.read() is a byte string (a sequence of bytes).
The reason your b1 in source fails is you are trying to find a unicode character sequence inside a byte string. This makes no sense, so it fails. If you take out the line b1.decode('iso-8859-1'), it should work because you are now comparing two byte sequences.
Now back to your real underlying issue. To support multiple charsets, you need to determine the character set so you cn decode it to a Unicode string. This is tricky to do. Normally you can examine the Content-Type header of the response. (See the rules below.) However, so many websites declare the wrong encoding in the header that we have had to develop other complicated encoding sniffing rules for html. Please read that link so you realize what a difficult problem this is!
I recommend you either:
Use the requests library instead of urllib, because it automatically takes care of most unicode conversions properly. (It's also much easier to use.) If conversion to unicode at this layer fails:
Try to pass the bytes directly to an underlying library you are using (e.g. lxml or html5lib) and let them deal with determining the encoding. They often implement the right charset-sniffing algorithms for the document type.
If neither of these work, you can get more aggressive and use libraries like chardet to detect the encoding, but in my experience people who serve their web pages this incorrectly are so incompetent that they produce mixed-encoding documents, so you will end up with garbage characters no matter what you do!
Here are the rules for interpreting the charset declared in a content-type header.
With no explicit charset declared:
text/* (e.g., text/html) is in ASCII.
application/* (e.g. application/json, application/xhtml+xml) is utf-8.
With an explicit charset declared:
if type is text/html and charset is iso-8859-1, it's actually win-1252 (==CP1252)
otherwise use the charset declared.
(Note that the html5 spec willfully violates the w3c specs by looking for UTF8 and UTF16 byte markers in preference to the Content-Type header. Please read that encoding detection algorithm link and see why we can't have nice things...)

The big problem here is that in many cases you can't be sure about the encoding of a webpage, even if it defines a charset. I've seen enough pages declaring one charset but acutally being in another, or having a different charsets in their Content-Type header then in their meta-tag or xml declaration.
In such cases chardet can be helpful.

You're checking whether str bytes contained within bytes object:
>>> 'df' in b'df'
Traceback (most recent call last):
File "<pyshell#107>", line 1, in <module>
'df' in b'df'
TypeError: Type str doesn't support the buffer API
So, yes, it considers b1 a str, because you've decoded bytes object into a str object with the certain encoding. Instead, you should check against original value of b1. It's not clear why you do .decode on it.

Have a look at the HTML standard, Parsing HTML documents, Determine character set (HTML5 is sufficient for our purposes).
There is an algorithm to take. For your purpose boils down to the following:
Check for identifying sequences for UTF-16 or UTF-8 (see provided link)
Use the character set supplied by HTTP (via the Content-Type header)
Apply the algorithm described a little later in Prescan a byte-stream to determine its encoding. This is basically searching for "charset=" in the document and extracting the value.

Related

Problems with unicode, beautifulsoup, cld2, and python [duplicate]

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?
... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.
Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.
Convert to unicode from cp437. This way you get your bytes right to unicode and back.

Bulletproof work with encoding in Python

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?
... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.
Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.
Convert to unicode from cp437. This way you get your bytes right to unicode and back.

Python - How to get accented characters correct? (BeautifulSoup)

I've write a s python code with BeautifulSoup to get HTML but not getting how to solve accented characters correct.
The charset of the HTML is this
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
I've this python code:
some_text = soup_ad.find("span", { "class" : "h1_span" }).contents[0]
some_text.decode('iso-8859-1','ignore')
And I get this:
Calções
What I'm doing wrong here? Some clues?
Best Regards,
The question here is about "where" do you "get this".
If that's the output received in your terminal, it might as well be possible that your terminal expects a different encoding!
You can try this when using print:
import sys
outenc = sys.stdout.encoding or sys.getfilesystemencoding()
print t.decode("iso-8859-1").encode(outenc)
As bernie points out, BS uses Unicode internally.
For BS3:
Beautiful Soup Gives You Unicode, Dammit
By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.
For BS4, the docs explain a bit more clearly when this happens:
You can pass in a string or an open filehandle… First, the document is converted to Unicode, and HTML entities are converted to Unicode characters…`
In other words, it decodes the data immediately. So, if you're getting mojibake, you have to fix it before it gets into BS, not after.
The input to the BeautifulSoup constructor can take 8-bit byte strings or files, and try to figure out the encoding. See Encodings for details. You can check whether it guessed right by printing out soup.original_encoding. If it didn't guess ISO-8859-1 or a synonym, your only option is to make it explicit: decode the string before passing it in, open the file in Unicode mode with an encoding, etc.
The results that come out of any BS object, and anything you pass as an argument to any method, will always be UTF-8 (if they're byte strings). So, calling decode('iso-8859-1') on something you got out of BS is guaranteed to break stuff if it's not already broken.
And you don't want to do this anyway. As you said in a comment, "I'm outputting to an SQLite3 database." Well, sqlite3 always uses UTF-8. (You can change this with a pragma at runtime, or change the default at compile time, but that basically breaks the Python interface, so… don't.) And the Python interface only allows UTF-8 in Py2 str (and of course in Py2 unicode/Py3 str, there is no encoding.) So, if you try to encode the BS data into Latin-1 to store in the database, you're creating problems. Just store the Unicode as-is, or encode it to UTF-8 if you must (Py2 only).
If you don't want to figure all of this out, just use Unicode everywhere after the initial call to BeautifulSoup and you'll never go wrong.

URL encoding/decoding with Python

I am trying to encode and store, and decode arguments in Python and getting lost somewhere along the way. Here are my steps:
1) I use google toolkit's gtm_stringByEscapingForURLArgument to convert an NSString properly for passing into HTTP arguments.
2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\'' (note that these are the standard keys on an iphone keypad in the "123" view and the "#+=" view, the \u and \x chars in there being some monetary prefixes like pound, yen, etc)
3) I call urllib.quote(myString,'') on that stored value, presumably to %-escape them for transport to the client so the client can unpercent escape them.
The result is that I am getting an exception when I try to log the result of % escaping. Is there some crucial step I am overlooking that needs to be applied to the stored value with the \u and \x format in order to properly convert it for sending over http?
Update: The suggestion marked as the answer below worked for me. I am providing some updates to address the comments below to be complete, though.
The exception I received cited an issue with \u20ac. I don't know if it was a problem with that specifically, rather than the fact that it was the first unicode character in the string.
That \u20ac char is the unicode for the 'euro' symbol. I basically found I'd have issues with it unless I used the urllib2 quote method.
url encoding a "raw" unicode doesn't really make sense. What you need to do is .encode("utf8") first so you have a known byte encoding and then .quote() that.
The output isn't very pretty but it should be a correct uri encoding.
>>> s = u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\|~<>\u20ac\xa3\xa5\u2022.,?!\''
>>> urllib2.quote(s.encode("utf8"))
'1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'
Remember that you will need to both unquote() and decode() this to print it out properly if you're debugging or whatever.
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8")))
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
>>> # oops, nasty  means we've got a utf8 byte stream being treated as an ascii stream
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8")
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
This is, in fact, what the django functions mentioned in another answer do.
The functions
django.utils.http.urlquote() and
django.utils.http.urlquote_plus() are
versions of Python’s standard
urllib.quote() and urllib.quote_plus()
that work with non-ASCII characters.
(The data is converted to UTF-8 prior
to encoding.)
Be careful if you are applying any further quotes or encodings not to mangle things.
i want to second pycruft's remark. web protocols have evolved over decades, and dealing with the various sets of conventions can be cumbersome. now URLs happen to be explicitly not defined for characters, but only for bytes (octets). as a historical coincidence, URLs are one of the places where you can only assume, but not enforce or safely expect an encoding to be present. however, there is a convention to prefer latin-1 and utf-8 over other encodings here. for a while, it looked like 'unicode percent escapes' would be the future, but they never caught on.
it is of paramount importance to be pedantically picky in this area about the difference between unicode objects and octet strings (in Python < 3.0; that's, confusingly, str unicode objects and bytes/bytearray objects in Python >= 3.0). unfortunately, in my experience it is for a number of reasons pretty difficult to cleanly separate the two concepts in Python 2.x.
even more OT, when you want to receive third-party HTTP requests, you can not absolutely rely on URLs being sent in percent-escaped, utf-8-encoded octets: there may both be the occasional %uxxxx escape in there, and at least firefox 2.x used to encode URLs as latin-1 where possible, and as utf-8 only where necessary.
You are out of your luck with stdlib, urllib.quote doesn't work with unicode. If you are using django you can use django.utils.http.urlquote which works properly with unicode

python appengine form-posted utf8 file issue

i am trying to form-post a sql file that consists on many INSERTS, eg.
INSERT INTO `TABLE` VALUES ('abcdé', 2759);
then i use re.search to parse it and extract the fields to put into my own datastore. The problem is that, although the file contains accented characters (see the e is a é), once uploaded it loses it and either errors or stores a bytestring representation of it.
Heres what i am currently using (and I have tried loads of alternatives):
form = cgi.FieldStorage()
uFile = form['sql']
uSql = uFile.file.read()
lineX = uSql.split("\n") # to get each line
and so on.
has anyone got a robust way of making this work? remember i am on appengine so access to some libraries is restricted/forbidden
You mention utf8 in the Q's title but then never again: what are you doing (in terms of setting headers and checking them) to verify what encoding is in use? There should be headers of the form
Content-Type: text/plain; charset=utf-8
and the charset= part is where the encoding is specified. So what are the values upon sending and receiving this? If charset is erroneous, you may have to manually perform some encoding and decoding. To help us gauge what the encoding seems to be, besides the headers, what's the ord value of that accented-e? E.g., if the encoding was actually iso-8859-1, that ord value would be 233 (in decimal; 0xE9 in hex).

Categories

Resources