URL encoding/decoding with Python - python

I am trying to encode and store, and decode arguments in Python and getting lost somewhere along the way. Here are my steps:
1) I use google toolkit's gtm_stringByEscapingForURLArgument to convert an NSString properly for passing into HTTP arguments.
2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\'' (note that these are the standard keys on an iphone keypad in the "123" view and the "#+=" view, the \u and \x chars in there being some monetary prefixes like pound, yen, etc)
3) I call urllib.quote(myString,'') on that stored value, presumably to %-escape them for transport to the client so the client can unpercent escape them.
The result is that I am getting an exception when I try to log the result of % escaping. Is there some crucial step I am overlooking that needs to be applied to the stored value with the \u and \x format in order to properly convert it for sending over http?
Update: The suggestion marked as the answer below worked for me. I am providing some updates to address the comments below to be complete, though.
The exception I received cited an issue with \u20ac. I don't know if it was a problem with that specifically, rather than the fact that it was the first unicode character in the string.
That \u20ac char is the unicode for the 'euro' symbol. I basically found I'd have issues with it unless I used the urllib2 quote method.

url encoding a "raw" unicode doesn't really make sense. What you need to do is .encode("utf8") first so you have a known byte encoding and then .quote() that.
The output isn't very pretty but it should be a correct uri encoding.
>>> s = u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\|~<>\u20ac\xa3\xa5\u2022.,?!\''
>>> urllib2.quote(s.encode("utf8"))
'1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'
Remember that you will need to both unquote() and decode() this to print it out properly if you're debugging or whatever.
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8")))
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
>>> # oops, nasty  means we've got a utf8 byte stream being treated as an ascii stream
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8")
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
This is, in fact, what the django functions mentioned in another answer do.
The functions
django.utils.http.urlquote() and
django.utils.http.urlquote_plus() are
versions of Python’s standard
urllib.quote() and urllib.quote_plus()
that work with non-ASCII characters.
(The data is converted to UTF-8 prior
to encoding.)
Be careful if you are applying any further quotes or encodings not to mangle things.

i want to second pycruft's remark. web protocols have evolved over decades, and dealing with the various sets of conventions can be cumbersome. now URLs happen to be explicitly not defined for characters, but only for bytes (octets). as a historical coincidence, URLs are one of the places where you can only assume, but not enforce or safely expect an encoding to be present. however, there is a convention to prefer latin-1 and utf-8 over other encodings here. for a while, it looked like 'unicode percent escapes' would be the future, but they never caught on.
it is of paramount importance to be pedantically picky in this area about the difference between unicode objects and octet strings (in Python < 3.0; that's, confusingly, str unicode objects and bytes/bytearray objects in Python >= 3.0). unfortunately, in my experience it is for a number of reasons pretty difficult to cleanly separate the two concepts in Python 2.x.
even more OT, when you want to receive third-party HTTP requests, you can not absolutely rely on URLs being sent in percent-escaped, utf-8-encoded octets: there may both be the occasional %uxxxx escape in there, and at least firefox 2.x used to encode URLs as latin-1 where possible, and as utf-8 only where necessary.

You are out of your luck with stdlib, urllib.quote doesn't work with unicode. If you are using django you can use django.utils.http.urlquote which works properly with unicode

Related

Problems with unicode, beautifulsoup, cld2, and python [duplicate]

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?
... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.
Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.
Convert to unicode from cp437. This way you get your bytes right to unicode and back.

Python 2: Handling unicode in a string when you dont have control of the list

So I'm using TwitterSearch Library. The function is simple, to print Twitter search result.
So here's a trouble. Your tweet is passed by The TwitterSearch from this dict (or list. Whatever is the actual one is)
tweet['text']
And if your python 2.7 have an unicode that this python can't solve, BOOM. Program Err
So I tried to make it like thise
a=unicode(tweet['text'], errors='ignore')
print a
The purpose is that I want the unicode converted to string, while ignoring unresolved unicode in the process (This is what I understand from the documentation. I may fail to understand the documentation so come up with this code)
I got this cute Error message.
typeError: decoding Unicode is not suported
My question
1: Why? Doesn't this Unicode stuff is part of default python Library
2: What should I do so I can I have unicode converted to string, while ignoring unresolved unicode in the process
PS: This is my first unicode issue and this is the best I can do at this point. Don't kill me.
You need to understand the distinction between Unicode objects and byte strings. In Python 2.7 unicode class is a Unicode object. These already consist of characters defined in the Unicode standard. From my understanding of the evidence you've provided, your tweet['text'] is already a unicode instance.
You can verify this by printing type(tweet['text']):
>>> print type(tweet['text'])
<type 'unicode'>
Now unicode objects are a high-level representation of a concept that does not have a single defined representation in computer memory. They are very useful as they allow you to use characters outside of the ASCII standard range that is limited to basic latin letters and numbers. But a character in Unicode is not remembered by the computer as its shape, instead we use their numbers provided by the standard and referred to as code points.
On the other hand pretty much every part of your computer operates using bytes. Network protocols transfer bytes, input and output streams transfer bytes. To be able to send a Unicode string across the network or even print it on a device such as your terminal you need to use a protocol that both communicating parties (e.g. your program and the terminal) understand. We call these encodings.
>>> u'żółw'.encode('utf-8')
'\xc5\xbc\xc3\xb3\xc5\x82w'
>>> print type(u'żółw'.encode('utf-8'))
<type 'str'>
There are many encoding and a single unicode object can often be encoded into many different byte strings depending on the encoding you choose. To pick one that is correct requires the knowledge of the context you want to use the resulting string in. If your terminal understands UTF-8 then all unicode objects will require encoding to UTF-8 before being sent to the output stream. If it only understands ASCII then you might need to drop some of the characters.
>>> print u'żółw'.encode('utf-8')
żółw
So if Python's default output encoding is either incorrect or cannot handle all the characters you're trying to print, you can always encode the object manually and output the resulting str instead. But before you do, please read all of the documents linked to in comments directly under your question.

Bulletproof work with encoding in Python

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?
... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.
Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.
Convert to unicode from cp437. This way you get your bytes right to unicode and back.

using extended ascii characters for wikimedia api

I am writing a simple search algorithm for wikipedia. I am having trouble when I send a query with characters that have accents and other characters that are not seen in regular english. Queries that return in error are:
http://en.wikipedia.org/w/api.php?action=query&titles=Albrecht%20Dürer&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Ancien%20Régime&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Feigenbaum-Cvitanović&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Banach–Tarski%20paradox&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Grundzüge%20der%20Mengenlehre&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Grundzüge%20einer%20Theorie%20der%20geordneten%20Mengen&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Karl%20Bögel&prop=links&pllimit=33&format=xml
But the query works fine if there are simple character such as "Fractals". How should I change the format of the query to make this work?
My code is open sourced at: http://code.google.com/p/wikipediafoundation/source/browse/. Please look at hg/src/list.py.
I don't see any trace in your Python source of how you're encoding any non-ascii characters you're sending in the query. For URLs (including query strings in them) using anything beyond ascii, you need to (make them unicode if they already aren't, then) encode them in utf-8 and percent-escape the result (for the latter use function urllib.quote_plus from the standard Python library module urllib, and for encoding, of course, the unicode string's .encode('utf8') method -- if you need to make a unicode string from a differently-encoded byte string, use the byte string's .decode('latin-1') -- or whatever the name of the encoding it's in, of course;-).

"Broken" unicode strings encoded in UTF-8?

I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.
u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.
Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.

Categories

Resources