Python encoding of Latin american characters - python

I'm trying to allow users to signup to my service and I'm noticing errors whenever somebody signs up with Latin american characters in their name.I tried reading several SO posts/websites as per below:
Python regex against Latin-1 character encoding?
http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0
http://docs.python.org/2/library/json.html
https://pypi.python.org/pypi/anyjson
but was still unable to solve it. My code example is as per below:
>>> val = json.dumps({"name":"Déjà"}, encoding="ISO-8859-1")
>>> val
'{"name": "D\\u00c3\\u00a9j\\u00c3\\u00a0"}'
Is there anyway to force the encoding to work in this case for both that and deserializing? Any help is appreciated!
EDIT
The client is Android and iPhone applications. I'm using the following libraries to encode the json on the clients:
http://loopj.com/android-async-http/ (android)
https://github.com/AFNetworking/AFNetworking (ios)
EDIT 2
The same text was received by the server from the Android client as per below:
{"NAME":"D\ufffdj\ufffd"}
I was using anyjson to deserialize that and it said:
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 135, in loads
return implementation.loads(value)
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 99, in loads
return self._decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 454, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 374, in decode
obj, end = self.raw_decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 393, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
ValueError: ('utf8', "D\xe9j\xe0", 1, 2, 'invalid continuation byte')

JSON should almost always be in Unicode (when encoded), and if you're writing a webserver, UTF-8. The following, in Python 3, is basically correct:
In [1]: import json
In [2]: val = json.dumps({"name":"Déjà"})
In [3]: val
Out[3]: '{"name": "D\\u00e9j\\u00e0"}'
A closer look:
'{"name": "D\\u00e9j\\u00e0"}'
^^^^^^^
The text \u00e9, which in JSON means "é".
The slash is doubled because we're looking at a repr of a str.
You can then send val to the client, and in Javascript, JSON.parse should give you the right result.
Because you mentioned, "when somebody signs up": that implies data coming from the client (web browser) to you. How is that data being sent? What library/libraries are you writing a webserver in?

Turns out this was mainly an issue in how I was doing the encoding from the Android side.
I am now setting the StringEntity this way in Android and it's working now:
StringEntity se = new StringEntity(obj.toString(), "UTF-8");
se.setContentType("application/json;charset=UTF-8");
se.setContentEncoding( new BasicHeader(HTTP.CONTENT_TYPE, "application/json"));
Also, I was using anyjson on the server which was using simplejson. This was creating errors at times as well. I switched to using the json library for Python.

Related

Convert base64 encoded google service account key to JSON file using Python

hello I'm trying to convert a google service account JSON key (contained in a base64 encoded field named privateKeyData in file foo.json - more context here ) into the actual JSON file (I need that format as ansible only accepts that)
The foo.json file is obtained using this google python api method
what I'm trying to do (though I am using python) is also described this thread which by the way does not work for me (tried on OSx and Linux).
#!/usr/bin/env python3
import json
import base64
with open('/tmp/foo.json', 'r') as f:
ymldict = json.load(f)
b64encodedCreds = ymldict['privateKeyData']
b64decodedBytes = base64.b64decode(b64encodedCreds,validate=True)
outputStr = b64decodedBytes
print(outputStr)
#issue
outputStr = b64decodedBytes.decode('UTF-8')
print(outputStr)
yields
./test.py
b'0\x82\t\xab\x02\x01\x030\x82\td\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\tU\x04\x82\tQ0\x82\tM0\x82\x05q\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\x05b\x04\x82\x05^0\x82\x05Z0\x82\x05V\x06\x0b*\x86H\x86\xf7\r\x01\x0c\n\x01\x02\xa0\x82\x#TRUNCATING HERE
Traceback (most recent call last):
File "./test.py", line 17, in <module>
outputStr = b64decodedBytes.decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 1: invalid start byte
I think I have run out of ideas and spent now more than a day on this :(
what am I doing wrong?
Your base64 decoding logic looks fine to me. The problem you are facing is probably due to a character encoding mismatch. The response body you received after calling create (your foo.json file) is probably not encoded with UTF-8. Check out the response header's Content-Type field. It should look something like this:
Content-Type: text/javascript; charset=Shift_JIS
Try to decode your base64 decoded string with the encoding used in the content type
b64decodedBytes.decode('Shift_JIS')

Meet " UnicodeDecodeError at /admin/login/ " when following Django Documentation Tutorial

I have problem with my Django admin site, 127.0.0.1:8000/admin/. Before, I have run the server by typing python manage.py runserver. I am now at Django docs tutorial part 2. I follow it and when I open the admin site, I meet this problem :
UnicodeDecodeError at /admin/login/
'utf-8' codec can't decode byte 0xee in position 394374: invalid continuation byte
Request Method: GET
Request URL: http://127.0.0.1:8000/admin/login/?next=/admin/
Django Version: 1.8.3
Exception Type: UnicodeDecodeError
Exception Value:
'utf-8' codec can't decode byte 0xee in position 394374: invalid continuation byte
Exception Location: c:\Python34\lib\codecs.py in decode, line 319
Python Executable: c:\Python34\python.exe
Python Version: 3.4.3
Python Path:
['C:\\Python34\\Scripts\\mypoll',
'C:\\WINDOWS\\system32\\python34.zip',
'c:\\Python34\\DLLs',
'c:\\Python34\\lib',
'c:\\Python34',
'c:\\Python34\\lib\\site-packages']
Server time: Sat, 1 Aug 2015 22:19:09 +0700
The Traceback :
Traceback:
File "c:\Python34\lib\site-packages\django\core\handlers\base.py" in get_response
164. response = response.render()
File "c:\Python34\lib\site-packages\django\template\response.py" in render
158. self.content = self.rendered_content
File "c:\Python34\lib\site-packages\django\template\response.py" in rendered_content
133. template = self._resolve_template(self.template_name)
File "c:\Python34\lib\site-packages\django\template\response.py" in _resolve_template
88. new_template = self.resolve_template(template)
File "c:\Python34\lib\site-packages\django\template\response.py" in resolve_template
80. return loader.get_template(template, using=self.using)
File "c:\Python34\lib\site-packages\django\template\loader.py" in get_template
35. return engine.get_template(template_name, dirs)
File "c:\Python34\lib\site-packages\django\template\backends\django.py" in get_template
30. return Template(self.engine.get_template(template_name, dirs))
File "c:\Python34\lib\site-packages\django\template\engine.py" in get_template
167. template, origin = self.find_template(template_name, dirs)
File "c:\Python34\lib\site-packages\django\template\engine.py" in find_template
141. source, display_name = loader(name, dirs)
File "c:\Python34\lib\site-packages\django\template\loaders\base.py" in __call__
13. return self.load_template(template_name, template_dirs)
File "c:\Python34\lib\site-packages\django\template\loaders\base.py" in load_template
17. template_name, template_dirs)
File "c:\Python34\lib\site-packages\django\template\loaders\app_directories.py" in load_template_source
39. return fp.read(), filepath
File "c:\Python34\lib\codecs.py" in decode
319. (result, consumed) = self._buffer_decode(data, self.errors, final)
Exception Type: UnicodeDecodeError at /admin/login/
Exception Value: 'utf-8' codec can't decode byte 0xee in position 394374: invalid continuation byte
At the bottom of the page, it states :
You're seeing this error because you have DEBUG = True in your Django settings file. Change that to False, and Django will display a standard page generated by the handler for this status code.
Then I change DEBUG = True to False. Automatically, the server will stop because at the command prompt stated : CommandError: You must set settings.ALLOWED_HOSTS if DEBUG is False.
Then, I can't do anything. What should I do ? Are there any solution ? I have searched google and stackoverflow but I can't find anything similar to my problem. I hope someone can help me with this problem.
Thanks
Breakdown:
The exception is issued by python itself. It happens when attempting to decode some raw data stream into strings. If you are new to python, you should know that python3 makes a clear distinction between strings aka str (that contain characters) and raw data aka bytes (that just contain bytes, potentially binary data).
The exception raised here means that for some reason, python was ordered to decode some bytes into text using utf-8 encoding, yet the data is not valid utf-8-encoded text.
Assuming you come from a western country, my bet is the text is using ANSI or ISO-8859-1 and has an “î” in it. That gets encoded as 0xee in ANSI, but should be encoded as 0xC3 0xAE in UTF-8.
There are several reasons this could happen. Here, from the traceback, it happened while rendering a template. More specifically, while rendering a template from an app's directory. So you have in one of your apps a template that's not properly encoded.
How it happened? Well, I see you are running a Windows box. The Windows environment is somewhat of a mess when it comes to text encoding. Every software comes with its own opinion of what to use as default (when it can be changed). For instance, Notepad still encodes in ANSI by default, or ISO-8859-1 in Western Europe.
It is very likely that one of the software you use for editing your templates is encoding your files into whatever. You have two options from here:
Check the options of your tools and make sure they are all configured to use UTF-8 encoding.
Or configure Django to use the same encoding as your tools. You would do that by adding a FILE_CHARSET='iso-8859-1' line to your settings, or whatever encoding your tools use.
In any case, you must be sure that all of your tools agree on the encoding used, or you will either have other decoding errors, or some characters will get mangled (and show as strange î or ? symbols).
Not useful for Django tutorial, but worth reading at some point in your python life: Unicode handling in Python

Flask: flask.request.args.get replacing '+' with space in url

I am trying to use a flask server for an api that takes image urls through the http get parameters.
I am using this url example which is very long (on pastebin) and contain's many +'s in the url. I have the following route set up in my flask server
#webapp.route('/example', methods=['GET'])
def process_example():
imageurl = flask.request.args.get('imageurl', '')
url = StringIO.StringIO(urllib.urlopen(imageurl).read())
...
but the issue I get is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 597, in open_data
data = base64.decodestring(data)
File "/Users/aly/anaconda/lib/python2.7/base64.py", line 321, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
Upon further inspection (i.e. printing the imageurl that flask gets) it would appear that the + characters are being replaced by literal spaces which seems to be screwing things up.
Is there an option for the flask.args.get function that can handle this?
You need to encode your query parameters correctly; in URL query paramater encoding, spaces are encoded to +, while + itself is encoded to %2B.
Flask cannot be told to treat specific data differently; you cannot reliably detect what data was correctly encoded and what wasn't. You could extract the parameters from query string manually, however, by using request.query_string.
The better approach is to escape your parameters correctly (in JavaScript, use encodeURIComponent(), for example). The + character is not the only problematic character in a Base64-encoded value; the format also uses / and =, both of which carry meaning in a URL, which is why there is a URL-safe variant.
In fact, it is probably the = character at the end of that data: URL that is missing, being the more direct cause of the Incorrect padding error message. If you added it back you'd next indeed have problems with all the + characters having been decoded to ' '.

How to handle Python 3.x UnicodeDecodeError in Email package?

I try to read an email from a file, like this:
import email
with open("xxx.eml") as f:
msg = email.message_from_file(f)
and I get this error:
Traceback (most recent call last):
File "I:\fakt\real\maildecode.py", line 53, in <module>
main()
File "I:\fakt\real\maildecode.py", line 50, in main
decode_file(infile, outfile)
File "I:\fakt\real\maildecode.py", line 30, in decode_file
msg = email.message_from_file(f) #, policy=mypol
File "C:\Python33\lib\email\__init__.py", line 56, in message_from_file
return Parser(*args, **kws).parse(fp)
File "C:\Python33\lib\email\parser.py", line 55, in parse
data = fp.read(8192)
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1920: character maps to <undefined>
The file contains a multipart email, where the part is encoded in UTF-8. The file's content or encoding might be broken, but I have to handle it anyway.
How can I read the file, even if it has Unicode errors? I cannot find the policy object compat32 and there seems to be no way to handle an exception and let Python continue right where the exception occured.
What can I do?
To parse an email message in Python 3 without unicode errors, read the file in binary mode and use the email.message_from_binary_file(f) (or email.message_from_bytes(f.read())) method to parse the content (see the documentation of the email.parser module).
Here is code that parses a message in a way that is compatible with Python 2 and 3:
import email
with open("xxx.eml", "rb") as f:
try:
msg = email.message_from_binary_file(f) # Python 3
except AttributeError:
msg = email.message_from_file(f) # Python 2
(tested with Python 2.7.13 and Python 3.6.0)
I can't test on your message, so I don't know if this will actually work, but you can do the string decoding yourself:
with open("xxx.eml", encoding='utf-8', errors='replace') as f:
text = f.read()
msg = email.message_from_string(f)
That's going to get you a lot of replacement characters if the message isn't actually in UTF-8. But if it's got \x81 in it, UTF-8 is my guess.
with open('email.txt','rb') as f:
ascii_txt = f.read().encode('ascii','backslashreplace')
with open('email.txt','w') as f:
f.write(ascii_text)
#now do your processing stuff
I doubt it is the best way to handle this ... but its at least a way ...
A method which works on python 3, which finds finds the encoding and reloads with the correct one.
msg=email.message_from_file(open('file.eml', errors='replace'))
codes=[x for x in msg.get_charsets() if x!=None]
if len(codes)>=1 :
msg=email.message_from_file(open('file.eml', encoding=codes[0]))
I have tried with msg.get_charset(), but it sometimes answers None while another encoding is available, hence the slightly involved encoding detection

How to use simplejson to decode following data?

I grab some data from a URL, and search online to find out the data is in in Jason data format, but when I tried to use simplejson.loads(data), it will raise exception.
First time deal with jason data, any suggestion how to decode the data?
Thanks
=================
result = simplejson.loads(data, encoding="utf-8")
File "F:\My Documents\My Dropbox\StockDataDownloader\simplejson__init__.py", line 401, in loads
return cls(encoding=encoding, **kw).decode(s)
File "F:\My Documents\My Dropbox\StockDataDownloader\simplejson\decoder.py", line 402, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "F:\My Documents\My Dropbox\StockDataDownloader\simplejson\decoder.py", line 420, in raw_decode
raise JSONDecodeError("No JSON object could be decoded", s, idx)
simplejson.decoder.JSONDecodeError: No JSON object could be decoded: line 1 column 0 (char 0)
============================
data = "{identifier:'ID', label:'As at Wed 4 Aug 2010 05:05 PM',items:[{ID:0,N:'2ndChance',NC:'528',R:'NONE',I:'NONE',M:'-',LT:0.335,C:0.015,VL:51.000,BV:20.000,B:0.330,S:0.345,SV:20.000,O:0.335,H:0.335,L:0.335,V:17085.000,SC:'4',PV:0.320,P:4.6875,P_:'X',V_:''},{ID:1,N:'8Telecom',NC:'E25',R:'NONE',I:'NONE',M:'-',LT:0.190,C:0.000,VL:965.000,BV:1305.000,B:0.185,S:0.190,SV:641.000,O:0.185,H:0.190,L:0.185,V:179525.000,SC:'2',PV:0.190,P:0.0,P_:'X',V_:''},{ID:2,N:'A-Sonic',NC:'A53',R:'NONE',I:'NONE',M:'-',LT:0.090,C:0.005,VL:1278.000,BV:17.000,B:0.090,S:0.095,SV:346.000,O:0.090,H:0.090,L:0.090,V:115020.000,SC:'A',PV:0.085,P:5.882352734375,P_:'X',V_:''},{ID:3,N:'AA Grp',NC:'5GZ',R:'NONE',I:'NONE',M:'t',LT:0.000,C:0.000,VL:0.000,BV:100.000,B:0.050,S:0.060,SV:50.000,O:0.000,H:0.000,L:0.000,V:0.000,SC:'2',PV:0.050,P:0.0,P_:'X',V_:''}]}"
You're using simplejson correctly, but the site that gave you that data isn't using JSON format properly. Look at json.org, which uses simple syntax diagrams to show what is JSON: in the object diagram, after { (unless the object is empty, in which case a } immediately follows), JSON always has a string -- and as you see in that diagram, this means something that starts with a double quote. So, the very start of the string:
{identifier:
tells you that's incorrect JSON -- no double quotes around the word identifier.
Working around this problem is not as easy as recognizing it's there, but I wanted to reassure you, at least, about your code. Sigh it does seem that broken websites, such a great tradition in old HTML days, are with us to stay no matter how modern the technology they break is...:-(

Categories

Resources