Get valid python list from string (javascript array) - python

I'm trying to get the valid python list from the response of a server like you can see below:
window.__search.list=[{"order":"1","base":"LAW","n":"148904","access":{"css":"avail_yes","title":"\u042
2\u0435\u043a\u0441\u0442\u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0434\u043e\u0441\u0442\u0443\u043f\u0435\u043d"},"title":"\"\u0410\u0440\u0431\u0438\u0442\u0440\u0430\u0436\u043d\u044b\u0439\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0443\u0430\u043b\u044c\u043d\u044b\u0439\u043a\u043e\u0434\u0435\u043a\u0441\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u043e\u0439\u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u0438\" \u043e\u0442 24.07.2002 N 95-\u0424\u0417 (\u0440\u0435\u0434. \u043e\u0442 02.07.2013) (\u0441 \u0438\u0437\u043c. \u0438 \u0434\u043e\u043f.,\u0432\u0441\u0442\u0443\u043f\u0430 \u044e\u0449\u0438\u043c\u0438\u0432 \u0441\u0438\u043b\u0443 \u0441 01.08.2013)"}, ... }];
I did it through cutting off "window.__search.list=" and ";" from the string using data = json.loads(re.search(r"(?=\[)(.*?)\s*(?=\;)", url).group(1)) and then it was looked like standard JSON:
[{u'access': {u'css': u'avail_yes', u'title': u'\u0422\u0435\u043a\u0441\u0442\u0434\u043e\u043a\u04
43\u043c\u0435\u043d\u0442\u0430 \u0434\u043e\u0441\u0442\u0443\u043f\u0435\u043d'},u'title': u'"\u0410\u0440\u0431\u0438\u0442\u0440\u0430\u0436\u043d\u044b\u0439\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0443\u0430\u043b\u044c\u043d\u044b\u0439\u043a\u043e\u0434\u0435\u043a\u0441\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u043e\u0439\u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u0438" \u043e\u0442 24.07.2002 N 95-\u0424\u0417 (\u04
40\u0435\u0434. \u043e\u0442 02.07.2013) (\u0441 \u0438\u0437\u043c. \u0438 \u0434\u043e
\u043f.,\u0432\u0441\u0442\u0443\u043f\u0430\u044e\u0449\u0438\u043c\u0438 \u0432 \u0441
\u0438\u043b\u0443 \u0441 01.08.2013)', u'base': u'LAW', u'order': u'1', u'n': u'148904'}, ... }]
But sometimes, during iterating an others urls I get an error like this:
File "/Developer/Python/test.py", line 123, in order_search
data = json.loads(re.search(r"(?=\[)(.*?)\s*(?=\;)", url).group(1))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 326, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \uXXXX escape: line 1 column 20235 (char 20235)
How can I fix it, or maybe there's an another way to get valid JSON (desirable using native libraries)?

Probably, your regular expression has found char ';' somewhere in the middle of a response, and because of this you get an error, because, using your regular expression, you might have received an incomplete, cropped response, and that's why you could not convert it into JSON.
Yes, I agree with user RickyA that sometimes using a native tools, a code will easier to read than trying to make up RegEx. But here, I'd rather to use exactly regular expression, something like this:
data = re.search(r'(?=\[)(.*?)[\;]*$', response).group(1)
/(?=\[)(.*?)[\;]*$/
(?=\[) Positive Lookahead
\[ Literal [
1st Capturing group (.*?)
. 0 to infinite times [lazy] Any character (except newline)
Char class [\;] 0 to infinite times [greedy] matches:
\; The character ;
$ End of string
I believe you meant that the variable 'url' means a response from a server, then maybe better to use name of variable 'response' instead of 'url'.
And, if you've some troubles with using RegEx, I advise you to use an editor of regular expressions, like RegEx 101.This is the online regular expression editor, which explains each block of inputted expression.

What about:
response = response.strip() #get rid of whitespaces
response = response[response.find("["):] #trim everything before the first '['
if response[-1:] == ";": #if last char == ";"
response = response[:-1] #trim it
Seems like a big overkill to do this with regex.

Related

Python -- get at JSON info that's written like XML

In Python, I usually do simple JSON with this sort of template:
url = "url"
file = urllib2.urlopen(url)
json = file.read()
parsed = json.loads(json)
and then get at the variables with calls like:
parsed[obj name][value name]
But, this works with JSON that's formatted roughly like:
{'object':{'index':'value', 'index':'value'}}
The JSON I just encountered is formatted like:
{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}
so there are no names for me to reference the different blocks. Of course the blocks give different info, but have the same "keys" -- much like XML is usually formatted. Using my method above, how would I parse through this JSON?
The following is not a valid JSON.
{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}
Where as
[{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}] is a valid JSON.
and python trackback shows that
import json
string = "{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}"
parsed = json.loads(string)
print parsed
Traceback (most recent call last):
File "/Users/tron/Desktop/test3.py", line 3, in <module>
parsed_json = json.loads(json_string)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 27 - line 1 column 54 (char 26 - 53)
[Finished in 0.0s with exit code 1]
where is if you do
json_string = '[{"a":"value", "b":"value"},{"a":"value", "b":"value"}]'
everything works fine.
If that is the case, you can refer to it as an array of Jsons. where json_string[0] is the first JSON string. json_string[1] is the second and so on.
Otherwise if you think this is going to be an issue that you "just have to deal with". Here is one option:
Think of the ways JSON can be malformed and write a simple class to account for them. In the case above, here is a hacky way you can deal with it.
import json
json_string = '{"a":"value", "b":"value"},{"a":"value", "b":"value"}'
def parseJson(string):
parsed_json = None
try:
parsed_json = json.loads(string)
print parsed_json
except ValueError, e:
print string, "didnt parse"
if "Extra data" in str(e.args):
newString = "["+string+"]"
print newString
return parseJson(newString)
You could add more if/else to deal with various things you run into. I have to admit, this is very hacky and I don't think you can ever account for every possible mutation.
Good luck
The result must be list of dict:
[{'index1':'value1', 'index2':'value2'},{'index1':'value1', 'index2':'value2'}]
thus you can reference it using numbers: item[1]['index1']

Flask: flask.request.args.get replacing '+' with space in url

I am trying to use a flask server for an api that takes image urls through the http get parameters.
I am using this url example which is very long (on pastebin) and contain's many +'s in the url. I have the following route set up in my flask server
#webapp.route('/example', methods=['GET'])
def process_example():
imageurl = flask.request.args.get('imageurl', '')
url = StringIO.StringIO(urllib.urlopen(imageurl).read())
...
but the issue I get is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 597, in open_data
data = base64.decodestring(data)
File "/Users/aly/anaconda/lib/python2.7/base64.py", line 321, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
Upon further inspection (i.e. printing the imageurl that flask gets) it would appear that the + characters are being replaced by literal spaces which seems to be screwing things up.
Is there an option for the flask.args.get function that can handle this?
You need to encode your query parameters correctly; in URL query paramater encoding, spaces are encoded to +, while + itself is encoded to %2B.
Flask cannot be told to treat specific data differently; you cannot reliably detect what data was correctly encoded and what wasn't. You could extract the parameters from query string manually, however, by using request.query_string.
The better approach is to escape your parameters correctly (in JavaScript, use encodeURIComponent(), for example). The + character is not the only problematic character in a Base64-encoded value; the format also uses / and =, both of which carry meaning in a URL, which is why there is a URL-safe variant.
In fact, it is probably the = character at the end of that data: URL that is missing, being the more direct cause of the Incorrect padding error message. If you added it back you'd next indeed have problems with all the + characters having been decoded to ' '.

converting string to dictionary using json.loads

I'm trying to pass some json data extracted from a JavaScript file.
I have the following variable in my python code. I get the string from file.read(). I know the below will be set as a dict if pasted into a python code as is.
resultStr = {"inst":{"summary":{"statistics":[],"wa_recursive":"100.000%","files":11,"dus":11}},"du":{"summary":{"statistics":[{"type":"stmt","data":"Statement Coverage","status":"covered","weight":1,"rhits":"100.000%","rtotal":"100.000%"},{"data":"Statements","rhits":86.000,"rtotal":86.000},{"data":"Subprograms","rhits":0.000,"rtotal":0.000},{"type":"branch","data":"Branch Coverage","status":"covered","weight":1,"rhits":"100.000%","rtotal":"100.000%"},{"data":"Branch paths","rhits":42.000,"rtotal":42.000},{"data":"Branches","rhits":21.000,"rtotal":21.000},{"type":"toggle","data":"Toggle Coverage","status":"uncovered","weight":1,"rhits":"94.410%","rtotal":"100.000%"},{"data":"Toggle bins","rhits":304.000,"rtotal":322.000},{"data":"Signal bits","rhits":150.000,"rtotal":161.000}],"wa_recursive":"98.137%","files":11,"dus":11}}};
When i pass this string into the json loader
json.loads(resultStr)
I get the following exception
File "C:\Python34\lib\json\__init__.py", line 318, in loads
return _default_decoder.decode(s)
File "C:\Python34\lib\json\decoder.py", line 346, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 825 - line 1 column 826 (char 824 - 825)
To simplify its failing on the last part of the string
"wa_recursive":"98.137%","files":11,"dus":11}}};
I've tried to just enter it manually and it is recognized as a dictionary in the python code.
I cant seem to find any fault with it so some assistance would be appreciated :)
Thank you :)
The following works fine for me. Did you keep the semicolon in the string?
import json
resultStr = '{"inst":{"summary":{"statistics":[],"wa_recursive":"100.000%","files":11,"dus":11}},"du":{"summary":{"statistics":[{"type":"stmt","data":"Statement Coverage","status":"covered","weight":1,"rhits":"100.000%","rtotal":"100.000%"},{"data":"Statements","rhits":86.000,"rtotal":86.000},{"data":"Subprograms","rhits":0.000,"rtotal":0.000},{"type":"branch","data":"Branch Coverage","status":"covered","weight":1,"rhits":"100.000%","rtotal":"100.000%"},{"data":"Branch paths","rhits":42.000,"rtotal":42.000},{"data":"Branches","rhits":21.000,"rtotal":21.000},{"type":"toggle","data":"Toggle Coverage","status":"uncovered","weight":1,"rhits":"94.410%","rtotal":"100.000%"},{"data":"Toggle bins","rhits":304.000,"rtotal":322.000},{"data":"Signal bits","rhits":150.000,"rtotal":161.000}],"wa_recursive":"98.137%","files":11,"dus":11}}}'
decodedData = json.loads(resultStr);
print(decodedData);

Python encoding of Latin american characters

I'm trying to allow users to signup to my service and I'm noticing errors whenever somebody signs up with Latin american characters in their name.I tried reading several SO posts/websites as per below:
Python regex against Latin-1 character encoding?
http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0
http://docs.python.org/2/library/json.html
https://pypi.python.org/pypi/anyjson
but was still unable to solve it. My code example is as per below:
>>> val = json.dumps({"name":"Déjà"}, encoding="ISO-8859-1")
>>> val
'{"name": "D\\u00c3\\u00a9j\\u00c3\\u00a0"}'
Is there anyway to force the encoding to work in this case for both that and deserializing? Any help is appreciated!
EDIT
The client is Android and iPhone applications. I'm using the following libraries to encode the json on the clients:
http://loopj.com/android-async-http/ (android)
https://github.com/AFNetworking/AFNetworking (ios)
EDIT 2
The same text was received by the server from the Android client as per below:
{"NAME":"D\ufffdj\ufffd"}
I was using anyjson to deserialize that and it said:
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 135, in loads
return implementation.loads(value)
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 99, in loads
return self._decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 454, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 374, in decode
obj, end = self.raw_decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 393, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
ValueError: ('utf8', "D\xe9j\xe0", 1, 2, 'invalid continuation byte')
JSON should almost always be in Unicode (when encoded), and if you're writing a webserver, UTF-8. The following, in Python 3, is basically correct:
In [1]: import json
In [2]: val = json.dumps({"name":"Déjà"})
In [3]: val
Out[3]: '{"name": "D\\u00e9j\\u00e0"}'
A closer look:
'{"name": "D\\u00e9j\\u00e0"}'
^^^^^^^
The text \u00e9, which in JSON means "é".
The slash is doubled because we're looking at a repr of a str.
You can then send val to the client, and in Javascript, JSON.parse should give you the right result.
Because you mentioned, "when somebody signs up": that implies data coming from the client (web browser) to you. How is that data being sent? What library/libraries are you writing a webserver in?
Turns out this was mainly an issue in how I was doing the encoding from the Android side.
I am now setting the StringEntity this way in Android and it's working now:
StringEntity se = new StringEntity(obj.toString(), "UTF-8");
se.setContentType("application/json;charset=UTF-8");
se.setContentEncoding( new BasicHeader(HTTP.CONTENT_TYPE, "application/json"));
Also, I was using anyjson on the server which was using simplejson. This was creating errors at times as well. I switched to using the json library for Python.

How to use simplejson to decode following data?

I grab some data from a URL, and search online to find out the data is in in Jason data format, but when I tried to use simplejson.loads(data), it will raise exception.
First time deal with jason data, any suggestion how to decode the data?
Thanks
=================
result = simplejson.loads(data, encoding="utf-8")
File "F:\My Documents\My Dropbox\StockDataDownloader\simplejson__init__.py", line 401, in loads
return cls(encoding=encoding, **kw).decode(s)
File "F:\My Documents\My Dropbox\StockDataDownloader\simplejson\decoder.py", line 402, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "F:\My Documents\My Dropbox\StockDataDownloader\simplejson\decoder.py", line 420, in raw_decode
raise JSONDecodeError("No JSON object could be decoded", s, idx)
simplejson.decoder.JSONDecodeError: No JSON object could be decoded: line 1 column 0 (char 0)
============================
data = "{identifier:'ID', label:'As at Wed 4 Aug 2010 05:05 PM',items:[{ID:0,N:'2ndChance',NC:'528',R:'NONE',I:'NONE',M:'-',LT:0.335,C:0.015,VL:51.000,BV:20.000,B:0.330,S:0.345,SV:20.000,O:0.335,H:0.335,L:0.335,V:17085.000,SC:'4',PV:0.320,P:4.6875,P_:'X',V_:''},{ID:1,N:'8Telecom',NC:'E25',R:'NONE',I:'NONE',M:'-',LT:0.190,C:0.000,VL:965.000,BV:1305.000,B:0.185,S:0.190,SV:641.000,O:0.185,H:0.190,L:0.185,V:179525.000,SC:'2',PV:0.190,P:0.0,P_:'X',V_:''},{ID:2,N:'A-Sonic',NC:'A53',R:'NONE',I:'NONE',M:'-',LT:0.090,C:0.005,VL:1278.000,BV:17.000,B:0.090,S:0.095,SV:346.000,O:0.090,H:0.090,L:0.090,V:115020.000,SC:'A',PV:0.085,P:5.882352734375,P_:'X',V_:''},{ID:3,N:'AA Grp',NC:'5GZ',R:'NONE',I:'NONE',M:'t',LT:0.000,C:0.000,VL:0.000,BV:100.000,B:0.050,S:0.060,SV:50.000,O:0.000,H:0.000,L:0.000,V:0.000,SC:'2',PV:0.050,P:0.0,P_:'X',V_:''}]}"
You're using simplejson correctly, but the site that gave you that data isn't using JSON format properly. Look at json.org, which uses simple syntax diagrams to show what is JSON: in the object diagram, after { (unless the object is empty, in which case a } immediately follows), JSON always has a string -- and as you see in that diagram, this means something that starts with a double quote. So, the very start of the string:
{identifier:
tells you that's incorrect JSON -- no double quotes around the word identifier.
Working around this problem is not as easy as recognizing it's there, but I wanted to reassure you, at least, about your code. Sigh it does seem that broken websites, such a great tradition in old HTML days, are with us to stay no matter how modern the technology they break is...:-(

Categories

Resources