Using Python's json.loads with JSON containing regex value, breaks? - python

TL,DR; How can JSON containing a regex with escaped backslahes, be loaded using Python's JSON decoder?
Detail; The regular expression \\[0-9]\\ will match (for example):
\2\
The same regular expression could be encoded as a JSON value:
{
"pattern": "\\[0-9]\\"
}
And in turn, the JSON value could be encoded as a string in Python (note the single quotes):
'{"pattern": "\\[0-9]\\"}'
When loading the JSON in Python, a JSONDecodeError is raised:
import json
json.loads('{"pattern": "\\[0-9]\\"}')
The problem is caused by the regular expression escaping the blackslashes:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 14 (char 13)
>>> json.loads('{"pattern": "\\[0-9]\\"}')
This surprised me since each step seems reasonable (i.e. valid regex, valid JSON, and valid Python).
How can JSON containing a regex with escaped backslahes, be loaded using Python's JSON decoder?

What's happening is that Python is first escaping the input to loads as a string literal, making it '{"pattern": "\[0-9]\"}' (double backslash -> single backslash). Then, loads now attempts to escape \[, which is invalid. To fix, escape the backslashes again. However, it's easier and more practical to specify it as a raw string:
>>> import json
>>> json.loads('{"pattern": "\\[0-9]\\"}')
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 14 (char 13)
>>> json.loads(r'{"pattern": "\\[0-9]\\"}')
{'pattern': '\\[0-9]\\'} # No error
Note that this problem won't apply if loading from a file.
test.json:
{"pattern": "\\[0-9]\\"}
Python:
import json
with open('test.json', 'r') as infile:
json.load(infile) # no problem
Basically, the problem arises with the fact that you're passing in a string literal, but ironically, your string literal isn't being taken literally.

The r means that the string is to be treated as a raw string, which means all escape codes will be ignored:
json.loads(r'{"pattern": "\\[0-9]\\"}')

Related

What's the most Pythonic way to parse out this value from a JSON-like blob?

See below. Given a well-known Google URL, I'm trying to retrieve data from that URL. That data will provide me another Google URL from which I can retrieve a list of JWKs.
>>> import requests, json
>>> open_id_config_url = 'https://ggp.sandbox.google.com/.well-known/openid-configuration'
>>> response = requests.get(open_id_config_url)
>>> r.status_code
200
>>> response.text
u'{\n "issuer": "https://www.stadia.com",\n "jwks_uri": "https://www.googleapis.com/service_accounts/v1/jwk/stadia-jwt#system.gserviceaccount.com",\n "claims_supported": [\n "iss",\n "aud",\n "sub",\n "iat",\n "exp",\n "s_env",\n "s_app_id",\n "s_gamer_tag",\n "s_purchase_country",\n "s_current_country",\n "s_session_id",\n "s_instance_ip",\n "s_restrict_text_chat",\n "s_restrict_voice_chat",\n "s_restrict_multiplayer",\n "s_restrict_stream_connect",\n ],\n "id_token_signing_alg_values_supported": [\n "RS256"\n ],\n}'
Above I have successfully retrieved the data from the first URL. I can see the entry jwks_uri contains the second URL I need. But when I try to convert that blob of text to a python dictionary, it fails.
>>> response.json()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/saqib.ali/saqib-env-99/lib/python2.7/site-packages/requests/models.py", line 889, in json
self.content.decode(encoding), **kwargs
File "/usr/local/Cellar/python#2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/local/Cellar/python#2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/Cellar/python#2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
>>> json.loads(response.text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python#2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/local/Cellar/python#2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/Cellar/python#2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
The only way I can get out the JWKs URL is by doing this ugly regular expression parsing:
>>> re.compile('(?<="jwks_uri": ")[^"]+').findall(response.text)[0]
u'https://www.googleapis.com/service_accounts/v1/jwk/stadia-jwt#system.gserviceaccount.com'
Is there a cleaner, more Pythonic way to extract this string?
I really wish Google would send back a string that could be cleanly JSON-ified.
The returned json string is incorrect because last item of the dictionary ends with ,, which json cannot parse.
": [\n "RS256"\n ],\n}'
^^^
But ast.literal_eval can do that (as python parsing accepts lists/dicts that end with a comma). As long as you don't have booleans or null values, it is possible and pythonic
>>> ast.literal_eval(response.text)["jwks_uri"]
'https://www.googleapis.com/service_accounts/v1/jwk/stadia-jwt#system.gserviceaccount.com'
Your JSON is invalid because it has an extra comma after the last value in the claims_supported array.
I wouldn't necessarily recommend it, but you could use the similarity of JSON and Python syntax to parse this directly, since Python is much less picky:
ast.literal_eval(response.tezt)
As suggested in this answer use yaml to parse json. It will tolerate the trailing comma as well as other deviations from the json standard.
import yaml
d = yaml.load(response.text)

How to change type of list to str? [duplicate]

This question already has answers here:
Reading a list stored in a text file [duplicate]
(2 answers)
Closed 4 years ago.
I am extracting a list from html of a web page in the format
lst = '["a","b","c"]' # (type <str>)
The data type of the above is str and I want it to convert to a python list type , somthing like this
lst = ["a","b","c"] #(type <list>)
I can obtain the above by
lst = lst[1:-1].replace('"','').split(',')
But as the actual value of a,b &c is quite long and complex(contains long html text), I can't be dependent on the above method.
I also tried doing it with json module and using json.loads(lst) , that is giving the below exception
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Any way of converting to list in Python?
Edit: The actual value of the list is:
['reqlistitem.no','reqlistitem.applyonlinejobdesc','reqlistitem.no','reqlistitem.referjobdesc','reqlistitem.applyemailsubjectapplication','reqlistitem.applyemailjobdesc','reqlistitem.no','reqlistitem.addedtojobcart','reqlistitem.displayjobcartactionjobdesc','reqlistitem.shareURL','reqlistitem.title','reqlistitem.shareable','reqlistitem.title','reqlistitem.contestnumber','reqlistitem.contestnumber','reqlistitem.description','reqlistitem.description','reqlistitem.primarylocation','reqlistitem.primarylocation','reqlistitem.otherlocations','reqlistitem.jobschedule','reqlistitem.jobschedule','reqlistitem.jobfield','reqlistitem.jobfield','reqlistitem.displayreferfriendaction','reqlistitem.no','reqlistitem.no','reqlistitem.applyonlinejobdesc','reqlistitem.no','reqlistitem.referjobdesc','reqlistitem.applyemailsubjectapplication','reqlistitem.applyemailjobdesc','reqlistitem.no','reqlistitem.addedtojobcart','reqlistitem.displayjobcartactionjobdesc','reqlistitem.shareURL','reqlistitem.title','reqlistitem.shareable']
i think you are looking for literal_eval:
import ast
string = '["a","b","c"]'
print ast.literal_eval(string) # ['a', 'b', 'c']
The problem in your sample string is the single quotes. The JSON standard requires double quotes.
If you change the single quotes to double quotes, it will work. An easy way is to use str.replace():
import json
s = "['reqlistitem.no','reqlistitem.applyonlinejobdesc','reqlistitem.no']"
json.loads(s.replace("'", '"'))
#[u'reqlistitem.no', u'reqlistitem.applyonlinejobdesc', u'reqlistitem.no']

Python 3 not reading a JSON file right

I have some json files created by powershell using the ConvertTo-Json command. The content of the json file looks like
{
"Key1": "Value1",
"Key2": "Value2"
}
I ran the python interpreter to see if I could read the file but I get this weird output
>>> f=open('test.json', 'r')
>>> f.read()
'ÿ\xfe{\x00\n\x00\n\x00 \x00 \x00 \x00 \x00"\x00K\x00e\x00y\x001\x00"\x00:\x00 \x00 \x00"\x00V\x00a\x00l\x00u\x00e\x001\x00"\x00,\x00\n\x00\n\x00 \x00 \x00 \x00 \x00"\x00K\x00e\x00y\x002\x00"\x00:\x00 \x00 \x00"\x00V\x00a\x00l\x00u\x00e\x002\x00"\x00\n\x00\n\x00}\x00\n\x00\n\x00'
For some reason all the characters are escaped byte characters and there's the weird ÿ at the begninning (powershell error?).
The weird thing is this:
>>> f=open('test.json', 'r')
>>> str=f.read()
>>> type(str)
<class 'str'>
>>> json.loads(str)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Rutvik_Choudhary\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\Rutvik_Choudhary\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\Rutvik_Choudhary\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
So the input is a string, but the json module can't parse it (json.load(f) return the same error). What is causing this error? Is it a python thing, a powershell thing, a json thing?
As pointed out by jwodder, PowerShell has encoded your json using UTF-16LE. To get this data into json correctly, you need to open the file using the correct encoding. eg.
with open("test.json", "r", encoding="utf16") as f:
json_string = f.read()
my_dict = json.loads(json_string)
You don't need to tell Python which variant of UTF-16 is being used. This is the purpose of the first two bytes of the text file. It's called a Byte Order Mark (BOM). It lets a program know if UTF-16LE or UTF-16BE has been used to encode the text file.
It seems that you have a BOM at the start of your file. You can verify it in a hex editor or with a good text editor (Notepad++ shows if BOM is present).
If you want to load text files with Unicode BOM headers, like yours you should better use to codecs.open functions instead of open as the default open is not able to interpret the BOM.
Or you can have a look at tendo.unicode - a small library that I wrote that can improve life for people that are not used to Unicode texts.

Unable to load json containing escape sequences

I'm being passed some Json and am having trouble parsing it.
The object is currently simple with a single key/value pair. The key works fine but the value \d causes issues.
This is coming from an html form, via javascript. All of the below are literals.
Html: \d
Javascript: {'Key': '\d'}
Json: {"Key": "\\d"}
json.loads() doesn't seem to like Json in this format. A quick sanity check that I'm not doing anything silly works fine:
>>> import json
>>> json.loads('{"key":"value"}')
{'key': 'value'}
Since I'm declaring this string in Python, it should escape it down to a literal of va\\lue - which, when parsed as Json should be va\lue.
>>> json.loads('{"key":"va\\\\lue"}')
{'key': 'va\\lue'}
In case python wasn't escaping the string on the way in, I thought I'd check without the doubling...
>>> json.loads('{"key":"va\\lue"}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Python33\lib\json\decoder.py", line 352, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python33\lib\json\decoder.py", line 368, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 11 (char 10)
but it fails, as expected.
I can't see any way to parse Json field that should contain a single backslash after all the unescaping has taken place.
How can I get Python to deserialize this string literal {"a":"val\\ue"} (which is valid Json) into the appropriate python representation: {'a': 'val\ue'}?
As an aside, it doesn't help that PyDev is inconsistent with what representation of a string it uses. The watch window shows double backslashes, the tooltip of the variable shows quadruple backslashes. I assume that's the "If you were to type the string, this is what you'd have to use for it to escape to the original" representation, but it's by no means clear.
Edit to follow on from #twalberg's answer...
>>> input={'a':'val\ue'}
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec cant decode bytes in position 3-5: truncated \uXXXX escape
>>> input={'a':'val\\ue'}
>>> input
{'a': 'val\\ue'}
>>> json.dumps(input)
'{"a": "val\\\\ue"}'
>>> json.loads(json.dumps(input))
{'a': 'val\\ue'}
>>> json.loads(json.dumps(input))['a']
'val\\ue'
Using json.dumps() to see how json would represent your target string:
>>> orig = { 'a' : 'val\ue' }
>>> jstring = json.dumps(orig)
>>> print jstring
{"a": "val\\ue"}
>>> extracted = json.loads(jstring)
>>> print extracted
{u'a': u'val\\ue'}
>>> print extracted['a']
val\ue
>>>
This was in Python 2.7.3, though, so it may be only partially relevant to your Python 3.x environment. Still, I don't think JSON has changed that much...

json.JSONDecoder().decode() can not work well

code is simple, but it can not work. I don't know the problem
import json
json_data = '{text: \"tl4ZCTPzQD0k|rEuPwudrAfgBD3nxFIsSbb4qMoYWA=\", key: \"MPm0ZIlk9|ADco64gjkJz2NwLm6SWHvW\"}'
my_data = json.JSONDecoder().decode(json_data)
print my_data
throw exption behinde:
Traceback (most recent call last):
File "D:\Python27\project\demo\digSeo.py", line 4, in <module>
my_data = json.JSONDecoder().decode(json_data)
File "D:\Python27\lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "D:\Python27\lib\json\decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 2 (char 1)
Your json_data is not valid JSON.
In JSON, property names need to be in double quotes ("). Also, the double quotes terminating the string values don't need to be ecaped since you're already using single quotes (') for the string.
Example:
json_data = '{"text": "tl4ZCTPzQD0k|rEuPwudrAfgBD3nxFIsSbb4qMoYWA=", "key": "MPm0ZIlk9|ADco64gjkJz2NwLm6SWHvW"}'
The json module in Python standard library can work well, that's what a lot of people are using for their applications.
However these few lines of code that use this module have a small issue. The problem is that your sample data is not a valid JSON. The keys (text and key) should be quoted like this:
json_data = '{"text": \"tl4ZCTPzQD0k|rEuPwudrAfgBD3nxFIsSbb4qMoYWA=\", "key": \"MPm0ZIlk9|ADco64gjkJz2NwLm6SWHvW\"}'

Categories

Resources