I am trying to read the following string into a dictionary using Python json package
However, under one of the subfield 'name' there is a description with a nested double quote. My json is unable to read the string that way
import json
string1 =
'{"id":17033,"project_id":17033,"state":"active","state_changed_at":1488054590,"name":"a.k.a.:\xa0"The Sunshine Makers""'
json.loads(string1)
A ERROR was raised
JSONDecodeError: Expecting ',' delimiter: line 1 column 96 (char 95)
I know that the reason for this error was due to the nested double quote around "The Sunshine Makers"
How to I get rid of this double quote?
More examples of string that cause error
string2 = '{"id":960066,"project_id":960066,"state":"active","state_changed_at":1502049940,"name":"New J. Lye Album - Behind The Lyes","blurb":"I am working on my new project titled "Behind The Lyes" which is coming out fall of 2017."'
#The problem with this string comes from the nested double quote around the pharse "Behind The Lyes inside" the 'blurb' subfield
Note that your string has more than one issue making it invalid JSON:
The error you're seeing is the \xa0 (a non-breaking space). That needs to be addressed before the "" issue becomes a problem.
Your string is missing a closing }.
That said, for the string you've cited first, one approach to fixing your issues would be to use .replace():
string1 = '{"id":17033,"project_id":17033,"state":"active","state_changed_at":1488054590,"name":"a.k.a.:\xa0"The Sunshine Makers""'.replace('\xa0"', "'").replace('""', "'\"") + '}'
For example, the following handles the double quoting and other issues in your two sample strings:
import json
fixes = [('\xa0', ' '),('"',"'"),("{'",'{"'),("','", '","'),(",'", ',"'),("':'", '":"'),("':", '":'),("''", '\'\"'), ("'}",'"}')]
print(fixes)
string1 = '{"id":17033,"project_id":17033,"state":"active","state_changed_at":1488054590,"name":"a.k.a.:\xa0"The Sunshine Makers""'
string2 = '{"id":960066,"project_id":960066,"state":"active","state_changed_at":1502049940,"name":"New J. Lye Album - Behind The Lyes","blurb":"I am working on my new project titled "Behind The Lyes" which is coming out fall of 2017."'
strings = [string1, string2]
for string in strings:
print(string)
string = string + '}'
for fix in fixes:
string = string.replace(*fix)
print(string)
print(json.loads(string)['name'])
It would be helpful if you could fill out your question with the code or file from which you are retrieving these strings. That would make it possible to give a more comprehensive answer.
Related
import json
a_json = '{"some_body":"
[{"someId":"189353945391","EId":"09358039485","someUID":10,"LegalId":"T743","cDate":"202452","rAmount":{"aPa":{"am":1500,"currId":"UD"},"cost":{"amount":1000,"currId":"US"},"lPrice":{"amount":100,"currId":"DD"}},"tes":{"ant":0,"currId":"US"},"toount":{"amnt":0,"currId":"US"},"toount":{"amt":210,"currId":"US"},"bry":"US","pay":[{"pId":"7111","axt":{"amt":2000,"currId":"US"},"mKey":"CSD"}],"oItems":[{"iIndex":0,"rId":"69823","provId":"001","segEntityId":"C001","per":{"vae":1,"ut":"MOS"},"pct":{"prod":"748"},"revType":"REW","rAmount":{"aPaid":{"amt":90000,"currId":"US"},"xt":{"amt":0,"currId":"USD"},"lPrice":{"amt":90000,"currId":"US"}},"stion":{"sLocal":"094u5304","eLocal":"3459340"},"tx":{"adt":{"adet":0,"currId":"US"},"era":"werTIC"}}}]"}'
Above is the JSON I am trying to iterate through. Kind of a weird JSON. I want to take that JSON string and turn it into a python dictionary, so I can better iterate it. I put a single quote mark on either side and used json.loads to turn it into a python dictionary.
loaded_body = json.loads(a_json)
And I am calling the only key from the beginning "some_body" which will get the contents of the entire body. When I run the following command print(loaded_body) I get back the error:
expecting ',' delimiter: line 1 column 18 (char 17)
Ultimately once that error is resolved, this is what I am trying to access:
desired_item =loaded_body['oItems'][0]['rAmount']['aPa']['currId']
Thanks
It seems that you're treating the content of some_body as a string since it's enclosed with double quotes. But inside of that string there's also quotation marks and now it's interpreted that the content of some_body is [{ and then it breaks because directly after that is someId rather than a comma. Thus the error:
expecting ',' delimiter: line 1 column 18 (char 17)
If the content of some_body was actually meant to be a string then all the double quotes inside of it should be preceded by a double backslash (\\) although in this case you'd have to parse the JSON twice - first the entire a_json string and then the content of some_body. However I think it would be easier to just remove the double quotes around the content of some_body.
a_json = '{"some_body":
[{"someId":"189353945391","EId":"09358039485","someUID":10,"LegalId":"T743","cDate":"202452","rAmount":{"aPa":{"am":1500,"currId":"UD"},"cost":{"amount":1000,"currId":"US"},"lPrice":{"amount":100,"currId":"DD"}},"tes":{"ant":0,"currId":"US"},"toount":{"amnt":0,"currId":"US"},"toount":{"amt":210,"currId":"US"},"bry":"US","pay":[{"pId":"7111","axt":{"amt":2000,"currId":"US"},"mKey":"CSD"}],"oItems":[{"iIndex":0,"rId":"69823","provId":"001","segEntityId":"C001","per":{"vae":1,"ut":"MOS"},"pct":{"prod":"748"},"revType":"REW","rAmount":{"aPaid":{"amt":90000,"currId":"US"},"xt":{"amt":0,"currId":"USD"},"lPrice":{"amt":90000,"currId":"US"}},"stion":{"sLocal":"094u5304","eLocal":"3459340"},"tx":{"adt":{"adet":0,"currId":"US"},"era":"werTIC"
}}]}]}'
Also note that there's a missing closing square bracket near the end.
In your original string:
"werTIC"}}}]"}
Which should probably be:
"werTIC"}}]}]}
Now the desired item should look like this (I assume that 'aPa' was meant to be 'aPaid'):
desired_item = loaded_body['some_body'][0]['oItems'][0]['rAmount']['aPaid']['currId']
print(desired_item)
US
I'm trying to parse a JSON string that contains double quotes in it:
import json
x = '''{"key":"Value \"123\" "}'''
When I try to load this JSON using the following statement
y = json.loads(x)
It raises the following exception:
Expecting ',' delimiter: line 1 column 15 (char 14)
As per my understanding, it is due to the double quotes around 123 in JSON. Also, I tried replacing the \" (backslash quote) with some other stuff as well but all in vain
x.replace('\"',"'")
as it also replaced the double quotes that are present around key and value as well
'''{"key": "Value \"123\" "}''' ---Replacing--> '''{'key':'Value '123' '}''')
I can not change anything in the input string. That's coming from an API.
Any help would be highly appreciated, I'm stuck with this for quite a time now. Thanks in advance...
\" within a string is simply a double quotation mark. You need to add another backslash:
x = '''{"key":"Value \\"123\\" "}'''
While trying to parse JSON from an AJAX request, the string returned contains invalid JSON.
Although the best practice would be to change the server to reply with valid JSON, as suggested in multiple related answers, this is not an option.
Trying to solve this problem using python, I looked at regular expressions.
The main problem is elements as follows (which I currently use as a test string:
testStr = '{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}'
I currently use the following code:
jsonString = re.sub(r'(?<=\w)\"(?=[^\(\:\}\,])','\\"',testStr)
jsonString = re.sub(r'\"\"(?![,}:])','\"\\\"',jsonString)
with very limited success.
If I was using C, I would parse the string, and simply escape all double quotes within the element (i.e between all double quotes which are preceded by [:{},] )
There must be a pythonic way to parse, without resorting to a for loop and looking ahead, and keeping history.
EDIT:
Assuming that strings do not contain: [ : { } ]
And also assuming that the unescaped double quotes are only within the value, and not in the key,
Then I assume that the following (or something similar should solve the problem:
import re
re.sub(r'(?<![\[\:])\"(?![,\}),'\"',testString)
But it still does not work.
Seems I needed a break to solve this.
The following regular expression seems to replace only doublequotes that are contained within the element string. (With the assumptions I stated in the question)
output = re.sub(r'(?<![\[\:\{\,])\"(?![\:\}\,])','\\\"', stringName)
I have created a sandbox here: https://repl.it/vNK
Example Output:
Original String:
{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}
Modified String:
{"KEY1":"THIS IS \"AN\" ELEMENT","KEY2":"\"\"THIS IS ANOTHER \"ELEMENT\""}
Parsed JSON:
{
"KEY1": "THIS IS \"AN\" ELEMENT",
"KEY2": "\"\"THIS IS ANOTHER \"ELEMENT\""
}
Any suggestions are welcome.
While trying to parse JSON from an AJAX request, the string returned contains invalid JSON.
Although the best practice would be to change the server to reply with valid JSON, as suggested in multiple related answers, this is not an option.
Trying to solve this problem using python, I looked at regular expressions.
The main problem is elements as follows (which I currently use as a test string:
testStr = '{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}'
I currently use the following code:
jsonString = re.sub(r'(?<=\w)\"(?=[^\(\:\}\,])','\\"',testStr)
jsonString = re.sub(r'\"\"(?![,}:])','\"\\\"',jsonString)
with very limited success.
If I was using C, I would parse the string, and simply escape all double quotes within the element (i.e between all double quotes which are preceded by [:{},] )
There must be a pythonic way to parse, without resorting to a for loop and looking ahead, and keeping history.
EDIT:
Assuming that strings do not contain: [ : { } ]
And also assuming that the unescaped double quotes are only within the value, and not in the key,
Then I assume that the following (or something similar should solve the problem:
import re
re.sub(r'(?<![\[\:])\"(?![,\}),'\"',testString)
But it still does not work.
Seems I needed a break to solve this.
The following regular expression seems to replace only doublequotes that are contained within the element string. (With the assumptions I stated in the question)
output = re.sub(r'(?<![\[\:\{\,])\"(?![\:\}\,])','\\\"', stringName)
I have created a sandbox here: https://repl.it/vNK
Example Output:
Original String:
{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}
Modified String:
{"KEY1":"THIS IS \"AN\" ELEMENT","KEY2":"\"\"THIS IS ANOTHER \"ELEMENT\""}
Parsed JSON:
{
"KEY1": "THIS IS \"AN\" ELEMENT",
"KEY2": "\"\"THIS IS ANOTHER \"ELEMENT\""
}
Any suggestions are welcome.
I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.
Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price
I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'
Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.
There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.
Try Cutplace. It verifies that tabluar data conforms to an interface control document.
Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.