How to parse JSON files with double-quotes inside strings in Python? - python

I'm trying to read a JSON file in Python. Some of the lines have strings with double quotes inside:
{"Height__c": "8' 0\"", "Width__c": "2' 8\""}
Using a raw string literal produces the right output:
json.loads(r"""{"Height__c": "8' 0\"", "Width__c": "2' 8\""}""")
{u'Width__c': u'2\' 8"', u'Height__c': u'8\' 0"'}
But my string comes from a file, ie:
s = f.readline()
Where:
>>> print repr(s)
'{"Height__c": "8\' 0"", "Width__c": "2\' 8""}'
And json throws the following exception:
json.loads(s) # s = """{"Height__c": "8' 0\"", "Width__c": "2' 8\""}"""
ValueError: Expecting ',' delimiter: line 1 column 21 (char 20)
Also,
>>> s = """{"Height__c": "8' 0\"", "Width__c": "2' 8\""}"""
>>> json.loads(s)
Fails, but assigning the raw literal works:
>>> s = r"""{"Height__c": "8' 0\"", "Width__c": "2' 8\""}"""
>>> json.loads(s)
{u'Width__c': u'2\' 8"', u'Height__c': u'8\' 0"'}
Do I need to write a custom Decoder?

The data file you have does not escape the nested quotes correctly; this can be hard to repair.
If the nested quotes follow a pattern; e.g. always follow a digit and are the last character in each string you can use a regular expression to fix these up. Given your sample data, if all you have is measurements in feet and inches, that's certainly doable:
import re
from functools import partial
repair_nested = partial(re.compile(r'(\d)""').sub, r'\1\\""')
json.loads(repair_nested(s))
Demo:
>>> import json
>>> import re
>>> from functools import partial
>>> s = '{"Height__c": "8\' 0"", "Width__c": "2\' 8""}'
>>> repair_nested = partial(re.compile(r'(\d)""').sub, r'\1\\""')
>>> json.loads(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/decoder.py", line 381, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 1 column 21 (char 20)
>>> json.loads(repair_nested(s))
{u'Width__c': u'2\' 8"', u'Height__c': u'8\' 0"'}

Related

Convert String Representation Of Array To Actual Array In Python

I am quite new to python , and am struggling to make this work.
I need to convert
["ladksa dlskdlsd slkd sldks lskds lskds","skkdsdl skdsl lskds lsdk sldk sdslkds"] - String Representation of Array To Actual Array Like this.
[0]-> "ladksa dlskdlsd slkd sldks lskds lskds"
[1]-> "skkdsdl skdsl lskds lsdk sldk sdslkds"
I have tried following things:
json.load(array) -> but it gave me parse array error
literal_eval(x) -> read it somewhere (dont know why it doesn't work)
Error:
custom_classes = json.loads(element.custom_classes)
File "C:\Users\Shri\AppData\Local\Programs\Python\Python38-32\lib\json\__init_
.py", line 357, in loads
return _default_decoder.decode(s)
File "C:\Users\Shri\AppData\Local\Programs\Python\Python38-32\lib\json\decoder
py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\Shri\AppData\Local\Programs\Python\Python38-32\lib\json\decoder
py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1)
Also, I am using below js code to send data:
'custom_classes':'\''+JSON.stringify(finalized_classes)+'\'',
Data Is Stored as :
Please check image here
and
arr_element = array of Objects For Specific Class
which has a property
custom_classes -> which is stored as above image.
and i am doing:
for element in arr_element:
string_obj = json.loads(str(element.custom_classes))
//here it is error
Please help
import json
s = '["ladksa dlskdlsd slkd sldks lskds lskds","skkdsdl skdsl lskds lsdk sldk sdslkds"]'
json.loads(s)
# gives ['ladksa dlskdlsd slkd sldks lskds lskds', 'skkdsdl skdsl lskds lsdk sldk sdslkds']
After you use json.loads(), you can use str.split() to turn the strings into arrays, and then append one of the arrays onto the other if you need to.
foo = ['this is a sentence', 'this is a different sentence']
bar = foo[0].split(' ')
biz = foo[1].split(' ')
bar.append(biz)
print(bar)
Result: ['this', 'is', 'a', 'sentence', 'this', 'is', 'a', 'different', 'sentence']

Why does simplejson think this string is invalid json? [duplicate]

This question already has answers here:
Using number as "index" (JSON)
(7 answers)
Closed 2 years ago.
I'm trying to convert a simple JSON string {\n 100: {"a": "b"}\n} to a python object but its giving me this error: Expecting property name enclosed in double quotes
Why is it insisting that the name of the attribute be a string?
>>> import simplejson
>>> my_json = simplejson.loads('{\n 100: {"a": "b"}\n}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/myuser/myenv/lib/python2.7/site-packages/simplejson/__init__.py", line 525, in loads
return _default_decoder.decode(s)
File "/Users/myuser/myenv/lib/python2.7/site-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/Users/myuser/myenv/lib/python2.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 5 (char 6)
This value is invalid {100: {"a": "b"}}, you need {"100": {"a": "b"}}.
The property name there being 100 needs to be enclosed in double quotes so "100".
Why is it insisting that the name of the attribute be a string?
That's how JSON is.
You may have been used to be able to write {100: {"a": "b"}} in Javascript or another language, (without double quoting the property name), but you'll still get a parsing error if you try to parse it as JSON in Javascript, e.g.:
JSON.parse('{100: {"a": "b"}}')
SyntaxError: JSON.parse: expected property name or '}'
at line 1 column 2 of the JSON data

ValueError: Unknown string format in Python?

I'm trying to parse a basic iso formatted datetime string in Python, but I'm having a hard time doing that. Consider the following example:
>>> import json
>>> from datetime import datetime, date
>>> import dateutil.parser
>>> date_handler = lambda obj: obj.isoformat()
>>> the_date = json.dumps(datetime.now(), default=date_handler)
>>> print the_date
"2017-02-18T22:14:09.915727"
>>> print dateutil.parser.parse(the_date)
Traceback (most recent call last):
File "<input>", line 1, in <module>
print dateutil.parser.parse(the_date)
File "/usr/local/lib/python2.7/site-packages/dateutil/parser.py", line 1168, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/usr/local/lib/python2.7/site-packages/dateutil/parser.py", line 559, in parse
raise ValueError("Unknown string format")
ValueError: Unknown string format
I've also tried parsing this using the regular strptime:
>>> print datetime.strptime(the_date, '%Y-%m-%dT%H:%M:%S')
# removed rest of the error output
ValueError: time data '"2017-02-18T22:11:58.125703"' does not match format '%Y-%m-%dT%H:%M:%S'
>>> print datetime.strptime(the_date, '%Y-%m-%dT%H:%M:%S.%f')
# removed rest of the error output
ValueError: time data '"2017-02-18T22:11:58.125703"' does not match format '%Y-%m-%dT%H:%M:%S.%f'
Does anybody know how on earth I can parse this fairly simple datetime format?
note the error message:
ValueError: time data '"2017-02-18T22:11:58.125703"'
There are single quotes + double quotes which means that the string actually contains double quotes. That's because json serialization adds double quotes to strings.
you may want to strip the quotes around your string:
datetime.strptime(the_date.strip('"'), '%Y-%m-%dT%H:%M:%S.%f')
or, maybe less "hacky", de-serialize using json.loads:
datetime.strptime(json.loads(the_date), '%Y-%m-%dT%H:%M:%S.%f')

How to base64 encode/decode a variable with string type in Python 3?

It gives me an error that the line encoded needs to be bytes not str/dict
I know of adding a "b" before the text will solve that and print the encoded thing.
import base64
s = base64.b64encode(b'12345')
print(s)
>>b'MTIzNDU='
But how do I encode a variable?
such as
import base64
s = "12345"
s2 = base64.b64encode(s)
print(s2)
It gives me an error with the b added and without. I don't understand
I'm also trying to encode/decode a dictionary with base64.
You need to encode the unicode string. If it's just normal characters, you can use ASCII. If it might have other characters in it, or just for general safety, you probably want utf-8.
>>> import base64
>>> s = "12345"
>>> s2 = base64.b64encode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ". . . /lib/python3.3/base64.py", line 58, in b64encode
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str
>>> s2 = base64.b64encode(s.encode('ascii'))
>>> print(s2)
b'MTIzNDU='
>>>

Retrieving JSON objects from a text file (using Python)

I have thousands of text files containing multiple JSON objects, but unfortunately there is no delimiter between the objects. Objects are stored as dictionaries and some of their fields are themselves objects. Each object might have a variable number of nested objects. Concretely, an object might look like this:
{field1: {}, field2: "some value", field3: {}, ...}
and hundreds of such objects are concatenated without a delimiter in a text file. This means that I can neither use json.load() nor json.loads().
Any suggestion on how I can solve this problem. Is there a known parser to do this?
This decodes your "list" of JSON Objects from a string:
from json import JSONDecoder
def loads_invalid_obj_list(s):
decoder = JSONDecoder()
s_len = len(s)
objs = []
end = 0
while end != s_len:
obj, end = decoder.raw_decode(s, idx=end)
objs.append(obj)
return objs
The bonus here is that you play nice with the parser. Hence it keeps telling you exactly where it found an error.
Examples
>>> loads_invalid_obj_list('{}{}')
[{}, {}]
>>> loads_invalid_obj_list('{}{\n}{')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "decode.py", line 9, in loads_invalid_obj_list
obj, end = decoder.raw_decode(s, idx=end)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 2 column 2 (char 5)
Clean Solution (added later)
import json
import re
#shameless copy paste from json/decoder.py
FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
WHITESPACE = re.compile(r'[ \t\n\r]*', FLAGS)
class ConcatJSONDecoder(json.JSONDecoder):
def decode(self, s, _w=WHITESPACE.match):
s_len = len(s)
objs = []
end = 0
while end != s_len:
obj, end = self.raw_decode(s, idx=_w(s, end).end())
end = _w(s, end).end()
objs.append(obj)
return objs
Examples
>>> print json.loads('{}', cls=ConcatJSONDecoder)
[{}]
>>> print json.load(open('file'), cls=ConcatJSONDecoder)
[{}]
>>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
return cls(encoding=encoding, **kw).decode(s)
File "decode.py", line 15, in decode
obj, end = self.raw_decode(s, idx=_w(s, end).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 5 (char 5)
Sebastian Blask has the right idea, but there's no reason to use regexes for such a simple change.
objs = json.loads("[%s]"%(open('your_file.name').read().replace('}{', '},{')))
Or, more legibly
raw_objs_string = open('your_file.name').read() #read in raw data
raw_objs_string = raw_objs_string.replace('}{', '},{') #insert a comma between each object
objs_string = '[%s]'%(raw_objs_string) #wrap in a list, to make valid json
objs = json.loads(objs_string) #parse json
How about something like this:
import re
import json
jsonstr = open('test.json').read()
p = re.compile( '}\s*{' )
jsonstr = p.sub( '}\n{', jsonstr )
jsonarr = jsonstr.split( '\n' )
for jsonstr in jsonarr:
jsonobj = json.loads( jsonstr )
print json.dumps( jsonobj )
Solution
As far as I know }{ does not appear in valid JSON, so the following should be perfectly safe when trying to get strings for separate objects that were concatenated (txt is the content of your file). It does not require any import (even of re module) to do that:
retrieved_strings = map(lambda x: '{'+x+'}', txt.strip('{}').split('}{'))
or if you prefer list comprehensions (as David Zwicker mentioned in the comments), you can use it like that:
retrieved_strings = ['{'+x+'}' for x in txt.strip('{}').split('}{'))]
It will result in retrieved_strings being a list of strings, each containing separate JSON object. See proof here: http://ideone.com/Purpb
Example
The following string:
'{field1:"a",field2:"b"}{field1:"c",field2:"d"}{field1:"e",field2:"f"}'
will be turned into:
['{field1:"a",field2:"b"}', '{field1:"c",field2:"d"}', '{field1:"e",field2:"f"}']
as proven in the example I mentioned.
Why don't you load the file as string, replace all }{ with },{ and surround the whole thing with []? Something like:
re.sub('\}\s*?\{', '\}, \{', string_read_from_a_file)
Or simple string replace if you are sure you always have }{ without whitespaces in between.
In case you expect }{ to occur in strings as well, you could also split on }{ and evaluate each fragment with json.load, in case you get an error, the fragment wasn't complete and you have to add the next to the first one and so forth.
import json
file1 = open('filepath', 'r')
data = file1.readlines()
for line in data :
values = json.loads(line)
'''Now you can access all the objects using values.get('key') '''
How about reading through the file incrementing a counter every time a { is found and decrementing it when you come across a }. When your counter reaches 0 you'll know that you've come to the end of the first object so send that through json.load and start counting again. Then just repeat to completion.
Suppose you added a [ to the start of the text in a file, and used a version of json.load() which, when it detected the error of finding a { instead of an expected comma (or hits the end of the file), spit out the just-completed object?
Replace a file with that junk in it:
$ sed -i -e 's;}{;}, {;g' foo
Do it on the fly in Python:
junkJson.replace('}{', '}, {')

Categories

Resources