I have thousands of text files containing multiple JSON objects, but unfortunately there is no delimiter between the objects. Objects are stored as dictionaries and some of their fields are themselves objects. Each object might have a variable number of nested objects. Concretely, an object might look like this:
{field1: {}, field2: "some value", field3: {}, ...}
and hundreds of such objects are concatenated without a delimiter in a text file. This means that I can neither use json.load() nor json.loads().
Any suggestion on how I can solve this problem. Is there a known parser to do this?
This decodes your "list" of JSON Objects from a string:
from json import JSONDecoder
def loads_invalid_obj_list(s):
decoder = JSONDecoder()
s_len = len(s)
objs = []
end = 0
while end != s_len:
obj, end = decoder.raw_decode(s, idx=end)
objs.append(obj)
return objs
The bonus here is that you play nice with the parser. Hence it keeps telling you exactly where it found an error.
Examples
>>> loads_invalid_obj_list('{}{}')
[{}, {}]
>>> loads_invalid_obj_list('{}{\n}{')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "decode.py", line 9, in loads_invalid_obj_list
obj, end = decoder.raw_decode(s, idx=end)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 2 column 2 (char 5)
Clean Solution (added later)
import json
import re
#shameless copy paste from json/decoder.py
FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
WHITESPACE = re.compile(r'[ \t\n\r]*', FLAGS)
class ConcatJSONDecoder(json.JSONDecoder):
def decode(self, s, _w=WHITESPACE.match):
s_len = len(s)
objs = []
end = 0
while end != s_len:
obj, end = self.raw_decode(s, idx=_w(s, end).end())
end = _w(s, end).end()
objs.append(obj)
return objs
Examples
>>> print json.loads('{}', cls=ConcatJSONDecoder)
[{}]
>>> print json.load(open('file'), cls=ConcatJSONDecoder)
[{}]
>>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
return cls(encoding=encoding, **kw).decode(s)
File "decode.py", line 15, in decode
obj, end = self.raw_decode(s, idx=_w(s, end).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 5 (char 5)
Sebastian Blask has the right idea, but there's no reason to use regexes for such a simple change.
objs = json.loads("[%s]"%(open('your_file.name').read().replace('}{', '},{')))
Or, more legibly
raw_objs_string = open('your_file.name').read() #read in raw data
raw_objs_string = raw_objs_string.replace('}{', '},{') #insert a comma between each object
objs_string = '[%s]'%(raw_objs_string) #wrap in a list, to make valid json
objs = json.loads(objs_string) #parse json
How about something like this:
import re
import json
jsonstr = open('test.json').read()
p = re.compile( '}\s*{' )
jsonstr = p.sub( '}\n{', jsonstr )
jsonarr = jsonstr.split( '\n' )
for jsonstr in jsonarr:
jsonobj = json.loads( jsonstr )
print json.dumps( jsonobj )
Solution
As far as I know }{ does not appear in valid JSON, so the following should be perfectly safe when trying to get strings for separate objects that were concatenated (txt is the content of your file). It does not require any import (even of re module) to do that:
retrieved_strings = map(lambda x: '{'+x+'}', txt.strip('{}').split('}{'))
or if you prefer list comprehensions (as David Zwicker mentioned in the comments), you can use it like that:
retrieved_strings = ['{'+x+'}' for x in txt.strip('{}').split('}{'))]
It will result in retrieved_strings being a list of strings, each containing separate JSON object. See proof here: http://ideone.com/Purpb
Example
The following string:
'{field1:"a",field2:"b"}{field1:"c",field2:"d"}{field1:"e",field2:"f"}'
will be turned into:
['{field1:"a",field2:"b"}', '{field1:"c",field2:"d"}', '{field1:"e",field2:"f"}']
as proven in the example I mentioned.
Why don't you load the file as string, replace all }{ with },{ and surround the whole thing with []? Something like:
re.sub('\}\s*?\{', '\}, \{', string_read_from_a_file)
Or simple string replace if you are sure you always have }{ without whitespaces in between.
In case you expect }{ to occur in strings as well, you could also split on }{ and evaluate each fragment with json.load, in case you get an error, the fragment wasn't complete and you have to add the next to the first one and so forth.
import json
file1 = open('filepath', 'r')
data = file1.readlines()
for line in data :
values = json.loads(line)
'''Now you can access all the objects using values.get('key') '''
How about reading through the file incrementing a counter every time a { is found and decrementing it when you come across a }. When your counter reaches 0 you'll know that you've come to the end of the first object so send that through json.load and start counting again. Then just repeat to completion.
Suppose you added a [ to the start of the text in a file, and used a version of json.load() which, when it detected the error of finding a { instead of an expected comma (or hits the end of the file), spit out the just-completed object?
Replace a file with that junk in it:
$ sed -i -e 's;}{;}, {;g' foo
Do it on the fly in Python:
junkJson.replace('}{', '}, {')
Related
What I want to do is getting start point of some words from string.
For example,
context = "abcd e f g ( $ 150 )"
answer = "g($150)"
I want to get the start index of answer from context which should be "9".
I tried something like this,
answer = ' ?'.join()
try:
answer = re.sub('[$]', '\$', answer)
answer = re.sub('[(]', '\(', answer)
answer = re.sub('[)]', '\)', answer)
except:
pass
start_point = re.search(answer, context).span()[0]
Because there are answers with meta expressions and answers without meta expressions I used try, except.
And I used this kinds of code,
answer = re.sub('[(]', '\(', answer)
because if I don't use it, I found that re.search(answer, context) can't find my answer from context.
then I get this error.
Traceback (most recent call last):
File "mc_answer_v2.py", line 42, in <module>
match = re.search(spaced_answer_text, mc_paragraph_text)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/re.py", line 182, in search
return _compile(pattern, flags).search(string)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_parse.py", line 619, in _parse
source.tell() - here + len(this))
sre_constants.error: multiple repeat at position 3
How do I fix it and is there any other good way to get the start index?
It seems possible to do it by sticking \s* (variable number of white space characters) after each escaped character of answer string.
import re
def findPosition(context, answer):
regex=r"\s*"
regAnswer=regex.join([re.escape(w) for w in answer]) + regex
# print(regAnswer)
return re.search(regAnswer, context).start()
context = "abcd e f g ( $ 150 )"
answer = "g($150)"
print(findPosition(context, answer))
Use map to escape each character
Regex replace the original string with the target string
The string find method looks for the target string. If the target string does not exist, it will not return -1 abnormally.
>>> import re
>>> context = 'abcd e f g ( $ 150 )'
>>> answer = 'g($150)'
>>> findSpacing = lambda target, src :re.sub("\s*".join(map(re.escape, target)), target, src).find(target)
>>> findSpacing(answer, context)
9
>>> findSpacing("FLAG", context)
-1
>>>
I have the following Python code:
array_to_return = dict()
response_json_object = json.loads(responsestring)
for section in response_json_object:
if section["requestMethod"] == "getPlayerResources":
array_to_return["resource_list"] = json.dumps(section["responseData"]["resources"])
break
array_to_return["requests_duration"] = time.time() - requests_start_time
array_to_return["python_duration"] = time.time() - python_start_time
Which returns the following content into a PHP script:
{'resource_list': '{"aaa": 120, "bbb": 20, "ccc": 2138, "ddd": 8}', 'requests_duration': '7.30', 'python_duration': 41.0}
I'm then trying to decode this string and convert it into something usable in PHP. My code if the following:
$cmd = "$python $pyscript";
exec("$cmd", $output);
echo 'output: ';
var_dump($output[0]);
$json_output = json_decode($output[0], true);
echo 'json_output: ';
var_dump($json_output, json_last_error_msg());
$output[0] is a string but json_last_error_msg() returns Syntax Error
I'm well aware that my string is not a valid Json string, but how can I convert it properly (either in Python or in PHP)? I probably do something wrong in my Python script...
UPDATE 1:
I actually found out that responsestring is a valid JSON string (with double quotes) but json.loads switches the double to single quotes; thus response_json_object has single quotes.
If I comment out the line with json.loads, I get an error:
TypeError: 'int' object is not subscriptable
UPDATE 2:
I managed to get around it by removing the associative list in Python, not exactly what I was hoping for but this works for now...
array_to_return = json.dumps(section["responseData"]["resources"])
#No longer using the following
#array_to_return["requests_duration"] = time.time() - requests_start_time
#array_to_return["python_duration"] = time.time() - python_start_time
If a working solution with associative list is suggested, I will accept that one.
The ' character is not a legal character for JSON, it must be a ".
Your json should look like this.
{
"resource_list": "{\"aaa\": 120, \"bbb\": 20, \"ccc\": 2138, \"ddd\": 8}",
"requests_duration": "7.30",
"python_duration": 41.0
}
instead of modifying the individual key, value pairs of array_to_return by json.dumps, you would json.dumps the whole dictionary.
array_to_return = dict()
response_json_object = json.loads(responsestring)
for section in response_json_object:
if section["requestMethod"] == "getPlayerResources":
array_to_return["resource_list"] = json.dumps(section["responseData"]["resources"])
array_to_return["resource_list"] = section["responseData"]["resources"]
break
array_to_return["requests_duration"] = time.time() - requests_start_time
array_to_return["python_duration"] = time.time() - python_start_time
json.dumps(array_to_return)
I'm trying to read a JSON file in Python. Some of the lines have strings with double quotes inside:
{"Height__c": "8' 0\"", "Width__c": "2' 8\""}
Using a raw string literal produces the right output:
json.loads(r"""{"Height__c": "8' 0\"", "Width__c": "2' 8\""}""")
{u'Width__c': u'2\' 8"', u'Height__c': u'8\' 0"'}
But my string comes from a file, ie:
s = f.readline()
Where:
>>> print repr(s)
'{"Height__c": "8\' 0"", "Width__c": "2\' 8""}'
And json throws the following exception:
json.loads(s) # s = """{"Height__c": "8' 0\"", "Width__c": "2' 8\""}"""
ValueError: Expecting ',' delimiter: line 1 column 21 (char 20)
Also,
>>> s = """{"Height__c": "8' 0\"", "Width__c": "2' 8\""}"""
>>> json.loads(s)
Fails, but assigning the raw literal works:
>>> s = r"""{"Height__c": "8' 0\"", "Width__c": "2' 8\""}"""
>>> json.loads(s)
{u'Width__c': u'2\' 8"', u'Height__c': u'8\' 0"'}
Do I need to write a custom Decoder?
The data file you have does not escape the nested quotes correctly; this can be hard to repair.
If the nested quotes follow a pattern; e.g. always follow a digit and are the last character in each string you can use a regular expression to fix these up. Given your sample data, if all you have is measurements in feet and inches, that's certainly doable:
import re
from functools import partial
repair_nested = partial(re.compile(r'(\d)""').sub, r'\1\\""')
json.loads(repair_nested(s))
Demo:
>>> import json
>>> import re
>>> from functools import partial
>>> s = '{"Height__c": "8\' 0"", "Width__c": "2\' 8""}'
>>> repair_nested = partial(re.compile(r'(\d)""').sub, r'\1\\""')
>>> json.loads(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/decoder.py", line 381, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 1 column 21 (char 20)
>>> json.loads(repair_nested(s))
{u'Width__c': u'2\' 8"', u'Height__c': u'8\' 0"'}
I have been trying to parse a file with xml.etree.ElementTree:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
def analyze(xml):
it = ET.iterparse(file(xml))
count = 0
last = None
try:
for (ev, el) in it:
count += 1
last = el
except ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
print('count: {0}'.format(count))
This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
from yparse import analyze; analyze('file.xml')
File "C:\Python27\yparse.py", line 10, in analyze
for (ev, el) in it:
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
ParseError: reference to invalid character number: line 1, column 52459
The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.
The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!
This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.
Any ideas?
Here are some ideas:
(0) Explain "a file" and "occasionally": do you really mean it works sometimes and fails sometimes with the same file?
Do the following for each failing file:
(1) Find out what is in the file at the point that it is complaining about:
text = open("the_file.xml", "rb").read()
err_col = 52459
print repr(text[err_col-50:err_col+100]) # should include the error text
print repr(text[:50]) # show the XML declaration
(2) Throw your file at a web-based XML validation service e.g. http://www.validome.org/xml/ or http://validator.aborla.net/
and edit your question to display your findings.
Update: Here is the minimal xml file that illustrates your problem:
[badcharref.xml]
<a></a>
[Python 2.7.1 output]
>>> import xml.etree.ElementTree as ET
>>> it = ET.iterparse(file("badcharref.xml"))
>>> for ev, el in it:
... print el.tag
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 3
>>>
Not all valid Unicode characters are valid in XML. See the XML 1.0 Specification.
You may wish to examine your files using regexes like r'&#([0-9]+);' and r'&#x([0-9A-Fa-f]+);', convert the matched text to an int ordinal and check against the valid list from the spec i.e. #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
... or maybe the numeric character reference is syntactically invalid e.g. not terminated by a ;', &#not-a-digit etc etc
Update 2 I was wrong, the number in the ElementTree error message is counting Unicode code points, not bytes. See the code below and snippets from the output from running it over the two bad files.
# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough.
BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
text = open(fname, "rb").read()
else:
# Assumes file is encoded in UTF-8.
text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
m = rx.search(text, pos)
if not m: break
mstart, mend = m.span()
target = m.group(1)
if target:
num = int(target)
else:
num = int(m.group(2), 16)
# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
print mstart, m.group()
pos = mend
Output:
comments.xml
6615405
10205764
10213901
10213936
10214123
13292514
...
155656543
155656564
157344876
157722583
posts.xml
7607143
12982273
12982282
12982292
12982302
12982310
16085949
16085955
...
36303479
36303494 <<=== whoops
38942863
...
785292911
801282472
848911592
As #John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.
In fact, all of these entities appear in the text:
set(['', '', '', '', '', '', '
', '', '', '', '', '', '', '', '
', '', '', ' ', '', '', '', '', ''])
Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.
I'm not sure if this answers your question, but if you want to use an exception with the ParseError raised by element tree, you would do this:
except ET.ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
Source: http://effbot.org/zone/elementtree-13-intro.htm
I felt it might also be important to note here that you could rather easily catch your error and avoid having to completely stop your program by simply using what you're already using later on in the function, placing your statement:
it = ET.iterparse(file(xml))
inside a try & except bracket:
try:
it = ET.iterparse(file(xml))
except:
print('iterparse error')
Of course, this will not fix your XML file or pre-processing technique, but could help in identifying which file (if you're parsing lots) is causing your error.
Why would this
if 1 \
and 0:
pass
simplest of code choke on tokenize/untokenize cycle
import tokenize
import cStringIO
def tok_untok(src):
f = cStringIO.StringIO(src)
return tokenize.untokenize(tokenize.generate_tokens(f.readline))
src='''if 1 \\
and 0:
pass
'''
print tok_untok(src)
It throws:
AssertionError:
File "/mnt/home/anushri/untitled-1.py", line 13, in <module>
print tok_untok(src)
File "/mnt/home/anushri/untitled-1.py", line 6, in tok_untok
tokenize.untokenize(tokenize.generate_tokens(f.readline))
File "/usr/lib/python2.6/tokenize.py", line 262, in untokenize
return ut.untokenize(iterable)
File "/usr/lib/python2.6/tokenize.py", line 198, in untokenize
self.add_whitespace(start)
File "/usr/lib/python2.6/tokenize.py", line 187, in add_whitespace
assert row <= self.prev_row
Is there a workaround without modifying the src to be tokenized (it seems \ is the culprit)
Another example where it fails is if no newline at end e.g. src='if 1:pass' fails with same error
Workaround:
But it seems using untokenize different way works
def tok_untok(src):
f = cStringIO.StringIO(src)
tokens = [ t[:2] for t in tokenize.generate_tokens(f.readline)]
return tokenize.untokenize(tokens)
i.e. do not pass back whole token tuple but only t[:2]
though python doc says extra args are skipped
Converts tokens back into Python source code. The iterable must return
sequences with at least two elements,
the token type and the token string.
Any additional sequence elements are
ignored.
Yes, it's a known bug and there is interest in a cleaner patch than the one attached to that issue. Perfect time to contribute to a better Python ;)