I have a multi-gigabyte JSON file. The file is made up of JSON objects that are no more than a few thousand characters each, but there are no line breaks between the records.
Using Python 3 and the json module, how can I read one JSON object at a time from the file into memory?
The data is in a plain text file. Here is an example of a similar record. The actual records contains many nested dictionaries and lists.
Record in readable format:
{
"results": {
"__metadata": {
"type": "DataServiceProviderDemo.Address"
},
"Street": "NE 228th",
"City": "Sammamish",
"State": "WA",
"ZipCode": "98074",
"Country": "USA"
}
}
}
Actual format. New records start one after the other without any breaks.
{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }
Generally speaking, putting more than one JSON object into a file makes that file invalid, broken JSON. That said, you can still parse data in chunks using the JSONDecoder.raw_decode() method.
The following will yield complete objects as the parser finds them:
from json import JSONDecoder
from functools import partial
def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
buffer = ''
for chunk in iter(partial(fileobj.read, buffersize), ''):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:].lstrip()
except ValueError:
# Not enough data to decode, read more
break
This function will read chunks from the given file object in buffersize chunks, and have the decoder object parse whole JSON objects from the buffer. Each parsed object is yielded to the caller.
Use it like this:
with open('yourfilename', 'r') as infh:
for data in json_parse(infh):
# process object
Use this only if your JSON objects are written to a file back-to-back, with no newlines in between. If you do have newlines, and each JSON object is limited to a single line, you have a JSON Lines document, in which case you can use Loading and parsing a JSON file with multiple JSON objects in Python instead.
Here is a slight modification of Martijn Pieters' solution, which will handle JSON strings separated with whitespace.
def json_parse(fileobj, decoder=json.JSONDecoder(), buffersize=2048,
delimiters=None):
remainder = ''
for chunk in iter(functools.partial(fileobj.read, buffersize), ''):
remainder += chunk
while remainder:
try:
stripped = remainder.strip(delimiters)
result, index = decoder.raw_decode(stripped)
yield result
remainder = stripped[index:]
except ValueError:
# Not enough data to decode, read more
break
For example, if data.txt contains JSON strings separated by a space:
{"business_id": "1", "Accepts Credit Cards": true, "Price Range": 1, "type": "food"} {"business_id": "2", "Accepts Credit Cards": true, "Price Range": 2, "type": "cloth"} {"business_id": "3", "Accepts Credit Cards": false, "Price Range": 3, "type": "sports"}
then
In [47]: list(json_parse(open('data')))
Out[47]:
[{u'Accepts Credit Cards': True,
u'Price Range': 1,
u'business_id': u'1',
u'type': u'food'},
{u'Accepts Credit Cards': True,
u'Price Range': 2,
u'business_id': u'2',
u'type': u'cloth'},
{u'Accepts Credit Cards': False,
u'Price Range': 3,
u'business_id': u'3',
u'type': u'sports'}]
If your JSON documents contains a list of objects, and you want to read one object one-at-a-time, you can use the iterative JSON parser ijson for the job. It will only read more content from the file when it needs to decode the next object.
Note that you should use it with the YAJL library, otherwise you will likely not see any performance increase.
That being said, unless your file is really big, reading it completely into memory and then parsing it with the normal JSON module will probably still be the best option.
Related
I have large file (about 3GB) which contains what looks like a JSON file but isn't because it lacks commas (,) between "observations" or JSON objects (I have about 2 million of these "objects" in my data file).
For example, this is what I have:
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"usernameid": "47284592942",
"username": "Alex",
"server": "475774810304151552",
"text": "Must watch",
"type": "462050823720009729",
"datetime": "2018-08-05T21:20:20.486000+00:00",
"type": {
"$numberLong": "0"
}
}
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "273342",
"usernameid": "Alice",
"server": "475774810304151552",
"text": "https://www.youtube.com/",
"type": "4620508237200097wd29",
"datetime": "2018-08-05T21:20:11.803000+00:00",
"type": {
"$numberLong": "0"
}
And this is what I want (the comma between "observations"):
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"username": "Alex",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
},
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "Alice",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
This is what I tried but it doesn't give me a comma where I need it:
import re
with open('dataframe.txt', 'r') as input, open('out.txt', 'w') as output:
output.write("[")
for line in input:
line = re.sub('', '},{', line)
output.write(' '+line)
output.write("]")
What can I do so that I can add a comma between each JSON object in my datafile?
This solution presupposes that none of the fields in JSON contains neither { nor }.
If we assume that there is at least one blank line between JSON dictionaries, an idea: let's maintain unclosed curly brackets count ({) as unclosed_count; and if we meet an empty line, we add the coma once.
Like this:
with open('test.json', 'r') as input_f, open('out.json', 'w') as output_f:
output_f.write("[")
unclosed_count = 0
comma_after_zero_added = True
for line in input_f:
unclosed_count_change = line.count('{') - line.count('}')
unclosed_count += unclosed_count_change
if unclosed_count_change != 0:
comma_after_zero_added = False
if line.strip() == '' and unclosed_count == 0 and not comma_after_zero_added:
output_f.write(",\n")
comma_after_zero_added = True
else:
output_f.write(line)
output_f.write("]")
Assuming sufficient memory, you can parse such a stream one object at a time using json.JSONDecoder.raw_decode directly, instead of using json.loads.
>>> x = '{"a": 1}\n{"b": 2}\n' # Hypothetical output of open("dataframe.txt").read()
>>> decoder = json.JSONDecoder()
>>> x = '{"a": 1}\n{"b":2}\n'
>>> decoder.raw_decode(x)
({'a': 1}, 8)
>>> decoder.raw_decode(x, 9)
({'b': 2}, 16)
The output of raw_decode is a tuple containing the first JSON value decoded and the position in the string where the remaining data starts. (Note that json.loads just creates an instance of JSONDecoder, and calls the decode method, which just calls raw_decode and artificially raises an exception if the entire input isn't consumed by the first decoded value.)
A little extra work is involved; note that you can't start decoding with whitespace, so you'll have to use the returned index to detect where the next value starts, following any additional whitespace at the returned index.
Another way to view your data is that you have multiple json records separated by whitespace. You can use the stdlib JSONDecoder to read each record, then strip whitespace and repeat til done. The decoder reads a record from a string and tells you how far it got. Apply that iteratively to the data until all is consumed. This is far less risky than making a bunch of assumptions about what data is contained in the json itself.
import json
def json_record_reader(filename):
with open(filename, encoding="utf-8") as f:
txt = f.read().lstrip()
decoder = json.JSONDecoder()
result = []
while txt:
data, pos = decoder.raw_decode(txt)
result.append(data)
txt = txt[pos:].lstrip()
return result
print(json_record_reader("data.json"))
Considering the size of your file, a memory mapped text file may be the better option.
If you're sure that the only place you will find a blank line is between two dicts, then you can go ahead with your current idea, after you fix its execution. For every line, check if it's empty. If it isn't, write it as-is. If it is, write a comma instead
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
output_file.write("[")
for line in input_file:
if line.strip():
output_file.write(line)
else:
output_file.write(",")
output_file.write("]")
If you cannot guarantee that any blank line must be replaced by a comma, you need a different approach.
You want to replace a close-bracket, followed by an empty line (or multiple whitespace), followed by an open-bracket, with },{.
You can keep track of the previous two lines in addition to the current line, and if these are "}", "", and "{" in that order, then write a comma before writing the "{".
from collections import deque
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
last_two_lines = deque(maxlen=2)
output_file.write("[")
for line in input_file:
line_s = line.strip()
if line_s == "{" and list(last_two_lines) == ["}", ""]:
output_file.write("," + line)
else:
output_file.write(line)
last_two_lines.append(line_s)
Alternatively, if you want to stick with regex, then you could do
with open('dataframe.txt') as input_file:
file_contents = input_file.read()
repl_contents = re.sub(r'\}(\s+)\{', r'},\1{', file_contents)
with open('out.txt', 'w') as output_file:
output_file.write(repl_contents)
Here, the regex r"\}(\s+)\{" matches the pattern we're looking for (\s+ matches multiple whitespace characters, and captures them in group 1, which we then use in the replacement string as \1.
Note that you will need to read and run re.sub on the entire file, which will be slow.
Can some one tell me what I am doing wrong ?I am Getting this error..
went through the earlier post of similar error. couldn't able to understand..
import json
import re
import requests
import subprocess
res = requests.get('https://api.tempura1.com/api/1.0/recipes', auth=('12345','123'), headers={'App-Key': 'some key'})
data = res.text
extracted_recipes = []
for recipe in data['recipes']:
extracted_recipes.append({
'name': recipe['name'],
'status': recipe['status']
})
print extracted_recipes
TypeError: string indices must be integers
data contains the below
{
"recipes": {
"47635": {
"name": "Desitnation Search",
"status": "SUCCESSFUL",
"kitchen": "eu",
"active": "YES",
"created_at": 1501672231,
"interval": 5,
"use_legacy_notifications": false
},
"65568": {
"name": "Validation",
"status": "SUCCESSFUL",
"kitchen": "us-west",
"active": "YES",
"created_at": 1522583593,
"interval": 5,
"use_legacy_notifications": false
},
"47437": {
"name": "Gateday",
"status": "SUCCESSFUL",
"kitchen": "us-west",
"active": "YES",
"created_at": 1501411588,
"interval": 10,
"use_legacy_notifications": false
}
},
"counts": {
"total": 3,
"limited": 3,
"filtered": 3
}
}
You are not converting the text to json. Try
data = json.loads(res.text)
or
data = res.json()
Apart from that, you probably need to change the for loop to loop over the values instead of the keys. Change it to something the following
for recipe in data['recipes'].values()
There are two problems with your code, which you could have found out by yourself by doing a very minimal amount of debugging.
The first problem is that you don't parse the response contents from json to a native Python object. Here:
data = res.text
data is a string (json formatted, but still a string). You need to parse it to turn it into it's python representation (in this case a dict). You can do it using the stdlib's json.loads() (general solution) or, since you're using python-requests, just by calling the Response.json() method:
data = res.json()
Then you have this:
for recipe in data['recipes']:
# ...
Now that we have turned data into a proper dict, we can access the data['recipes'] subdict, but iterating directly over a dict actually iterates over the keys, not the values, so in your above for loop recipe will be a string ( "47635", "65568" etc). If you want to iterate over the values, you have to ask for it explicitly:
for recipe in data['recipes'].values():
# now `recipe` is the dict you expected
I have a list of json objects that I would like to write to a json file. Example of my data is as follows:
{
"_id": "abc",
"resolved": false,
"timestamp": "2017-04-18T04:57:41 366000",
"timestamp_utc": {
"$date": 1492509461366
},
"sessionID": "abc",
"resHeight": 768,
"time_bucket": ["2017-year", "2017-04-month", "2017-16-week", "2017-04-18-day", "2017-04-18 16-hour"],
"referrer": "Standalone",
"g_event_id": "abc",
"user_agent": "abc"
"_id": "abc",
} {
"_id": "abc",
"resolved": false,
"timestamp": "2017-04-18T04:57:41 366000",
"timestamp_utc": {
"$date": 1492509461366
},
"sessionID": "abc",
"resHeight": 768,
"time_bucket": ["2017-year", "2017-04-month", "2017-16-week", "2017-04-18-day", "2017-04-18 16-hour"],
"referrer": "Standalone",
"g_event_id": "abc",
"user_agent": "abc"
}
I would like to wirte this to a json file. Here's the code that I am using for this purpose:
with open("filename", 'w') as outfile1:
for row in data:
outfile1.write(json.dumps(row))
But this gives me a file with only 1 long row of data. I would like to have a row for each json object in my original data. I know there are some other StackOverflow questions that are trying to address somewhat similar situation (by externally inserting '\n' etc.), but it hasn't worked in my case for some reason. I believe there has to be a pythonic way to do this.
How do I achieve this?
The format of the file you are trying to create is called JSON lines.
It seems, you are asking why the jsons are not separated with a newline. Because write method does not append the newline.
If you want implicit newlines you should better use print function:
with open("filename", 'w') as outfile1:
for row in data:
print(json.dumps(row), file=outfile1)
Use the indent argument to output json with extra whitespace. The default is to not output linebreaks or extra spaces.
with open('filename.json', 'w') as outfile1:
json.dump(data, outfile1, indent=4)
https://docs.python.org/3/library/json.html#basic-usage
I have a multi-gigabyte JSON file. The file is made up of JSON objects that are no more than a few thousand characters each, but there are no line breaks between the records.
Using Python 3 and the json module, how can I read one JSON object at a time from the file into memory?
The data is in a plain text file. Here is an example of a similar record. The actual records contains many nested dictionaries and lists.
Record in readable format:
{
"results": {
"__metadata": {
"type": "DataServiceProviderDemo.Address"
},
"Street": "NE 228th",
"City": "Sammamish",
"State": "WA",
"ZipCode": "98074",
"Country": "USA"
}
}
}
Actual format. New records start one after the other without any breaks.
{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }
Generally speaking, putting more than one JSON object into a file makes that file invalid, broken JSON. That said, you can still parse data in chunks using the JSONDecoder.raw_decode() method.
The following will yield complete objects as the parser finds them:
from json import JSONDecoder
from functools import partial
def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
buffer = ''
for chunk in iter(partial(fileobj.read, buffersize), ''):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:].lstrip()
except ValueError:
# Not enough data to decode, read more
break
This function will read chunks from the given file object in buffersize chunks, and have the decoder object parse whole JSON objects from the buffer. Each parsed object is yielded to the caller.
Use it like this:
with open('yourfilename', 'r') as infh:
for data in json_parse(infh):
# process object
Use this only if your JSON objects are written to a file back-to-back, with no newlines in between. If you do have newlines, and each JSON object is limited to a single line, you have a JSON Lines document, in which case you can use Loading and parsing a JSON file with multiple JSON objects in Python instead.
Here is a slight modification of Martijn Pieters' solution, which will handle JSON strings separated with whitespace.
def json_parse(fileobj, decoder=json.JSONDecoder(), buffersize=2048,
delimiters=None):
remainder = ''
for chunk in iter(functools.partial(fileobj.read, buffersize), ''):
remainder += chunk
while remainder:
try:
stripped = remainder.strip(delimiters)
result, index = decoder.raw_decode(stripped)
yield result
remainder = stripped[index:]
except ValueError:
# Not enough data to decode, read more
break
For example, if data.txt contains JSON strings separated by a space:
{"business_id": "1", "Accepts Credit Cards": true, "Price Range": 1, "type": "food"} {"business_id": "2", "Accepts Credit Cards": true, "Price Range": 2, "type": "cloth"} {"business_id": "3", "Accepts Credit Cards": false, "Price Range": 3, "type": "sports"}
then
In [47]: list(json_parse(open('data')))
Out[47]:
[{u'Accepts Credit Cards': True,
u'Price Range': 1,
u'business_id': u'1',
u'type': u'food'},
{u'Accepts Credit Cards': True,
u'Price Range': 2,
u'business_id': u'2',
u'type': u'cloth'},
{u'Accepts Credit Cards': False,
u'Price Range': 3,
u'business_id': u'3',
u'type': u'sports'}]
If your JSON documents contains a list of objects, and you want to read one object one-at-a-time, you can use the iterative JSON parser ijson for the job. It will only read more content from the file when it needs to decode the next object.
Note that you should use it with the YAJL library, otherwise you will likely not see any performance increase.
That being said, unless your file is really big, reading it completely into memory and then parsing it with the normal JSON module will probably still be the best option.
After contacting a server I get the following strings as response
{"kind": "t2", "data": {"has_mail": null, "name": "shadyabhi", "created": 1273919273.0, "created_utc": 1273919273.0, "link_karma": 1343, "comment_karma": 301, "is_gold": false, "is_mod": false, "id": "425zf", "has_mod_mail": null}}
which is stored as type 'str' in my script.
Now, when I try to decode it using json.dumps(mystring, sort_keys=True, indent=4), I get this.
"{\"kind\": \"t2\", \"data\": {\"has_mail\": null, \"name\": \"shadyabhi\", \"created\": 1273919273.0, \"created_utc\": 1273919273.0, \"link_karma\": 1343, \"comment_karma\": 301, \"is_gold\": false, \"is_mod\": false, \"id\": \"425zf\", \"has_mod_mail\": null}}"
which should really be like this
shadyabhi#archlinux ~ $ echo '{"kind": "t2", "data": {"has_mail": "null", "name": "shadyabhi", "created": 1273919273.0, "created_utc": 1273919273.0, "link_karma": 1343, "comment_karma": 299, "is_gold": "false", "is_mod": "false", "id": "425zf", "has_mod_mail": "null"}}' | python2 -mjson.tool
{
"data": {
"comment_karma": 299,
"created": 1273919273.0,
"created_utc": 1273919273.0,
"has_mail": "null",
"has_mod_mail": "null",
"id": "425zf",
"is_gold": "false",
"is_mod": "false",
"link_karma": 1343,
"name": "shadyabhi"
},
"kind": "t2"
}
shadyabhi#archlinux ~ $
So, what is it that's going wrong?
You need to load it before you can dump it. Try this:
data = json.loads(returnFromWebService)
json.dumps(data, sort_keys=True, indent=4)
To add a bit more detail - you're receiving a string, and then asking the json library to dump it to a string. That doesn't make a great deal of sense. What you need to do first is put the data into a more meaningful container. By calling loads you take the string value of the return and parse it into an actual Python Dictionary. Then, you can pass that data to dumps which outputs a string using your requested formatting.
You have things backwards. If you want to convert a string to a data structure you need to use json.loads(thestring). json.dumps() is for converting a data structure to a json encoded string.
You are supposed to dump an object (like a dictionary) which then becomes a string, not the other way round... see here.
Use json.loads() instead.
You want json.loads. The dumps method is for going the other way (dumping an object to a json string).