Python: storing/retrieving/updating a large number of arbitrary objects - python

I have several million records I want to store, retrieve, delete pretty frequently. Each of these records has a "key", but the "value" is not easily translatable to a dictionary as it is an arbitrary Python object returned from a module method that I didn't write (I understand that a lot of hierarchical data structures like json work better as dictionaries, and not sure if json is the preferred database in any case).
I am thinking to pickle each entry in a separate file. Is there a better way?

Use the shelve module.
You can use it as a dictionary, much like in json, but it stores objects using pickle.
From the python official docs:
import shelve
d = shelve.open(filename) # open -- file may get suffix added by low-level
# library
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of data at key (raise KeyError if no
# such key)
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = d.has_key(key) # true if the key exists
klist = d.keys() # a list of all existing keys (slow!)
# as d was opened WITHOUT writeback=True, beware:
d['xx'] = range(4) # this works as expected, but...
d['xx'].append(5) # *this doesn't!* -- d['xx'] is STILL range(4)!
# having opened d without writeback=True, you need to code carefully:
temp = d['xx'] # extracts the copy
temp.append(5) # mutates the copy
d['xx'] = temp # stores the copy right back, to persist it
# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.
d.close() # close it

I would evaluate the use of a key/value database like berkeleydb, kyoto cabinet or others. This will give you all the fancy things plus a better handling of disk space. In a filesystem with a block size of 4096B, one million files occupy ~4GB whatever is the size of your objects (as lower bound limit, if the objects are larger than 4096B the the size increase).

Related

Default Dict using a lot of memory Python

I have a server with 32GB of memory and I find my script that it's consuming all that memory and even more because it's getting killed. Just want to ask what can I do on this scenario. I have a CSV file with 320,000 rows which I am iterating in Pandas. It has three columns called date, location and value. My goal is to store all location and values per date. So date is my key. I'm going to store it to s3 as a json file. My code in appending is like this
appends = defaultdict(list)
for i, row in df.iterrows():
appends[row["date"]].append(dict(
location=row['location'],
value=row['value'],
))
Then I will it to s3 like this
for key in appends.keys():
try:
append = appends[key]
obj = s3.Object(bucket_storage, f"{key}/data.json")
obj.put(Body=json.dumps(append, cls=DjangoJSONEncoder))
except IntegrityError:
pass
But this one is consuming all the memory. I've read somewhere that default dict can be memory consuming but I'm not sure what my other options are. I don'
t want to involve a database here as well.
Basically I need all the data to be mapped first in the dict before saving it to s3 but problem is I think it can't handle all the data or I am doing something wrong here. Thanks

Streaming JSON Objects From a Large Compressed File

I am working on a personal project which involves reading in large files of JSON objects, which consist of potentially millions of entries, which are compressed using GZip. The problem that I am having is in determining how to efficiently parse these objects line-by-line and store them in memory such that they do not use up all of the RAM on my system. It must be able to access or construct these objects at a later time for analysis. What I have attempted thus far is as follows
def parse_data(file):
accounts = []
with gzip.open(file, mode='rb') as accounts_data:
for line in accounts_data:
# if line is not empty
if len(line,strip()) != 0:
account = BytesIO(line)
accounts.append(account)
return accounts
def getaccounts(accounts, idx):
account = json.load(accounts[idx])
# creates account object using fields in account dict
return account_from_dict(account)
A major problem with this implementation is that I am unable to access the same object in accounts twice without it resulting in a JSONDecodeError's being generated. I also am not sure whether or not this is the most compact way I could be doing this.
Any assistance would be much appreciated.
Edit: The format of the data stored in these files are as follows:
{JSON Object 1}
{JSON Object 2}
...
{JSON Object n}
Edit: It is my intention to use the information stored in these JSON account entries to form a graph of similarities or patterns in account information.
Here's how to randomly access JSON objects in the gzipped file by first uncompressing it into a temporary file and then using tell() and seek() to retrieve them by index — thus requiring only enough memory to hold the offsets of each one.
I'm posting this primarily because you asked me for an example of doing it in the comments...which I wouldn't have otherwise, because it not quite the same thing as streaming data. The major difference is that, unlike doing that, it gives access to all the data including being able to randomly access any of the objects at will.
Uncompressing the entire file first does introduce some additional overhead, so unless you need to be able to access the JSON object more than once, probably wouldn't be worth it. The implementation shown could probably be sped-up by caching previous loaded objects, but without knowing precisely what the access patterns will be, it hard to say for sure.
import collections.abc
import gzip
import json
import random
import tempfile
class GZ_JSON_Array(collections.abc.Sequence):
""" Allows objects in gzipped file of JSON objects, one-per-line, to be
treated as an immutable sequence of JSON objects.
"""
def __init__(self, gzip_filename):
self.tmpfile = tempfile.TemporaryFile('w+b')
# Decompress a gzip file into a temp file and save offsets of the
# start of each line in it.
self.offsets = []
with gzip.open(gzip_filename, mode='rb') as gzip_file:
for line in gzip_file:
line = line.rstrip().decode('utf-8')
if line:
self.offsets.append(self.tmpfile.tell())
self.tmpfile.write(bytes(line + '\n', encoding='utf-8'))
def __len__(self):
return len(self.offsets)
def __iter__(self):
for index in range(len(self)):
yield self[index]
def __getitem__(self, index):
""" Return a JSON object at offsets[index] in the given open file. """
if index not in range(len(self.offsets)):
raise IndexError
self.tmpfile.seek(self.offsets[index])
try:
size = self.offsets[index+1] - self.offsets[index] # Difference with next.
except IndexError:
size = -1 # Last one - read all remaining data.
return json.loads(self.tmpfile.read(size).decode())
def __del__(self):
try:
self.tmpfile.close() # Allow it to auto-delete.
except Exception:
pass
if __name__ == '__main__':
gzip_filename = 'json_objects.dat.gz'
json_array = GZ_JSON_Array(gzip_filename)
# Randomly access some objects in the JSON array.
for index in random.sample(range(len(json_array)), 3):
obj = json_array[index]
print('object[{}]: {!r}'.format(index, obj))
Hhi, perhaps use an incremental json reader such as ijson. That does not require loading the entire structure into memory at once.
Based on your answers in the comments, it seems like you just need to scan through the objects:
def evaluate_accounts(file):
results = {}
with gzip.open(file) as records:
for json_rec in records:
if json_rec.strip():
account = json.loads(json_rec)
results[account['id']] = evaluate_account(account)
return results

Best Way To Save/Load Data To/From A .CSV File

I'm trying to save and load a list of tuples of 2 ndarrays and an int to and from a .csv file.
In my current implementation, when I save and load a list l, there is some error in the recovered list on the order of 10^-10. Is there a way to save and recover values more precisely? I would also appreciate comments on my code in general. Thanks!
This is what I have now:
def save_l(l,path):
tup=()
for X in l:
u=X[0].reshape(784*9)
v=X[2]*np.ones(1)
w=np.concatenate((u,X[1],v))
tup+=(w,)
L=np.row_stack(tup)
df=pd.DataFrame(L)
df.to_csv(path)
def load_l(path):
df=pd.read_csv(path)
L=df.values
l=[]
for v in L:
tup=()
for i in range(784):
tup+=(v[9*i+1:9*(i+1)+1],)
T=np.row_stack(tup)
Q=v[9*784+1:10*784+1]
i=v[7841]
l.append((T,Q,i))
return(l)
It may be possible that the issue you are experiencing is due to absence of .csv file protection during save and load.
a good way to make sure that your file is locked until all data are saved/loaded completely is using a context manager. In this way, you won't lose any data in case your system stops execution due to whatever reason, because all results are saved in the moment when they are available.
I recommend to use the with-statement, whose primary use is an exception-safe cleanup of the object used inside (in this case your .csv). In other words, with makes sure that files are closed, locks released, contexts restored etc.
with open("myfile.csv", "a") as reference: # Drop to csv w/ context manager
df.to_csv(reference, sep = ",", index = False) # Same goes for read_csv
# As soon as you are here, reference is closed
If you try this and still see your error, it's not due to save/load issues.

What is the preferred method for storing application persistent data, a flat file or a database?

What is the preferred way to store application-specific parameters (persistent data) for my Python program?
I'm creating a Python program, which needs to store some parameters: "project_name", "start_year", "max_value", ...
I don't know which is the best way to store these data (I must reuse them when making calculations and reports): using local TXT files, using a tiny-very-simple DB (does it exist in Python? shall I use SQLite?), ...
Thank you very much in advance.
SQLite. Very easy to setup, and you gain a number of builtin db functions. You also won't have to handle file read/writes and parsing.
pickle it:
import pickle
options = {
'project_name': 'foo',
'start_year': 2000
}
with open('config.pickle', 'wb') as config:
pickle.dump(options, config)
The pickle module lets you dump most Python objects into a file and read them back again.
You can use shelve library. From shelve documentation:
A "shelf" is a persistent, dictionary-like object. The difference with dbm databases is that the values (not the keys!) in a shelf can be essentially arbitrary Python objects -- anything that the "pickle" module can handle
import shelve
d = shelve.open(filename) # open, with (g)dbm filename -- no suffix
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of the data at key (raise
# KeyError if no such key) -- NOTE that this
# access returns a *copy* of the entry!
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = d.has_key(key) # true if the key exists; same as "key in d"
list = d.keys() # a list of all existing keys (slow!)
d.close()
if scheme is fixed, sqldb is best choise, like sqlite3, plus memcached as cache.
if relationship change often, i think flexible data may be stored in files(hash indexed).

Is there a memory efficient and fast way to load big JSON files?

I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T

Categories

Resources