I am facing the following problem: I create a big data set (several 10GB) of python objects. I want to create an output file in YAML format containing an entry for each object that contains information about the object saved as a nested dictionary. However, I never hold all data in memory at the same time.
The output data should be stored in a dictionary mapping an object name to the saved values. A simple version would look like this:
object_1:
value_1: 42
value_2: 23
object_2:
value_1: 17
value_2: 13
[...]
object_a_lot:
value_1: 47
value_2: 11
To keep a low memory footprint, I would like to write the entry for each object and immediately delete it after writing. My current approach is as follows:
from yaml import dump
[...] # initialize huge_object_list. Here it is still small
with open("output.yaml", "w") as yaml_file:
for my_object in huge_object_list:
my_object.compute() # this blows up the size of the object
# create a single entry for the top level dict
object_entry = dump(
{my_object.name: my_object.get_yaml_data()},
default_flow_style=False,
)
yaml_file.write(object_entry)
my_object.delete_big_stuff() # delete the memory consuming stuff in the object, keep other information which is needed later
Basically I am writing several dictionaries, but each only has one key and since the object names are unique this does not blow up. This works, but feels like a bit of a hack and I would like to ask if someone knows of a way to do this better/ proper.
Is there a way to write a big dictionary to a YAML file, one entry at a time?
If you want to write out a YAML file in stages, you can do it the way you describe.
If your keys are not guaranteed to be unique, then I would recommend using a sequence (i.e. list a the top-level (even with one item), instead of a mapping.
This doesn't solve the problem of re-reading the file as PyYAML will try to read the file as a whole and that is not going load quickly, and keep in mind that the memory overhead of PyYAML will require for loading a file can easily be over 100x (a hundred times) the file size. My ruamel.yaml is wrt to memory somewhat better but still requires several tens of times the file size in memory.
You can of course cut up a file based on "leading" spaces, a new key (or dash for an item in case you use sequences) can be easily found in a different way. You can also look at storing each key-value pair in its own document within one file, that vastly reduces the overhead during loading if you combine the key-value pairs of the single documents yourself.
In similar situations I stored individual YAML "objects" in different files, using the filenames as keys to the "object" values. This requires some efficient filesystem (e.g. tail packing) and depends on what is available based on the OS your system is based on.
Related
This question already has answers here:
Is there a memory efficient and fast way to load big JSON files?
(11 answers)
Closed 3 years ago.
I would like to load one by one the items of my json file. The file could be up to 3gb so loading it in advance and looping over it is not an option.
My json file is basically a dictionary of key and value pairs (hundreds of pairs), and there is nothing I want to discard (ijson).
I just want to load one pair at a time to work with it. Is there anyway to do that?
So basically I found out in this answer how to do it in a much simple way:
https://stackoverflow.com/a/17326199/2933485
Using ijson, it looks like you can loop over the file without loadin it but opening the file and using ijson parse function over it, this is the example I found:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
Why dont you populate a sqlite table with the data once and query the data using the record PK? See https://docs.python.org/3.7/library/sqlite3.html
OK, so json is a nested format, which means each repeating block (dict or list object) is surrounded by start and end characters. Normally, you read the entire file, and in doing so, can confirm the well-formed, structure and "closedness" of each object - in other words, it's verifiable that all objects are legally structured. When you load a json file into memory using the json library, part of that process is the validation.
If you want to do that for an extra large file - you have to forgo the normal library and roll your own, loading in a line (or chunk) at a time, and processing that under the assumption that validation will retrospectively succeed.
That's achievable (assuming you're able to put your faith in such an assumption) but it's probably something you'll have to write yourself.
One strategy might be to read a line at a time, splitting on the colon : character, with commas as record delimiters, which is a crude approximation of how key-value pairs are coded within json. Following this method, you're going to be able to process all but the first and final key-value pairs cleanly in sequence.
That just leaves you to write some special conditions for properly parsing the first and final records, which will come through garbled using this strategy.
Crudely then, call something like this (referencing the csv library) and treat the json like a massive, unusually formatted csv file.
import csv
with open('big.json', newline=',') as csv_json_franken_file:
jsonreader = csv.reader(csv_json_franken_file, delimiter=':', quotechar='"')
for row in jsonreader: # This bit reads in a "row" at a time, until finished
print(', '.join(row))
Then do some edge-case treatment of the first and last rows (more or less depending on the structure of your json) to repair the garbling caused by what is a fairly blatant hack. It's not clean, and it's not robust to changes in the content - but sometimes, you just have to play the hand you've been dealt.
To be honest, generating json files of 3GB in size is a little irresponsible, so if anyone comes asking, you've got that in your corner.
As a research project I'm currently writing a document-oriented database from scratch in Python. Like MongoDB, the database supports the creation of indexes on arbitrary document keys. These indexes are currently implemented using two simple dictionaries: The first contains as key the (possibly hashed) value of the indexed field and as value the store keys of all documents associated with that field value, which allows the DB to locate the document on disk. The second dictionary contains the inverse of that, i.e. as a key the store key of a given document and as value the (hashed) value of the indexed field (which makes removing document from the index more efficient). An example:
doc1 = {'foo' : 'bar'} # store-key : doc1
doc2 = {'foo' : 'baz'} # store-key : doc2
doc3 = {'foo' : 'bar'} # store-key : doc3
For the foo field, the index dictionaries for these documents would look like this:
foo_index = {'bar' : ['doc1','doc3'],'baz' : ['doc2']}
foo_reverse_index = {'doc1' : ['bar'],'doc2' : ['baz'], 'doc3' : ['bar']}
(please not that the reverse index also consists of lists of values [and not single values] to accommodate indexing of list fields, in which case each element of the list field would be contained in the index separately)
During normal operation, the index resides in memory and is updated in real time after each insert/update/delete operation. To persist it, it gets serialized (e.g. as JSON object) and stored to disk, which works reasonably well for index sizes up to a few 100k entries. However, as the database size grows the index loading times at program startup become problematic, and committing changes in realtime to disk becomes nearly impossible since writing of the index incurs a large overhead.
Hence I'm looking for an implementation of a persistent index which allows for efficient incremental updates, or, expressed differently, does not require rewriting the whole index when persisting it to disk. What would be a suitable strategy for approaching this problem? I thought about using a linked-list to implement an addressable storage space to which objects could be written but I'm not sure if this is the right approach.
My suggestion is limited to the update of the index for persistence; the extra time at program startup is not a major one and can not really be avoided.
One approach is to use a preallocation of disk space for the index (possibly for other collections too). In the preallocation, you define an empirical size associated with each entry of the index as well as the total size of the index on the disk. For example 1024 bytes for each entry of the index and a total of 1000 entries.
The strategy allows for direct access to each entry of the index on disk. You just have to store the position on the disk along with the index in memory. Any time you update an entry of the index in memory, you point directly to its exact location on the disk and rewrite only a single entry.
If it happens that the first index file is full, just create a second file; always preallocate the space for your file on disk (1024*1000 bytes). You should also preallocate the space for your other data, and choose to use multiple fixed-size files instead of a single large file
If it happens that some entries of the index require more than 1024 bytes, simply create an extra index file for larger entries; for example 2048 bytes per entry and a total of 100 entries.
The most important is to used fixed size index entries for direct access.
I hope it helps
I have a program which imports a text file through standard input and aggregates the lines into a dictionary. However the input file is very large (1Tb order) and I wont have enough space to store the whole dictionary in memory (running on 64Gb ram machine). Currently Iv got a very simple clause which outputs the dictionary once it has reached a certain length (in this case 100) and clears the memory. The output can then be aggregated at later point.
So i want to: output the dictionary once memory is full. what is the best way of managing this? Is there a function which gives me the current memory usage? Is this costly to keep on checking? Am I using the right tactic?
import sys
X_dic = dict()
# Used to print the dictionary in required format
def print_dic(dic):
for key, value in dic.iteritems():
print "{0}\t{1}".format(key, value)
for line in sys.stdin:
value, key = line.strip().split(",")
if (not key in X_dic):
X_dic[key] = []
X_dic[key].append(value)
# Limit size of dic.
if( len(X_dic) == 100):
print_dic(X_dic) # Print and clear dictionary
X_dic = dict()
# Now output
print_dic(X_dic)
The module resource provides some information on how much resources (memory, etc.) you are using. See here for a nice little usage.
On a Linux system (I don't know where you are) you can watch the contents of the file /proc/meminfo. As part of the proc file system it is updated automatically.
But I object to the whole strategy of monitoring the memory and using it up as much as possible, actually. I'd rather propose to dump the dictionary regularly (after 1M entries have been added or such). It probably will speed up your program to keep the dict smaller than possible; also it presumably will have advantages for later processing if all dumps are of similar size. If you dump a huge dict which fit into your whole memory when nothing else was using memory, then you later will have trouble re-reading that dict if something else is currently using some of your memory. So then you would have to create a situation in which nothing else is using memory (e. g. reboot or similar). Not very convenient.
I'm using Multiprocessing with a large (~5G) read-only dict used by processes. I started by passing the whole dict to each process, but ran into memory restraints, so changed to use a Multiprocessing Manager dict (after reading this How to share a dictionary between multiple processes in python without locking )
Since the change, performance has dived. What alternatives are there for a faster shared data store? The dict has a 40 character string key, and 2 small string element tuple data.
Use a memory mapped file. While this might sound insane (performance wise), it might not be if you use some clever tricks:
Sort the keys so you can use binary search in the file to locate a record
Try to make each line of the file the same length ("fixed width records")
If you can't use fixed width records, use this pseudo code:
Read 1KB in the middle (or enough to be sure the longest line fits *twice*)
Find the first new line character
Find the next new line character
Get a line as a substring between the two positions
Check the key (first 40 bytes)
If the key is too big, repeat with a 1KB block in the first half of the search range, else in the upper half of the search range
If the performance isn't good enough, consider writing an extension in C.
I have a large object I'd like to serialize to disk. I'm finding marshal works quite well and is nice and fast.
Right now I'm creating my large object then calling marshal.dump . I'd like to avoid holding the large object in memory if possible - I'd like to dump it incrementally as I build it. Is that possible?
The object is fairly simple, a dictionary of arrays.
The bsddb module's 'hashopen' and 'btopen' functions provide a persistent dictionary-like interface. Perhaps you could use one of these, instead of a regular dictionary, to incrementally serialize the arrays to disk?
import bsddb
import marshal
db = bsddb.hashopen('file.db')
db['array1'] = marshal.dumps(array1)
db['array2'] = marshal.dumps(array2)
...
db.close()
To retrieve the arrays:
db = bsddb.hashopen('file.db')
array1 = marshal.loads(db['array1'])
...
It all your object has to do is be a dictionary of lists, then you may be able to use the shelve module. It presents a dictionary-like interface where the keys and values are stored in a database file instead of in memory. One limitation which may or may not affect you is that keys in Shelf objects must be strings. Value storage will be more efficient if you specify protocol=-1 when creating the Shelf object to have it use a more efficient binary representation.
This very much depends on how you are building the object. Is it an array of sub objects? You could marshal/pickle each array element as you build it. Is it a dictionary? Same idea applies (marshal/pickle keys)
If it is just a big complex harry object, you might want to marshal dump each piece of the object, and then the apply what ever your 'building' process is when you read it back in.
You should be able to dump the item piece by piece to the file. The two design questions that need settling are:
How are you building the object when you're putting it in memory?
How do you need you're data when it comes out of memory?
If your build process populates the entire array associated with a given key at a time, you might just dump the key:array pair in a file as a separate dictionary:
big_hairy_dictionary['sample_key'] = pre_existing_array
marshal.dump({'sample_key':big_hairy_dictionary['sample_key']},'central_file')
Then on update, each call to marshal.load('central_file') will return a dictionary that you can use to update a central dictionary. But this is really only going to be helpful if, when you need the data back, you want to handle reading 'central_file' once per key.
Alternately, if you are populating arrays element by element in no particular order, maybe try:
big_hairy_dictionary['sample_key'].append(single_element)
marshal.dump(single_element,'marshaled_files/'+'sample_key')
Then, when you load it back, you don't necessarily need to build the entire dictionary to get back what you need; you just call marshal.load('marshaled_files/sample_key') until it returns None, and you have everything associated with the key.