Creating an efficient file-based index that can be incrementally updated

Creating an efficient file-based index that can be incrementally updated - python

As a research project I'm currently writing a document-oriented database from scratch in Python. Like MongoDB, the database supports the creation of indexes on arbitrary document keys. These indexes are currently implemented using two simple dictionaries: The first contains as key the (possibly hashed) value of the indexed field and as value the store keys of all documents associated with that field value, which allows the DB to locate the document on disk. The second dictionary contains the inverse of that, i.e. as a key the store key of a given document and as value the (hashed) value of the indexed field (which makes removing document from the index more efficient). An example:
doc1 = {'foo' : 'bar'} # store-key : doc1
doc2 = {'foo' : 'baz'} # store-key : doc2
doc3 = {'foo' : 'bar'} # store-key : doc3
For the foo field, the index dictionaries for these documents would look like this:
foo_index = {'bar' : ['doc1','doc3'],'baz' : ['doc2']}
foo_reverse_index = {'doc1' : ['bar'],'doc2' : ['baz'], 'doc3' : ['bar']}
(please not that the reverse index also consists of lists of values [and not single values] to accommodate indexing of list fields, in which case each element of the list field would be contained in the index separately)
During normal operation, the index resides in memory and is updated in real time after each insert/update/delete operation. To persist it, it gets serialized (e.g. as JSON object) and stored to disk, which works reasonably well for index sizes up to a few 100k entries. However, as the database size grows the index loading times at program startup become problematic, and committing changes in realtime to disk becomes nearly impossible since writing of the index incurs a large overhead.
Hence I'm looking for an implementation of a persistent index which allows for efficient incremental updates, or, expressed differently, does not require rewriting the whole index when persisting it to disk. What would be a suitable strategy for approaching this problem? I thought about using a linked-list to implement an addressable storage space to which objects could be written but I'm not sure if this is the right approach.

My suggestion is limited to the update of the index for persistence; the extra time at program startup is not a major one and can not really be avoided.
One approach is to use a preallocation of disk space for the index (possibly for other collections too). In the preallocation, you define an empirical size associated with each entry of the index as well as the total size of the index on the disk. For example 1024 bytes for each entry of the index and a total of 1000 entries.
The strategy allows for direct access to each entry of the index on disk. You just have to store the position on the disk along with the index in memory. Any time you update an entry of the index in memory, you point directly to its exact location on the disk and rewrite only a single entry.
If it happens that the first index file is full, just create a second file; always preallocate the space for your file on disk (1024*1000 bytes). You should also preallocate the space for your other data, and choose to use multiple fixed-size files instead of a single large file
If it happens that some entries of the index require more than 1024 bytes, simply create an extra index file for larger entries; for example 2048 bytes per entry and a total of 100 entries.
The most important is to used fixed size index entries for direct access.
I hope it helps

Related

make function memory efficent or store data somewhere else to avoid memory error

I currently have a for loop which is finding and storing combinations in a list. The possible combinations are very large and I need to be able to access the combos.
can I use an empty relational db like SQLite to store my list on a disk instead of using list = []?
Essentially what I am asking is whether there is a db equivalent to list = [] that I can use to store the combinations generated via my script?
Edit:
SQLlite is not a must. Any will work if it can accomplish my task.
Here is the exact function that is causing me so much trouble. Maybe there is a better solution in general.
Idea - Could I insert the list into the database on each loop and then empty the list? Basically, create a list on each loop, send that list to PostgreSQL and then empty the list in the python to keep the RAM usage down?
def permute(set1, set2):
set1_combos = list(combinations(set1, 2))
set2_combos = list(combinations(set2, 8))
full_sets = []
for i in set1_combos:
for j in set2_combos:
full_sets.append(i + j)
return full_sets

Ok, a few ideas
My first thought was, why do you explode the combinations objects in lists? But of course, since we have two nested for loops, the iterator in the inner loop is consumed at the first iteration of the outer loop if it is not converted to a list.
However, you don't need to explode both objects: you can explode just the smaller one. For instance, if both our sets are made of 50 elements, the combinations of 2 elements are 1225 with a memsize (if the items are integers) of about 120 bytes each, i.e. 147KB, while the combinations of 8 elements are 5.36e+08 with a memsize of about 336 bytes, i.e. 180GB. So the first thing is, keep the larger combo set as a combinations object and iterate over it in the outer loop. By the way, this will also be really faster.
Now the database part. I assume a relational DBMS, be it SQLite or anything.
You want to create a table with a single column defined. Each row of your table will contain one final combination. Instead of appending each combination to a list, you will insert it in the table.
Now the question is, how do you need to access the data you created? Do you just need to iterate over the final combos sequentially, or do you need to query them, for instance finding all the combos which contain one specific value?
In the latter case, you'll want to define your column as the Primay Key, so your queries will be efficient; otherwise, you will save space on disk using an auto incrementing integer as the PK (SQLite will create it for you if you don't explicitly define a PK, and so will do a few other DMBS as well).
One final note: the insert phase may be painfully slow if you don't take some specific measures: check this very interesting SO post for details. In short, with a few optimizations they were able to pass from 85 to over 96K insert per second.
EDIT: iterating over the saved data
Once we have the data in the DB, iterating over them could be as simple as:
mycursor.execute('SELECT * FROM <table> WHERE <conditions>')
for combo in mycursor.fetchall():
print(combo) #or do what you need
But if your conditions don't filter away most of the rows you will meet the same memory issue we started with. A first step could be using fetchmany() or even fetchone() instead of fetchall() but still you may have a problem with the size of the query result set.
So you will probably need to read from the DB a chunk of data at a time, exploiting the LIMIT and OFFSET parameters in your SELECT. The final result may be something like:
chunck_size = 1000 #or whatever number fits your case
chunk_count = 0
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> LIMIT {chunk_size} ORDER BY <primarykey>'}
while chunk:
for combo in mycursor.fetchall():
print(combo) #or do what you need
chunk_count += 1
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> ORDER BY <primarykey>' OFFSET {chunk_size * chunk_count} LIMIT {chunk_size}}
Note that you will usually need the ORDER BY clause to ensure rows are returned as you expect them, and not in a random manner.

I don't believe SQLite has a built in array data type. Other DBMSs, such as PostgreSQL, do.
For SQLite, a good recommendation by another user on this site to obtain an array in SQLite can be found here: How to store array in one column in Sqlite3?
Another solution can be found: https://sqlite.org/forum/info/99a33767e8a07e59
In either case, yes it is possible to have a DBMS like SQLite store an array (list) type. However, it may require a little setup depending on the DBMS.
Edit: If you're having memory issues, have you thought about storing your data as a string and accessing the portions of the string you need when you need it?

Stepwise creation of a YAML file

I am facing the following problem: I create a big data set (several 10GB) of python objects. I want to create an output file in YAML format containing an entry for each object that contains information about the object saved as a nested dictionary. However, I never hold all data in memory at the same time.
The output data should be stored in a dictionary mapping an object name to the saved values. A simple version would look like this:
object_1:
value_1: 42
value_2: 23
object_2:
value_1: 17
value_2: 13
[...]
object_a_lot:
value_1: 47
value_2: 11
To keep a low memory footprint, I would like to write the entry for each object and immediately delete it after writing. My current approach is as follows:
from yaml import dump
[...] # initialize huge_object_list. Here it is still small
with open("output.yaml", "w") as yaml_file:
for my_object in huge_object_list:
my_object.compute() # this blows up the size of the object
# create a single entry for the top level dict
object_entry = dump(
{my_object.name: my_object.get_yaml_data()},
default_flow_style=False,
)
yaml_file.write(object_entry)
my_object.delete_big_stuff() # delete the memory consuming stuff in the object, keep other information which is needed later
Basically I am writing several dictionaries, but each only has one key and since the object names are unique this does not blow up. This works, but feels like a bit of a hack and I would like to ask if someone knows of a way to do this better/ proper.
Is there a way to write a big dictionary to a YAML file, one entry at a time?

If you want to write out a YAML file in stages, you can do it the way you describe.
If your keys are not guaranteed to be unique, then I would recommend using a sequence (i.e. list a the top-level (even with one item), instead of a mapping.
This doesn't solve the problem of re-reading the file as PyYAML will try to read the file as a whole and that is not going load quickly, and keep in mind that the memory overhead of PyYAML will require for loading a file can easily be over 100x (a hundred times) the file size. My ruamel.yaml is wrt to memory somewhat better but still requires several tens of times the file size in memory.
You can of course cut up a file based on "leading" spaces, a new key (or dash for an item in case you use sequences) can be easily found in a different way. You can also look at storing each key-value pair in its own document within one file, that vastly reduces the overhead during loading if you combine the key-value pairs of the single documents yourself.
In similar situations I stored individual YAML "objects" in different files, using the filenames as keys to the "object" values. This requires some efficient filesystem (e.g. tail packing) and depends on what is available based on the OS your system is based on.

Redis / Dictionaries / sqlite3 on millions of pairs

I have pairs (key,value) which consist key = string , value = int. I try to build an index from a large text corpus, so I store the string and an identifier. For every term I read from the corpus I have to check to the index to see if it exists, so I need fast lookups(O(1) if possible). I was using python dictionary to create the index. Problem is that I go out of Ram (16GB Ram). My alternative was to use dictionary and when my ram goes 90% usage I was using a sqlite3 database to store the pairs to the disk. But now the problem is that the seeking time takes too much time(first check dict, if fail go and check database at disk) .
I am thinking to switch to Redis-db. My question is, should I strore the key values as strings or should I hash them and then store them ? (keys are strings which contain (2~100 chars). And what about the values, should I try anything on them (values are int32 numbers)?
edit:
I want to store every term and its identifier(unique pairs) , and if i read a term and it exists inside the index then pass it.
edit2:
I tried using redis but it seems it goes really slow (?) , I use the same code instead of dictionary I use redis set & get which are supposed to have O(1) complexity, but the building time of the index is tooo slow. any advice ?

A Python dictionary can be simulated with C hashes quite easely. Glib provides a working hash implementation that is not difficult to use with some C training. The advantage is that is will be faster and (much) less memory hungry that the Python dictionary:
https://developer.gnome.org/glib/2.40/glib-Hash-Tables.html
GLib Hash Table Loop Problem
You can also add some algorithm to improve the performance. For example store a compressed key.
Even easier, you can segment your large text corpus in sections, create an independent index for each section and then "merge" the indexes.
So for example index 1 will look:
key1 -> page 1, 3, 20
key2 -> page 2, 7
...
index 2:
key1 -> page 50, 70
key2 -> page 65
...
Then you can merge index 1 and 2:
key1 -> page 1, 3, 20, 50, 70
key2 -> page 2, 7, 65
...
You can even paralelize into N machines.

should I store the key values as strings or should I hash them and then store them? [...] what about the values?
The most naive way to use Redis in your case is to perform a SET for every unique pair, e.g SET foo 1234 and so on.
As demonstrated by Instagram (x) what you can do instead is using Redis Hashes that feature transparent memory optimizations behind the scenes:
Hashes [...] when
smaller than a given number of elements, and up to a maximum element
size, are encoded in a very memory efficient way that uses up to 10
times less memory
(see Redis memory optimization documentation for more details).
As suggested by Instagram what you can do is:
hash every key with a 64-bit hash function: n = hash(key)
compute the corresponding bucket: b = n/1000 (with 1,000 elements per bucket)
store the hash, value (= i) pair in this bucket: HSET b n i
Note: keep your integer value i as-is since behind the scenes integers are encoded using a variable number of bytes in ziplists.
Of course make sure to configure Redis with hash-max-ziplist-entries 1000 to make sure every hash will be memory optimized (xx).
To speed up your initial insertion, you may want to use the raw Redis protocol via mass insertion.
(x) Storing hundreds of millions of simple key-value pairs in Redis.
Edit:
(xx) even though in practice most (if not all) of your hashes will contain a single element because of the sparsity of the hash function. In other words, since your keys are hashed strings and not monotonically increasing ID-s as in Instagram exemple, this approach may NOT be as interesting in terms of memory saving (all your ziplists will contain a single pair). You may want to load your dataset and see what it does with real data in comparison to the basic SET key(= string) value(= integer) approach.

How to create a memoryview for a non-contiguous memory location?

I have a fragmented structure in memory and I'd like to access it as a contiguous-looking memoryview. Is there an easy way to do this or should I implement my own solution?
For example, consider a file format that consists of records. Each record has a fixed length header, that specifies the length of the content of the record. A higher level logical structure may spread over several records. It would make implementing the higher level structure easier if it could see it's own fragmented memory location as a simple contiguous array of bytes.
Update:
It seems that python supports this 'segmented' buffer type internally, at least based on this part of the documentation. But this is only the C API.
Update2:
As far as I see, the referenced C API - called old-style buffers - does what I need, but it's now deprecated and unavailable in newer version of Python (3.X). The new buffer protocol - specified in PEP 3118 - offers a new way to represent buffers. This API is more usable in most of the use cases (among them, use cases where the represented buffer is not contiguous in memory), but does not support this specific one, where a one dimensional array may be laid out completely freely (multiple differently sized chunks) in memory.

First - I am assuming you are just trying to do this in pure python rather than in a c extension. So I am assuming you have loaded in the different records you are interested in into a set of python objects and your problem is that you want to see the higher level structure that is spread across these objects with bits here and there throughout the objects.
So can you not simply load each of the records into a byte arrays type? You can then use python slicing of arrays to create a new array that has just the data for the high level structure you are interested in. You will then have a single byte array with just the data you are interested in and can print it out or manipulate it in any way that you want to.
So something like:
a = bytearray(b"Hello World") # put your records into byte arrays like this
b = bytearray(b"Stack Overflow")
complexStructure = bytearray(a[0:6]+b[0:]) # Slice and join arrays to form
# new array with just data from your
# high level entity
print complexStructure
Of course you will still ned to know where within the records your high level structure is to slice the arrays correctly but you would need to know this anyway.
EDIT:
Note taking a slice of a list does not copy the data in the list it just creates a new set of references to the data so:
>>> a = [1,2,3]
>>> b = a[1:3]
>>> id(a[1])
140268972083088
>>> id(b[0])
140268972083088
However changes to the list b will not change a as b is a new list. To have the changes automatically change in the original list you would need to make a more complicated object that contained the lists to the original records and hid them in such a way as to be able to decide which list and which element of a list to change or view when a user look to modify/view the complex structure. So something like:
class ComplexStructure():
def add_records(self,record):
self.listofrecords.append(record)
def get_value(self,position):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
return self.listofrecords[listnum][record]
def set_value(self,position,value):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
self.listofrecords[listnum][record] = value
Granted this is not the simple way of doing things you were hoping for but it should do what you need.

C data structures

Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.

The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).

Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.

The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.

There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.

C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/

A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.

Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.