I have pairs (key,value) which consist key = string , value = int. I try to build an index from a large text corpus, so I store the string and an identifier. For every term I read from the corpus I have to check to the index to see if it exists, so I need fast lookups(O(1) if possible). I was using python dictionary to create the index. Problem is that I go out of Ram (16GB Ram). My alternative was to use dictionary and when my ram goes 90% usage I was using a sqlite3 database to store the pairs to the disk. But now the problem is that the seeking time takes too much time(first check dict, if fail go and check database at disk) .
I am thinking to switch to Redis-db. My question is, should I strore the key values as strings or should I hash them and then store them ? (keys are strings which contain (2~100 chars). And what about the values, should I try anything on them (values are int32 numbers)?
edit:
I want to store every term and its identifier(unique pairs) , and if i read a term and it exists inside the index then pass it.
edit2:
I tried using redis but it seems it goes really slow (?) , I use the same code instead of dictionary I use redis set & get which are supposed to have O(1) complexity, but the building time of the index is tooo slow. any advice ?
A Python dictionary can be simulated with C hashes quite easely. Glib provides a working hash implementation that is not difficult to use with some C training. The advantage is that is will be faster and (much) less memory hungry that the Python dictionary:
https://developer.gnome.org/glib/2.40/glib-Hash-Tables.html
GLib Hash Table Loop Problem
You can also add some algorithm to improve the performance. For example store a compressed key.
Even easier, you can segment your large text corpus in sections, create an independent index for each section and then "merge" the indexes.
So for example index 1 will look:
key1 -> page 1, 3, 20
key2 -> page 2, 7
...
index 2:
key1 -> page 50, 70
key2 -> page 65
...
Then you can merge index 1 and 2:
key1 -> page 1, 3, 20, 50, 70
key2 -> page 2, 7, 65
...
You can even paralelize into N machines.
should I store the key values as strings or should I hash them and then store them? [...] what about the values?
The most naive way to use Redis in your case is to perform a SET for every unique pair, e.g SET foo 1234 and so on.
As demonstrated by Instagram (x) what you can do instead is using Redis Hashes that feature transparent memory optimizations behind the scenes:
Hashes [...] when
smaller than a given number of elements, and up to a maximum element
size, are encoded in a very memory efficient way that uses up to 10
times less memory
(see Redis memory optimization documentation for more details).
As suggested by Instagram what you can do is:
hash every key with a 64-bit hash function: n = hash(key)
compute the corresponding bucket: b = n/1000 (with 1,000 elements per bucket)
store the hash, value (= i) pair in this bucket: HSET b n i
Note: keep your integer value i as-is since behind the scenes integers are encoded using a variable number of bytes in ziplists.
Of course make sure to configure Redis with hash-max-ziplist-entries 1000 to make sure every hash will be memory optimized (xx).
To speed up your initial insertion, you may want to use the raw Redis protocol via mass insertion.
(x) Storing hundreds of millions of simple key-value pairs in Redis.
Edit:
(xx) even though in practice most (if not all) of your hashes will contain a single element because of the sparsity of the hash function. In other words, since your keys are hashed strings and not monotonically increasing ID-s as in Instagram exemple, this approach may NOT be as interesting in terms of memory saving (all your ziplists will contain a single pair). You may want to load your dataset and see what it does with real data in comparison to the basic SET key(= string) value(= integer) approach.
Related
I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.
I have a scenario where to compare two dictionaries based on the set of keys.
i.e
TmpDict ={}
TmpDict2={}
for line in reader:
line = line.strip()
TmpArr=line.split('|')
TmpDict[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
for line in reader2:
line = line.strip()
TmpArr=line.split('|')
TmpDict2[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
This works fine comparing two dictionaries with exactly same keys but there is a tolerance ,i need to consider.That is .. TmpArr[12],TmpArr[14] are time and duration where tolerance need to be checked.Please see the example below
Example:
dict1={(111,12,23,12:22:30,12:23:34,64): 4|1994773966623|773966623|754146741|\N|359074037474030|413025600032728|}
dict2={(111,12,23,12:22:34,12:23:34,60) :4|1994773966623|773966623|754146741|\N|359074037474030|413025600032728|}
Lets assume i have two dictionaries with length 1 each and tolerance is '4'seconds so the above keys must be considered as Matched lines even though there is a difference in the time and duration of 4 seconds.
I know Dictionaries search for a key is o(1) irrespective of the length ,how could i achieve this scenario with same performance.
Thanks
You have at least these 4 options:
store all keys within the tolerance (consumes memory).
Look up keys with tolerance. Notice that if the tolerance is defined and constant, then
lookups are C * O(N) which is O(n).
combine the previous ones: compress the keys using some scheme, say round down to divisible to 4, and up to divisible by 4, and then store the value for these keys in the dictionary, and verify from the exact value if it is correct.
or do not use a dictionary but instead some kind of tree structure; notice that you can still store the exact part of the key in a dictionary.
As such, you do not provide enough information to decide whichever is best of these. However I personally would go for 3.
If you can use more memory to keep performance as above code, you can insert more than one entry for each element. For example "111,12,23,12:22:30,12:23:34,60", "111,12,23,12:22:30,12:23:34,61", "... , "111,12,23,12:22:30,12:23:34,68" are inserted for just ""111,12,23,12:22:30,12:23:34,64" key.
If you want to not waste memory but o(1) performance is kept, You can check 8 keys (4 before and 4 after) for one key. It has 8 times more comparison than above code but is o(1) also.
As a research project I'm currently writing a document-oriented database from scratch in Python. Like MongoDB, the database supports the creation of indexes on arbitrary document keys. These indexes are currently implemented using two simple dictionaries: The first contains as key the (possibly hashed) value of the indexed field and as value the store keys of all documents associated with that field value, which allows the DB to locate the document on disk. The second dictionary contains the inverse of that, i.e. as a key the store key of a given document and as value the (hashed) value of the indexed field (which makes removing document from the index more efficient). An example:
doc1 = {'foo' : 'bar'} # store-key : doc1
doc2 = {'foo' : 'baz'} # store-key : doc2
doc3 = {'foo' : 'bar'} # store-key : doc3
For the foo field, the index dictionaries for these documents would look like this:
foo_index = {'bar' : ['doc1','doc3'],'baz' : ['doc2']}
foo_reverse_index = {'doc1' : ['bar'],'doc2' : ['baz'], 'doc3' : ['bar']}
(please not that the reverse index also consists of lists of values [and not single values] to accommodate indexing of list fields, in which case each element of the list field would be contained in the index separately)
During normal operation, the index resides in memory and is updated in real time after each insert/update/delete operation. To persist it, it gets serialized (e.g. as JSON object) and stored to disk, which works reasonably well for index sizes up to a few 100k entries. However, as the database size grows the index loading times at program startup become problematic, and committing changes in realtime to disk becomes nearly impossible since writing of the index incurs a large overhead.
Hence I'm looking for an implementation of a persistent index which allows for efficient incremental updates, or, expressed differently, does not require rewriting the whole index when persisting it to disk. What would be a suitable strategy for approaching this problem? I thought about using a linked-list to implement an addressable storage space to which objects could be written but I'm not sure if this is the right approach.
My suggestion is limited to the update of the index for persistence; the extra time at program startup is not a major one and can not really be avoided.
One approach is to use a preallocation of disk space for the index (possibly for other collections too). In the preallocation, you define an empirical size associated with each entry of the index as well as the total size of the index on the disk. For example 1024 bytes for each entry of the index and a total of 1000 entries.
The strategy allows for direct access to each entry of the index on disk. You just have to store the position on the disk along with the index in memory. Any time you update an entry of the index in memory, you point directly to its exact location on the disk and rewrite only a single entry.
If it happens that the first index file is full, just create a second file; always preallocate the space for your file on disk (1024*1000 bytes). You should also preallocate the space for your other data, and choose to use multiple fixed-size files instead of a single large file
If it happens that some entries of the index require more than 1024 bytes, simply create an extra index file for larger entries; for example 2048 bytes per entry and a total of 100 entries.
The most important is to used fixed size index entries for direct access.
I hope it helps
I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.
Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.
The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).
Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.
The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.
There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.
C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/
A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.
Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.