Efficient way to check existence in a large set of strings

Efficient way to check existence in a large set of strings - python

I have a set of 100+ million strings, each up to 63 characters long. I have lots of disk space and very little memory (512 MB). I need to query for existence alone, and store no additional metadata.
My de facto solution is a BDB btree. Are there any preferable alternatives? I'm aware of leveldb and Kyoto Cabinet, but not familiar enough to identify advantages.

If false positives are acceptable, then one possible solution would be to use a bloom filter. Bloom filters are similar to hash tables, but instead of using one hash value to index a table of buckets, it uses multiple hashes to index a bit array. The bits corresponding to those indices are set. Then, to test if a string is in the filter, the string is hashed again, and if the corresponding indices are set, then the string is "in" the filter.
It doesn't store any information about the strings, so it uses very little memory -- but if there's a collision between two strings, no collision resolution is possible. This means there may be false positives (because a string not in the filter may hash to the same indices as a string that is in the filter). However, there can be no false negatives; any string that really is in the set will be found in the bloom filter.
There are a few Python implementations. It's also not hard to roll your own; I recall coding up a quick-and-dirty bloom filter myself once using bitarrays that performed pretty well.

You said you have lots of disk, huh? One option would be to store the strings as filenames in nested subdirectories. You could either use the strings directly:
Store "drew sears" in d/r/e/w/ sears
or by taking a hash of the string and following a similar process:
MD5('drew sears') = 'f010fe6e20d12ed895c10b93b2f81c6e'
Create an empty file named f0/10/fe/6e/20d12ed895c10b93b2f81c6e.
Think of it as an OS-optimized, hash-table based indexed NoSQL database.
Side benefits:
You could change your mind at a later time and store data in the file.
You can replicate your database to another system using rsync.

Related

Fastest way to check if an image sequence string actually exists on disk

I have a potentially big list of image sequences from nuke. The format of the string can be:
/path/to/single_file.ext
/path/to/img_seq.###[.suffix].ext
/path/to/img_seq.%0id[.suffix].ext, i being an integer value, the values between [] being optional.
The question is: given this string, that can represent a sequence or a still image, check if at least one image on disk corresponds to that string in the fastest way possible.
There is already some code that checks if these files exist, but it's quite slow.
First it checks if the folder exists, if not, returns False
Then it checks if the file exists with os.path.isfile, if it does, it returns True.
Then it checks if no % or # is found in the path, and if not os.path.isfile, it returns False.
All this is quite fast.
But then, it uses some internal library which is in performance a bit faster than pyseq to try to find an image sequence, and does a bit more operations depending if start_frame=end_frame or not.
But it stills take a large amount of time to analyze if something is an image sequence, specially on some sections of the network and for big image sequences.
For example, for a 2500 images sequence, the analysis takes between 1 and 3 seconds.
If I take a very naive approach, and just checks if a frame exist by replacing #### by %04d, and loop over 10000 and break if found, it takes less than .02 seconds to check for os.path.isfile(f), specially if the first frame is between 1-3000.
Of course I cannot guarantee what the start frame will be, and that approach is not perfect, but in practice many of the sequences do begin between 1-3000, and I could return True if found and fallback to the sequence approach if nothing is found (it would still be quicker for most of the cases)
I'm not sure what's the best approach is for this, I already made it multithreaded when searching for many image sequences, so it's faster than before, but I'm sure there is room for improvement.

You should probably not loop for candidates using os.path.isfile(), but use glob.glob() or os.listdir() and check the returned lists for matching your file patterns, i.e. prefer memory operations over disk accesses.

If there are potentially so many files that you're worried about wasting memory for a dictionary that holds them all, you could just store a single key for each img_seq.###[.suffix].ext pattern, removing the sequence number as you scan the directory. Then a single lookup will suffice. The values in the dictionary could either be "dummy" booleans because the existence of the key is the only thing you care about, or counters in case you ever want to know how many files you have for a certain sequence.

Implementing python list and Binary search tree

I am working on making a index for a text file. index will be an list of every word and symbols like(~!##$%^&*()_-{}:"<>?/.,';[]1234567890|) and counting the number of times each token occurred in the text file. printing all this in an ascending ASCII value order.
I am going to read a .txt file and split the words and special characters and store it in a list. Can any one throw me idea on how to use binary search in this case.

If your lookup is small (say, up to 1000 records), then you can use a dict; you can either pickle() it, or write it out to a text file. Overhead (for this size) is fairly small anyway.
If your look-up table is bigger, or there are a small number of lookups per run, I would suggest using a key/value database (e.g. dbm).
If it is too complex to use a dbm, use a SQL (e.g. sqlite3, MySQL, Postgres) or NoSQL database. Depending on your application, you can get a huge benefit from the extra features these provide.
In either case, all the hard work is done for you, much better than you can expect to do it yourself. These formats are all standard, so you will get simple-to-use tools to read the data.

Data structure options for efficiently storing sets of integer pairs on disk?

I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.
I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:
Reading and writing are both quite fast
After-the-fact additions to the corpus are fairly straightforward to add (a bit less so for lz4 than for zlib because lz4 doesn't have a framing mechanism built in, but doable)
At both write and read time, data can be streamed so it doesn't need to be held in memory all at once, which would be prohibitive given the size of our corpora
Things that kind of suck:
Deletes are a huge pain, and basically involve streaming through all the buckets and writing out new ones that omit any pairs that contain the ID of a document that's been deleted
I suspect I could still do better both in terms of speed and compactness with a more special-purpose data structure and/or compression strategy
So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.

How about using one hash table or B-tree per bucket?
On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.
Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.
A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.

Why not just store a table containing stuff that was deleted since the last re-write?
This table could be the same structure as your main bucket, maybe with a Bloom filter for quick membership checks.
You can re-write the main bucket data without the deleted items either when you were going to re-write it anyway for some other modification, or when the ratio of deleted items:bucket size exceeds some threshold.
This scheme could work either by storing each deleted pair alongside each bucket, or by storing a single table for all deleted documents: I'm not sure which is a better fit for your requirements.
Keeping a single table, it's hard to know when you can remove an item unless you know how many buckets it affects, without just re-writing all buckets whenever the deletion table gets too large. This could work, but it's a bit stop-the-world.
You also have to do two checks for each pair you stream in (ie, for (3278, 15378), you'd check whether either 3278 or 15378 has been deleted, instead of just checking whether pair (3278, 15378) has been deleted.
Conversely, the per-bucket table of each deleted pair would take longer to build, but be slightly faster to check, and easier to collapse when re-writing the bucket.

You are trying to reinvent what already exists in new age NoSQL data stores.
There are 2 very good candidates for your requirements.
Redis.
MongoDb
Both support data structures like dictionaries,lists,queues. The operations like append, modify or delete are also available in both , and very fast.
The performance of both of them is driven by amount of data that can reside in the RAM.
Since most of your data is integer based, that should not be a problem.
My personal suggestion is to go with Redis, with a good persistence configuration (i.e. the data should periodically be saved from RAM to disk ).
Here is a brief of redis data structures :
http://redis.io/topics/data-types-intro
The redis database is a lightweight installation, and client is available in Python.

Mongodb with Python's "set()" type

I am building a web App with mongoDB as the backend. Some of the documents need to store a collection of items in some sort of list, and then the system will need to frequently check if a specified item is present in that list. Using Python's 'in' operator takes Big-O(N) time, n being the size of the list. Since these list can get quite large, I want something faster than that. Python's 'set' type does this operation in constant time (and enforces uniqueness, which is good in my case), but is considered an invalid data type to put in MongoDB.
So what's the best way to do this? Is there some way to just use a regular list and exploit mongo's indexing features? Again, I want to know, for a given document in a collection, does a list inside that document contain particular element?

You can represent a set using a dictionary. Your elements become the keys, and all the values can be set to a constant such as 1. The in operator checks for the existence of a key.
EDIT. MongoDB stores a dict as a BSON document, where the keys must be strings (with some additional restrictions), so the above advice is of limited use.

merging dictionaries in python

Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!

A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).

Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count

If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.