i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.
However, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space.
One idea would be to replace the BTrees.IIBTree structure with a int-byte hash written in C as an extension for Python, but the data would still be lost in case the machine fails (not a terrible thing in our case).
What are your suggestions?
Related
In Python, I'm reading in a very large 2D grid of data that consists of around 200,000,000 data points in total. Each data point is a tuple of 3 floats. Reading all of this data into a two dimensional list frequently causes Memory Errors. To get around this Memory Error, I would like to be able to read this data into some sort of table on the hard drive that can be efficiently accessed when given a grid coordinate i.e harddrive_table.get(300, 42).
So far in my research, I've come across PyTables, which is an implementation of HDF5 and seems like overkill, and the built in shelve library, which uses a dictionary-like method to access saved data, but the keys have to be strings and the performance of converting hundreds of millions of grid coordinates to strings for storage could be too much of a performance hit for my use.
Are there any libraries that allow me to store a 2D table of data on the hard drive with efficient access for a single data point?
This table of data is only needed while the program is running, so I don't care about it's interoperability or how it stores the data on the hard drive as it will be deleted after the program has run.
HDF5 isn't really overkill if it works. In addition to PyTables there's the somewhat simpler h5py.
Numpy lets you mmap a file directly into a numpy array. The values will be stored in the disk file in the minimum-overhead way, with the numpy array shape providing the mapping between array indices and file offsets. mmap uses the same underlying OS mechanisms that power the disk cache to map a disk file into virtual memory, meaning that the whole thing can be loaded into RAM if memory permits, but parts can be flushed to disk (and reloaded later on demand) if it doesn't all fit at once.
So, I have written an autocomplete and autocorrect program in Python 2. I have written the autocorrect program using the approach mentioned is Peter Norvig's blog on how to write a spell checker, link.
Now, I am using a trie data structure implemented using nested lists. I am using a trie as it can give me all words starting with a particular prefix.At the leaf would be a tuple with the word and a value denoting the frequency of the word.For e.g.- the words bad,bat,cat would be saved as-
['b'['a'['d',('bad',4),'t',('bat',3)]],'c'['a'['t',('cat',4)]]]
Where 4,3,4 are the number times the words have been used or the frequency value. Similarly I have made a trie of about 130,000 words of the english dictionary and stored it using cPickle.
Now, it takes about 3-4 seconds for the entire trie to be read each time.The problem is each time a word is encountered the frequency value has to be incremented and then the updated trie needs to be saved again. As you can imagine it would be a big problem waiting each time for 3-4 seconds to read and then again that much time to save the updated trie each time. I will need to perform a lot of update operations each time the program is run and save them.
Is there a faster or efficient way to store a large data structure which repeatedly will be updated? How are the data structures of the autocorrect programs in IDEs and mobile devices saved & retrieved so fast? I am open to different approaches as well.
A few things come to mind.
1) Split the data. Say use 26 files each storing the tries starting with a certain character. You can improve it so that you use a prefix. This way the amount of data you need to write is less.
2) Don't reflect everything to disk. If you need to perform a lot of operations do them in ram(memory) and write them down at then end. If you're afraid of data loss, you can checkpoint your computation after some time X or after a number of operations.
3) Multi-threading. Unless you program only does spellchecking, it's likely there are other things it needs to do. Have a separate thread that does loading writing so that it doesn't block everything while it does disk IO. Multi-threading in python is a bit tricky but it can be done.
4) Custom structure. Part of the time spent in serialization is invoking serialization functions. Since you have a dictionary for everything that's a lot of function calls. In the perfect case you should have a memory representation that matches exactly the disk representation. You would then simply read a large string and put it into your custom class (and write that string to disk when you need to). This is a bit more advanced and likely the benefits will not be that huge especially since python is not so efficient in playing with bits, but if you need to squeeze the last bit of speed out of it, this is the way to go.
I would suggest you to move serialization to a separate thread and run it periodically. You don't need to re-read your data each time because you already have the latest version in memory. This way your program would be responsive to the user while the data is being saved to the disk. The saved version on disk may be lagging and the latest updates may get lost in case of program crash but this shouldn't be a big issue for your use case, I think.
It depends on a particular use case and environment but, I think, most programs having local data sets sync them using multi-threading.
I have a bunch of flat files that basically store millions of paths and their corresponding info (name, atime, size, owner, etc)
I would like to compile a full list of all the paths stored collectively on the files. For duplicate paths only the largest path needs to be kept.
There are roughly 500 files and approximately a million paths in the text file. The files are also gzipped. So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
Should I go for a database solution? sqlite3? Is there a data structure or better algorithm to go about this in python? Thanks for any help!
So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
If "the current list" implies that you're keeping track of all of the paths seen so far in a list, and then doing if newpath in listopaths: for each line, then each one of those searches takes linear time. If you have 500M total paths, of which 100M are unique, you're doing O(500M*100M) comparisons.
Just changing that list to a set, and changing nothing else in your code (well, you need to replace .append with .add, and you can probably remove the in check entirely… but without seeing your code it's hard to be specific) makes each one of those checks take constant time. So you're doing O(500M) comparisons—100M times faster.
Another potential problem is that you may not have enough memory. On a 64-bit machine, you've got enough virtual memory to hold almost anything you want… but if there's not enough physical memory available to back that up, eventually you'll spend more time swapping data back and forth to disk than doing actual work, and your program will slow to a crawl.
There are actually two potential sub-problems here.
First, you might be reading each entire file in at once (or, worse, all of the files at once) when you don't need to (e.g., by decompressing the whole file instead of using gzip.open, or by using f = gzip.open(…) but then doing f.readlines() or f.read(), or whatever). If so… don't do that. Just iterate over the lines in each GzipFile, for line in f:.
Second, maybe even a simple set of however many unique lines you have is too much to fit in memory on your computer. In that case, you probably want to look at a database. But you don't need anything as complicated as sqlite. A dbm acts like a dict (except that its keys and values have to be byte strings), but it's stored on-dict, caching things in memory where appropriate, instead of stored in memory, paging to disk randomly, which means it will go a lot faster in this case. (And it'll be persistent, too.) Of course you want something that acts like a set, not a dict… but that's easy. You can model a set as a dict whose keys are always ''. So instead of paths.add(newpath), it's just paths[newpath] = ''. Yeah, that wastes a few bytes of disk space over building your own custom on-disk key-only hash table, but it's unlikely to make any significant difference.
Lets suppose that I have a lot of(hundreds) big python dictionaries. Pickled file size is about 2Mb. I want to draw chart using data from this dictionaries so i have to load them all. What is the most efficent (at first speed, at second memory) way to store my data? May be I should use another caching tool? This how i am solving this task now:
Pickle every my dictionary. Just pickle(dict)
Load the pickled string to redis. redis.set(key, dict)
When user needs chart, i am creating array and fill it with unpickled data from redis. Just like that:
array = []
for i in range(iteration_count):
array.append(unpickle(redis.get(key)))
Now i have both problems: with memory, cause my array is very big, but its not important and easy to solve. The main problem - speed. A lot of objects unpickling more than 0.3 seconds. I even have bottlenecks with more than 1 second unpickling time. And getting this string from redis rather expensive (more than 0.01 sec). When i have lots of objects, my user have to wait a lot of seconds.
If it can be assumed that you are asking in the context of a web application and that you are displaying your charts in a browser, I would definitely recommend storing your dictionaries as JSON in redis.
Again, you have not provided too many details about your application, but I have implemented charting over very large data sets before (100,000's of sensor data points per second over several minutes of time). To help performance when rendering the datasets, I stored each type of data into their own dictionary or 'series'. This strategy allows you to render only portions of the data as required.
Perhaps if you share more about your particular application we may be able to provide more help.
I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.
I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:
Reading and writing are both quite fast
After-the-fact additions to the corpus are fairly straightforward to add (a bit less so for lz4 than for zlib because lz4 doesn't have a framing mechanism built in, but doable)
At both write and read time, data can be streamed so it doesn't need to be held in memory all at once, which would be prohibitive given the size of our corpora
Things that kind of suck:
Deletes are a huge pain, and basically involve streaming through all the buckets and writing out new ones that omit any pairs that contain the ID of a document that's been deleted
I suspect I could still do better both in terms of speed and compactness with a more special-purpose data structure and/or compression strategy
So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.
How about using one hash table or B-tree per bucket?
On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.
Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.
A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.
Why not just store a table containing stuff that was deleted since the last re-write?
This table could be the same structure as your main bucket, maybe with a Bloom filter for quick membership checks.
You can re-write the main bucket data without the deleted items either when you were going to re-write it anyway for some other modification, or when the ratio of deleted items:bucket size exceeds some threshold.
This scheme could work either by storing each deleted pair alongside each bucket, or by storing a single table for all deleted documents: I'm not sure which is a better fit for your requirements.
Keeping a single table, it's hard to know when you can remove an item unless you know how many buckets it affects, without just re-writing all buckets whenever the deletion table gets too large. This could work, but it's a bit stop-the-world.
You also have to do two checks for each pair you stream in (ie, for (3278, 15378), you'd check whether either 3278 or 15378 has been deleted, instead of just checking whether pair (3278, 15378) has been deleted.
Conversely, the per-bucket table of each deleted pair would take longer to build, but be slightly faster to check, and easier to collapse when re-writing the bucket.
You are trying to reinvent what already exists in new age NoSQL data stores.
There are 2 very good candidates for your requirements.
Redis.
MongoDb
Both support data structures like dictionaries,lists,queues. The operations like append, modify or delete are also available in both , and very fast.
The performance of both of them is driven by amount of data that can reside in the RAM.
Since most of your data is integer based, that should not be a problem.
My personal suggestion is to go with Redis, with a good persistence configuration (i.e. the data should periodically be saved from RAM to disk ).
Here is a brief of redis data structures :
http://redis.io/topics/data-types-intro
The redis database is a lightweight installation, and client is available in Python.