Lets suppose that I have a lot of(hundreds) big python dictionaries. Pickled file size is about 2Mb. I want to draw chart using data from this dictionaries so i have to load them all. What is the most efficent (at first speed, at second memory) way to store my data? May be I should use another caching tool? This how i am solving this task now:
Pickle every my dictionary. Just pickle(dict)
Load the pickled string to redis. redis.set(key, dict)
When user needs chart, i am creating array and fill it with unpickled data from redis. Just like that:
array = []
for i in range(iteration_count):
array.append(unpickle(redis.get(key)))
Now i have both problems: with memory, cause my array is very big, but its not important and easy to solve. The main problem - speed. A lot of objects unpickling more than 0.3 seconds. I even have bottlenecks with more than 1 second unpickling time. And getting this string from redis rather expensive (more than 0.01 sec). When i have lots of objects, my user have to wait a lot of seconds.
If it can be assumed that you are asking in the context of a web application and that you are displaying your charts in a browser, I would definitely recommend storing your dictionaries as JSON in redis.
Again, you have not provided too many details about your application, but I have implemented charting over very large data sets before (100,000's of sensor data points per second over several minutes of time). To help performance when rendering the datasets, I stored each type of data into their own dictionary or 'series'. This strategy allows you to render only portions of the data as required.
Perhaps if you share more about your particular application we may be able to provide more help.
Related
I am looking for some high level advice about a project that I am attempting.
I want to write a PyQt application (following the model-view pattern) to read in images from a directory one by one and process them. Typically there will be a few thousand .png images (each around 1 megapixel, 16 bit grayscale) in the directory. After being read in, the application will then process the integer pixel values of each image in some way, and crucially the result will be a matrix of floats for each. Once processed, the user should be able be able to then go back and explore any of the matrices they choose (or multiple at once), and possibly apply further processing.
My question is regarding a sensible way to store the matrices in memory, and access them when needed. After reading in the raw .png files and obatining the corresponding matrix of floats, I can then see the following options for handling the result:
Simply store each matrix as a numpy array and have every one of them stored in a class attribute. That way they will all be easily accessible to the code when requested by the user, but will this be poor in terms of RAM required?
After processing each, write out the matrix to a text file, and read it back in from the text file when requested by the user.
I have seen examples (see here) of people using SQLite databases to store data for a GUI application (using MVC pattern), and then query the database when you need access to data. This seems like it would have the advantage that data is not stored in RAM by the "model" part of the application (like in option 1), and is possibly more storage-efficient than option 2, but is this suitable given that my data are matrices?
I have seen examples (see here) of people using something called HDF5 for storing application data, and that this might be similar to using a SQLite database? Again, suitable for matrices?
Finally, I see that PyQt has the classes QImage and QPixmap. Do these make sense for solving the problem I have described?
I am a little lost with all the options, and don't want to spend too much time investigating all of them in too much detail so would appreciate some general advice. If someone could offer comments on each of the options I have described (as well as letting me know if any can be ruled out in this situation) that would be great!
Thank you
thanks for hearing me out.
I have a dataset that is a matrix of shape 75000x10000 filled with float values. Think of it like heatmap/correlation matrix. I want to store this in a SQLite database (SQLite because I am modifying an existing Django project). The source data file is 8 GB in size and I am trying to use python to carry out my task.
I have tried to use pandas chunking to read the file into python and transform it into unstacked pairwise indexed data and write it out onto a json file. But this method is eating up my computational cost. For a chunk of size 100x10000 it generates a 200 MB json file.
This json file will be used as a fixture to form the SQLite database in Django backend.
Is there a better way to do this? Faster/Smarter way. I don't think a 90 GB odd json file written out taking a full day is the way to go. Not even sure if Django databases can take this load.
Any help is appreciated!
SQLite is quite impressive for what it is, but it's probably not going to give you the performance you are looking for at that scale, so even though your existing project is Django on SQLite I would recommend simply writing a Python wrapper for a different data backend and just using that from within Django.
More importantly, forget about using Django models for something like this; they are an abstraction layer built for convenience (mapping database records to Python objects), not for performance. Django would very quickly choke trying to build 100s of millions of objects since it doesn't understand what you're trying to achieve.
Instead, you'll want to use a database type / engine that's suited to the type of queries you want to make; if a typical query consists of a hundred point queries to get the data in particular 'cells', a key-value store might be ideal; if you're typically pulling ranges of values in individual 'rows' or 'columns' then that's something to optimize for; if your queries typically involve taking sub-matrices and performing predictable operations on them then you might improve the performance significantly by precalculating certain cumulative values; and if you want to use the full dataset to train machine learning models, you're probably better off not using a database for your primary storage at all (since databases by nature sacrifice fast-retrieval-of-full-raw-data for fast-calculations-on-interesting-subsets), especially if your ML models can be parallelised using something like Spark.
No DB will handle everything well, so it would be useful if you could elaborate on the workload you'll be running on top of that data -- the kind of questions you want to ask of it?
I basically have a large (multi-terabyte) dataset of text (it's in JSON but I could change it to dict or dataframe). It has multiple keys, such as "group" and "user".
Right now I'm filtering the data by reading through the entire text for these keys. It would be far more efficient to have a structure where I filter and read only the key.
Doing the above would be trivial if it fit in memory, and I could use standard dict/pandas methods and hash tables. But it doesn't fit in memory.
There must be an off the shelf system for this. Can anyone recommend one?
There are discussions about this, but some of the better ones are old. I'm looking for the simplest off the shelf solution.
I suggest you to split your large file to multiple small files with method readlines(CHUNK) and then you can process it one by one.
I worked with large Json and at beginning, the process was 45sec by file and my program ran while 2 days but when I splintered it, the program finished only for 4h
So, I have written an autocomplete and autocorrect program in Python 2. I have written the autocorrect program using the approach mentioned is Peter Norvig's blog on how to write a spell checker, link.
Now, I am using a trie data structure implemented using nested lists. I am using a trie as it can give me all words starting with a particular prefix.At the leaf would be a tuple with the word and a value denoting the frequency of the word.For e.g.- the words bad,bat,cat would be saved as-
['b'['a'['d',('bad',4),'t',('bat',3)]],'c'['a'['t',('cat',4)]]]
Where 4,3,4 are the number times the words have been used or the frequency value. Similarly I have made a trie of about 130,000 words of the english dictionary and stored it using cPickle.
Now, it takes about 3-4 seconds for the entire trie to be read each time.The problem is each time a word is encountered the frequency value has to be incremented and then the updated trie needs to be saved again. As you can imagine it would be a big problem waiting each time for 3-4 seconds to read and then again that much time to save the updated trie each time. I will need to perform a lot of update operations each time the program is run and save them.
Is there a faster or efficient way to store a large data structure which repeatedly will be updated? How are the data structures of the autocorrect programs in IDEs and mobile devices saved & retrieved so fast? I am open to different approaches as well.
A few things come to mind.
1) Split the data. Say use 26 files each storing the tries starting with a certain character. You can improve it so that you use a prefix. This way the amount of data you need to write is less.
2) Don't reflect everything to disk. If you need to perform a lot of operations do them in ram(memory) and write them down at then end. If you're afraid of data loss, you can checkpoint your computation after some time X or after a number of operations.
3) Multi-threading. Unless you program only does spellchecking, it's likely there are other things it needs to do. Have a separate thread that does loading writing so that it doesn't block everything while it does disk IO. Multi-threading in python is a bit tricky but it can be done.
4) Custom structure. Part of the time spent in serialization is invoking serialization functions. Since you have a dictionary for everything that's a lot of function calls. In the perfect case you should have a memory representation that matches exactly the disk representation. You would then simply read a large string and put it into your custom class (and write that string to disk when you need to). This is a bit more advanced and likely the benefits will not be that huge especially since python is not so efficient in playing with bits, but if you need to squeeze the last bit of speed out of it, this is the way to go.
I would suggest you to move serialization to a separate thread and run it periodically. You don't need to re-read your data each time because you already have the latest version in memory. This way your program would be responsive to the user while the data is being saved to the disk. The saved version on disk may be lagging and the latest updates may get lost in case of program crash but this shouldn't be a big issue for your use case, I think.
It depends on a particular use case and environment but, I think, most programs having local data sets sync them using multi-threading.
I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.
I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:
Reading and writing are both quite fast
After-the-fact additions to the corpus are fairly straightforward to add (a bit less so for lz4 than for zlib because lz4 doesn't have a framing mechanism built in, but doable)
At both write and read time, data can be streamed so it doesn't need to be held in memory all at once, which would be prohibitive given the size of our corpora
Things that kind of suck:
Deletes are a huge pain, and basically involve streaming through all the buckets and writing out new ones that omit any pairs that contain the ID of a document that's been deleted
I suspect I could still do better both in terms of speed and compactness with a more special-purpose data structure and/or compression strategy
So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.
How about using one hash table or B-tree per bucket?
On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.
Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.
A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.
Why not just store a table containing stuff that was deleted since the last re-write?
This table could be the same structure as your main bucket, maybe with a Bloom filter for quick membership checks.
You can re-write the main bucket data without the deleted items either when you were going to re-write it anyway for some other modification, or when the ratio of deleted items:bucket size exceeds some threshold.
This scheme could work either by storing each deleted pair alongside each bucket, or by storing a single table for all deleted documents: I'm not sure which is a better fit for your requirements.
Keeping a single table, it's hard to know when you can remove an item unless you know how many buckets it affects, without just re-writing all buckets whenever the deletion table gets too large. This could work, but it's a bit stop-the-world.
You also have to do two checks for each pair you stream in (ie, for (3278, 15378), you'd check whether either 3278 or 15378 has been deleted, instead of just checking whether pair (3278, 15378) has been deleted.
Conversely, the per-bucket table of each deleted pair would take longer to build, but be slightly faster to check, and easier to collapse when re-writing the bucket.
You are trying to reinvent what already exists in new age NoSQL data stores.
There are 2 very good candidates for your requirements.
Redis.
MongoDb
Both support data structures like dictionaries,lists,queues. The operations like append, modify or delete are also available in both , and very fast.
The performance of both of them is driven by amount of data that can reside in the RAM.
Since most of your data is integer based, that should not be a problem.
My personal suggestion is to go with Redis, with a good persistence configuration (i.e. the data should periodically be saved from RAM to disk ).
Here is a brief of redis data structures :
http://redis.io/topics/data-types-intro
The redis database is a lightweight installation, and client is available in Python.