I'm working on a project where I crawl and re-organize huge number of data into a result text file. Previously I used dictionary to store temporary data, but as the data volume increased the process slowed down because of memory usage and dictionary got useless.
Since process speed is not so important in my case, I'm trying to replace dictionary to file but I'm not sure how can I easily move file pointer to appropriate position and read up required data. In dictionary I can easily refer to any data. I would like to achieve the same but in file.
I'm thinking to use mmap and write my own functions to move file pointer where I want. Does Python have a built-in or 3rd party module for such operations?
Any other theoretical approach is welcome.
I think you are now trying to reinvent a key-value database.
Maybe the easiest thing would be to check if sqlite3 module would offer you what you need. Using a readymade database is easier than rolling your own!
Of course, sqlite3 is not a key-value DB (on the surface), so if you need something even simpler, have a look at LMDB and its Python bindings: http://lmdb.readthedocs.org/en/release/
It is as lightweight and fast as it gets. It is probably close to the fastest way to achieve what you want.
It should be noted that there is no such thing as an optimal key-value database. There are several aspects to consider. At least:
Do you read a lot or write a lot?
What are the key and value sizes?
Do you need transactions/crash-proof?
Do you have duplicate keys (one key, several values)?
Do you want to have sorted keys?
Do you want to read the data out in the same order it is inserted?
What is your database size (MB, GB, TB, PB)?
Are you constrained on IO or CPU?
For example, LMDB I suggested above is very good in read-intensive tasks, not so much in write-intensive tasks. It offers transactions, keeps keys in sorted order and is crash-proof (limited by the underlying file system). However, if you need to write your database often, LMDB may not be the best choice.
On the other hand, SQLite is not the perfect choice to this task - theoretically speking. In practice, it is built in into the standard Python distribution and is thus easy to use. It may provide adequate performance, and it may thus be the best choice.
There are numerous high-quality databases out there. By not mentioning them I do not try to give the impression that the DBs mentioned in this answer are the only good alternatives. Most database managers have a very good reason for their existence. While there are some that are a bit outdated, most have their own sweet spots in the application area.
The field is constantly changing. There are both completely new databases available and old database systems are updated. This should be kept in mind when reading old benchmarks. Also, the type of HW used has its impact; a computer with a SSD disk, a cloud computing instance, and a traditional computer with a HDD behave quite differently performance-wise.
Related
I am working with some network simulator. After making some extensions to it, I need to make a lot of different simulations and tests. I need to record:
simulation scenario configurations
values of some parameters (e.g. buffer sizes, signal qualities, position) per devices per time unit t
final results computed from those recorded values
Second data is needed to perform some visualization after simulation was performed (simple animation, showing some statistics over time).
I am using Python with matplotlib etc. for post-processing the data and for writing a proper app (now considering pyQt or Django, but this is not the topic of the question). Now I am wondering what would be the best way to store this data?
My first guess was to use XML files, but it can be too much overhead from the XML syntax (I mean, files can grow up to very big sizes, especially for the second part of the data type). So I tried to design a database... But this also seems to me to be not the proper way... Maybe a mix of both?
I have tried to find some clues in Google, but found nothing special. Have you ever had a need for storing such data? How have you done that? Is there any "design pattern" for that?
Separate concerns:
Apart from pondering on the technology to use for storing data (DBMS, CSV, or maybe one of the specific formats for scientific data), note that you have three very different kinds of data to manage:
Simulation scenario configurations: these are (typically) rather small, but they need to be simple to edit, simple to re-use, and should allow to reproduce a simulation run. Here, text or code files seem to be a good choice (these should also be version-controlled).
Raw simulation data: this is where you should be really careful if you are concerned with simulation performance, because writing 3 GB of data during a run can take a huge amount of time if implemented badly. One way to proceed would be to use existing file formats for this purpose (see below) and see if they work for you. If not, you can still use a DBMS. Also, it is usually a good idea to include a description of the scenario that generated the data (or at least a reference), as this helps you managing the results.
Data for post-processing: how to store this mostly depends on the post-processing tools. For example, if you already have a class structure for your visualization application, you could define a file format that makes it easy to read in the required data.
Look for existing solutions:
The problem you face (How to manage simulation data?) is fundamental and there are many potential solutions, each coming with certain trade-offs. As you are working in network simulation, check out what capabilities other tools used in your community provide. It could be that their developers ran into problems you are not even anticipating yet (regarding reproducibility etc.), and already found a good solution. For example, you could check out how OMNeT++ is handling simulation output: the simulation configurations are defined in a separate file, results are written to vec and sca files (depending on their nature). As far as I understood your problems with hierarchical data, this is supported as well (vectors get unique IDs and are associated with an attribute of some model entity).
Additional tools already work with these file formats, e.g. to convert them to other formats like CSV/MATLAB files, so you could even think of creating files in the same format (documented here) and to use existing tools/converters for post-processing.
Many other simulation tools will have similar features, so take a look at what would work best for you.
It sounds like you need to record more or less the same kinds of information for each case, so a relational database sounds like a good fit-- why do you think it's "not the proper way"?
If your data fits in a collection of CSV files, you're most of the way to a relational database already! Just store in database tables instead, and you have support for foreign keys and queries. If you go on to implement an object-oriented solution, you can initialize your objects from the database.
If your data structures are well-known and stable AND you need some of the SQL querying / computation features then a light-weight relational DB like SQLite might be the way to go (just make sure it can handle your eventual 3+GB data).
Else - ie, each simulation scenario might need a dedicated data structure to store the results -, and you don't need any SQL feature, then you might be better using a more free-form solution (document-oriented database, OO database, filesystem + csv, whatever).
Note that you can still use a SQL db in the second case, but you'll have to dynamically create tables for each resultset, and of course dynamically create the relevant SQL queries too.
I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.
However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.
Requirements:
Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.
Preferences:
I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.
Candidates:
SQLite in memory
SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap
On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.
I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing
The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk
My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.
I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?
I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.
I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.
Thank you so much for your help!
I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!
I'm working on a program in Python that needs to store a persistent "set" data structure containing many fixed-size hash values (SHA256, but that's not important). The critical operations are insert and lookup. Delete is not needed for regular operation. The set will grow over time and eventually may not all fit in memory.
I have considered:
a set stored on disk using pickle (slow [several seconds] to write new file to disk, eventually won't fit in memory)
a SQLite database (additional dependency not available by default)
custom disk-based balanced tree structure, such as B-tree or similar
Ideally, there would be a built-in Python module that provides something that can support these operations. What's a good option here?
After I composed this I found Fast disk-based hashtables? which has some good ideas. I like the mmap/bucket accepted answer there.
(This is for a rewrite of shaback if you're curious.)
Another option is to use shelve, i know it's the same as pickle (under the hood) but i think it's a good option (that i didn't see in your list of options :-)) or maybe if you don't mind using a third party lib you can take a look at shove (it's like a shelve++).
I think this is what databases like sqlite are made for. Is there a reason you can't use it?
You could use a DBM style database. I'm doing a similar thing with dbm, just storing all the keys with a value of '1'. Since it's BSD, the dbhash module should work. (it's deprecated, so no Python 3; and not a great idea for long-term use because of that). Otherwise, use the modules gdbm (dbm.gdbm in Python 3) and ndbm(dbm.dbm in Python 3). There's also the module dumbdbm(dbm.dumbdbm in Python 3) which is pure python and always works, but a bit slower. Also, if you are going to have multiple simultaneous reads and writes, definitely do not use the dumbdbm module.
The various dbm modules all work just like a python dictionary, except the keys and the values need to be strings. You can use the "in" keyword just like you would for a set, or a dict.
Dbm and setting the second value as an arbitrary value of 1 as Brian Minton suggested is a convenient solution. cPickle is good too
However, You should also consider using json. Check google but AFAIK, it seems that the json parser is faster than Pickle/cPickle. (e.g., http://kovshenin.com/2010/pickle-vs-json-which-is-faster/)
I have a relatively large dictionary. How do I know the size? well when I save it using cPickle the size of the file will grow approx. 400Mb. cPickle is supposed to be much faster than pickle but loading and saving this file just takes a lot of time. I have a Dual Core laptop 2.6 Ghz with 4GB RAM on a Linux machine. Does anyone have any suggestions for a faster saving and loading of dictionaries in python? thanks
Use the protocol=2 option of cPickle. The default protocol (0) is much slower, and produces much larger files on disk.
If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.
The advantages of a database like sqlite over cPickle will depend on your use case. How often will you write data? How many times do you expect to read each datum that you write? Will you ever want to perform a search of the data you write, or load it one piece at a time?
If you're doing write-once, read-many, and loading one piece at a time, by all means use a database. If you're doing write once, read once, cPickle (with any protocol other than the default protocol=0) will be hard to beat. If you just want a large, persistent dict, use shelve.
I know it's an old question but just as an update for those who still looking for an answer to this question:
The protocol argument has been updated in python 3 and now there are even faster and more efficient options (i.e. protocol=3 and protocol=4) which might not work under python 2.
You can read about it more in the reference.
In order to always use the best protocol supported by the python version you're using, you can simply use pickle.HIGHEST_PROTOCOL. The following example is taken from the reference:
import pickle
# ...
with open('data.pickle', 'wb') as f:
# Pickle the 'data' dictionary using the highest protocol available.
pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
Sqlite
It might be worthwhile to store the data in a Sqlite database. Although there will be some development overhead when refactoring your program to work with Sqlite, it also becomes much easier and performant to query the database.
You also get transactions, atomicity, serialization, compression, etc. for free.
Depending on what version of Python you're using, you might already have sqlite built-in.
I have tried this for many projects and concluded that shelve is faster than pickle in saving data. Both perform the same at loading data.
Shelve is in fact a dirty solution.
That is because you have to be very careful with it. If you do not close a shelve file after opening it, or due to any reason some interruption happens in your code when you're in the middle of opening and closing it, the shelve file has high chance of getting corrupted (resulting in frustrating KeyErrors); which is really annoying given that we who are using them are interested in them because of storing our LARGE dict files which clearly also took a long time to be constructed
And that is why shelve is a dirty solution... It's still faster though. So!
You may test to compress your dictionnary (with some restrictions see : this post) it will be efficient if the disk access is the bottleneck.
That is a lot of data...
What kind of contents has your dictionary? If it is only primitive or fixed datatypes, maybe a real database or a custom file-format is the better option?
I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?
Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.
A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.
I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.
if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.
Sqlite -- it comes with python, fast, widely availible and easy to maintain
If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.
Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.
The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.
It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.