Can you have multiple read/writes to SQLite database simultaneously? - python

I am using Python with SQLite currently and wondering if it is safe to have multiple threads reading and writing to the database simultaneously. Does SQLite handle data coming in as a queue or have sort of mechanism that will stop the data from getting corrupt?

This is my issue too. SQLite using some kind of locking mechanism which prevent you doing concurrency operation on a DB. But here is a trick which i use when my db are small. You can select all your tables data into memory and operate on it and then update the original table.
As i said this is just a trick and it does not always solve the problem.
I advise to create your trick.

SQLite has a number of robust locking mechanisms to ensure the data doesn't get corrupted, but the problem with that is if you have a number of threads reading and writing to it simultaneously you'll suffer pretty badly in terms of performance as they all trip over the others. It's not intended to be used this way, even if it does work.
You probably want to look at using a shared database server of some sort if this is your intended usage pattern. They have much better support for concurrent operations.

Related

Bundling reads or caching collections with Pymongo

I'm currently running into an issue in integrating ElasticSearch and MongoDB. Essentially I need to convert a number of Mongo Documents into searchable documents matching my ElasticSearch query. That part is luckily trivial and taken care of. My problem though is that I need this to be fast. Faster than network time, I would really like to be able to index around 100 docs/second, which simply isn't possible with network calls to Mongo.
I was able to speed this up a lot by using ElasticSearch's bulk indexing, but that's only half of the problem. Is there any way to either bundle reads or cache a collection (a manageable part of a collection, as this collection is larger than I would like to keep in memory) to help speed this up? I was unable to really find any documentation about this, so if you can point me towards relevant documentation I consider that a perfectly acceptable answer.
I would prefer a solution that uses Pymongo, but I would be more than happy to use something that directly talks to MongoDB over requests or something similar. Any thoughts on how to alleviate this?
pymongo is thread safe, so you can run multiple queries in parallel. (I assume that you can somehow partition your document space.)
Feed the results to a local Queue if processing the result needs to happen in a single thread.

Creating an in-memory cache that persists between executions

I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.
However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.
Requirements:
Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.
Preferences:
I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.
Candidates:
SQLite in memory
SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap
On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.
I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing
The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk
My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.
I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?
I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.
I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.
Thank you so much for your help!
I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!

Pros and cons of using sqlite3 vs custom table implementation

I noticed that a significant part of my (pure Python) code deals with tables. Of course, I have class Table which supports the basic functionality, but I end up adding more and more features to it, such as queries, validation, sorting, indexing, etc.
I to wonder if it's a good idea to remove my class Table, and refactor the code to use a regular relational database that I will instantiate in-memory.
Here's my thinking so far:
Performance of queries and indexing would improve but communication between Python code and the separate database process might be less efficient than between Python functions. I assume that is too much overhead, so I would have to go with sqlite which comes with Python and lives in the same process. I hope this means it's a pure performance gain (at the cost of non-standard SQL definition and limited features of sqlite).
With SQL, I will get a lot more powerful features than I would ever want to code myself. Seems like a clear advantage (even with sqlite).
I won't need to debug my own implementation of tables, but debugging mistakes in SQL are hard since I can't put breakpoints or easily print out interim state. I don't know how to judge the overall impact of my code reliability and debugging time.
The code will be easier to read, since instead of calling my own custom methods I would write SQL (everyone who needs to maintain this code knows SQL). However, the Python code to deal with database might be uglier and more complex than the code that uses pure Python class Table. Again, I don't know which is better on balance.
Any corrections to the above, or anything else I should think about?
SQLite does not run in a separate process. So you don't actually have any extra overhead from IPC. But IPC overhead isn't that big, anyway, especially over e.g., UNIX sockets. If you need multiple writers (more than one process/thread writing to the database simultaneously), the locking overhead is probably worse, and MySQL or PostgreSQL would perform better, especially if running on the same machine. The basic SQL supported by all three of these databases is the same, so benchmarking isn't that painful.
You generally don't have to do the same type of debugging on SQL statements as you do on your own implementation. SQLite works, and is fairly well debugged already. It is very unlikely that you'll ever have to debug "OK, that row exists, why doesn't the database find it?" and track down a bug in index updating. Debugging SQL is completely different than procedural code, and really only ever happens for pretty complicated queries.
As for debugging your code, you can fairly easily centralize your SQL calls and add tracing to log the queries you are running, the results you get back, etc. The Python SQLite interface may already have this (not sure, I normally use Perl). It'll probably be easiest to just make your existing Table class a wrapper around SQLite.
I would strongly recommend not reinventing the wheel. SQLite will have far fewer bugs, and save you a bunch of time. (You may also want to look into Firefox's fairly recent switch to using SQLite to store history, etc., I think they got some pretty significant speedups from doing so.)
Also, SQLite's well-optimized C implementation is probably quite a bit faster than any pure Python implementation.
You could try to make a sqlite wrapper with the same interface as your class Table, so that you keep your code clean and you get the sqlite performences.
If you're doing database work, use a database, if your not, then don't. Using tables, it sound's like you are. I'd recommend using an ORM to make it more pythonic. SQLAlchemy is the most flexible (though it's not strictly just an ORM).

Comparing persistent storage solutions in python

I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?
Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.
A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.
I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.
if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.
Sqlite -- it comes with python, fast, widely availible and easy to maintain
If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.
Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.
The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.
It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.

Using CSV as a mutable database?

Yes, this is as stupid a situation as it sounds like. Due to some extremely annoying hosting restrictions and unresponsive tech support, I have to use a CSV file as a database.
While I can use MySQL with PHP, I can't use it with the Python backend of my program because of install issues with the host. I can't use SQLite with PHP because of more install issues, but can use it as it's a Python builtin.
Anyways, now, the question: is it possible to update values SQL-style in a CSV database? Or should I keep on calling the help desk?
Don't walk, run to get a new host immediately. If your host won't even get you the most basic of free databases, it's time for a change. There are many fish in the sea.
At the very least I'd recommend an xml data store rather than a csv. My blog uses an xml data provider and I haven't had any issues with performance at all.
Take a look at this: http://www.netpromi.com/kirbybase_python.html
Keep calling on the help desk.
While you can use a CSV as a database, it's generally a bad idea. You would have to implement you own locking, searching, updating, and be very careful with how you write it out to make sure that it isn't erased in case of a power outage or other abnormal shutdown. There will be no transactions, no query language unless you write your own, etc.
I couldn't imagine this ever being a good idea. The current mess I've inherited writes vital billing information to CSV and updates it after projects are complete. It runs horribly and thousands of dollars are missed a month. For the current restrictions that you have, I'd consider finding better hosting.
You can probably used sqlite3 for more real database. It's hard to imagine hosting that won't allow you to install it as a python module.
Don't even think of using CSV, your data will be corrupted and lost faster than you say "s#&t"
"Anyways, now, the question: is it possible to update values SQL-style in a CSV database?"
Technically, it's possible. However, it can be hard.
If both PHP and Python are writing the file, you'll need to use OS-level locking to assure that they don't overwrite each other. Each part of your system will have to lock the file, rewrite it from scratch with all the updates, and unlock the file.
This means that PHP and Python must load the entire file into memory before rewriting it.
There are a couple of ways to handle the OS locking.
Use the same file and actually use some OS lock module. Both processes have the file open at all times.
Write to a temp file and do a rename. This means each program must open and read the file for each transaction. Very safe and reliable. A little slow.
Or.
You can rearchitect it so that only Python writes the file. The front-end reads the file when it changes, and drops off little transaction files to create a work queue for Python. In this case, you don't have multiple writers -- you have one reader and one writer -- and life is much, much simpler.
I'd keep calling help desk. You don't want to use CSV for data if it's relational at all. It's going to be nightmare.
I agree. Tell them that 5 random strangers agree that you being forced into a corner to use CSV is absurd and unacceptable.
If I understand you correctly: you need to access the same database from both python and php, and you're screwed because you can only use mysql from php, and only sqlite from python?
Could you further explain this? Maybe you could use xml-rpc or plain http requests with xml/json/... to get the php program to communicate with the python program (or the other way around?), so that only one of them directly accesses the db.
If this is not the case, I'm not really sure what the problem.
It's technically possible. For example, Perl has DBD::CSV that provides a driver that runs SQL queries on the CSV file.
That being said, why not run off a SQLite database on your server?
What about postgresql? I've found that quite nice to work with, and python supports it well.
But I really would look for another provider unless it's really not an option.
Disagreeing with the noble colleagues, I often use DBD::CSV from Perl. There are good reasons to do it. Foremost is data update made simple using a spreadsheet. As a bonus, since I am using SQL queries, the application can be easily upgraded to a real database engine. Bear in mind these were extremely small database in a single user application.
So rephrasing the question: Is there a python module equivalent to Perl's DBD:CSV

Categories

Resources