Writing in SQLite multiple Threads in Python - python

I've got a sqlite3 database and I want to write in it from multiple threads. I've got multiple ideas but I'm not sure which I should implement.
create multiple connection, detect and waif if the DB is locked
use one connection and try to make use of Serialized connections (which don't seem to be implemented in python)
have a background process with a single connection, which collects the queries from all threads and then executes them on their behalft
forget about SQlite and use something like Postgresql
What are the advances of these different approaches and which is most likely to be fruitful? Are there any other possibilities?

Try to use https://pypi.python.org/pypi/sqlitedict
A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
But take into account "Concurrent requests are still serialized internally, so this "multithreaded support" doesn't give you any performance benefits. It is a work-around for sqlite limitations in Python."
PostgreSQL, MySQL, etc. give you the better performance for several connections in one time

I used method 1 before. It is the easiest in coding. Since that project has a small website, each query take only several milliseconds. All the users requests can be processed promptly.
I also used method 3 before. Because when the query take longer time, it is better to queue the queries since frequent "detect and wait" makes no sense here. And would require a classic consumer-producer model. It would require more time to code.
But if the query is really heavy and frequent. I suggest look to other db like MS SQL/MySQL.

Related

Persistant MySQL connection in Python for social media harvesting

I am using Python to stream large amounts of Twitter data into a MySQL database. I anticipate my job running over a period of several weeks. I have code that interacts with the twitter API and gives me an iterator that yields lists, each list corresponding to a database row. What I need is a means of maintaining a persistent database connection for several weeks. Right now I find myself having to restart my script repeatedly when my connection is lost, sometimes as a result of MySQL being restarted.
Does it make the most sense to use the mysqldb library, catch exceptions and reconnect when necessary? Or is there an already made solution as part of sqlalchemy or another package? Any ideas appreciated!
I think the right answer is to try and handle the connection errors; it sounds like you'd only be pulling in a much a larger library just for this feature, while trying and catching is probably how it's done, whatever level of the stack it's at. If necessary, you could multithread these things since they're probably IO-bound (i.e. suitable for Python GIL threading as opposed to multiprocessing) and decouple the production and the consumption with a queue, too, which would maybe take some of the load off of the database connection.

GAE: Aggregate work results from tasks? (for a GAE query performance issue)

Which of those options is best to span out work on GAE (to be completed within a reuest timeframe)?
Use of tasks, store the results in memcache, periodically query memcache in the request and hope the tasks complete in time
Use of urlfetch to get results of tasks, error handling and security will be a pain though.
Use of backend instances? (seems insane)
Or a JAVA instance (seems totally insane)
Background:
It´s ridiculous to even have to do this. I need to deliver 10k datastore items as a JSON. Apparently the issue is that Python takes a lot of time to process the datastore results (Java seems much faster). This is well covered:
25796142, 11509368 and
21941954
Approach:
As there is nothing to optimize on the Software side (can´t re-write GAE), the approach would be to span work out over multiple instances and to aggregate the results.
Querying keys only and getting query cursors for chunks of 2k items performs reasonably well and there tasks could be spun off to get the results in 2k chunks. The question is about how to best aggregate the results.
It is not "ridiculous" to have to do this: it is an accepted consequence of the scalability offered by GAE. If you don't like the tradeoffs made to enable that scalability, you should choose another platform.
It's also unclear why you think using backend instances is "insane". Using Java would indeed be strange, but only because there's no reason to think it would perform any better.
However, there is a perfectly good way to do this which does not involve any of the hacks you mention, and that is to use the mapreduce framework, which is expressly made for collecting large quantities of data.

Sqlite3 and Python: Handling a locked database

I sometimes run python scripts that access the same database concurrently. This often causes database lock errors. I would like the script to then retry ASAP as the database is never locked for long.
Is there a better way to do this than with a try except inside a while loop and does that method have any problems?
Increase the timeout parameter in your calls to the connect function:
db = sqlite3.connect("myfile.db", timeout = 20)
If you are looking for concurrency, SQlite is not the answer. The engine doesn't perform good when concurrency is needed, especially when writing from different threads, even if the tables are not the same.
If your scripts are accessing different tables, and they have no relationships at DB level (i.e. declared FK's), you can separate them in different databases and then your concurrency issue will be solved.
If they are linked, but you can link them in the app level (script), you can separate them as well.
The best practice in those cases is implementing a lock mechanism with events, but honestly I have no idea how to implement such in phyton.

ubuntu django run managements much faster( i tried renice by setting -18 priority to python process pid)

I am using ubuntu. I have some management commands which when run, does lots of database manipulations, so it takes nearly 15min.
My system monitor shows that my system has 4 cpu's and 6GB RAM. But, this process is not utilising all the cpu's . I think it is using only one of the cpus and that too very less ram. I think, if I am able to make it to use all the cpu's and most of the ram, then the process will be completed in very less time.
I tried renice , by settings priority to -18 (means very high) but still the speed is less.
Details:
its a python script with loop count of nearly 10,000 and that too nearly ten such loops. In every loop, it saves to postgres database.
If you are looking to make this application run across multiple cpu's then there are a number of things you can try depending on your setup.
The most obvious thing that comes to mind is making the application make use of threads and multiprocesses. This will allow the application to "do more" at once. Obviously the issue you might have here is concurrent database access so you might need to use transactions (at which point you might loose the advantage of using multiprocesses in the first place).
Secondly, make sure you are not opening and closing lots of database connections, ensure your application can hold the connection open for as long as it needs.
thirdly, Ensure the database is correctly indexed. If you are doing searches on large strings then things are going to be slow.
Fourthly, Do everything you can in SQL leaving little manipulation to python, sql is horrendously quick at doing data manipulation if you let it. As soon as you start taking data out of the database and into code then things are going to slow down big time.
Fifthly, make use of stored procedures which can be cached and optimized internally within the database. These can be a lot quicker than application built queries which cannot be optimized as easily.
Sixthly, dont save on each iteration of a program. Try to produce a batch styled job whereby you alter a number of records then save all of those in one batch job. This will reduce the amount of IO on each iteration and speed up the process massivly.
Django does support the use of a bulk update method, there was also a question on stackoverflow a while back about saving multiple django objects at once.
Saving many Django objects with one big INSERT statement
Django: save multiple object signal once
Just in case, did you run the command renice -20 -p {pid} instead of renice --20 -p {pid}? In the first case it will be given the lowest priority.

SQLite3 and Multiprocessing

I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Categories

Resources