Python Sqlite Concurrent access - python

I have a scenario where a third party application scans a folder and triggers my python script/generated EXE for the number of times (yes! number of separate processes) for the number of files that are in the folder.
My script/application writes the path of the file to a local sqlite database, calls the next application and exits.
My script/application takes care that it calls only one instance of the next application. But nothing can be done of the third party application that calls my script.
The ISSUE
Sometimes more than 1000 instances of my script/application can be called at the same time resulting in almost 1000 concurrent connections to the local sqlite database.
Due to the limited number of concurrent connections possible with sqlite, some of the processes are getting a "database is locked" Exception. This results in some of the file names NOT being written to the database
We came up with a work around for this. We write in the database in an infinite loop. On encountering the exception, we make the thread sleep for say 50 milliseconds and try again till such time that the write works. I know that this is not a clean approach.
Is there a better way of doing this? How do I handle 1000 may be 10000 or may be more concurrent connections and yet each script succeeds?

Your workaround is correct, but you can make the database do most of the work by setting a busy timeout. (For 1000 connections, this needs to be set extremely high, essentially inifinite, like you're already doing.)
But this still results in random wait times.
SQLite does not wait until one writer's transaction has finished and then signals the next, because there is no portable API for this. However, in Windows, you could use a named mutex object (which requires some trickery to access it from Python).

Related

postgres database: When does a job get killed

I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl
Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.
Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.

Persistant MySQL connection in Python for social media harvesting

I am using Python to stream large amounts of Twitter data into a MySQL database. I anticipate my job running over a period of several weeks. I have code that interacts with the twitter API and gives me an iterator that yields lists, each list corresponding to a database row. What I need is a means of maintaining a persistent database connection for several weeks. Right now I find myself having to restart my script repeatedly when my connection is lost, sometimes as a result of MySQL being restarted.
Does it make the most sense to use the mysqldb library, catch exceptions and reconnect when necessary? Or is there an already made solution as part of sqlalchemy or another package? Any ideas appreciated!
I think the right answer is to try and handle the connection errors; it sounds like you'd only be pulling in a much a larger library just for this feature, while trying and catching is probably how it's done, whatever level of the stack it's at. If necessary, you could multithread these things since they're probably IO-bound (i.e. suitable for Python GIL threading as opposed to multiprocessing) and decouple the production and the consumption with a queue, too, which would maybe take some of the load off of the database connection.

ubuntu django run managements much faster( i tried renice by setting -18 priority to python process pid)

I am using ubuntu. I have some management commands which when run, does lots of database manipulations, so it takes nearly 15min.
My system monitor shows that my system has 4 cpu's and 6GB RAM. But, this process is not utilising all the cpu's . I think it is using only one of the cpus and that too very less ram. I think, if I am able to make it to use all the cpu's and most of the ram, then the process will be completed in very less time.
I tried renice , by settings priority to -18 (means very high) but still the speed is less.
Details:
its a python script with loop count of nearly 10,000 and that too nearly ten such loops. In every loop, it saves to postgres database.
If you are looking to make this application run across multiple cpu's then there are a number of things you can try depending on your setup.
The most obvious thing that comes to mind is making the application make use of threads and multiprocesses. This will allow the application to "do more" at once. Obviously the issue you might have here is concurrent database access so you might need to use transactions (at which point you might loose the advantage of using multiprocesses in the first place).
Secondly, make sure you are not opening and closing lots of database connections, ensure your application can hold the connection open for as long as it needs.
thirdly, Ensure the database is correctly indexed. If you are doing searches on large strings then things are going to be slow.
Fourthly, Do everything you can in SQL leaving little manipulation to python, sql is horrendously quick at doing data manipulation if you let it. As soon as you start taking data out of the database and into code then things are going to slow down big time.
Fifthly, make use of stored procedures which can be cached and optimized internally within the database. These can be a lot quicker than application built queries which cannot be optimized as easily.
Sixthly, dont save on each iteration of a program. Try to produce a batch styled job whereby you alter a number of records then save all of those in one batch job. This will reduce the amount of IO on each iteration and speed up the process massivly.
Django does support the use of a bulk update method, there was also a question on stackoverflow a while back about saving multiple django objects at once.
Saving many Django objects with one big INSERT statement
Django: save multiple object signal once
Just in case, did you run the command renice -20 -p {pid} instead of renice --20 -p {pid}? In the first case it will be given the lowest priority.

SQLite3 and Multiprocessing

I noticed that sqlite3 isnĀ“t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

A daemon to call a function every 2 minutes with start and stop capablities

I am working on a django web application.
A function 'xyx' (it updates a variable) needs to be called every 2 minutes.
I want one http request should start the daemon and keep calling xyz (every 2 minutes) until I send another http request to stop it.
Appreciate your ideas.
Thanks
Vishal Rana
There are a number of ways to achieve this. Assuming the correct server resources I would write a python script that calls function xyz "outside" of your django directory (although importing the necessary stuff) that only runs if /var/run/django-stuff/my-daemon.run exists. Get cron to run this every two minutes.
Then, for your django functions, your start function creates the above mentioned file if it doesn't already exist and the stop function destroys it.
As I say, there are other ways to achieve this. You could have a python script on loop waiting for approx 2 minutes... etc. In either case, you're up against the fact that two python scripts run on two different invocations of cpython (no idea if this is the case with mod_wsgi) cannot communicate with each other and as such IPC between python scripts is not simple, so you need to use some sort of formal IPC (like semaphores, files etc) rather than just common variables (which won't work).
Probably a little hacked but you could try this:
Set up a crontab entry that runs a script every two minutes. This script will check for some sort of flag (file existence, contents of a file, etc.) on the disk to decide whether to run a given python module. The problem with this is it could take up to 1:59 to run the function the first time after it is started.
I think if you started a daemon in the view function it would keep the httpd worker process alive as well as the connection unless you figure out how to send a connection close without terminating the django view function. This could be very bad if you want to be able to do this in parallel for different users. Also to kill the function this way, you would have to somehow know which python and/or httpd process you want to kill later so you don't kill all of them.
The real way to do it would be to code an actual daemon in w/e language and just make a system call to "/etc/init.d/daemon_name start" and "... stop" in the django views. For this, you need to make sure your web server user has permission to execute the daemon.
If the easy solutions (loop in a script, crontab signaled by a temp file) are too fragile for your intended usage, you could use Twisted facilities for process handling and scheduling and networking. Your Django app (using a Twisted client) would simply communicate via TCP (locally) with the Twisted server.

Categories

Resources