Python threading or multiprocess questions with sqlite3 and matplotlib - python

I have a python script that I'd like to run using two processes or threads. I am limited to two because I am connecting to an api/link which only has two license. I grab the license by importing their module and instantiating their class. Here are my issues:
I need to write to a sqlitedb3. I tried to share a db connection, pass it to the "worker" and have it create its own cursor but I will get stuck with a "database locked" message and it seems no matter how long I keep retrying, the lock doesnt clear. My program will spend about 5min loading data from a model, then about a minute processing data and inserting into the db. Then at the end before I move to the next model, it does a commit(). I think I can live with just creating two separate databases though
After it writes to the database, I use matplotlib to create some plots and images then save them to a file with a unique name. I kept getting "QApplication was not created in the main() thread" and "Xlib: unexpected async reply". I figure that switching from threading to multiprocess may help this
I want to make sure only two threads or processes are running at once. What is the best way to accomplish this. With threading, I was doing the following:
c1 = load_lib_get_license()
c2 = load_lib_get_license()
prc_list = list of models to process
while (len(prc_list) > 0):
if not t1.is_alive():
t1 = threading.Process(target=worker,args=(c1,db_connection,prc_list.pop(0))
t1.start()
if not t2.is_alive():
t2 = threading.Process(target=worker,args=(c2,db_connection,prc_list.pop(0))
t2.start()
while (t1.is_alive() and t2.is_alive():
sleep(1)

Queue is probably what you're looking for, maybe the link in this previous answer might help you:
Sharing data between threads in Python

Related

concurrent sqlite writes in python

I've got a python application (Gtk) which uses threads to fetch information from certain sites and writes them to the datebase.
I've got a thread that checks for new updates at site1, if there are updates I receive a json object (json1).
I will then iterate through json1 and insert the new information to the datebase, within json1 there is a result I need to use to fetch more information at site2. I will recive a json object(json2) at site2 as well.
So the situation is something like this
def get_more_info(name):
json2 = get(www.site2.com?=name....)
etc
for information in json1:
db.insert(information)
get_more_info(information.name)
From this situation I see that there are a couple of ways of doing this.
get_more_info to return json object so that
for information in json1:
db.insert(information)
json2 = get_more_info(information.name)
for info in json2:
db.insert(info)
db.commit()
get_more_info to do the inserting
for information in json1:
db.insert(information)
get_more_info(information.name)
db.commit()
Both of these ways seem a bit slow since the main for loop will have to wait for get_more_info to complete before carrying on and both json1 and json2 could be large, there is also the possiblity that site2 is unavailiable at that moment, causing the whole transaction to fail. The application can still function without json2, that data can be fetched at a later time if needed.
So I was thinking of passing information.name to a queue so that the main loop can continue and kick off a thread that will monitor that queue and excute get_more_info. Is this the right approach to take?
I know that sqlite does not perform concurrent writes, If I recall correctly if get_more_info tries to write while the main for loop is busy, sqlite will output OperationalError: database is locked.
Now what happends to get_more_info at that point, does it get put into sometype of write queue or does it wait for the main loop to complete and what happens to the main for loop when get_more_info is busying writing?
Will there be a need to go to another database engine?
Since you are using threads always, you can use an other thread to write to the database. In order to feed it with the data you should use a globally accessible Queue.Queue() (queue.Queue() in Python3) instance. Using the instances get() method with block=true will make the thread wait for data to write.

MPI locking for sqlite (python)

I am using mpi4py for a project I want to parallelize. Below is very basic pseudo code for my program:
Load list of data from sqlite database
Based on COMM.Rank and Comm.Size, select chunk of data to process
Process data...
use MPI.Gather to pass all of the results back to root
if root:
iterate through results and save to sqlite database
I would like to eliminate the call to MPI.Gather by simply having each process write its own results to the database. So I want my pseudo code to look like this:
Load list of data
Select chunk of data
Process data
Save results
This would drastically improve my program's performance. However, I am not entirely sure how to accomplish this. I have tried to find methods through google, but the only thing I could find is MPI-IO. Is it possible to use MPI-IO to write to a database? Specifically using python, sqlite, and mpi4py. If not, are there any alternatives for writing concurrently to a sqlite database?
EDIT:
As #CL pointed out in a comment, sqlite3 does not support concurrent writes to the database. So let me ask my question a little differently: Is there a way to lock writes to the database so that other processes wait till the lock is removed before writing? I know sqlite3 has its own locking modes, but these modes seem to cause insertions to fail rather than block. I know I've seen something like this in Python threading, but I haven't been able to find anything online about doing this with MPI.
I would suggest you pass your results back to the root process, and let the root process write them to the SQLite database. The pseudocode would look something like this:
load list of data
if rank == 0:
for _ in len(data):
result = receive from any worker
save result
else:
select chunk of data
process data
send result(s) to rank 0
The advantage over gathering is that rank 0 can save the results as soon as they are ready. There is an mpi4py example that shows how to spread tasks out over multiple workers when there are lots of tasks and the processing time varies widely.

Asynchronous listening/iteration of pipes in python

I'm crunching a tremendous amount of data and since I have a 12 core server at my disposal, I've decided to split the work by using the multiprocessing library. The way I'm trying to do this is by having a single parent process that dishes out work evenly to multiple worker processes, then another that acts as a collector/funnel of all the completed work to be moderately processed for final output. Having done something similar to this before, I'm using Pipes because they are crazy fast in contrast to managed ques.
Sending data out to the workers using the pipes is working fine. However, I'm stuck on efficiently collecting the data from the workers. In theory, the work being handed out will be processed at the same pace and they will all get done at the same time. In practice, this never happens. So, I need to be able to iterate over each pipe to do something, but if there's nothing there, I need it to move on to the next pipe and check if anything is available for processing. As mentioned, it's on a 12 core machine, so I'll have 10 workers funneling down to one collection process.
The workers use the following to read from their pipe (called WorkerRadio)
for Message in iter(WorkerRadio.recv, 'QUIT'):
Crunch Numbers & perform tasks here...
CollectorRadio.send(WorkData)
WorkerRadio.send('Quitting')
So, they sit there looking at the pipe until something comes in. As soon as they get something they start doing their thing. Then fire it off to the data collection process. If they get a quit command, they acknowledge and shut down peacefully.
As for the collector, I was hoping to do something similar but instead of just 1 pipe (radio) there would be 10 of them. The collector needs to check all 10, and do something with the data that comes in. My first try was doing something like the workers...
i=0
for Message in iter(CollectorRadio[i].recv, 'QUIT'):
Crunch Numbers & perform tasks here...
if i < NumOfRadios:
i += 1
else:
i = 0
CollectorRadio.send('Quitting')
That didn't cut it & I tried a couple other ways of manipulating without success too. I either end up with syntax errors, or like the above, I get stuck on the first radio because it never changes for some reason. I looked into having all the workers talking into a single pipe, but the Python site explicit states that "data in a pipe may become corrupted if two processes (or threads) try to read from or write to the same end of the pipe at the same time."
As I mentioned, I'm also worried about some processes going slower than the others and holding up progress. If at all possible, I would like something that doesn't wait around for data to show up (ie. check and move on if nothing's there).
Any help on this would be greatly appreciated. I've seen some use of managed ques that might allow this to work; but, from my testing, managed ques are significantly slower than pipes and I can use as much performance on this as I can muster.
SOLUTION:
Based on pajton's post here's what I did to make it work...
#create list of pipes(labeled as radios)
TheRadioList = [CollectorRadio[i] for i in range(NumberOfRadios)]
while True:
#check for data on the pipes/radios
TheTransmission, Junk1, Junk2 = select.select(TheRadioList, [], [])
#find out who sent the data (which pipe/radio)
for TheSender in TheTransmission:
#read the data from the pipe
TheMessage = TheSender.recv()
crunch numbers & perform tasks here...
If you are using standard system pipes, then you can use select system call to query for which descriptors the data is available. Bt default select will block until at least one of passed descriptors is ready:
read_pipes = [pipe_fd0, pipe_fd1, ... ]
while True:
read_fds, write_fds, exc_fds = select.select(read_pipes, [], [] )
for read_fd in read_fds:
# read from read_fd pipe descriptor

identifying a processor core or worker id parallel python

I am running processes in parallel but need to create a database for each cpu process to write to. I only want as many databases as cpu's assigned on each server, so the 100 jobs written to 3 databases that can be merged after.
Is there worker id number or core id that I can identify each worker as?
def workerProcess(job):
if workerDBexist(r'c:\temp\db\' + workerid):
##processjob into this database
else:
makeDB(r'c:\temp\db\' + workerid)
##first time this 'worker/ core' used, make DB then process
import pp
ppservers = ()
ncpus = 3
job_server = pp.Server(ncpus, ppservers=ppservers)
for work in 100WorkItems:
job_server.submit(workerProcess, (work,))
As far as I know, pp doesn't have any such feature in its API.
If you used the stdlib modules instead, that would make your life a lot easier—e.g., multiprocessing.Pool takes an initializer argument, which you could use to initialize a database for each process, which would then be available as a variable that each task could use.
However, there is a relatively easy workaround.
Each process has a unique (at least while it's running) process ID.* In Python, you can access the process ID of the current process with os.getpid(). So, in each task, you can do something like this:
dbname = 'database{}'.format(os.getpid())
Then use dbname to open/create the database. I don't know whether by "database" you mean a dbm file, a sqlite3 file, a database on a MySQL server, or what. You may need to, e.g., create a tempfile.TemporaryDirectory in the parent, pass it to all of the children, and have them os.path.join it to the dbname (so after all the children are done, you can grab everything in os.listdir(the_temp_dir)).
The problem with this is that if pp.Server restarts one of the processes, you'll end up with 4 databases instead of 3. Probably not a huge deal, but your code should deal with that possibility. (IIRC, pp.Server usually doesn't restart the processes unless you pass restart=True, but it may do so if, e.g., one of them crashes.)
But what if (as seems to be the case) you're actually running each task in a brand-new process, rather than using a pool of 3 processes? Well, then you're going to end up with as many databases as there are processes, which probably isn't what you want. Your real problem here is that you're not using a pool of 3 processes, which is what you ought to fix. But are there other ways you could get what you want? Maybe.
For example, let's say you created three locks, one for each database, maybe as lockfiles. Then, each task could do this pseudocode:
for i, lockfile in enumerate(lockfiles):
try:
with lockfile:
do stuff with databases[i]
break
except AlreadyLockedError:
pass
else:
assert False, "oops, couldn't get any of the locks"
If you can actually lock the databases themselves (with an flock, or with some API for the relevant database, etc.) things are even easier: just try to connect to them in turn until one of them succeeds.
As long as your code isn't actually segfaulting or the like,** if you're actually never running more than 3 tasks at a time, there's no way all 3 lockfiles could be locked, so you're guaranteed to get one.
* This isn't quite true, but it's true enough for your purposes. For example, on Windows, each process has a unique HANDLE, and if you ask for its pid one will be generated if it didn't already have one. And on some *nixes, each thread has a unique thread ID, and the process's pid is the thread ID of the first thread. And so on. But as far as your code can tell, each of your processes has a unique pid, which is what matters.
** Even if your code is crashing, you can deal with that, it's just more complicated. For example, use pidfiles instead of empty lockfiles. Get a read lock on the pidfile, then try to upgrade to a write lock. If it fails, read the pid from the file, and check whether any such process exists (e.g., on *nix, if os.kill(pid, 0) raises, there is no such process), and if so forcibly break the lock. Either way, now you've got a write lock, so write your pid to the file.

Schema migration on GAE datastore

First off, this is my first post on Stack Overflow, so please forgive any newbish mis-steps. If I can be clearer in terms of how I frame my question, please let me know.
I'm running a large application on Google App Engine, and have been adding new features that are forcing me to modify old data classes and add new ones. In order to clean our database and update old entries, I've been trying to write a script that can iterate through instances of a class, make changes, and then re-save them. The problem is that Google App Engine times out when you make calls to the server that take longer than a few seconds.
I've been struggling with this problem for several weeks. The best solution that I've found is here: http://code.google.com/p/rietveld/source/browse/trunk/update_entities.py?spec=svn427&r=427
I created a version of that code for my own website, which you can see here:
def schema_migration (self, target, batch_size=1000):
last_key = None
calls = {"Affiliate": Affiliate, "IPN": IPN, "Mail": Mail, "Payment": Payment, "Promotion": Promotion}
while True:
q = calls[target].all()
if last_key:
q.filter('__key__ >', last_key)
q.order('__key__')
this_batch_size = batch_size
while True:
try:
batch = q.fetch(this_batch_size)
break
except (db.Timeout, DeadlineExceededError):
logging.warn("Query timed out, retrying")
if this_batch_size == 1:
logging.critical("Unable to update entities, aborting")
return
this_batch_size //= 2
if not batch:
break
keys = None
while not keys:
try:
keys = db.put(batch)
except db.Timeout:
logging.warn("Put timed out, retrying")
last_key = keys[-1]
print "Updated %d records" % (len(keys),)
Strangely, the code works perfectly for classes with between 100 - 1,000 instances, and the script often takes around 10 seconds. But when I try to run the code for classes in our database with more like 100K instances, the script runs for 30 seconds, and then I receive this:
"Error: Server Error
The server encountered an error and could not complete your request.
If the problem persists, please report your problem and mention this error message and the query that caused it.""
Any idea why GAE is timing out after exactly thirty seconds? What can I do to get around this problem?
Thanks you!
Keller
you are hitting the second DeadlineExceededError by the sound of it. AppEngine requests can only run for 30 seconds each. When DeadLineExceedError is raised it's your job to stop processing and tidy up as you are running out of time, the next time it is raised you cannot catch it.
You should look at using the Mapper API to split your migration into batches and run each batch using the Task Queue.
The start of your solution will be to migrate to using GAE's Task Queues. This feature will allow you to queue some more work to happen at a later time.
That won't actually solve the problem immediately, because even task queue's are limited to short timeslices. However, you can unroll your loop to process a handfull of rows in your database at a time. After completing each batch, it can check to see how long it has been running, and if it's been long enough, it can start a new task in the queue to continue where the current task will leave off.
An alternative solution is to not migrate the data. Change the implementing logic so that each entity knows whether or not it has been migrated. Newly created entities, or old entities that get updated, will take the new format. Since GAE doesn't require that entities have all the same fields, you can do this easily, where on a relational database, that wouldn't be practical.

Categories

Resources