MPI locking for sqlite (python)

MPI locking for sqlite (python) - python

I am using mpi4py for a project I want to parallelize. Below is very basic pseudo code for my program:
Load list of data from sqlite database
Based on COMM.Rank and Comm.Size, select chunk of data to process
Process data...
use MPI.Gather to pass all of the results back to root
if root:
iterate through results and save to sqlite database
I would like to eliminate the call to MPI.Gather by simply having each process write its own results to the database. So I want my pseudo code to look like this:
Load list of data
Select chunk of data
Process data
Save results
This would drastically improve my program's performance. However, I am not entirely sure how to accomplish this. I have tried to find methods through google, but the only thing I could find is MPI-IO. Is it possible to use MPI-IO to write to a database? Specifically using python, sqlite, and mpi4py. If not, are there any alternatives for writing concurrently to a sqlite database?
EDIT:
As #CL pointed out in a comment, sqlite3 does not support concurrent writes to the database. So let me ask my question a little differently: Is there a way to lock writes to the database so that other processes wait till the lock is removed before writing? I know sqlite3 has its own locking modes, but these modes seem to cause insertions to fail rather than block. I know I've seen something like this in Python threading, but I haven't been able to find anything online about doing this with MPI.

I would suggest you pass your results back to the root process, and let the root process write them to the SQLite database. The pseudocode would look something like this:
load list of data
if rank == 0:
for _ in len(data):
result = receive from any worker
save result
else:
select chunk of data
process data
send result(s) to rank 0
The advantage over gathering is that rank 0 can save the results as soon as they are ready. There is an mpi4py example that shows how to spread tasks out over multiple workers when there are lots of tasks and the processing time varies widely.

Related

Why do writes through Cassandra Python Driver add records with a delay?

I'm writing 2.5 million records into Cassandra using a python program. The program finishes quickly but on querying the data, the records are reflected after a long time. The number of records gradually increase and it seems like the database is performing the writes to the tables in a queue fashion. The writes continue on till all the records are finished. Why do writes reflect late?

It is customary to provide a minimal code example plus steps to replicate the issue but you haven't provided much information.
My guess is that you've issued a lot of asynchronous writes which means that those queries get queued up because that's how asynchronous programming works. Until they eventually reach the cluster and get processed, you won't be able to immediately see the results.
In addition, you haven't provided information on how you're verifying the data so I'm going to make another guess and say you're doing a SELECT COUNT(*) which requires a full table scan in Cassandra. Given that you've issued millions of writes, chances are the nodes are overloaded and take a while to respond.
For what it's worth, if you are doing a COUNT() you might be interested in this post where I've explained why it's bad to do it in Cassandra -- https://community.datastax.com/questions/6897/. Cheers!

Split search between python processes

I currently have a large database (stored as numpy array) that I'd like to perform a search on, however due to the size I'd like to split the database into pieces and perform a search on each piece before combining the results.
I'm looking for a way to host the split database pieces on separate python processes where they will wait for a query, after which they perform the search, and send the results back to the main process.
I've tried a load of different things with the multiprocessing package, but I can't find any way to (a) keep processes alive after loading the database on them, and (b) send more commands to the same process after initialisation.
Been scratching my head about this one for several days now, so any advice would be much appreciated.
EDIT: My problem is analogous to trying to host 'web' apis in the form of python processes, and I want to be able to send and receive requests at will without reloading the database shards every time.

concurrent sqlite writes in python

I've got a python application (Gtk) which uses threads to fetch information from certain sites and writes them to the datebase.
I've got a thread that checks for new updates at site1, if there are updates I receive a json object (json1).
I will then iterate through json1 and insert the new information to the datebase, within json1 there is a result I need to use to fetch more information at site2. I will recive a json object(json2) at site2 as well.
So the situation is something like this
def get_more_info(name):
json2 = get(www.site2.com?=name....)
etc
for information in json1:
db.insert(information)
get_more_info(information.name)
From this situation I see that there are a couple of ways of doing this.
get_more_info to return json object so that
for information in json1:
db.insert(information)
json2 = get_more_info(information.name)
for info in json2:
db.insert(info)
db.commit()
get_more_info to do the inserting
for information in json1:
db.insert(information)
get_more_info(information.name)
db.commit()
Both of these ways seem a bit slow since the main for loop will have to wait for get_more_info to complete before carrying on and both json1 and json2 could be large, there is also the possiblity that site2 is unavailiable at that moment, causing the whole transaction to fail. The application can still function without json2, that data can be fetched at a later time if needed.
So I was thinking of passing information.name to a queue so that the main loop can continue and kick off a thread that will monitor that queue and excute get_more_info. Is this the right approach to take?
I know that sqlite does not perform concurrent writes, If I recall correctly if get_more_info tries to write while the main for loop is busy, sqlite will output OperationalError: database is locked.
Now what happends to get_more_info at that point, does it get put into sometype of write queue or does it wait for the main loop to complete and what happens to the main for loop when get_more_info is busying writing?
Will there be a need to go to another database engine?

Since you are using threads always, you can use an other thread to write to the database. In order to feed it with the data you should use a globally accessible Queue.Queue() (queue.Queue() in Python3) instance. Using the instances get() method with block=true will make the thread wait for data to write.

Asynchronous listening/iteration of pipes in python

I'm crunching a tremendous amount of data and since I have a 12 core server at my disposal, I've decided to split the work by using the multiprocessing library. The way I'm trying to do this is by having a single parent process that dishes out work evenly to multiple worker processes, then another that acts as a collector/funnel of all the completed work to be moderately processed for final output. Having done something similar to this before, I'm using Pipes because they are crazy fast in contrast to managed ques.
Sending data out to the workers using the pipes is working fine. However, I'm stuck on efficiently collecting the data from the workers. In theory, the work being handed out will be processed at the same pace and they will all get done at the same time. In practice, this never happens. So, I need to be able to iterate over each pipe to do something, but if there's nothing there, I need it to move on to the next pipe and check if anything is available for processing. As mentioned, it's on a 12 core machine, so I'll have 10 workers funneling down to one collection process.
The workers use the following to read from their pipe (called WorkerRadio)
for Message in iter(WorkerRadio.recv, 'QUIT'):
Crunch Numbers & perform tasks here...
CollectorRadio.send(WorkData)
WorkerRadio.send('Quitting')
So, they sit there looking at the pipe until something comes in. As soon as they get something they start doing their thing. Then fire it off to the data collection process. If they get a quit command, they acknowledge and shut down peacefully.
As for the collector, I was hoping to do something similar but instead of just 1 pipe (radio) there would be 10 of them. The collector needs to check all 10, and do something with the data that comes in. My first try was doing something like the workers...
i=0
for Message in iter(CollectorRadio[i].recv, 'QUIT'):
Crunch Numbers & perform tasks here...
if i < NumOfRadios:
i += 1
else:
i = 0
CollectorRadio.send('Quitting')
That didn't cut it & I tried a couple other ways of manipulating without success too. I either end up with syntax errors, or like the above, I get stuck on the first radio because it never changes for some reason. I looked into having all the workers talking into a single pipe, but the Python site explicit states that "data in a pipe may become corrupted if two processes (or threads) try to read from or write to the same end of the pipe at the same time."
As I mentioned, I'm also worried about some processes going slower than the others and holding up progress. If at all possible, I would like something that doesn't wait around for data to show up (ie. check and move on if nothing's there).
Any help on this would be greatly appreciated. I've seen some use of managed ques that might allow this to work; but, from my testing, managed ques are significantly slower than pipes and I can use as much performance on this as I can muster.
SOLUTION:
Based on pajton's post here's what I did to make it work...
#create list of pipes(labeled as radios)
TheRadioList = [CollectorRadio[i] for i in range(NumberOfRadios)]
while True:
#check for data on the pipes/radios
TheTransmission, Junk1, Junk2 = select.select(TheRadioList, [], [])
#find out who sent the data (which pipe/radio)
for TheSender in TheTransmission:
#read the data from the pipe
TheMessage = TheSender.recv()
crunch numbers & perform tasks here...

If you are using standard system pipes, then you can use select system call to query for which descriptors the data is available. Bt default select will block until at least one of passed descriptors is ready:
read_pipes = [pipe_fd0, pipe_fd1, ... ]
while True:
read_fds, write_fds, exc_fds = select.select(read_pipes, [], [] )
for read_fd in read_fds:
# read from read_fd pipe descriptor

SQLite3 and Multiprocessing

I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?

First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?

I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.

sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.

If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.