I started a new project quite recently it is about processing huge amounts of Data .The Data is read from a file and should be inserted into a Database and at the same time there are some calculation that are done on the Data. Therefor i designed a system for starting each of those task in one process. But i need to share the Data in Realtime between those processes i designed the system with the
multiprocessing.Manager()
and pass all shared variables as arguments to functions i execute.
Now im running in some issues because the Database cannot read the variable because the Read Task is always occupying the variable.
Im stuck right now because i don't find any proper solution to adress this issue
Thanks for all answers in advance
Related
Short context:
Our application has a backend written in Python. It contains a couple of rest API endpoints and Message Queue handling (rabbitMQ and Pika).
The reason why we use Python is that it is a Data Science/AI project - so lot of data processing require some DS knowledge.
Problem:
Some parts of our application have CPU heavy calculations, and we are using multiprocessing to add parallelization.
However, we need to be careful because each process it's starting new interpreter and imports everything again.
The environment is windows, so the process creation is a "spawn".
Question:
Is there the best way how to maintain this? The team is big, so there is a chance that someone will put some big object creation or long processing function that will start on application boot that might be called and kept in memory while creating a pool of processes.
I have one python script which is generating data and one which is training a neural network with tensorflow and keras on this data. Both need an instance of the neural network.
Since I haven't set the flag "allow growth" each process takes the full GPU memory. Therefore I simply give each process it's own GPU. (Maybe not a good solution for people with only one GPU... yet another unsolved problem)
The actual problem is as follow: Both instances need access to the networks weights file. I recently had a bunch of crashes because both processes tried to access the weights. A flag or something similar should stop each process from accessing it, whilst the other process is accessing. Hopefully this doesn't create a bottle neck.
I tried to come up with a solution like semaphores in C, but today I found this post in stack-exchange.
The idea with renaming seems quite simple and effective to me. Is this good practice in my case? I'll just create the weight file with my own function
self.model.save_weights(filepath='weights.h5$$$')
in the learning process, rename them after saving with
os.rename('weights.h5$$$', 'weights.h5')
and load them in my data generating process with function
self.model.load_weights(filepath='weights.h5')
?
Will this renaming overwrite the old file? And what happens if the other process is currently loading? I would appreciate other ideas how I could multithread / multiprocess my script. Just realized that generating data, learn, generating data,... in a sequential script is not really performant.
EDIT 1: Forgot to mention that the weights are stored in a .h5 file by keras' save function
The multiprocessing module has a RLock class that you can use to regulate access to a sharded resource. This also works for files if you remember to acquire the lock before reading and writing and release it afterwards. Using a lock implies that some of the time one of the processes cannot read or write the file. How much of a problem this is depends on how much both processes have to access the file.
Note that for this to work, one of the scripts has to start the other script as a Process after creating the lock.
If the weights are a Python data structure, you could put that under control of a multiprocessing.Manager. That will manage access to the objects under its control for you. Note that a Manager is not meant for use with files, just in-memory objects.
Additionally on UNIX-like operating systems Python has os.lockf to lock (part of) a file. Note that this is an advisory lock only. That is, if another process calls lockf, the return value indicates that the file is already locked. It does not actually prevent you from reading the file.
Note:
Files can be read and written. Only when two processes are reading the same file (read/read) does this work well. Every other combination (read/write, write/read, write/write) can and eventually will result in undefined behavior and data corruption.
Note2:
Another possible solution involves inter process communication.
Process 1 writes a new h5 file (with a random filename), closes it, and then sends a message (using a Pipe or Queue to Process 2 "I've written a new parameter file \path\to\file".
Process 2 then reads the file and deletes it. This can work both ways but requires that both processes check for and process messages every so often. This prevents file corruption because the writing process only notifies the reading process after it has finished the file.
I currently have a large database (stored as numpy array) that I'd like to perform a search on, however due to the size I'd like to split the database into pieces and perform a search on each piece before combining the results.
I'm looking for a way to host the split database pieces on separate python processes where they will wait for a query, after which they perform the search, and send the results back to the main process.
I've tried a load of different things with the multiprocessing package, but I can't find any way to (a) keep processes alive after loading the database on them, and (b) send more commands to the same process after initialisation.
Been scratching my head about this one for several days now, so any advice would be much appreciated.
EDIT: My problem is analogous to trying to host 'web' apis in the form of python processes, and I want to be able to send and receive requests at will without reloading the database shards every time.
I have a script in Python which has to be reliable and work 24/7 and preferably never fail. It receives data from a third-party service, then processes it and then saves into a db. It might receive plenty of data at a time or might not.
For the save of reliability, it saves data first into different files, one new file whenever it receives data.
I wonder, should I create a new thread each time it receives data to speed it up? Or just create one new thread in which I'd say all data into files?
Or new thread(s) for processing data (and) saving it into the db?
I'm thinking of that because data can come when it's busy processing other portion of data which came previously or saving it into a database.
How can I know whether or not I really need that?
I noticed that sqlite3 isnĀ“t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/