Good practice for parallel tasks in python

Good practice for parallel tasks in python - python

I have one python script which is generating data and one which is training a neural network with tensorflow and keras on this data. Both need an instance of the neural network.
Since I haven't set the flag "allow growth" each process takes the full GPU memory. Therefore I simply give each process it's own GPU. (Maybe not a good solution for people with only one GPU... yet another unsolved problem)
The actual problem is as follow: Both instances need access to the networks weights file. I recently had a bunch of crashes because both processes tried to access the weights. A flag or something similar should stop each process from accessing it, whilst the other process is accessing. Hopefully this doesn't create a bottle neck.
I tried to come up with a solution like semaphores in C, but today I found this post in stack-exchange.
The idea with renaming seems quite simple and effective to me. Is this good practice in my case? I'll just create the weight file with my own function
self.model.save_weights(filepath='weights.h5$$$')
in the learning process, rename them after saving with
os.rename('weights.h5$$$', 'weights.h5')
and load them in my data generating process with function
self.model.load_weights(filepath='weights.h5')
?
Will this renaming overwrite the old file? And what happens if the other process is currently loading? I would appreciate other ideas how I could multithread / multiprocess my script. Just realized that generating data, learn, generating data,... in a sequential script is not really performant.
EDIT 1: Forgot to mention that the weights are stored in a .h5 file by keras' save function

The multiprocessing module has a RLock class that you can use to regulate access to a sharded resource. This also works for files if you remember to acquire the lock before reading and writing and release it afterwards. Using a lock implies that some of the time one of the processes cannot read or write the file. How much of a problem this is depends on how much both processes have to access the file.
Note that for this to work, one of the scripts has to start the other script as a Process after creating the lock.
If the weights are a Python data structure, you could put that under control of a multiprocessing.Manager. That will manage access to the objects under its control for you. Note that a Manager is not meant for use with files, just in-memory objects.
Additionally on UNIX-like operating systems Python has os.lockf to lock (part of) a file. Note that this is an advisory lock only. That is, if another process calls lockf, the return value indicates that the file is already locked. It does not actually prevent you from reading the file.
Note:
Files can be read and written. Only when two processes are reading the same file (read/read) does this work well. Every other combination (read/write, write/read, write/write) can and eventually will result in undefined behavior and data corruption.
Note2:
Another possible solution involves inter process communication.
Process 1 writes a new h5 file (with a random filename), closes it, and then sends a message (using a Pipe or Queue to Process 2 "I've written a new parameter file \path\to\file".
Process 2 then reads the file and deletes it. This can work both ways but requires that both processes check for and process messages every so often. This prevents file corruption because the writing process only notifies the reading process after it has finished the file.

Related

Best Practice to share Data between two Python Processes

I started a new project quite recently it is about processing huge amounts of Data .The Data is read from a file and should be inserted into a Database and at the same time there are some calculation that are done on the Data. Therefor i designed a system for starting each of those task in one process. But i need to share the Data in Realtime between those processes i designed the system with the
multiprocessing.Manager()
and pass all shared variables as arguments to functions i execute.
Now im running in some issues because the Database cannot read the variable because the Read Task is always occupying the variable.
Im stuck right now because i don't find any proper solution to adress this issue
Thanks for all answers in advance

Pros and cons of using shared Value/Array vs Queue/Pipe in Python multiprocessing

I've been slowly learning to use the multiprocessing library in Python these last few days, and I've come to a point where I'm asking myself this question, and I can't find an answer to this.
I understand that the answer might vary depending on the application, so I'll explain what my application is.
I've created a scheduler in my main process that control when multiple processes execute (these processes are spawned earlier, loop continuously and execute code when my scheduler raises a flag in a shared Value). Using counters in my scheduler, I can have multiple processes executing code at different frequencies (in the 100-400 Hz range), and they are all perfectly synchronized.
For example, one process executes a dynamic model of a quadcopter (ode) at 400 Hz and updates the quadcopter's state. My other processes (command generation and trajectory generation) run at lower frequencies (200 Hz and 100 Hz), but require the updated state. I see currently 2 methods of doing this:
With Pipes: This requires separate Pipes for the dynamics/control and dynamics/trajectory connections. Furthermore, I need the control and trajectory processes to use the latest calculated quadcopter's state, so I need to flush the Pipes until the last value in them. This works, but doesn't look very clean.
With a shared Value/Array : I would only need one Array for the state, my dynamics process would write to it, while my other processes would read from it. I would probably have to implement locks (can I read a shared Value/Array from 2 processes at the same time without a lock?). This hasn't been tested yet, but would probably be cleaner.
I've read around that it is a bad practice to use shared memory too much (why is that?). Yes, I'll be updating it at 400 Hz and reading it at 200 and 100 Hz, but it's not going to be such a large array (10-ish float or doubles). However, I've also read that shared memory is faster that Pipes/Queues, and I would like to prioritize speed in my code, if it's not too much of an issue to use shared memory.
Mind you, I'll have to send generated commands to my dynamics process (another 5-ish floats), and generated desired states to my control process (another 10-ish floats), so that's either more shared Arrays, of more Pipes.
So I was wondering, for my application, what are the pros and cons of both methods. Thanks!

Using ZooKeeper to manage tasks which are in process or have been processed

I have a python script which periodically scans directories, processing new files. Each file takes a long time to process (many hours). I currently have the script running on a single computer, writing the names of processed files to a local file. Not fancy or robust, but it more or less works. I would like to use multiple worker machines to improve throughput (and robustness). My goals are to keep it as simple as possible. A zookeeper cluster is readily available.
My plan is to have in zookeeper a directory "started_files" with ephemeral nodes with the filename, which is known to be unique. I would have another directory "completed_files" with regular nodes with the filename. In pseudocode,
if filename does not exist in completed files:
try:
create emphemeral node filename in started files
process(filename)
create node filename in completed files
except node exists error:
do nothing, another worker is processing it
My first question is whether or not this is safe. Under any circumstance, can two different machines each create the same node successfully? I don't fully understand the doc. Having a file processed twice won't cause anything ALL that bad, but I would prefer it to be correct out of principle.
Secondly, is this a decent approach? Is there another approach which is clearly better? I will be processing 10's of files per DAY, so performance of this part of the application doesn't really matter to me (I sure wish processing the file was faster). Alternatively, I could have another script with just a single instance (or elect a leader) to scan for files and put them in a queue. I could modify the code which is causing these files to magically appear in the first place. I could use celery or storm. However all of those alternatives grow the scope which I am trying to keep small and simple.

In general your approach should work. It is possible, that you configure writing znodes to ZooKeeper in a way that consecutive creation of the same path will fail if it exists.
For the ephermal znodes you already found out quite well that these would vanish automatically if a client closes the connection to ZooKeeper which could,be especially useful in the case of failing compute nodes.
Other nodes can actually monitor the path with the ephermal znodes in order to figure out when it would be a good idea to scan for new tasks.
It would even be possible to implement a queue on top of ZooKeeper for instance using the sequencing of znodes; there are possible better ways.
In general I believe that a message queue system with publish subscribe pattern would scale a bit better. In that case you would only need to think about how to reschedule jobs of failed compute nodes.

Sharing a resource (file) across different python processes using HDFS

So I have some code that attempts to find a resource on HDFS...if it is not there it will calculate the contents of that file, then write it. And next time it goes to be accessed the reader can just look at the file. This is to prevent expensive recalculation of certain functions
However...I have several processes running at the same time on different machines on the same cluster. I SUSPECT that they are trying to access the same resource and I'm hitting a race condition that leads a lot of errors where I either can't open a file or a file exists but can't be read.
Hopefully this timeline will demonstrate what I believe my issue to be
Process A goes to access resource X
Process A finds resource X exists and begins writing
Process B goes to access resource X
Process A finishes writing resource X
...and so on
Obviously I would want Process B to wait for Process A to be done with Resource X and simply read it when A is done.
Something like semaphores come to mind but I am unaware of how to use these across different python processes on separate processors looking at the same HDFS location. Any help would be greatly appreciated
UPDATE: To be clear..process A and process B will end up calculating the exact same output (i.e. the same filename, with the same contents, to the same location). Ideally, B shouldn't have to calculate it. B would wait for A to calculate it, then read the output once A is done. Essentially this whole process is working like a "long term cache" using HDFS. Where a given function will have an output signature. Any process that wants the output of a function, will first determine the output signature (this is basically a hash of some function parameters, inputs, etc.). It will then check the HDFS to see if it is there. If it's not...it will write calculate it and write it to the HDFS so that other processes can also read it.

(Setting aside that it sounds like HDFS might not be the right solution for your use case, I'll assume you can't switch to something else. If you can, take a look at Redis, or memcached.)
It seems like this is the kind of thing where you should have a single service that's responsible for computing/caching these results. That way all your processes will have to do is request that the resource be created if it's not already. If it's not already computed, the service will compute it; once it's been computed (or if it already was), either a signal saying the resource is available, or even just the resource itself, is returned to your process.
If for some reason you can't do that, you could try using HDFS for synchronization. For example, you could try creating the resource with a sentinel value inside which signals that process A is currently building this file. Meanwhile process A could be computing the value and writing it to a temporary resource; once it's finished, it could just move the temporary resource over the sentinel resource. It's clunky and hackish, and you should try to avoid it, but it's an option.
You say you want to avoid expensive recalculations, but if process B is waiting for process A to compute the resource, why can't process B (and C and D) be computing it as well for itself/themselves? If this is okay with you, then in the event that a resource doesn't already exist, you could just have each process start computing and writing to a temporary file, then move the file to the resource location. Hopefully moves are atomic, so one of them will cleanly win; it doesn't matter which if they're all identical. Once it's there, it'll be available in the future. This does involve the possibility of multiple processes sending the same data to the HDFS cluster at the same time, so it's not the most efficient, but how bad it is depends on your use case. You can lessen the inefficiency by, for example, checking after computation and before upload to the HDFS whether someone else has created the resource since you last looked; if so, there's no need to even create the temporary resource.
TLDR: You can do it with just HDFS, but it would be better to have a service that manages it for you, and it would probably be even better not to use HDFS for this (though you still would possibly want a service to handle it for you, even if you're using Redis or memcached; it depends, once again, on your particular use case).

Downloading a Large Number of Files from S3

What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).
At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.
Would some type of concurrency help? PyCurl.CurlMulti object?
I am open to all suggestions. Thanks!

I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.

In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.
If you do use multiple threads, this is a good candidate for the Queue class.

You might consider using s3fs, and just running concurrent file system commands from Python.

I've been using txaws with twisted for S3 work, though what you'd probably want is just to get the authenticated URL and use twisted.web.client.DownloadPage (by default will happily go from stream to file without much interaction).
Twisted makes it easy to run at whatever concurrency you want. For something on the order of 200,000, I'd probably make a generator and use a cooperator to set my concurrency and just let the generator generate every required download request.
If you're not familiar with twisted, you'll find the model takes a bit of time to get used to, but it's oh so worth it. In this case, I'd expect it to take minimal CPU and memory overhead, but you'd have to worry about file descriptors. It's quite easy to mix in perspective broker and farm the work out to multiple machines should you find yourself needing more file descriptors or if you have multiple connections over which you'd like it to pull down.

what about thread + queue, I love this article: Practical threaded programming with Python

Each job can be done with appropriate tools :)
You want use python for stress testing S3 :), so I suggest find a large volume downloader program and pass link to it.
On Windows I have experience for installing ReGet program (shareware, from http://reget.com) and creating downloading tasks via COM interface.
Of course there may other programs with usable interface exists.
Regards!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.