Using shell vs python for parallelism

Using shell vs python for parallelism - python

Let's say I have many tasks N to run (~thousands) and each task takes a good amount of time X (few minutes to hours). Luckily, each of these tasks can be run independently. Each task is a shell command invoked through Python.
Edit: Each task is almost identical, so I don't really need to abstract tasks.
Which is better? (in terms of memory required, cpu usage, background task limits ...)
Invoke each task as a background process in a single threaded python script (use files/redirection for tracking) or,
Multiple python threads, each one calling the shell command.
I need python here primarily to interact with db, do some logic, read files etc.
Is this a tradeoff worth considering or either way is fine?
Ideally it would be nice if there are some statistics/graphs around the two approaches for different N/X values.
PS: Google and SO search didn't give me any leads. I am sorry if there is something like this already.
Thanks!

Related

Should I use Python native multithread or Multiple Tasks in Airflow?

I'm refactoring a .NET application to airflow. This .NET application uses multiple threads to extract and process data from a mongoDB (Without multiple threads the process takes ~ 10hrs, with multi threads i can reduce this) .
In each documment on mongoDB I have a key value namedprocess. This value is used to control which thread process the documment. I'm going to develop an Airflow DAG to optimize this process. My doubt is about performance and the best way to do this.
My application should have multiple tasks (I will control the process variable in the input of the python method). Or should I use only 1 task and use Python MultiThreading inside this task? The image below illustrates my doubt.
Multi Task X Single Task (Multi Threading)
I know that using MultiTask I'm going to do more DB Reads (1 per task). Although, using Python Multi Threading I know I'll have to do a lot of control processing inside de task method. What is the best, fastest and optimized way to do this?

It really depends on the nature of your processing.
Multi-threading in Python can be limiting because of GIL (Global Interpreter Lock) - there are some operations that require exclusive lock, and this limit the parallelism it can achieve. Especially if you mix CPU and I/O operations the effects might be that a lot of time is spent by threads waiting for the lock. But it really depends on what you do - you need to experiment to see if GIL affects your multithreading.
Multiprocessing (which is used by Airflow for Local Executor) is better because each process runs effectively a separate Python interpreter. So each process has it's own GIL - at the expense of resources used (each process uses it's own memory, sockets and so on). Each task in Airlfow will run in a separate process.
However Airflow offers a bit more - it also offers multi-machine. You can run separate workers With X processes on Y machines, effectively running up to X*Y processes at a time.
Unfortunately, Airflow is (currently) not well suited to run dynamic number of parallel tasks of the same type. Specifically if you would like to split load to process to N pieces and run each piece in a separate task - this would only really work if N is constant and does not change over time for the same DAG (like if you know you have 10 machines, with 4 CPUs, you's typically want to run 10*4 = 40 tasks at a time, so you'd have to split your job into 40 tasks. And it cannot change dynamically between runs really - you'd have to write your DAG to run 40 parallel tasks every time it runs.
Not sure if I helped, but there is no single "best optimised" answer - you need to experiment and check what works best for your case.

Python multiprocessing on Amazon EC2 eventually resorts to single core

I am running into a very weird problem on an Amazon instance running python and multiprocessing.
The context
I want to use pool.map or something similar (imap_unordered would do the trick too) to apply a CPU intensive task to an iterable. The iterable isn't that big (few hundred) but the task takes a long time.
I'm using the multiprocessing module of Python, in Python 2.7.11
The general structure is:
for longer_loop:
for small loop:
pool = Pool(processes=18)
pool.map(f, iterable)
pool.close()
pool.join()
The problem
I start the run. I go and look at "top" and see that Python is nicely using all them cores. I go and work on something else. I come back and see that all of a sudden Python is still in the longer loop but now uses only one core, and completely stopped taking advantage of the multiprocessing. To emphasize: It doesn't hang. Stuff is still happening. But it's happening one item at a time, instead of 18.
Things I tried (that didn't help)
First instinct: This is a load balancing issue, since the function takes a long yet slightly variable time, so some cores are just finishing earlier. Set chunksize to 1 since the bottleneck is definitely the function being applied, not the creation of lots of chunks. That didn't help.
Second instinct: I vaguely remember that numpy and python multiprocessing did not gel very well. Set OMP_NUM_THREADS=1 in the environment variables. While that seemed to help at first (via making everything run faster), a run with longer execution time (more data than my "let's test this stuff on small things first") still got stuck at the "only one thread".
Note: I had the creation of the pool outside the small loop, but that didn't change anything. The actual execution of the map takes the most time, so closing and recreating Pool objects would be insignificant.
More suspicions of what it could be
Currently trying a run where I deal with that core affinity issue, but I feel like if that was the issue then I should see it from the start, not at some undetermined later time.
Is there something weird about the Amazon EC2 instance that says "enough cores for you, fool!" after creating too many processes?
Could it have to do with using too much memory? But then I'd just expect to still see 18 hardworking (and 1 monitoring) python processes, just now they're all busy swapping stuff because they're out of memory. But I really just see a single working process (and the 1 monitoring process) toiling away at the loop, as if map or imap_unordered) decided that 1 was enough now. Which... just shouldn't happen.
Happy for any clues and pointers, and happy to provide more information if required.

What is the difference between running multiple python scripts and multiprocessing

I was thinking about adding multiprocessing to one of my scripts to
increase performance.
It's nothing fancy, there are 1-2 parameters to the main method.
Is there anything wrong with just running four of the same scripts cloned
on the terminal versus actually adding multiprocessing into the python code?
ex for four cores:
~$ script.py &
script.py &
script.py &
script.py;
I've read that linux/unix os's automatically divide the
programs amongst the available cores.
Sorry if ^ stuff i've mentioned above is totally wrong. Nothing above was
formally learned, it was just all online stuff.

Martijn Pieters' comment hits the nail on the head, I think. If each of your processes only consumes a small amount of memory (so that you can easily have all four running in parallel without running out of RAM) and if your processes do not need to communicate with each other, it is easiest to start all four processes independently, from the shell, as you suggested.
The python multiprocessing module is very useful if you have slightly more complex requirements. You might, for instance, have a program that needs to run in serial at startup, then spawn several copies for the more computationally intensive section, and finally do some post-processing in serial. multiprocessing would be invaluable for this sort of synchronization.
Alternatively, your program might require a large amount of memory (maybe to store a large matrix in scientific computing, or a large database in web programming). multiprocessing lets you share that object among the different processes so that you don't have n copies of the data in memory (using multiprocessing.Value and multiprocessing.Array objects).

Using multiprocessing might also become the better solution if you'd want to run your script, say, 100 times on only 4 cores.
Then your terminal-based approach would become pretty nasty.
In this case you might want to use a Pool from the multiprocessing module.

Using multithreading for maximum CPU efficiency

I am currently working in Python and my program looks like this:
function(1)
function(2)
...
function(100)
Performing a function takes ~30 minutes at 100% CPU, so executing the program takes a lot of time. The functions access the same file for inputs, do a lot of math and print the results.
Would introducing multithreading decrease the time, which the program takes to complete (I am working on a multicore machine)? If so, how many threads should I use?
Thank you!

It depends.
If none of the functions depend on each other at all, you can of course run them on separate threads (or even processes using multiprocessing, to avoid the global interpreter lock). You can either run one process per core, or run 100 processes, or any number in between, depending on the resource constraints of your system. (If you don't own the system, some admins don't like users who spam the process table.)
If the functions must be run one after the other, then you can't do that. You have to restructure the program to try and isolate independent tasks, or accept that you might have a P-complete (inherently hard to parallelize) problem and move on.

Tasks queue process in python

Task is:
I have task queue stored in db. It grows. I need to solve tasks by python script when I have resources for it. I see two ways:
python script working all the time. But i don't like it (reason posible memory leak).
python script called by cron and do a little part of task. But i need to solve the problem of one working active script in memory (To prevent active scripts count grow). What is the best solution to implement it in python?
Any ideas to solve this problem at all?

You can use a lockfile to prevent multiple scripts from running out of cron. See the answers to an earlier question, "Python: module for creating PID-based lockfile". This is really just good practice in general for anything that you need to make sure won't have multiple instances running, actually, so you should look into it even if you do have the script running constantly, which I do suggest.
For most things, it shouldn't be too hard to avoid memory leaks, but if you're having a lot of trouble with it (I sometimes do with complex third-party web frameworks, for example), I would suggest instead writing the script with a small, carefully-designed main loop that monitors the database for new jobs, and then uses the multiprocessing module to fork off new processes to complete each task.
When a task is complete, the child process can exit, immediately freeing any memory that isn't properly garbage collected, and the main loop should be simple enough that you can avoid any memory leaks.
This also offers the advantage that you can run multiple tasks in parallel if your system has more than one CPU core, or if your tasks spend a lot of time waiting for I/O.

This is a bit of a vague question. One thing you should remember is that it is very difficult to leak memory in Python, because of the automatic garbage collection. croning a Python script to handle the queue isn't very nice, although it would work fine.
I would use method 1; if you need more power you could make a small Python process that monitors the DB queue and starts new processes to handle the tasks.

I'd suggest using Celery, an asynchronous task queuing system which I use myself.
It may seem a bit heavy for your use case, but it makes it easy to expand later by adding more worker resources if/when needed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.