Parallel processing from a command queue on Linux (bash, python, ruby... whatever) - python

I have a list/queue of 200 commands that I need to run in a shell on a Linux server.
I only want to have a maximum of 10 processes running (from the queue) at once. Some processes will take a few seconds to complete, other processes will take much longer.
When a process finishes I want the next command to be "popped" from the queue and executed.
Does anyone have code to solve this problem?
Further elaboration:
There's 200 pieces of work that need to be done, in a queue of some sort. I want to have at most 10 pieces of work going on at once. When a thread finishes a piece of work it should ask the queue for the next piece of work. If there's no more work in the queue, the thread should die. When all the threads have died it means all the work has been done.
The actual problem I'm trying to solve is using imapsync to synchronize 200 mailboxes from an old mail server to a new mail server. Some users have large mailboxes and take a long time tto sync, others have very small mailboxes and sync quickly.

On the shell, xargs can be used to queue parallel command processing. For example, for having always 3 sleeps in parallel, sleeping for 1 second each, and executing 10 sleeps in total do
echo {1..10} | xargs -d ' ' -n1 -P3 sh -c 'sleep 1s' _
And it would sleep for 4 seconds in total. If you have a list of names, and want to pass the names to commands executed, again executing 3 commands in parallel, do
cat names | xargs -n1 -P3 process_name
Would execute the command process_name alice, process_name bob and so on.

I would imagine you could do this using make and the make -j xx command.
Perhaps a makefile like this
all : usera userb userc....
usera:
imapsync usera
userb:
imapsync userb
....
make -j 10 -f makefile

Parallel is made exatcly for this purpose.
cat userlist | parallel imapsync
One of the beauties of Parallel compared to other solutions is that it makes sure output is not mixed. Doing traceroute in Parallel works fine for example:
(echo foss.org.my; echo www.debian.org; echo www.freenetproject.org) | parallel traceroute

For this kind of job PPSS is written: Parallel processing shell script. Google for this name and you will find it, I won't linkspam.

GNU make (and perhaps other implementations as well) has the -j argument, which governs how many jobs it will run at once. When a job completes, make will start another one.

Well, if they are largely independent of each other, I'd think in terms of:
Initialize an array of jobs pending (queue, ...) - 200 entries
Initialize an array of jobs running - empty
while (jobs still pending and queue of jobs running still has space)
take a job off the pending queue
launch it in background
if (queue of jobs running is full)
wait for a job to finish
remove from jobs running queue
while (queue of jobs is not empty)
wait for job to finish
remove from jobs running queue
Note that the tail test in the main loop means that if the 'jobs running queue' has space when the while loop iterates - preventing premature termination of the loop. I think the logic is sound.
I can see how to do that in C fairly easily - it wouldn't be all that hard in Perl, either (and therefore not too hard in the other scripting languages - Python, Ruby, Tcl, etc). I'm not at all sure I'd want to do it in shell - the wait command in shell waits for all children to terminate, rather than for some child to terminate.

In python, you could try:
import Queue, os, threading
# synchronised queue
queue = Queue.Queue(0) # 0 means no maximum size
# do stuff to initialise queue with strings
# representing os commands
queue.put('sleep 10')
queue.put('echo Sleeping..')
# etc
# or use python to generate commands, e.g.
# for username in ['joe', 'bob', 'fred']:
# queue.put('imapsync %s' % username)
def go():
while True:
try:
# False here means no blocking: raise exception if queue empty
command = queue.get(False)
# Run command. python also has subprocess module which is more
# featureful but I am not very familiar with it.
# os.system is easy :-)
os.system(command)
except Queue.Empty:
return
for i in range(10): # change this to run more/fewer threads
threading.Thread(target=go).start()
Untested...
(of course, python itself is single-threaded. You should still get the benefit of multiple threads in terms of waiting for IO, though.)

If you are going to use Python, I recommend using Twisted for this.
Specifically Twisted Runner.

https://savannah.gnu.org/projects/parallel (gnu parallel)
and pssh might help.

Python's multiprocessing module would seem to fit your issue nicely. It's a high-level package that supports threading by process.

Simple function in zsh to parallelize jobs in not more than 4 subshells, using lock files in /tmp.
The only non trivial part are the glob flags in the first test:
#q: enable filename globbing in a test
[4]: returns the 4th result only
N: ignore error on empty result
It should be easy to convert it to posix, though it would be a bit more verbose.
Do not forget to escape any quotes in the jobs with \".
#!/bin/zsh
setopt extendedglob
para() {
lock=/tmp/para_$$_$((paracnt++))
# sleep as long as the 4th lock file exists
until [[ -z /tmp/para_$$_*(#q[4]N) ]] { sleep 0.1 }
# Launch the job in a subshell
( touch $lock ; eval $* ; rm $lock ) &
# Wait for subshell start and lock creation
until [[ -f $lock ]] { sleep 0.001 }
}
para "print A0; sleep 1; print Z0"
para "print A1; sleep 2; print Z1"
para "print A2; sleep 3; print Z2"
para "print A3; sleep 4; print Z3"
para "print A4; sleep 3; print Z4"
para "print A5; sleep 2; print Z5"
# wait for all subshells to terminate
wait

Can you elaborate what you mean by in parallel? It sounds like you need to implement some sort of locking in the queue so your entries are not selected twice, etc and the commands run only once.
Most queue systems cheat -- they just write a giant to-do list, then select e.g. ten items, work them, and select the next ten items. There's no parallelization.
If you provide some more details, I'm sure we can help you out.

Related

Know if subprocess is not stuck by it's prints to stdout

I have subprocess that I am running by:
proc = subprocess.Popen("python -u my_script.py", shell=True)
my_script.py should print regularly to stdout and I have other non related process that is listening to this output so I can't change the output to be printed to somewhere else.
I want to ensure that the process is really regularly printing and not got stuck in some loop .etc, do I have way to check if stdout was wroten for some amount of time?
any other options to reach this goal?
EDIT
I am using windows
you can create a named pipe with mkfifo and use tee to output your script's data to both the process listening for it and the pipe.
mkfifo blarg
my_script.py | tee blarg | your_greedy_data_processing_instance
tail -f blarg
instead of tail you can use an arbitrarly complicated script to study the output and the state of the process generating it (timers, pid checks)
It appears that the access time and modification time of /dev/stdout is updated regularly. Note, however, that /dev/stdout will always be a soft link -- er, a symbolic link, I mean -- to the file handle of stdout for the process that's checking /dev/stdout. I.e., /dev/stdout links to /proc/self/fd/1.
So it seems that you could check the first file descriptor of your process to see if its modification time has changed, e.g.:
$ stat -c %y -L /proc/10830/fd/1
2021-05-13 02:34:00.367857061
-L means act on the target of the soft link, not the soft link itself; -c %y is just asking for the modification time. This Python script is running as process 10830 on my system right now, and it's occasionally updating the modification time (about every 8 seconds):
>>> import time
>>> while True: time.sleep(1); print("still alive")
still alive
still alive
still alive
....
You should Google this answer to be sure that the behavior I'm seeing is reliable, though, because I've never read anything about it before.
Alternatively, you could either (a) trust that the script is fine -- which it will, of course, always be (unless it's catching exceptions and refusing to exit even if it can no longer do anything useful, in which case you should change it to die the way it should), or (b) set up a daemon to do something like send a signal to the script, at which point the script could send a signal to the daemon to say "I'm still alive." There's literally no reason to do that, in my opinion, but how you write your programs is up to you.
So assuming that you want to press forward with this, here's a trivial example of the daemon that would monitor the script you want to make sure isn't stuck in a loop or something:
import time
import signal
import os
import sys
# keep a timestamp of when we receive a response
response_timestamp = time.time()
# add code here to get the process ID of the other script
other_pid = 0
def sig_handler(signum, frame):
global response_timestamp
response_timestamp = time.time()
if __name__ == '__main__':
# make sure that when we receive SIGBREAK, sig_handler() gets called
signal.signal(signal.SIGBREAK, sig_handler)
while True:
# send SIGBREAK to "other_pid"
os.kill(other_pid, signal.SIGBREAK)
time.sleep(15)
if time.time() - 20 > response_timestamp:
print("the other process is frozen")
sys.exit(os.EX_SOFTWARE)
Then you add this to the other script that you're monitoring:
import signal
import os
# add code here to get the process ID
other_pid = 0
def sig_handler(signum, frame):
os.kill(other_pid, signal.SIGBREAK)
...
...
(rest of your script)
Now be aware that the only thing this will do, is make sure that the process isn't completely frozen. Regrettably, Windows doesn't have a great deal of options when it comes to signals: SIGBREAK was the best one that I saw, but note that it's the signal received by a process when you hit CTRL+C to interrupt the program (so if you manually hit CTRL+C in the window running the Python program, it won't kill it, it will just make it call sig_handler()).
I would also be remiss if I did not inform you that even though this will probably work just fine, it is not safe to do almost anything inside of a signal handler function. It's bad form and may blow up on you unexpectedly, but in practice, it's pretty safe.

Running external commands partly in parallel from python (or bash)

I am running a python script which creates a list of commands which should be executed by a compiled program (proprietary).
The program kan split some of the calculations to run independently and the data will then be collected afterwards.
I would like to run these calculations in parallel as each are a very time consuming single threaded task and I have 16 cores available.
I am using subprocess to execute the commands (in Class environment):
def run_local(self):
p = Popen(["someExecutable"], stdout=PIPE, stdin=PIPE)
p.stdin.write(self.exec_string)
p.stdin.flush()
while(p.poll() is not none):
line = p.stdout.readline()
self.log(line)
Where self.exec_string is a string of all the commands.
This string an be split into: an initial part, the part i want parallelised and a finishing part.
How should i go about this?
Also it seems the executable will "hang" (waiting for a command, eg. "exit" which will release the memory) if a naive copy-paste of the current method is used for each part.
Bonus: The executable also has the option to run a bash script of commands, if it is easier/possible to parallelise bash?
For bash, it could be very simple. Assuming your file looks like this:
## init part##
ls
cd ..
ls
cat some_file.txt
## parallel ##
heavycalc &
heavycalc &
heavycalc &
## finish ##
wait
cat results.txt
With & behind the command you tell bash to run this command in a background-thread. wait will then wait for all background-threads to finish, so you can be sure, all calculations are done.
I've assumed your input txt-file are plain bash-commands.
Using GNU Parallel:
## init
cd foo
cp bar baz
## parallel ##
parallel heavycalc ::: file1 file2 file3 > results.txt
## finish ##
cat results.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

python program doesnt stop once its built

I start this program all.py
import subprocess
import os
scripts_to_run = ['AppFlatForRent.py','AppForSale.py','CommercialForSale.py','LandForSale.py','MultipleUnitsForSale.py','RentalWanted.py','RentCommercial.py','RoomsForRent.py','RoomsWanted.py','ShortTermDaily.py','ShortTermMonthly.py','VillaHouseForRent.py','VillaHouseForSale.py']
for s in scripts_to_run:
subprocess.Popen(["python", os.path.join(os.getcwd(), s)])
Its running 13 programs at a time. The problem is that in the sublime - unlike other programs- this particular program doesnt cancel the built. it just keep running (I know because the program is inputting values in the database and it doesnt stop doing that)
I want it to be done via terminal.
any help?
There are two approaches you can take.
The shell approach
If you only want to kill the child processes after the main app has finished but don't want the main app to handle this itself, for any reason (mostly its for debugging purposes), you can do it from the terminal:
kill $(ps aux |grep -E 'python[[:space:]][[:alnum:]]+.py' |awk '{print $2}')
▲ ▲ ▲ ▲
║ ╚═══════════════════╦══════════════════╝ ║
Get all ═════╝ ║ ║
running ║ Get the second column
processes Find all scripts executed which is the PID
by Python, ending with .py
(check Note 1 for more details)
Note 1: the regular expression in the upper example is just for demonstration purposes and it kills only scripts executed with a relative path like python script.py, but does not include processes like python /path/to/script.py. This is just an example, so make sure to adapt the regular expression to your specific needs.
Note 2: this approach is risky because it can kill unwanted applications, make sure you know what you are doing before using it.
The Python approach
The other approach offers more control, and is implemented in the main application itself.
You can make sure that all child processes are exited when the main application ends by keeping track of all the processes you created, and killing them afterwards.
Example usage:
First change your process spawning code to keep the Popen objects of the running processes for later usage:
running_procs = []
for s in scripts_to_run:
running_procs.append(
subprocess.Popen(["python", os.path.join(os.getcwd(), s)])
)
Then define the do_clean() function that will iterate through them and terminate them:
def do_clean():
for p in running_procs:
p.kill()
You can call this function manually whenever you wish to do this, or you can use the atexit module to do this when the application is terminating.
The atexit module defines a single function to register cleanup
functions. Functions thus registered are automatically executed upon
normal interpreter termination.
Note: The functions registered via this module are not called when the
program is killed by a signal not handled by Python, when a Python
fatal internal error is detected, or when os._exit() is called.
For example:
import atexit
atexit.register(do_clean)
To stop all child script, you could call .terminate(), .kill() methods:
import sys
import time
from subprocess import Popen
# start child processes
children = [Popen([sys.executable or 'python', scriptname])
for scriptname in scripts_to_run]
time.sleep(30) # wait 30 seconds
for p in children:
p.terminate() # or p.kill()
for p in children:
p.wait() # wait until they exit

Avoid python setup time

This image below says python takes lot of time in user space. Is it possible to reduce this time at all ?
In the sense I will be running a script several 100 times. Is it possible to start python so that it takes time to initialize once and doesn't do it the subsequent time ??
I just searched for the same and found this:
http://blogs.gnome.org/johan/2007/01/18/introducing-python-launcher/
Python-launcher does not solve the problem directly, but it points into an interesting direction: If you create a small daemon which you can contact via the shell to fork a new instance, you might be able to get rid of your startup time.
For example get the python-launcher and socat¹ and do the following:
PYTHONPATH="../lib.linux-x86_64-2.7/" python python-launcher-daemon &
echo pass > 1
for i in {1..100}; do
echo 1 | socat STDIN UNIX-CONNECT:/tmp/python-launcher-daemon.socket &
done
Todo: Adapt it to your program, remove the GTK stuff. Note the & at the end: Closing the socket connection seems to be slow.
The essential trick is to just create a server which opens a socket. Then it reads all the data from the socket. Once it has the data, it forks like the following:
pid = os.fork()
if pid:
return
signal.signal(signal.SIGPIPE, signal.SIG_DFL)
signal.signal(signal.SIGCHLD, signal.SIG_DFL)
glob = dict(__name__="__main__")
print 'launching', program
execfile(program, glob, glob)
raise SystemExit
Running 100 programs that way took just 0.7 seconds for me.
You might have to switch from forking to just executing the code instead of forking if you want to be really fast.
(That’s what I also do with emacsclient… My emacs takes ~30s to start (due to excessive use of additional libraries I added), but emacsclient -c shows up almost instantly.)
¹: http://www.socat.org
Write the "do this several 100 times" logic in your Python script. Call it ONCE from that other language.
Use timeit instead:
http://docs.python.org/library/timeit.html

python run multi command in the same time

Prior to this,I run two command in for loop,like
for x in $set:
command
In order to save time,i want to run these two command in the same time,like parallel method in makefile
Thanks
Lyn
The threading module won't give you much performance-wise because of the Global Interpreter Lock.
I think the best way to do this is to use the subprocess module and open each command with it's own stdout.
processes = {}
for cmd in ['cmd1', 'cmd2', 'cmd3']:
p = subprocess.Popen('cmd1', stdout=subprocess.PIPE)
processes[p.stdout] = p
while len(processes):
rfds, _, _ = select.select(processes.keys(), [], [])
for fd in rfds:
process = processses[fd]
print fd.read()
if process.returncode is not None:
print "Process {0} returned with code {1}".format(process.pid, process.returncode)
del processes[fd]
You basically have to use select to see which file descriptors are ready and you have to check their returncode to see if doing a "read" caused them to exit. Processes basically go into a wait state until their stdout is closed. If you would like to do some things while you're waiting, you can put a timeout on select.select() so you'll stop waiting after so long. You can test the length of rfds and if it is 0 then you know that the timeout happened.
twisted or select module is probably what you're after.
If all you want to do is a bunch of batch commands, shell scripts, ie
#!/bin/sh
for i in "command1 command2 command3"; do
$i &
done
Might work better. Alternately, a Makefile like you said.
Look at the threading module.

Categories

Resources