bash script launching many processes and blocking computer - python

I have written a bash script with the aim to run a .py template Python script 15,000 times, each time using a slightly modified version of this .py.
After each run of one .py, the bash script logs what happened into a file.
The bash script, which works on my laptop and computes the 15,000 things.
N_SIM=15000
for ((j = 1; j <= $N_SIM; j++))
do
index=$((j))
a0=$(awk "NR==${index} { print \$1 }" values_a0_in_Xenon_10_20_10_25_Wcm2_RRon.txt)
dirname="a0_${a0}"
mkdir -p $dirname
cd $dirname
awk -v s="a0=${a0}" 'NR==6 {print s} 1 {print}' ../integration.py > integrationa0included.py
mpirun -n 1 python3 integrationa0included.py 2&> integration_Xenon.log &
cd ..
done
It launches processes and the terminal looks like (or something along these lines, the numbers are only there for illustrative purposes, they are not exact):
[1]
[2]
[3]
...
...
[45]
[1]: exit, success (a message along these lines)
[4]: exit, success
[46]
[47]
[48]
[2]: exit, success
...
And the pattern of finishing some launched processes and continuously launching new ones repeats up until the 15,000 processes are launched and completed.
I want to run this on a very old computer.
The problem is that it launches almost instantly 300 such processes and then the computer freezes. It basically crashes. I cannot do CTRL+Z or CTRL+C or type. It's frozen.
I want to ask you if there's a modification to the bash script which launches only 2 processes, waits for 1 to finish, launches the 3rd, waits for the 2nd to finish, launches the 4th, and so on.
So that there aren't so many processes waiting at any given time. Maybe this doesn't block the old computer.

Inside your loop, add the following code to the beginning of the loop body:
# insert this after `do`
[ "$(jobs -pr | wc -l)" -ge 2 ] && wait -n
If there are already two or more background jobs running this waits till at least one of the running jobs terminated.

Related

Make a batch file to run python script for a certain amount of time and then auto-closes after

I need to make a batch file to delete a file created each time a script starts, then, set a timer to stop the python script and exit it to run another command line after a certain amount of time. Here's the code, I can't find any solution. The thing is that the python program is a bit complicated and I can't modify it so all I can do is create a batch or powershell file that does this task indefinitely. Here's the batch file :
cd C:\Users\Admin\Desktop\instabot.py-master
del -f user.session
instabot-py #user
#Here I need a line to stop the instabot-py running on cmd maybe using a condition before launching the script
I feel lost I can't figure out a way to do that, and the must would be a loop that does all this code again and again every 5 minutes for example (runs the python program for 5 minutes then closes it and start again at with cd, del, launches the program and stops it again.
PS : I'm on windows 10 x64.
You can use powershell jobs for that:
$myScript = {
cd c:\some\folder
python .\py_script.py
}
$timeOut = 10
while ($true){
$job = start-job -ScriptBlock $myScript
Start-Sleep $timeOut
Remove-Item -Path .\some.file
stop-job -Job $job
Remove-Job -Job $job
}

Running parallel commands in bash

I have a situation where I have a directory "batches" containing several batch files:
one.txt
two.txt
...
seventy.txt
Each of these files needs to be processed by a python script as:
python processor.py --inputFile=batches/one.txt
My current implementation is as such:
for f in batches/$f
do
python processor.py --inputFile=batches/$f
done
I have hundreds of batches, so running all of them in parallel as
python processor.py --inputFile=batches/$f &
Is not feasible.
However,I think that running ~10 at a time shouldn't be a problem.
I'm aware that the syntax
{
python processor.py --inputFile=batches/batchOne.txt
python processor.py --inputFile=batches/batchTwo.txt
} &
{
python processor.py --inputFile=batches/batchThree.txt
python processor.py --inputFile=batches/batchFour.txt
}
Should give me a result similar to the one I wanted. However, are there any better solutions? Basically, given a command template, in my case
python processor.py --inputFile=batches/$1
And a list of batches, I'd like to control how many get executed at the same time.
I'm working on Ubuntu Linux.
Try doing this to run 10 // executions :
parallel -j 10 command_line
to install it
sudo apt-get install parallel
parallel is a great tool, but not always you have option to install additional packages on a system. You can emulate parallel with bash jobs.
Here is a small example:
#!/usr/bin/env bash
for FILE in /tmp/*.sh;
do
# count only running jobs.
JOBS=$(jobs -r | wc -l)
while [[ ${JOBS} -ge 3 ]];
do
echo "RUNNING JOBS = ${JOBS} => WAIT"
sleep 5 # too much, just for demo
done
echo "APPEND ${FILE} TO JOBS QUEUE [JOBS: ${JOBS}]"
bash ${FILE} &
done
exit 0
Test:
$ grep '' /tmp/file*.sh
/tmp/file01.sh:sleep 8
/tmp/file02.sh:sleep 10
/tmp/file03.sh:sleep 5
/tmp/file04.sh:sleep 10
/tmp/file05.sh:sleep 8
/tmp/file06.sh:sleep 8
$ ./parallel.sh
APPEND /tmp/file01.sh TO JOBS QUEUE [JOBS: 0]
APPEND /tmp/file02.sh TO JOBS QUEUE [JOBS: 1]
APPEND /tmp/file03.sh TO JOBS QUEUE [JOBS: 2]
RUNNING JOBS = 3 => WAIT
APPEND /tmp/file04.sh TO JOBS QUEUE [JOBS: 2]
RUNNING JOBS = 3 => WAIT
APPEND /tmp/file05.sh TO JOBS QUEUE [JOBS: 1]
APPEND /tmp/file06.sh TO JOBS QUEUE [JOBS: 2]

Running external commands partly in parallel from python (or bash)

I am running a python script which creates a list of commands which should be executed by a compiled program (proprietary).
The program kan split some of the calculations to run independently and the data will then be collected afterwards.
I would like to run these calculations in parallel as each are a very time consuming single threaded task and I have 16 cores available.
I am using subprocess to execute the commands (in Class environment):
def run_local(self):
p = Popen(["someExecutable"], stdout=PIPE, stdin=PIPE)
p.stdin.write(self.exec_string)
p.stdin.flush()
while(p.poll() is not none):
line = p.stdout.readline()
self.log(line)
Where self.exec_string is a string of all the commands.
This string an be split into: an initial part, the part i want parallelised and a finishing part.
How should i go about this?
Also it seems the executable will "hang" (waiting for a command, eg. "exit" which will release the memory) if a naive copy-paste of the current method is used for each part.
Bonus: The executable also has the option to run a bash script of commands, if it is easier/possible to parallelise bash?
For bash, it could be very simple. Assuming your file looks like this:
## init part##
ls
cd ..
ls
cat some_file.txt
## parallel ##
heavycalc &
heavycalc &
heavycalc &
## finish ##
wait
cat results.txt
With & behind the command you tell bash to run this command in a background-thread. wait will then wait for all background-threads to finish, so you can be sure, all calculations are done.
I've assumed your input txt-file are plain bash-commands.
Using GNU Parallel:
## init
cd foo
cp bar baz
## parallel ##
parallel heavycalc ::: file1 file2 file3 > results.txt
## finish ##
cat results.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Freeze stdin when in the background, unfreeze it when in the foreground

I am trying to run a sript in the background:
nohup script.py > out 2> err < /dev/null &
The script (Python 3.4) does at some point:
answer = input('? ')
(it has a menu running in one of the threads)
And the nohup call is crashing with:
EOFError: EOF when reading a line
Because of the /dev/null redirection of stdin I imagine. If I run it without stdin redirection:
nohup script.py > out 2> err &
It crashes with:
OSError: [Errno 9] Bad file descriptor
If I run it with:
script.py > out 2> err
It works, but blocks my terminal (it is in the foreground)
If I run it with:
script.py > out 2> err &
It runs in the background alright, but it gets stopped as soon as the input call is reached.
What I would like is:
be able to redirect stdout and stderr to the filesystem
be able to put the script in the background
be able to move if to the foreground and interact with the menu normally (so stdin must be enabled somehow). stdout and stderr would still be redirected to the filesystem, but stdin would behave normally.
the script must run fine in the background and in the foreground (of course, the menu is not working in the background, because stdin is "frozen")
Basically, what I would like is that when it is in the background, stdin is kind of "frozen", and whenever it comes to the foreground it works again normally.
Is this possible? The solution does not need to involve nohup
What you want (and how input works and fails on EOF under python) with an interactive menu means that you cannot safely pass a file as stdin when invoking your program. This means your only option is to invoke this like so:
$ script.py > out 2> err &
As a demonstration, this is my script:
from time import sleep
import sys
c = 0
while True:
sleep(0.001)
c += 1
if c % 1000 == 0:
print(c, flush=True)
if c % 2000 == 0:
print(c, file=sys.stderr, flush=True)
if c % 10000 == 0:
answer = input('? ')
print('The answer is %s' % answer, flush=True)
Essentially, every second it will write to stdout, every two seconds it will write to stderr, and lastly, every ten seconds it will wait for input. If I were to run this and wait a bit over a seconds (to allow disk flush), and chain this together, like so:
$ python script.py > out 2> err & sleep 2.5; cat out err
[1] 32123
1000
2000
2000
$
Wait at least 10 seconds and try cat out err again:
$ cat out err
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
? 2000
4000
6000
8000
10000
[1]+ Stopped python script.py > out 2> err
$
Note that the prompt generated by input is also written to stdout, and the program effectively continued running up to where it is expecting stdin to give it data. You simply have to bring the process back into foreground by %, and start feeding it the required data, then suspend with ^Z (CtrlZ) and keep it running again in the background with %&. Example:
$ %
python script.py > out 2> err
Test input
^Z
[1]+ Stopped python script.py > out 2> err
$ %&
[1]+ python script.py > out 2> err &
$
Now cat out again, after waiting another ten seconds:
$ cat out
1000
...
10000
? The answer is Test input
11000
...
20000
?
[1]+ Stopped python script.py > out 2> err
$
This is essentially a basic crash course in how standard processes typically functions in both foreground and background, and things just simply work as intended if the code handles the standard IO correctly.
Lastly, you can't really have it both ways. If the application expects stdin and none is provided, then the clear option is failure. If one is provided however but application got sent to background and kept running, it will be Stopped as it expects further input. If this stopped behaviour is unwanted, the application is at fault, there is nothing can be done but to change the application to not result in an error when EOF is encountered when executed with /dev/null as its stdin. If you want to keep stdin as is, with the application being able to somehow keep running when it is in the background you cannot use the input function as it will block when stdin is empty (resulting in process being stopped).
Now that you have clarified via the comment below that your "interactive prompt" is running in a thread, and since usage of input directly reads from stdin and you seem unwilling to modify your program (you asking for general case) but expects a utility to do this for you, the simple solution is to execute this within a tmux or screen session as they fully implement a pseudo tty that is independent from whichever console that started (so you can disconnect and send the session to the background, or start other virtual sessions, see manual pages) which will provide stdio that the program expects.
Finally, if you actually want your application to support this natively, you cannot simply use input as is, but you should check whether input can safely be called (i.e. perhaps making use of select), or by checking whether the process is currently in the foreground or background (An example you could start working from is How to detect if python script is being run as a background process, although you might want to check using sys.stdin, maybe.) to determine if input can be safely called (however if user suspends the task as input comes it would still hang as input waits), or use unix sockets for communication.

Parallel processing from a command queue on Linux (bash, python, ruby... whatever)

I have a list/queue of 200 commands that I need to run in a shell on a Linux server.
I only want to have a maximum of 10 processes running (from the queue) at once. Some processes will take a few seconds to complete, other processes will take much longer.
When a process finishes I want the next command to be "popped" from the queue and executed.
Does anyone have code to solve this problem?
Further elaboration:
There's 200 pieces of work that need to be done, in a queue of some sort. I want to have at most 10 pieces of work going on at once. When a thread finishes a piece of work it should ask the queue for the next piece of work. If there's no more work in the queue, the thread should die. When all the threads have died it means all the work has been done.
The actual problem I'm trying to solve is using imapsync to synchronize 200 mailboxes from an old mail server to a new mail server. Some users have large mailboxes and take a long time tto sync, others have very small mailboxes and sync quickly.
On the shell, xargs can be used to queue parallel command processing. For example, for having always 3 sleeps in parallel, sleeping for 1 second each, and executing 10 sleeps in total do
echo {1..10} | xargs -d ' ' -n1 -P3 sh -c 'sleep 1s' _
And it would sleep for 4 seconds in total. If you have a list of names, and want to pass the names to commands executed, again executing 3 commands in parallel, do
cat names | xargs -n1 -P3 process_name
Would execute the command process_name alice, process_name bob and so on.
I would imagine you could do this using make and the make -j xx command.
Perhaps a makefile like this
all : usera userb userc....
usera:
imapsync usera
userb:
imapsync userb
....
make -j 10 -f makefile
Parallel is made exatcly for this purpose.
cat userlist | parallel imapsync
One of the beauties of Parallel compared to other solutions is that it makes sure output is not mixed. Doing traceroute in Parallel works fine for example:
(echo foss.org.my; echo www.debian.org; echo www.freenetproject.org) | parallel traceroute
For this kind of job PPSS is written: Parallel processing shell script. Google for this name and you will find it, I won't linkspam.
GNU make (and perhaps other implementations as well) has the -j argument, which governs how many jobs it will run at once. When a job completes, make will start another one.
Well, if they are largely independent of each other, I'd think in terms of:
Initialize an array of jobs pending (queue, ...) - 200 entries
Initialize an array of jobs running - empty
while (jobs still pending and queue of jobs running still has space)
take a job off the pending queue
launch it in background
if (queue of jobs running is full)
wait for a job to finish
remove from jobs running queue
while (queue of jobs is not empty)
wait for job to finish
remove from jobs running queue
Note that the tail test in the main loop means that if the 'jobs running queue' has space when the while loop iterates - preventing premature termination of the loop. I think the logic is sound.
I can see how to do that in C fairly easily - it wouldn't be all that hard in Perl, either (and therefore not too hard in the other scripting languages - Python, Ruby, Tcl, etc). I'm not at all sure I'd want to do it in shell - the wait command in shell waits for all children to terminate, rather than for some child to terminate.
In python, you could try:
import Queue, os, threading
# synchronised queue
queue = Queue.Queue(0) # 0 means no maximum size
# do stuff to initialise queue with strings
# representing os commands
queue.put('sleep 10')
queue.put('echo Sleeping..')
# etc
# or use python to generate commands, e.g.
# for username in ['joe', 'bob', 'fred']:
# queue.put('imapsync %s' % username)
def go():
while True:
try:
# False here means no blocking: raise exception if queue empty
command = queue.get(False)
# Run command. python also has subprocess module which is more
# featureful but I am not very familiar with it.
# os.system is easy :-)
os.system(command)
except Queue.Empty:
return
for i in range(10): # change this to run more/fewer threads
threading.Thread(target=go).start()
Untested...
(of course, python itself is single-threaded. You should still get the benefit of multiple threads in terms of waiting for IO, though.)
If you are going to use Python, I recommend using Twisted for this.
Specifically Twisted Runner.
https://savannah.gnu.org/projects/parallel (gnu parallel)
and pssh might help.
Python's multiprocessing module would seem to fit your issue nicely. It's a high-level package that supports threading by process.
Simple function in zsh to parallelize jobs in not more than 4 subshells, using lock files in /tmp.
The only non trivial part are the glob flags in the first test:
#q: enable filename globbing in a test
[4]: returns the 4th result only
N: ignore error on empty result
It should be easy to convert it to posix, though it would be a bit more verbose.
Do not forget to escape any quotes in the jobs with \".
#!/bin/zsh
setopt extendedglob
para() {
lock=/tmp/para_$$_$((paracnt++))
# sleep as long as the 4th lock file exists
until [[ -z /tmp/para_$$_*(#q[4]N) ]] { sleep 0.1 }
# Launch the job in a subshell
( touch $lock ; eval $* ; rm $lock ) &
# Wait for subshell start and lock creation
until [[ -f $lock ]] { sleep 0.001 }
}
para "print A0; sleep 1; print Z0"
para "print A1; sleep 2; print Z1"
para "print A2; sleep 3; print Z2"
para "print A3; sleep 4; print Z3"
para "print A4; sleep 3; print Z4"
para "print A5; sleep 2; print Z5"
# wait for all subshells to terminate
wait
Can you elaborate what you mean by in parallel? It sounds like you need to implement some sort of locking in the queue so your entries are not selected twice, etc and the commands run only once.
Most queue systems cheat -- they just write a giant to-do list, then select e.g. ten items, work them, and select the next ten items. There's no parallelization.
If you provide some more details, I'm sure we can help you out.

Categories

Resources