Multiprocessing file reading and output - python

I am trying to process files in over 1000 directories. However, as the size of the directory is very big, i have to run the process concurrently to save time. The dir [:100] is so that that function has only 100 directories to process.However, the 2nd Thread does not start until the first Thread has finished.
def sort(start,end):
dir = os.list(filepath)
dir = dir [start:end]
for file in dir:
~process~
print result
if __name__ == '__main__':
Thread(target = Sort(0,100)).start()
Thread(target = Sort(100,200)).start()
I have tried duplicating the Sort function as Sort2(): but this yields the same result
Thread(target = Sort(0,100)).start()
Thread(target = Sort2(100,200)).start()
Any help or direction is greatly apprenticed

The problem is that Sort and Sort2 are being evaluated before you start your threads. When you say Sort(...), the () is an operator that executes the Sort function.
This answer could say more about your problem.
You'll have to use the args parameter for the Thread constructor. For example:
Thread(target=Sort, args=(0,100)).start()

import multiprocessing
def worker_sort():
# includes the code of your sorting function
pass
if __name__ == '__main__':
jobs = []
for i in range(2):
p = multiprocessing.Process(target=worker_sort, args=(0,100))
jobs.append(p)
p.start()
I'd recommend some reading about the multiprocessing module in the official documentation

Related

Automatically restarting Python sub-processes using identical arguments

I have a python script which calls a series of sub-processes. They need to run "for ever" - but they occasionally die, or get killed. When this happens I need to restart the process using the same arguments as the one which died.
This is a very simplified version:
[edit: this is the less simplified version, which includes "restart" code]
import multiprocessing
import time
import random
def printNumber(number):
print("starting :", number)
while random.randint(0, 5) > 0:
print(number)
time.sleep(2)
if __name__ == '__main__':
children = [] # list
args = {} # dictionary
for processNumber in range(10,15):
p = multiprocessing.Process(
target=printNumber,
args=(processNumber,)
)
children.append(p)
p.start()
args[p.pid] = processNumber
while True:
time.sleep(1)
for n, p in enumerate(children):
if not p.is_alive():
#get parameters dead child was started with
pidArgs = args[p.pid]
del(args[p.pid])
print("n,args,p: ",n,pidArgs,p)
children.pop(n)
# start new process with same args
p = multiprocessing.Process(
target=printNumber,
args=(pidArgs,)
)
children.append(p)
p.start()
args[p.pid] = pidArgs
I have updated the example to illustrate how I want the processes to be restarted if one crashes/killed/etc - keeping track of which pid was started with which args.
Is this the "best" way to do this, or is there a more "python" way of doing this?
I think I would create a separate thread for each Process and use a ProcessPoolExecutor. Executors have a useful function, submit, which returns a Future. You can wait on each Future and re-launch the Executor when the Future is done. Arguments to the function are tracked as class variables, so restarting is just a simple loop.
import threading
from concurrent.futures import ProcessPoolExecutor
import time
import random
import traceback
def printNumber(number):
print("starting :", number)
while random.randint(0, 5) > 0:
print(number)
time.sleep(2)
class KeepRunning(threading.Thread):
def __init__(self, func, *args, **kwds):
self.func = func
self.args = args
self.kwds = kwds
super().__init__()
def run(self):
while True:
with ProcessPoolExecutor(max_workers=1) as pool:
future = pool.submit(self.func, *self.args, **self.kwds)
try:
future.result()
except Exception:
traceback.print_exc()
if __name__ == '__main__':
for process_number in range(10, 15):
keep = KeepRunning(printNumber, process_number)
keep.start()
while True:
time.sleep(1)
At the end of the program is a loop to keep the main thread running. Without that, the program will attempt to exit while your Processes are still running.
For the example you provided I would just remove the exit condition from the while loop and change it to True.
As you said though the actual code is more complicated (why didn't you post that?). So if the process gets terminated by lets say an exception just put the code inside a try catch block. You can then put said block in an infinite loop.
I hope this is what you are looking for but that seems to be the right way to do it provided the goal and information you provided.
Instead of just starting the process immediately, you can save the list of processes and their arguments, and create another process that checks they are alive.
For example:
if __name__ == '__main__':
process_list = []
for processNumber in range(5):
process = multiprocessing.Process(
target=printNumber,
args=(processNumber,)
)
process_list.append((process,args))
process.start()
while True:
for running_process, process_args in process_list:
if not running_process.is_alive():
new_process = multiprocessing.Process(target=printNumber, args=(process_args))
process_list.remove(running_process, process_args) # Remove terminated process
process_list.append((new_process, process_args))
I must say that I'm not sure the best way to do it is in python, you may want to look at scheduler services like jenkins or something like that.

How can I check that the Process class from Python Multiprocessing has worked?

I've written the following code which runs a function that simulates a stochastic simulation of a series of chemical reactions. I've written the following code:
v = range(1, 51)
def parallelfunc(*v):
gillespie_tau_leaping(start_state, LHS, stoch_rate, state_change_array)
def info(title):
print(title)
print('module name:', __name__)
print('parent process:', os.getppid())
print('process id:', os.getpid())
if __name__ == '__main__':
info('main line')
start = datetime.utcnow()
p = Process(target=parallelfunc, args=(v))
p.start()
p.join()
end = datetime.utcnow()
sim_time = end - start
print(f"Simualtion utc time:\n{sim_time}")
I'm using the Process method from the multiprocessing library and am trying to run gillespie_tau_leaping 50 times.
Only I'm not sure if its working. gillespie_tau_leaping prints out a number of values to the terminal, but these values are only printed out once, I'd expect them to be printed out 50 times.
I tried using the getpid etc command and this returns the following to the terminal:
main line
module name: __main__
parent process: 6188
process id: 27920
How can I tell if my code as worked and how can I get it to print the values from gillepsie_tau_leaping 50 times to the terminal?
Cheers
Your code is running just one process, the call to Process, spawns a new thread but you are doing it only once (not in a loop).
I would suggest you to use multiprocessing pools
Your code can be something like this:
from multiprocess import Pool
def parallelfunc(*args):
do_something()
def main():
# create a list of list of args for the function invocation
func_args = [['arg1call1', 'arg2call1', 'arg3call1'], ['arg1call2', 'arg2call2', 'arg3call2']]
with Pool() as p:
results = p.map(parallelfunc, func_args)
# do something with results which is a list of results
multiprocessing pool by default create the same number of processes as your CPU cores and manage the process Pool till the end of the processing taking care of all the Inter Process Communication.
This is really handy because synchronizing processes can be hard.
Hope this helps

How to run 10 python programs simultaneously?

I have a_1.py~a_10.py
I want to run 10 python programs in parallel.
I tried:
from multiprocessing import Process
import os
def info(title):
I want to execute python program
def f(name):
for i in range(1, 11):
subprocess.Popen(['python3', f'a_{i}.py'])
if __name__ == '__main__':
info('main line')
p = Process(target=f)
p.start()
p.join()
but it doesn't work
How do I solve this?
I would suggest using the subprocess module instead of multiprocessing:
import os
import subprocess
import sys
MAX_SUB_PROCESSES = 10
def info(title):
print(title, flush=True)
if __name__ == '__main__':
info('main line')
# Create a list of subprocesses.
processes = []
for i in range(1, MAX_SUB_PROCESSES+1):
pgm_path = f'a_{i}.py' # Path to Python program.
command = f'"{sys.executable}" "{pgm_path}" "{os.path.basename(pgm_path)}"'
process = subprocess.Popen(command, bufsize=0)
processes.append(process)
# Wait for all of them to finish.
for process in processes:
process.wait()
print('Done')
If you just need to call 10 external py scripts (a_1.py ~ a_10.py) as a separate processes - use subprocess.Popen class:
import subprocess, sys
for i in range(1, 11):
subprocess.Popen(['python3', f'a_{i}.py'])
# sys.exit() # optional
It's worth to look at a rich subprocess.Popen signature (you may find some useful params/options)
You can use a multiprocessing pool to run them concurrently.
import multiprocessing as mp
def worker(module_name):
""" Executes a module externally with python """
__import__(module_name)
return
if __name__ == "__main__":
max_processes = 5
module_names = [f"a_{i}" for i in range(1, 11)]
print(module_names)
with mp.Pool(max_processes) as pool:
pool.map(worker, module_names)
The max_processes variable is the maximum number of workers to have working at any given time. In other words, its the number of processes spawned by your program. The pool.map(worker, module_names) uses the available processes and calls worker on each item in your module_names list. We don't include the .py because we're running the module by importing it.
Note: This might not work if the code you want to run in your modules is contained inside if __name__ == "__main__" blocks. If that is the case, then my recommendation would be to move all the code in the if __name__ == "__main__" blocks of the a_{} modules into a main function. Additionally, you would have to change the worker to something like:
def worker(module_name):
module = __import__(module_name) # Kind of like 'import module_name as module'
module.main()
return

Python Multiprocessing; Infinite Processes

I have a function in my main python file that does some multiprocessing, which works fine;
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=len(directories))
pool.map(worker, directories)
However, I imported a .py file from an other directory, in which I try to do exactly the same;
# Main file
import multiprocessing
read_DataFiles.test(os.getcwd())
# Imported file
directories=["x", "x", "x"]
def worker(sample):
File=open('test'+sample+'.bat', 'w')
File.close()
1 == 1
def test(path):
if __name__ == 'read_DataFiles':
pool = multiprocessing.Pool(processes=8)
print pool.map(worker, directories)
which doesn't stop working, it continues to create new processes. Anyone sees what I'm doing wrong?
The difference is around the if __name__ == .... On Windows, multiprocessing is a hack that works by creating new processes and re-importing the code in each of them. I'm sure that you call test(path) from the top-level of another module. The check in this function, if __name__ == 'read_DataFiles':, is pointless: this is always true, which means it will always start a new pool. What you want instead is to use if __name__ == '__main__' from the main script, and only call test(path) if that is the case.

Python parallel threads

Here the code which download 3 files, and do something with it.
But before starting Thread2 it waits until Thread1 will be finished. How make them run together?
Specify some examples with commentary. Thanks
import threading
import urllib.request
def testForThread1():
print('[Thread1]::Started')
resp = urllib.request.urlopen('http://192.168.85.16/SOME_FILE')
data = resp.read()
# Do something with it
return 'ok'
def testForThread2():
print('[Thread2]::Started')
resp = urllib.request.urlopen('http://192.168.85.10/SOME_FILE')
data = resp.read()
# Do something with it
return 'ok'
if __name__ == "__main__":
t1 = threading.Thread(name="Hello1", target=testForThread1())
t1.start()
t2 = threading.Thread(name="Hello2", target=testForThread2())
t2.start()
print(threading.enumerate())
t1.join()
t2.join()
exit(0)
You are executing the target function for the thread in the thread instance creation.
if __name__ == "__main__":
t1 = threading.Thread(name="Hello1", target=testForThread1()) # <<-- here
t1.start()
This is equivalent to:
if __name__ == "__main__":
result = testForThread1() # == 'ok', this is the blocking execution
t1 = threading.Thread(name="Hello1", target=result)
t1.start()
It's Thread.start()'s job to execute that function and store its result somewhere for you to reclaim. As you can see, the previous format was executing the blocking function in the main thread, preventing you from being able to parallelize (e.g. it would have to finish that function execution before getting to the line where it calls the second function).
The proper way to set the thread in a non-blocking fashion would be:
if __name__ == "__main__":
t1 = threading.Thread(name="Hello1", target=testForThread1) # tell thread what the target function is
# notice no function call braces for the function "testForThread1"
t1.start() # tell the thread to execute the target function
For this, we can use threading but it's not efficient since you want to download files. so the total time will be equal to the sum of download time of all files.
If you have good internet speed, then multiprocessing is the best way.
import multiprocessing
def test_function():
for i in range(199999998):
pass
t1 = multiprocessing.Process(target=test_function)
t2 = multiprocessing.Process(target=test_function)
t1.start()
t2.start()
This is the fastest solution. You can check this using following command:
time python3 filename.py
you will get the following output like this:
real 0m6.183s
user 0m12.277s
sys 0m0.009s
here, real = user + sys
user time is the time taken by python file to execute.
but you can see that above formula doesn't satisfy because each function takes approx 6.14. But due to multiprocessing, both take 6.18 seconds and reduced total time by multiprocessing in parallel.
You can get more about it from here.

Categories

Resources