Running same function for multiple files in parallel in python

Running same function for multiple files in parallel in python - python

I am trying to run a function in parallel for multiple files and want all of them to terminate before a point.
For Example:
There is a loop
def main():
for item in list:
function_x(item)
function_y(list)
Now what I want is that this function_x should run in parallel for all items. But this function should be executed for all items before my function_y is called.
I am planning to use celery for this. but can not understand how to do this.

Here is my final test code.
All I needed to do is use multiprocessing library.
from multiprocessing import Process
from time import sleep
Pros = []
def function_x(i):
for j in range(0,5):
sleep(3)
print i
def function_y():
print "done"
def main():
for i in range(0,3):
print "Thread Started"
p = Process(target=function_x, args=(i,))
Pros.append(p)
p.start()
# block until all the threads finish (i.e. block until all function_x calls finish)
for t in Pros:
t.join()
function_y()

you can use threads for this. thread.join is the function you need, this function block until the thread is finished.
you can do this:
import threading
threads = []
def main():
for item in list:
t = threading.Thread(target=function_x, args=(item,))
threads.append(t)
t.start()
# block until all the threads finish (i.e. until all function_a functions finish)
for t in threads:
t.join()
function_y(list)

You can do this elegantly with Ray, which is a library for writing parallel and distributed Python.
Simply declare the function_x with #ray.remote, and then it can be executed in parallel by invoking it with function_x.remote and the results can be retrieved with ray.get.
import ray
import time
ray.init()
#ray.remote
def function_x(item):
time.sleep(1)
return item
def function_y(list):
pass
list = [1, 2, 3, 4]
# Process the items in parallel.
results = ray.get([function_x.remote(item) for item in list])
function_y(list)
View the Ray documentation.

Here is the documentation for celery groups, which is what I think you want. Use AsyncResult.get() instead of AsyncResult.ready() to block.

#!/bin/env python
import concurrent.futures
def function_x(item):
return item * item
def function_y(lst):
return [x * x for x in lst]
a_list = range(10)
if __name__ == '__main__':
with concurrent.futures.ThreadPoolExecutor(10) as tp:
future_to_function_x = {
tp.submit(function_x, item): item
for item in a_list
}
results = {}
for future in concurrent.futures.as_completed(future_to_function_x):
item = future_to_function_x[future]
try:
res = future.result()
except Exception as e:
print('Exception when processing item "%s": %s' % (item, e))
else:
results[item] = res
print('results:', results)
after = function_y(results.values())
print('after:', after)
Output:
results: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}
after: [0, 1, 16, 81, 256, 625, 1296, 2401, 4096, 6561]

Related

Multiprocessing a for loop - got errors

I have some code in Python and I wanna do it with multiprocessing
import multiprocessing as mp
from multiprocessing.sharedctypes import Value
import time
import math
resault_a = []
resault_b = []
resault_c = []
def make_calculation_one(numbers):
for number in numbers:
resault_a.append(math.sqrt(number**3))
def make_calculation_two(numbers):
for number in numbers:
resault_a.append(math.sqrt(number**4))
def make_calculation_three(numbers):
for number in numbers:
resault_c.append(math.sqrt(number**5))
number_list = list(range(1000000))
if __name__ == "__main__":
mp.set_start_method("fork")
p1 = mp.Process(target=make_calculation_one, args=(number_list))
p2 = mp.Process(target=make_calculation_two, args=(number_list))
p3 = mp.Process(target=make_calculation_three, args=(number_list))
start = time.time()
p1.start()
p2.start()
p3.start()
end = time.time()
print(end - start)
I got an empty array, where is the problem?
I got some errors:
"Process Process-1:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()"
How can I fix it?
TNX

There are several issues with your code:
The major problem is that the args argument to the Process initializer requires a tuple or list. You are specifying args=(number_list). The parentheses around number_list does not make this a tuple. Without the comma you just have a parenthesized expression, i.e. a list. So instead of passing a single argument that is a list, you are passing 10,000 arguments, while your "worker" functions only take 1 argument. You need: args=(number_list,).
Your worker functions are doing calculations but neither printing nor returning the results of these calculations. Assuming you want to return the results, you need a mechanism for doing so. If you are using multiprocessing.Process then the usual solution is to pass to the worker function a multiprocessing.Queue instance to which the worker function can put the results (see below). You can also use a multiprocessing pool (also see below).
Your timing is not quite right. You have started the child processes and immediately set end without waiting for the tasks to complete. To get the actual time, end should only be set when the child processes have finished creating their results.
Using Process with queues
import multiprocessing as mp
import time
import math
def make_calculation_one(numbers, out_q):
out_q.put([math.sqrt(number**3) for number in numbers])
def make_calculation_two(numbers, out_q):
out_q.put([math.sqrt(number**4) for number in numbers])
def make_calculation_three(numbers, out_q):
out_q.put([math.sqrt(number**5) for number in numbers])
if __name__ == "__main__":
# We only want one copy of `number_list`, i.e. in our main process.
# But there is actually no need to convert to a list:
number_list = range(1000000)
mp.set_start_method("fork")
out_q_1 = mp.Queue()
out_q_2 = mp.Queue()
out_q_3 = mp.Queue()
# Create pool of size 3:
p1 = mp.Process(target=make_calculation_one, args=(number_list, out_q_1))
p2 = mp.Process(target=make_calculation_two, args=(number_list, out_q_2))
p3 = mp.Process(target=make_calculation_three, args=(number_list, out_q_3))
start = time.time()
p1.start()
p2.start()
p3.start()
results = []
# Get return values:
results.append(out_q_1.get())
results.append(out_q_2.get())
results.append(out_q_3.get())
end = time.time()
p1.join()
p2.join()
p3.join()
print(end - start)
Using a shared memory array to pass the number list and to return the results
import multiprocessing as mp
import time
import math
def make_calculation_one(numbers, results):
for idx, number in enumerate(numbers):
results[idx] = math.sqrt(number**3)
def make_calculation_two(numbers, results):
for idx, number in enumerate(numbers):
results[idx] = math.sqrt(number**4)
def make_calculation_three(numbers, results):
for idx, number in enumerate(numbers):
results[idx] = math.sqrt(number**5)
if __name__ == "__main__":
# We only want one copy of `number_list`, i.e. in our main process
number_list = mp.RawArray('d', range(1000000))
mp.set_start_method("fork")
results_1 = mp.RawArray('d', len(number_list))
results_2 = mp.RawArray('d', len(number_list))
results_3 = mp.RawArray('d', len(number_list))
# Create pool of size 3:
p1 = mp.Process(target=make_calculation_one, args=(number_list, results_1))
p2 = mp.Process(target=make_calculation_two, args=(number_list, results_2))
p3 = mp.Process(target=make_calculation_three, args=(number_list, results_3))
start = time.time()
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
end = time.time()
print(end - start)
Using a multiprocessing pool
import multiprocessing as mp
import time
import math
def make_calculation_one(numbers):
return [math.sqrt(number**3) for number in numbers]
def make_calculation_two(numbers):
return [math.sqrt(number**4) for number in numbers]
def make_calculation_three(numbers):
return [math.sqrt(number**5) for number in numbers]
if __name__ == "__main__":
# We only want one copy of `number_list`, i.e. in our main process
number_list = range(1000000)
mp.set_start_method("fork")
# Create pool of size 3:
pool = mp.Pool(3)
start = time.time()
async_results = []
async_results.append(pool.apply_async(make_calculation_one, args=(number_list,)))
async_results.append(pool.apply_async(make_calculation_two, args=(number_list,)))
async_results.append(pool.apply_async(make_calculation_three, args=(number_list,)))
# Now wait for results:
results = [async_result.get() for async_result in async_results]
end = time.time()
pool.close()
pool.join()
print(end - start)
Conclusion
Since your calculations yield a type readily supported by shared memory, the second code example above should result in the best performance. You could also adapt the multiprocessing pool example to use shared memory.

I'm getting some other error:
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
TypeError: make_calculation_one() takes 1 positional argument but 1000000 were given
but if I change these line accordingly then it works:
p1 = mp.Process(target=make_calculation_one, args=([number_list]))
p2 = mp.Process(target=make_calculation_two, args=([number_list]))
p3 = mp.Process(target=make_calculation_three, args=([number_list]))

The function that is run in a worker Process cannot access data in the parent process.
If the "fork" start method is used, it would have access to the copy of that data in the forked process.
But modifying that would not alter the value in the parent process.
In this case, the easiest thing to do it to create a multiprocessing.Array, and pass that to the process to use.
import math
import multiprocessing as mp
def make_calculation_one(numbers, res):
for idx, number in enumerate(numbers):
res[idx] = math.sqrt(number**3)
number_list = list(range(10000))
if __name__ == "__main__":
result_a = mp.Array("d", len(number_list))
p1 = mp.Process(target=make_calculation_one, args=(number_list, result_a))
p1.start()
p1.join()
print(sum(result_a))
This code prints the value 3999500012.4745193.

Python parallel list processing not working like expected

I am using this library for parallel list processing: https://github.com/npryce/python-parallelize (Java fork-join like implementation)
This code works like expected
func = lambda x: x is not None
for i in parallelize([1, None, "3", "zsh"]):
if func(i):
print(i)
# Output: 1, "3", "zsh"
while this doesn't work:
func = lambda x: x is not None
src = []
for i in parallelize([1, None, "3", "zsh"]):
if func(i):
src.append(i)
print(src) # Output: []
The used library code looks like that:
import sys
from itertools import islice
import os
import multiprocessing
def per_cpu(seq):
cpu_count = multiprocessing.cpu_count()
return (islice(seq, cpu, None, cpu_count) for cpu in range(cpu_count))
def parallelize(seq, fork=per_cpu):
pids = []
for slice in fork(seq):
pid = os.fork()
if pid == 0:
yield from slice
sys.exit(0)
else:
pids.append(pid)
for pid in pids:
os.waitpid(pid, 0)
Update: I tried to use a shared list. It works but multiprocessing shows an assertion error
manager = Manager()
shared_list = manager.list()
func = lambda x: x is not None
for i in parallelize([1, None, "3", "zsh"]):
if func(i):
shared_list.append(i)
print(shared_list)
The error is:
File "/usr/lib/python3.8/multiprocessing/process.py", line 147, in join
assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
What am I doing wrong?
Thanks in advance!

Can't pass file handle and lock to a process with multiprocessing.Pool?

I'm using multiprocessing.Pool() to launch a bunch of processes, where each process writes to the same file (using a lock).
Each process gets assigned a "task", which is just a tuple of arguments.
One of the arguments is the file handle, and another argument is the lock.
But Python doesn't like me passing neither the file handle nor the lock.
(I can do it without using a multiprocessing.Pool, when calling simply multiprocessing.Process.)
Example.
import multiprocessing as mp
import time
import random
def thr_work00(args):
arg0 = args[0]
arg1 = args[1]
arg2 = args[2]
arg3 = args[3]
arg4 = args[4]
s = random.random()/10
time.sleep(s)
print(f'\x1b[92m{arg0} \x1b[32m{s:.3f}\x1b[0m')
return args
o_file = open('test.txt','w')
o_lock = mp.Lock()
tasks = [
[0, 0,1, o_file,o_lock],
[1, 2,3, o_file,o_lock],
[2, 4,5, o_file,o_lock],
[3, 6,7, o_file,o_lock],
]
with mp.Pool(2) as pool:
results = pool.map(thr_work00, tasks)
for res in results:
print(res)
When passing the file I get: TypeError: cannot serialize '_io.TextIOWrapper' object .
When passing the lock I get: RuntimeError: Lock objects should only be shared between processes through inheritance.
How can I get around this?
Edit.
So I wonder if this is okay (it seems to be working). The only thing I care about is that each write itself is atomic, but it doesn't matter in which order the writes are done.
import multiprocessing as mp
import time
import random
import os
# ----------------------------------------------------------------
def thr_work00(args):
arg0 = args[0]
arg1 = args[1]
s = random.random()/10
time.sleep(s)
txt = 1004*str(arg0)
with open('test.txt','a') as o_file:
o_file.write(f'{txt}\n')
print(f'\x1b[92m{arg0} \x1b[32m{s:.3f}\x1b[0m')
return args
# ----------------------------------------------------------------
os.remove('test.txt')
tasks = [
[0, 0xf0],
[1, 0xf1],
[2, 0xf2],
[3, 0xf3],
[4, 0xf4],
[5, 0xf5],
[6, 0xf6],
[7, 0xf7],
]
with mp.Pool(2) as pool:
results = pool.map(thr_work00, tasks)
for res in results:
print(res)

For both locks and open file descriptors, you should be sharing these through process inheritance, rather than trying to pass them as parameters. A child process inherits all the open file descriptors from its parent, so you can write your code like this:
import multiprocessing as mp
import time
import random
def thr_work00(args):
global o_lock, o_file
s = random.randint(0, 5)
with o_lock:
time.sleep(s)
print(f"\x1b[92m{args[0]} \x1b[32m{s}\x1b[0m")
o_file.write(f"{args[0]} {s}\n")
o_file.flush()
return args
with open("test.txt", "w") as o_file:
o_lock = mp.Lock()
tasks = [
[0, 0, 1],
[1, 2, 3],
[2, 4, 5],
[3, 6, 7],
]
with mp.Pool(2) as pool:
results = pool.map(thr_work00, tasks)
for res in results:
print(res)
Alternately, instead of writing to the file in your worker, just
perform the writes in the main thread as you gather the results. This
eliminates the need for the lock because you no longer need to worry
about multiple processes writing to the same file...
...or, if you need the writes to happen "live", rather than at the end, deliver them to a dedicated writer thread using a Queue.
Here's one example of using a Queue to pass results to a dedicate
writer:
import multiprocessing as mp
import time
import random
resultq = mp.Queue()
def thr_work00(args):
global resultq
s = random.randint(0, 5)
print(f"\x1b[92m{args[0]} \x1b[32m{s}\x1b[0m")
time.sleep(s)
resultq.put((args[0], s))
return args
def thr_writer():
global resultq
print('writer start')
with open('test.txt', 'w') as fd:
while True:
item = resultq.get()
if item is None:
break
fd.write(f'{item[0]}: {item[1]}\n')
print('writer exit')
with open("test.txt", "w") as o_file:
o_lock = mp.Lock()
writer = mp.Process(target=thr_writer)
writer.start()
tasks = [
[0, 0, 1],
[1, 2, 3],
[2, 4, 5],
[3, 6, 7],
]
with mp.Pool(2) as pool:
results = pool.map(thr_work00, tasks)
for res in results:
print(res)
resultq.put(None)
writer.join()

Multiprocessing append to pool without map

I have a function to encrypt a bunch of strings to md5 and inside of it, I have a pool which I create.
Main.py
config = ConfigParser()
config.read("config.ini")
possibleCharacters = "abcd"
def mapped_loop_digit(args):
loop_digit(*args, is_pool=True)
def loop_digit(current_str, place, strings, hashes, is_outer=False, is_pool=False):
if place == config.getint("string_creation", "length_for_new_process"):
current_strings = list()
for character in possibleCharacters:
current_str[place] = character
if is_outer and config.getboolean("development", "minor_logging"):
print("Outer character maker at", possibleCharacters.index(character) + 1, "in", len(possibleCharacters))
elif is_pool and config.getboolean("development", "pool_minor_logging"):
print("Outest in pool character maker for process", multiprocessing.current_process()._identity[0],
"at", possibleCharacters.index(character) + 1, "in", len(possibleCharacters), "with character as",
str(character) + ". Current string is", current_str)
if place == 0:
string = "".join(_character for _character in current_str)
hashes.append(hashlib.md5(string.encode()).hexdigest())
strings.append(string)
elif place == config.getint("string_creation", "length_for_new_process"):
current_strings.append(current_str.copy())
else:
loop_digit(current_str, place - 1, strings, hashes)
if place == config.getint("string_creation", "length_for_new_process"):
args = list()
print("Starting a new pool")
for string in current_strings:
args.append([string, place - 1, strings, hashes])
with multiprocessing.Pool(processes=config.getint("string_creation", "processes")) as pool:
pool.map(mapped_loop_digit, args)
pool.close()
pool.join()
manager = multiprocessing.Manager()
all_strings = manager.list("")
all_hashes = manager.list("")
loop_digit(["", "", "", ""], 4 - 1, all_strings, all_hashes, is_outer=True)
config.ini
[development]
minor_logging = 1
pool_minor_logging = 1
[string_creation]
processes = 3
length_for_new_process = 3
At the moment I have a list called current_strings and append to it in the middle of the program, then at the end, I loop through it and create a list of the arguments to then map it to a separate function and then run the original function again. Is there an easier way to do this so I can just append to the pool instead of the list.

If you create Pool as
pool = multiprocessing.Pool(5)
without pool.close() pool.join() then you can use pool many times in different places (in different functions).
If you use map_async() instead of map() then you don't have to wait for the end of processes and you can add more processes using next map_async() and pool will manage all processes together.
You can also use apply_async to add single proces to existing pool.
Because map_async and apply_async doesn't wait for end of processses so you have to control it using wait() before exit program
it1 = pool.map_async(...)
it2 = pool.map_async(...)
it3 = pool.apply_async(...)
# ... code ...
it1.wait()
it2.wait()
it3.wait()
or you have to use (both) at the end
pool.close()
pool.join()
If you don't use it then program may exit before processes will be finished and it will terminate them.
Minimal working example
import multiprocessing
import time
def fun(number):
for x in range(3):
time.sleep(.2)
print(number, 'loop:', x)
if __name__ == '__main__':
pool = multiprocessing.Pool(2)
print("map [1,2,3]")
it1 = pool.map_async(fun, [1,2,3])
print("map ['A', 'B', 'C']")
it2 = pool.map_async(fun, ['A', 'B', 'C'])
print("single work X")
it3 = pool.apply_async(fun, 'X')
print("single work Y")
it4 = pool.apply_async(fun, 'Y')
# wait for the end of processes
print('wait for the end of processes')
#it1.wait()
#it2.wait()
#it3.wait()
#it4.wait()
pool.close()
pool.join()
print('exit')

Multiprocessing in Python not calling the worker functions

I'm fairly new to multiprocessing and I have written the script below, but the methods are not getting called. I dont understand what I'm missing.
What I want to do is the following:
call two different methods asynchronously.
call one method before the other.
# import all necessary modules
import Queue
import logging
import multiprocessing
import time, sys
import signal
debug = True
def init_worker():
signal.signal(signal.SIGINT, signal.SIG_IGN)
research_name_id = {}
ids = [55, 125, 428, 429, 430, 895, 572, 126, 833, 502, 404]
# declare all the static variables
num_threads = 2 # number of parallel threads
minDelay = 3 # minimum delay
maxDelay = 7 # maximum delay
# declare an empty queue which will hold the publication ids
queue = Queue.Queue(0)
proxies = []
#print (proxies)
def split(a, n):
"""Function to split data evenly among threads"""
k, m = len(a) / n, len(a) % n
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)]
for i in xrange(n))
def run_worker(
i,
data,
queue,
research_name_id,
proxies,
debug,
minDelay,
maxDelay):
""" Function to pull out all publication links from nist
data - research ids pulled using a different script
queue - add the publication urls to the list
research_name_id - dictionary with research id as key and name as value
proxies - scraped proxies
"""
print 'getLinks', i
for d in data:
print d
queue.put(d)
def fun_worker(i, queue, proxies, debug, minDelay, maxDelay):
print 'publicationData', i
try:
print queue.pop()
except:
pass
def main():
print "Initializing workers"
pool = multiprocessing.Pool(num_threads, init_worker)
distributed_ids = list(split(list(ids), num_threads))
for i in range(num_threads):
data_thread = distributed_ids[i]
print data_thread
pool.apply_async(run_worker, args=(i + 1,
data_thread,
queue,
research_name_id,
proxies,
debug,
minDelay,
maxDelay,
))
pool.apply_async(fun_worker,
args=(
i + 1,
queue,
proxies,
debug,
minDelay,
maxDelay,
))
try:
print "Waiting 10 seconds"
time.sleep(10)
except KeyboardInterrupt:
print "Caught KeyboardInterrupt, terminating workers"
pool.terminate()
pool.join()
else:
print "Quitting normally"
pool.close()
pool.join()
if __name__ == "__main__":
main()
The only output that I get is
Initializing workers
[55, 125, 428, 429, 430, 895]
[572, 126, 833, 502, 404]
Waiting 10 seconds
Quitting normally

There are a couple of issues:
You're not using multiprocessing.Queue
If you want to share a queue with a subprocess via apply_async etc, you need to use a manager (see example).
However, you should take a step back and ask yourself what you are trying to do. Is apply_async is really the way to go? You have a list of items that you want to map over repeatedly, applying some long-running transformations that are compute intensive (because if they're just blocking on I/O, you might as well use threads). It seems to me that imap_unordered is actually what you want:
pool = multiprocessing.Pool(num_threads, init_worker)
links = pool.imap_unordered(run_worker1, ids)
output = pool.imap_unordered(fun_worker1, links)
run_worker1 and fun_worker1 need to be modified to take a single argument. If you need to share other data, then you should pass it in the initializer instead of passing it to the subprocesses over and over again.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running same function for multiple files in parallel in python - python

Here is the documentation for celery groups, which is what I think you want. Use AsyncResult.get() instead of AsyncResult.ready() to block.

Related

Multiprocessing a for loop - got errors

Python parallel list processing not working like expected

Can't pass file handle and lock to a process with multiprocessing.Pool?

Multiprocessing append to pool without map

Multiprocessing in Python not calling the worker functions

Categories

Resources