python multiprocessing multithread a whole code - python

I've not used multiprocessing/multithread in python code.
I have a long code (Over 600 lines) and I need to run it with using multiple cpus.
So I saw the way to use mutiprocessing/thread, but the way to do it for a whole code could not find it.
The code has the form of..
for loop
read csv
do several preprocessing
Mean of the values
Compare it with other values
...
If I have to edit all the code for the multiprocessing, it would take many times so could you please let me know if you know the way to do mutiprocessing the whole code?

To parallelize a function on multiple CPU cores it must generally avoid mutating global state and each function call must be independent of the others. Consider this hypothetical fonction which respects the conditions (comparison with other values is removed):
def f(file: Path) -> Value:
data = read_csv(file)
processed = pre_processing(data)
return mean(processed)
You can easily multithread it with Python using the concurrent integrated package:
from concurrent.futures import ThreadPoolExecutor
files = ["/path/1/", ...] # List of files
with ThreadPoolExecutor() as executor:
values = executor.map(f, files)
# Compare values here
for value in values:
...
You can also use ProcessPoolExecutor for multiprocessing.

Related

Run Pyspark functions in parallel efficiently

I have a pyspark code that have 3 functions. The first function is loads some data and prepares it for other two functions. The other two functions takes this output and do some task and generate respective outputs.
So the code will look something like this,
def first_function():
# load data
# pre-process
# return pre-processed data
def second_function(output_of_first_function):
# tasks for second function
# return output
def third_function(output_of_first_function):
# tasks for third function
# return output
And these functions are called from a main function like this,
def main():
output_from_first_function = first_function()
output_from_second_function = second_function(output_from_first_function)
output_from_third_function = third_function(output_from_first_function)
There is no interdependence among second_function and third_function. I'm looking for a way to run these two functions in parallel at a same time. There are some transforms happening inside these functions. So it may help to help these functions in parallel.
How to run the second_function and third_function in parallel? Should each of these functions create their own spark context or can they share a spark context?
From your problem, it doesn't seems like you really need pyspark. I think you should consider using Python Threads library. As described in this post: How to run independent transformations in parallel using PySpark?

How to have a multi-procsesing function return and store values in python?

I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
core.start()
for core in cores:
core.join()
Where the function 'processImages' returns a value. How do I save the returned value?
In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
TLDR
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
print(processed_work)
You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.
If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
<blabla>
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.

How can I run the for loop in parallel for a class function in python?

I have a class My_mechanism which contains the function velocity_params which writes the results in a csv file. I need to iterate over some range but the iteration is very slow (only one CPU core is utilised at once). Is there any way to speed up the process? The present code takes around 10 minutes to execute, finally, I need to iterate for all i,j,k,l in My_mechanism(i,j,k,l,2).
from crankshaft import *
import multiprocessing
import time
initial_time = time.time()
for i in range(10,20):
m = My_mechanism(i,50,20,14,2)
try:
m.velocity_params()
except Exception:
continue
print("Processing time : ",time.time()-initial_time,"s")
You can either use a compiled C module such as numpy to do the heavy math computations as numpy will parallelize operations on vectors for you, or you can utilize the multiprocessing library you imported.
import multiprocessing as mp
parallelism = 4
p = mp.pool(parallelism)
res = p.map(my_mechanism, [x for x in range(10,20)])
Although you will need to handle the multiple input parameters in some other way to use the map function, as by default it will only take one.
Generally writing to a csv file takes a long time. I would recommend running all your data processing beforehand (this can be parallelised) and writing the output at the end.
If you want a more detailed answer, please be more specific about what csv writer do you use, what is your data etc

multiprocessing - calling function with different input files

I have a function which reads in a file, compares a record in that file to a record in another file and depending on a rule, appends a record from the file to one of two lists.
I have an empty list for adding matched results to:
match = []
I have a list restrictions that I want to compare records in a series of files with.
I have a function for reading in the file I wish to see if contains any matches. If there is a match, I append the record to the match list.
def link_match(file):
links = json.load(file)
for link in links:
found = False
try:
for other_link in other_links:
if link['data'] == other_link['data']:
match.append(link)
found = True
else:
pass
else:
print "not found"
I have numerous files that I wish to compare and I thus wish to use the multiprocessing library.
I create a list of file names to act as function arguments:
list_files=[]
for file in glob.glob("/path/*.json"):
list_files.append(file)
I then use the map feature to call the function with the different input files:
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=6)
pool.map(link_match,list_files)
pool.close()
pool.join()
CPU use goes through the roof and by adding in a print line to the function loop I can see that matches are being found and the function is behaving correctly.
However, the match results list remains empty. What am I doing wrong?
multiprocessing runs a new instance of Python for each process in the pool - the context is empty (if you use spawn as a start method) or copied (if you use fork), plus copies of any arguments you pass in (either way), and from there they're all separate. If you want to pass data between branches, there's a few other ways to do it.
Instead of writing to an internal list, write to a file and read from it later when you're done. The largest potential problem here is that only one thing can write to a file at a time, so either you make a lot of separate files (and have to read all of them afterwards) or they all block each other.
Continue with multiprocessing, but use a multiprocessing.Queue instead of a list. This is an object provided specifically for your current use-case: Using multiple processes and needing to pass data between them. Assuming that you should indeed be using multiprocessing (that your situation wouldn't be better for threading, see below), this is probably your best option.
Instead of multiprocessing, use threading. Separate threads all share a single environment. The biggest problems here are that Python only lets one thread actually run Python code at a time, per process. This is called the Global Interpreter Lock (GIL). threading is thus useful when the threads will be waiting on external processes (other programs, user input, reading or writing files), but if most of the time is spent in Python code, it actually takes longer (because it takes a little time to switch threads, and you're not doing anything to save time). This has its own queue. You should probably use that rather than a plain list, if you use threading - otherwise there's the potential that two threads accessing the list at the same time interfere with each other, if it switches threads at the wrong time.
Oh, by the way: If you do use threading, Python 3.2 and later has an improved implementation of the GIL, which seems like it at least has a good chance of helping. A lot of stuff for threading performance is very dependent on your hardware (number of CPU cores) and the exact tasks you're doing, though - probably best to try several ways and see what works for you.
When multiprocessing, each subprocess gets its own copy of any global variables in the main module defined before the if __name__ == '__main__': statement. This means that the link_match() function in each one of the processes will be accessing a different match list in your code.
One workaround is to use a shared list, which in turn requires a SyncManager to synchronize access to the shared resource among the processes (which is created by calling multiprocessing.Manager()). This is then used to create the list to store the results (which I have named matches instead of match) in the code below.
I also had to use functools.partial() to create a single argument callable out of the revised link_match function which now takes two arguments, not one (which is the kind of function pool.map() expects).
from functools import partial
import glob
import multiprocessing
def link_match(matches, file): # note: added results list argument
links = json.load(file)
for link in links:
try:
for other_link in other_links:
if link['data'] == other_link['data']:
matches.append(link)
else:
pass
else:
print "not found"
if __name__ == '__main__':
manager = multiprocessing.Manager() # create SyncManager
matches = manager.list() # create a shared list here
link_matches = partial(link_match, matches) # create one arg callable to
# pass to pool.map()
pool = multiprocessing.Pool(processes=6)
list_files = glob.glob("/path/*.json") # only used here
pool.map(link_matches, list_files) # apply partial to files list
pool.close()
pool.join()
print(matches)
Multiprocessing creates multiple processes. The context of your "match" variable will now be in that child process, not the parent Python process that kicked the processing off.
Try writing the list results out to a file in your function to see what I mean.
To expand cthrall's answer, you need to return something from your function in order to pass the info back to your main thread, e.g.
def link_match(file):
[put all the code here]
return match
[main thread]
all_matches = pool.map(link_match,list_files)
the list match will be returned from each single thread and map will return a list of lists in this case. You can then flatten it again to get the final output.
Alternatively you can use a shared list but this will just add more headache in my opinion.

Multiprocessing, pooling and randomness

I am experiencing a strange thing: I wrote a program to simulate economies. Instead of running this simulation one by one on one CPU core, I want to use multiprocessing to make things faster. So I run my code (fine), and I want to get some stats from the simulations I am doing. Then arises one surprise: all the simulations done at the same time yield the very same result! Is there some strange relationship between Pool() and random.seed()?
To be much clearer, here is what the code can be summarized as:
class Economy(object):
def __init__(self,i):
self.run_number = i
self.Statistics = Statistics()
self.process()
def run_and_return(i):
eco = Economy(i)
return eco
collection = []
def get_result(x):
collection.append(x)
if __name__ == '__main__':
pool = Pool(processes=4)
for i in range(NRUN):
pool.apply_async(run_and_return, (i,), callback=get_result)
pool.close()
pool.join()
The process(i) is the function that goes through every step of the simulation, during i steps. Basically I simulate NRUN Economies, from which I get the Statistics that I put in the list collection.
Now the strange thing is that the output of this is exactly the same for the first 4 runs: during the same "wave" of simulations, I get the very same output. Once I get to the second wave, then I get a different output for the next 4 simulations!
All these simulations run well if I use the same program with processes=1: I get different results when I only work on one core, taking simulations one by one... I have tried a few things, but can't get my head around this, hence my post...
Thank you very much for taking the time to read this long post, do not hesitate to ask for more precisions!
All the best,
If you are on Linux then each pool process is made by forking the parent process. This means the process is literally duplicated - this includes the seed any random object may be using.
The random module selects the seed for its default functions on import. Meaning the seed has already been selected before you create the Pool.
To get around this you must use an initialiser for each pool process that sets the random seed to something unique.
A decent way to seed random would be to use the process id and the current time. The process id is bound to be unique on a single run of your program. Whilst using the time will ensure uniqueness over multiple runs in case the same process id is produced. Passing process id and time through as a string will mean that the digest of the string is also used to seed the random number generator -- meaning two similar strings will produce substantially different seeds. Alternatively, you could use the uuid module to generate seeds.
def proc_init():
random.seed(str(os.getpid()) + str(time.time()))
pool = Pool(num_procs, initializer=proc_init)

Categories

Resources