I want to run two different function in parallel in python, I have used the below code :
def remove_special_char(data):
data['Description'] = data['Description'].apply(lambda val: re.sub(r'^=', "'=", str(val))) # Change cell values which start with '=' sign leading to Excel formula issues
return(data)
file_path1 = '.\file1.xlsx'
file_path2 = '.\file2.xlsx'
def method1(file_path1):
data = pd.read_excel(file_path1)
data= remove_special_char(data)
return data
def method2(file_path2):
data = pd.read_excel(file_path2)
data= remove_special_char(data)
return data
I am using the below Pool process , but its not working.
from multiprocessing import Pool
p = Pool(3)
result1 = p.map(method1(file_path1), args=file_path1)
result2 = p.map(method2(file_path1), args=file_path2)
I want to run both these methods in parallel to save execution time and at the same time get the return value as well.
I don't know why you are defining the same method twice with different parameter names, but anyway the map method of Pools is taking as its first argument a function, and the second argument is an iterable. What map does is call the function on each item of the iterable, and return a list with all the results. So what you want to do is more something like:
from multiprocessing import Pool
file_paths = ('.\file1.xlsx', '.\file2.xlsx')
def method(file_path):
data = pd.read_excel(file_path)
data= remove_special_char(data)
return data
with Pool(3) as p:
result = p.map(method, file_paths)
Related
I have a multiprocessing code, and each process have to analyse same data differently.
The input data is always the same, it is not changeable.
Input data is a data frame 20 columns and 60k rows.
How to efficiently 'put' this data to each process?
On single process application I have used global variable, but in multiprocessing it's not working.
When I try to transfer this as a function argument, I have only the first element of the table
Welcome to Stack Overflow. You need to take the time and give reproducible minimal working examples to get specific answers and help the society in general.
Anyway, you shouldn't use global variables if you need to change them with each iteration/process/etc.
Multiprocessing works like that in rough easily-digestible terms:
import concurrent.futures
import glob
def manipulate_data_function(data):
result = torture_data(data)
return result
# ProcessPoolExecutor for CPU bound stuff
with concurrent.futures.ThreadPoolExecutor(max_workers = None) as executor:
futures = []
for file in glob.glob('*txt'):
futures.append(executor.submit(manipulate_data_function, data))
thank you for the answer, I don't change this date each iteration. I use the same data to each process, data how to change the data is given throw function argument
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p)
for f in concurrent.futures.as_completed(res):
fp = res
and next
def goal_fcn(x):
return heavy_calculation(x, global_DataFrame, global_String)
EDIT:
it work with:
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p, [global_DataFrame], [global_String])
for f in concurrent.futures.as_completed(res):
fp = res
def goal_fcn(x, DataFrame, String):
return heavy_calculation(x, DataFrame, String)
I'm have implemented an Evolutionary Algorithm process in Python 3.8, and am attempting to optimise/reduce its runtime. Due to the heavy constraints upon valid solutions, it can take a few minutes to generate valid chromosomes. To avoid spending hours just generating the initial population, I want to use Multiprocessing to generate multiple at a time.
My code at this point in time is:
populationCount = 500
def readDistanceMatrix():
# code removed
def generateAvailableValues():
# code removed
def generateAvailableValuesPerColumn():
# code removed
def generateScheduleTemplate():
# code removed
def generateChromosome():
# code removed
if __name__ == '__main__':
# Data type = DataFrame
distanceMatrix = readDistanceMatrix()
# Data type = List of Integers
availableValues = generateAvailableValues()
# Data type = List containing Lists of Integers
availableValuesPerColumn = generateAvailableValuesPerColumn(availableValues)
# Data type = DataFrame
scheduleTemplate = generateScheduleTemplate(distanceMatrix)
# Data type = List containing custom class (with Integer and DataFrame)
population = []
while len(population) < populationCount:
chrmSolution = generateChromosome(availableValuesPerColumn, scheduleTemplate, distanceMatrix)
population.append(chrmSolution)
Where the population list is filled in with the while loop at the end. I would like to replace the while loop with a Multiprocessing solution that can use up to a pre-set number of cores. For example:
population = []
availableCores = 6
while len(population) < populationCount:
while usedCores < availableCores:
# start generating another chromosome as 'chrmSolution'
population.append(chrmSolution)
However, after reading and watching hours worth of tutorials, I'm unable to get a loop up-and-running. How should I go about doing this?
It sounds like a simple multiprocessing.Pool should do the trick, or at least be a place to start. Here's a simple example of how that might look:
from multiprocessing import Pool, cpu_count
child_globals = {} #mutable object at the `module` level acts as container for globals (constants)
if __name__ == '__main__':
# ...
def init_child(availableValuesPerColumn, scheduleTemplate, distanceMatrix):
#passing variables to the child process every time is inefficient if they're
# constant, so instead pass them to the initialization function, and let
# each child re-use them each time generateChromosome is called
child_globals['availableValuesPerColumn'] = availableValuesPerColumn
child_globals['scheduleTemplate'] = scheduleTemplate
child_globals['distanceMatrix'] = distanceMatrix
def child_work(i):
#child_work simply wraps generateChromosome with inputs, and throws out dummy `i` from `range()`
return generateChromosome(child_globals['availableValuesPerColumn'],
child_globals['scheduleTemplate'],
child_globals['distanceMatrix'])
with Pool(cpu_count(),
initializer=init_child, #init function to stuff some constants into the child's global context
initargs=(availableValuesPerColumn, scheduleTemplate, distanceMatrix)) as p:
#imap_unordered doesn't make child processes wait to ensure order is preserved,
# so it keeps the cpu busy more often. it returns a generator, so we use list()
# to store the results into a list.
population = list(p.imap_unordered(child_work, range(populationCount)))
I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)
I have a code like this
def plotFrame(n):
a = data[n, :]
do_something_with(a)
data = loadtxt(filename)
ids = data[:,0] # some numbers from the first column of data
map(plotFrame, ids)
That worked fine for me. Now I want to try replacing map() with pool.map() as follows:
pools = multiprocessing.Pool(processes=1)
pools.map(plotFrame, ids)
But that won't work, saying:
NameError: global name 'data' is not defined
The questions is: What is going on? Why map() does not complain about the data variable that is not passed to the function, but pool.map() does?
EDIT:
I' m using Linux.
EDIT 2:
Based on #Bill 's second suggestion, I now have the following code:
def plotFrame_v2(line):
plot_with(line)
if __name__ == "__main__":
ff = np.loadtxt(filename)
m = int( max(ff[:,-1]) ) # max id
l = ff.shape[0]
nfig = 0
pool = Pool(processes=1)
for i in range(0, l/m, 50):
data = ff[i*m:(i+1)*m, :] # data of one frame contains several ids
pool.map(plotFrame_v2, data)
nfig += 1
plt.savefig("figs_bot/%.3d.png"%nfig)
plt.clf()
That works just as expected. However, now I have another unexpected problem: The produced figures are blank, whereas the above code with map() produces figures with the content of data.
Using multiprocessing.pool, you are spawning individual processes to work with the shared (global) resource data. Typically, you can allow the processes to work with a shared resource in the parent process by make that resource explicitly global. However, it is better practice to explicitly pass all needed resources to the child processes as function arguments. This is required if you are working on Windows. Check out the multiprocessing guidelines here.
So you could try doing
data = loadtxt(filename)
def plotFrame(n):
global data
a = data[n, :]
do_something_with(a)
ids = data[:,0] # some numbers from the first column of data
pools = multiprocessing.Pool(processes=1)
pools.map(plotFrame, ids)
or even better see this thread about feeding multiple arguments to a function with multiprocessing.pool. A simple way could be
def plotFrameWrapper(args):
return plotFrame(*args)
def plotFrame(n, data):
a = data[n, :]
do_something_with(a)
if __name__ == "__main__":
from multiprocessing import Pool
data = loadtxt(filename)
pools = Pool(1)
ids = data[:,0]
pools.map(plotFrameWrapper, zip([data]*len(inds), inds))
print results
One last thing: since it looks like the only thing you are doing from your example is slicing the array, you can simply slice first then pass the sliced arrays to your function:
def plotFrame(sliced_data):
do_something_with(sliced_data)
if __name__ == "__main__":
from multiprocessing import Pool
data = loadtxt(filename)
pools = Pool(1)
ids = data[:,0]
pools.map(plotFrame, data[ids])
print results
To avoid "unexpected" problems, avoid globals.
To reproduce your first code example with builtin map that calls plotFrame:
def plotFrame(n):
a = data[n, :]
do_something_with(a)
using multiprocessing.Pool.map, the first thing is to deal with the global data. If do_something_with(a) also uses some global data then it should also be changed.
To see how to pass a numpy array to a child process, see Use numpy array in shared memory for multiprocessing. If you don't need to modify the array then it is even simpler:
import numpy as np
def init(data_): # inherit data
global data #NOTE: no other globals in the program
data = data_
def main():
data = np.loadtxt(filename)
ids = data[:,0] # some numbers from the first column of data
pool = Pool(initializer=init, initargs=[data])
pool.map(plotFrame, ids)
if __name__=="__main__":
main()
All arguments either should be explicitly passed as arguments to plotFrame or inherited via init().
Your second code example tries to manipulate global data again (via plt calls):
import matplotlib.pyplot as plt
#XXX BROKEN, DO NOT USE
pool.map(plotFrame_v2, data)
nfig += 1
plt.savefig("figs_bot/%.3d.png"%nfig)
plt.clf()
Unless you draw something in the main process this code saves blank figures. Either plot in the child processes or send data to be plotted to the parent processes explicitly e.g., by returning it from plotFrame and using pool.map() returned value. Here's a code example: how to plot in child processes.
I'm looking for a simple process-based parallel map for python, that is, a function
parmap(function,[data])
that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of results.
Does something like this exist? I would like something simple, so a simple module would be nice. Of course, if no such thing exists, I will settle for a big library :-/
I seems like what you need is the map method in multiprocessing.Pool():
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though). It blocks till the result is ready.
This method chops the iterable into a number of chunks which it submits to the
process pool as separate tasks. The (approximate) size of these chunks can be
specified by setting chunksize to a positive integ
For example, if you wanted to map this function:
def f(x):
return x**2
to range(10), you could do it using the built-in map() function:
map(f, range(10))
or using a multiprocessing.Pool() object's method map():
import multiprocessing
pool = multiprocessing.Pool()
print pool.map(f, range(10))
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your map function with the #ray.remote decorator, and then invoke it with .remote. This will ensure that every instance of the remote function will executed in a different process.
import time
import ray
ray.init()
# Define the function you want to apply map on, as remote function.
#ray.remote
def f(x):
# Do some work...
time.sleep(1)
return x*x
# Define a helper parmap(f, list) function.
# This function executes a copy of f() on each element in "list".
# Each copy of f() runs in a different process.
# Note f.remote(x) returns a future of its result (i.e.,
# an identifier of the result) rather than the result itself.
def parmap(f, list):
return [f.remote(x) for x in list]
# Call parmap() on a list consisting of first 5 integers.
result_ids = parmap(f, range(1, 6))
# Get the results
results = ray.get(result_ids)
print(results)
This will print:
[1, 4, 9, 16, 25]
and it will finish in approximately len(list)/p (rounded up the nearest integer) where p is number of cores on your machine. Assuming a machine with 2 cores, our example will execute in 5/2 rounded up, i.e, in approximately 3 sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Python3's Pool class has a map() method and that's all you need to parallelize map:
from multiprocessing import Pool
with Pool() as P:
xtransList = P.map(some_func, a_list)
Using with Pool() as P is similar to a process pool and will execute each item in the list in parallel. You can provide the number of cores:
with Pool(processes=4) as P:
For those who looking for Python equivalent of R's mclapply(), here is my implementation. It is an improvement of the following two examples:
"Parallelize Pandas map() or apply()", as mentioned by #Rafael
Valero.
How to apply map to functions with multiple arguments.
It can be apply to map functions with single or multiple arguments.
import numpy as np, pandas as pd
from scipy import sparse
import functools, multiprocessing
from multiprocessing import Pool
num_cores = multiprocessing.cpu_count()
def parallelize_dataframe(df, func, U=None, V=None):
#blockSize = 5000
num_partitions = 5 # int( np.ceil(df.shape[0]*(1.0/blockSize)) )
blocks = np.array_split(df, num_partitions)
pool = Pool(num_cores)
if V is not None and U is not None:
# apply func with multiple arguments to dataframe (i.e. involves multiple columns)
df = pd.concat(pool.map(functools.partial(func, U=U, V=V), blocks))
else:
# apply func with one argument to dataframe (i.e. involves single column)
df = pd.concat(pool.map(func, blocks))
pool.close()
pool.join()
return df
def square(x):
return x**2
def test_func(data):
print("Process working on: ", data.shape)
data["squareV"] = data["testV"].apply(square)
return data
def vecProd(row, U, V):
return np.sum( np.multiply(U[int(row["obsI"]),:], V[int(row["obsJ"]),:]) )
def mProd_func(data, U, V):
data["predV"] = data.apply( lambda row: vecProd(row, U, V), axis=1 )
return data
def generate_simulated_data():
N, D, nnz, K = [302, 184, 5000, 5]
I = np.random.choice(N, size=nnz, replace=True)
J = np.random.choice(D, size=nnz, replace=True)
vals = np.random.sample(nnz)
sparseY = sparse.csc_matrix((vals, (I, J)), shape=[N, D])
# Generate parameters U and V which could be used to reconstruct the matrix Y
U = np.random.sample(N*K).reshape([N,K])
V = np.random.sample(D*K).reshape([D,K])
return sparseY, U, V
def main():
Y, U, V = generate_simulated_data()
# find row, column indices and obvseved values for sparse matrix Y
(testI, testJ, testV) = sparse.find(Y)
colNames = ["obsI", "obsJ", "testV", "predV", "squareV"]
dtypes = {"obsI":int, "obsJ":int, "testV":float, "predV":float, "squareV": float}
obsValDF = pd.DataFrame(np.zeros((len(testV), len(colNames))), columns=colNames)
obsValDF["obsI"] = testI
obsValDF["obsJ"] = testJ
obsValDF["testV"] = testV
obsValDF = obsValDF.astype(dtype=dtypes)
print("Y.shape: {!s}, #obsVals: {}, obsValDF.shape: {!s}".format(Y.shape, len(testV), obsValDF.shape))
# calculate the square of testVals
obsValDF = parallelize_dataframe(obsValDF, test_func)
# reconstruct prediction of testVals using parameters U and V
obsValDF = parallelize_dataframe(obsValDF, mProd_func, U, V)
print("obsValDF.shape after reconstruction: {!s}".format(obsValDF.shape))
print("First 5 elements of obsValDF:\n", obsValDF.iloc[:5,:])
if __name__ == '__main__':
main()
I know this is an old post, but just in case, I wrote a tool to make this super, super easy called parmapper (I actually call it parmap in my use but the name was taken).
It handles a lot of the setup and deconstruction of processes and adds tons of features. In rough order of importance
Can take lambda and other unpickleable functions
Can apply starmap and other similar call methods to make it very easy to directly use.
Can split amongst both threads and/or processes
Includes features such as progress bars
It does incur a small cost but for most uses, that is negligible.
I hope you find it useful.
(Note: It, like map in Python 3+, returns an iterable so if you expect all results to pass through it immediately, use list())