Parallelize operations in python - python

I'm populating a matrix using a conditional lookup from a file. The file is extremely large (25,00,000 records) and is saved as a dataframe ('file').
Each matrix row operation (lookup) is independent of the other. Is there anyway I could parallelize this process?
I'm working in pandas and python. My current approach is bare naive.
for r in row:
for c in column:
num=file[(file['Unique_Inventor_Number']==r) & file['AppYearStr']==c)]['Citation'].tolist()
num = len(list(set(num)))
d.set_value(r, c, num)

For 2.5 million records you should be able to do
res = file.groupby(['Unique_Inventor_Number', 'AppYearStr']).Citation.nunique()
The matrix should be available in
res.unstack(level=1).fillna(0).values
I'm not sure if the is the fastest, but should be significantly faster than your implementation

[EDIT] As Roland mentioned in comment, in a standard Python implementation, this post does not offer any solution to improve CPU performances.
In the standard Python implementation, threads do not really improve performance on CPU-bound tasks. There is a "Global Interpreter Lock" that enforces that only one thread at a time can be executing Python bytecode. This was done to keep the complexity of memory management down.
Have you tried to use different Threads for the different functions?
Let's say you separate your dataframe into columns and create multiple threads. Then you assign each thread to apply a function to a column. If you have enough processing power, you might be able to gain a lot of time:
from threading import Thread
import pandas as pd
import numpy as np
from queue import Queue
from time import time
# Those will be used afterwards
N_THREAD = 8
q = Queue()
df2 = pd.DataFrame() # The output of the script
# You create the job that each thread will do
def apply(series, func):
df2[series.name] = series.map(func)
# You define the context of the jobs
def threader():
while True:
worker = q.get()
apply(*worker)
q.task_done()
def main():
# You import your data to a pandas dataframe
df = pd.DataFrame(np.random.randn(100000,4), columns=['A', 'B', 'C', 'D'])
# You create the functions you will apply to your columns
func1 = lambda x: x<10
func2 = lambda x: x==0
func3 = lambda x: x>=0
func4 = lambda x: x<0
func_rep = [func1, func2, func3, func4]
for x in range(N_THREAD): # You create your threads
t = Thread(target=threader)
t.start()
# Now is the tricky part: You enclose the arguments that
# will be passed to the function into a tuple which you
# put into a queue. Then you start the job by "joining"
# the queue
for i, func in enumerate(func_rep):
worker = tuple([df.iloc[:,i], func])
q.put(worker)
t0 = time()
q.join()
print("Entire job took: {:.3} s.".format(time() - t0))
if __name__ == '__main__':
main()

Related

python - Difference in CPU cores used when using Pool map and Pool starmap

I want to use Pool to split a task among n workers. What happens is that when I'm using map with one argument in the task function, I observe that all the cores are used, all tasks are launched simultaneously.
On the other hand, when I'm using starmap, task launch is one by one and I never reach 100% CPU load.
I want to use starmap for my case because I want to pass a second argument, but there's no use if it doesn't take advantage of multiprocessing.
This is the code that works
import numpy as np
from multiprocessing import Pool
# df_a = just a pandas dataframe which I split in n parts and I
# feed each part to a task. Each one may have a few
# thousand rows
n_jobs = 16
def run_parallel(df_a):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.map(task_function, dfs_a)
return result
def task_function(left_df):
print("in task function")
# execute task...
return result
result = run_parallel(df_a)
in this case, "in task function" is printed at the same time, 16 times.
This is the code that doesn't work
n_jobs = 16
# df_b: a big pandas dataframe (~1.7M rows, ~20 columns) which I
# want to send to each task as is
def run_parallel(df_a, df_b):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.starmap(task_function, zip(dfs_a, repeat(df_b)))
return result
def task_function(left_df, right_df):
print("in task function")
# execute task
return result
result = run_parallel(df_a, df_b)
Here, "in task function" is printed sequentially and the processors never reach 100% capacity. I also tried workarounds based on this answer:
https://stackoverflow.com/a/5443941/6941970
but no luck. Even when I used map in this way:
from functools import partial
pool.map(partial(task_function, b=df_b), dfs_a)
considering that maybe repeat(*very big df*) would introduce memory issues, still there wasn't any real parallelization

Parallelizing a function to write to global variables

I'm trying to figure out the best way to parallelize a program like this:
global_data = some data
global_data2 = some data
data_store1 = np.empty(n)
data_store2 = np.empty(n)
.
.
.
def simulation(global_data):
retrieve values from global datasets and set element of global datastores
such that I do something like pass list(enumerate(global_data)) to a multiprocessing function, and each process sets elements of the global data stores corresponding to the received (index, vlaue) pair. I'm running on a high performance cluster with 128 cores, so I think parallelization is preferable to threading.
If you use a multiprocessing pool (e.g. a multiprocessing.Pool instance) with its map method, then your worker function, simulation, just needs to return its result back to the main process which will end up with a list of results that will be in the correct order. This will be less expensive than using a managed list to which the worker function adds its result:
import multiprocessing
def simulation(global_data_elem):
# We are passed a single element of global_data
# Do calculation with global_data_elem and return result.
# The CPU resources required to do this calculation must be sufficiently
# high to justify the additional overhead of multiprocessing (which is
# not the case for this demo):
return global_data_elem * 2
def main():
# global_data is some data (not necesarilly at global scope)
global_data = ['a', 'b', 'c']
# create pool of the correct size
# (not larger than the number of cores we have nor the number of tasks being submitted):
pool = multiprocessing.Pool(min(len(global_data), multiprocessing.cpu_count()))
# results are returned in the correct order (task submission order):
results = pool.map(simulation, global_data)
print(results)
# Required for Windows:
if __name__ == '__main__':
main()
Prints:
['aa', 'bb', 'cc']

Running different Python functions in separate CPUs

Using multiprocessing.pool I can split an input list for a single function to be processed in parallel along multiple CPUs. Like this:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
results = pool.map(f, range(100))
pool.close()
pool.join()
However, this does not allow to run different functions on different processors. If I want to do something like this, in parallel / simultaneously:
foo1(args1) --> Processor1
foo2(args2) --> Processor2
How can this be done?
Edit: After Darkonaut remarks, I do not care about specifically assigning foo1 to Processor number 1. It can be any processor as chosen by the OS. I am just interested in running independent functions in different/ parallel Processes. So rather:
foo1(args1) --> process1
foo2(args2) --> process2
I usually find it easiest to use the concurrent.futures module for concurrency. You can achieve the same with multiprocessing, but concurrent.futures has (IMO) a much nicer interface.
Your example would then be:
from concurrent.futures import ProcessPoolExecutor
def foo1(x):
return x * x
def foo2(x):
return x * x * x
if __name__ == '__main__':
with ProcessPoolExecutor(2) as executor:
# these return immediately and are executed in parallel, on separate processes
future_1 = executor.submit(foo1, 1)
future_2 = executor.submit(foo2, 2)
# get results / re-raise exceptions that were thrown in workers
result_1 = future_1.result() # contains foo1(1)
result_2 = future_2.result() # contains foo2(2)
If you have many inputs, it is better to use executor.map with the chunksize argument instead:
from concurrent.futures import ProcessPoolExecutor
def foo1(x):
return x * x
def foo2(x):
return x * x * x
if __name__ == '__main__':
with ProcessPoolExecutor(4) as executor:
# these return immediately and are executed in parallel, on separate processes
future_1 = executor.map(foo1, range(10000), chunksize=100)
future_2 = executor.map(foo2, range(10000), chunksize=100)
# executor.map returns an iterator which we have to consume to get the results
result_1 = list(future_1) # contains [foo1(x) for x in range(10000)]
result_2 = list(future_2) # contains [foo2(x) for x in range(10000)]
Note that the optimal values for chunksize, the number of processes, and whether process-based concurrency actually leads to increased performance depends on many factors:
The runtime of foo1 / foo2. If they are extremely cheap (as in this example), the communication overhead between processes might dominate the total runtime.
Spawning a process takes time, so the code inside with ProcessPoolExecutor needs to run long enough for this to amortize.
The actual number of physical processors in the machine you are running on.
Whether your application is IO bound or compute bound.
Whether the functions you use in foo are already parallelized (such as some np.linalg solvers, or scikit-learn estimators).

How to update dataframe value in multiprocessing

I want to update pandas
Hello, I want to compare the speeds of single-core and multicore in pandas dataframe calculations.
The following cases are given, The column'c' in the 'i'th-row is the average of the values ​​of 'a' from 'i-9'-row to 'i'th-row.
from multiprocessing import Process, Value, Array, Manager
import pandas as pd
import numpy as np
import time
total_num = 1000
df = pd.DataFrame(np.arange(1,total_num*2+1).reshape(total_num,2),
columns=['a','b'])
df['c']=0
df2 = pd.DataFrame(np.arange(1,total_num*2+1).reshape(total_num,2),
columns=['a','b'])
df2['c']=0
def Cal(start, end):
for i in range(end-start-1):
if i+start < 10:
df.loc[i+start,'c']=df.loc[:i+start,'c'].mean()
else :
df.loc[i+start,'c']=df.loc[i-9:i+start,'c'].mean()
def Cal2(my_df,start, end):
for i in range(end-start-1):
if i+start < 10:
my_df.df.loc[i+start,'c']=my_df.df.loc[:i+start,'c'].mean()
else :
my_df.df.loc[i+start,'c']=my_df.df.loc[i-9:i+start,'c'].mean()
print(my_df)
print('Single core : --->')
start_t = time.time()
Cal(0,total_num+1)
end_t = time.time()
print(end_t-start_t)
print('Multiprocess ---->')
if __name__=='__main__':
num=len(df2)
num_core=4
between=num//num_core
mgr=Manager()
ns = mgr.Namespace()
ns.df=df2
procs=[]
start_t =time.time()
for index in range(num_core):
proc=Process(target=Cal2,args=(ns,index*between,(index+1)*between))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
end_t = time.time()
print(end_t-start_t)
At first I realized that Multiprocessing does not use global variables. So I used Manager. However, the 'c'column of df2 did not change.
How do I do what I want to do? :p
You may look at swifter as well, iit applies functions using multiprocessing IF it helps in faster code execution.
In your case it is a terrible idea, 10 is a really small amount of data so distributing it between cores will not help and cost of processes will be much higher than operations.
Furthermore, memory sharing is not a good idea between processes (as this is really costly), and that's what you are trying to do here (usually you split data beforehand and push it to multiprocessing functions like applymap, but once again, data chunks should be much bigger).
You could use threads, as those are the ones you may be after, but remember about Python's GIL (you may read about threads, processes and GIL in other answers, e.g. here)

Is there a simple process-based parallel map for python?

I'm looking for a simple process-based parallel map for python, that is, a function
parmap(function,[data])
that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of results.
Does something like this exist? I would like something simple, so a simple module would be nice. Of course, if no such thing exists, I will settle for a big library :-/
I seems like what you need is the map method in multiprocessing.Pool():
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though). It blocks till the result is ready.
This method chops the iterable into a number of chunks which it submits to the
process pool as separate tasks. The (approximate) size of these chunks can be
specified by setting chunksize to a positive integ
For example, if you wanted to map this function:
def f(x):
return x**2
to range(10), you could do it using the built-in map() function:
map(f, range(10))
or using a multiprocessing.Pool() object's method map():
import multiprocessing
pool = multiprocessing.Pool()
print pool.map(f, range(10))
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your map function with the #ray.remote decorator, and then invoke it with .remote. This will ensure that every instance of the remote function will executed in a different process.
import time
import ray
ray.init()
# Define the function you want to apply map on, as remote function.
#ray.remote
def f(x):
# Do some work...
time.sleep(1)
return x*x
# Define a helper parmap(f, list) function.
# This function executes a copy of f() on each element in "list".
# Each copy of f() runs in a different process.
# Note f.remote(x) returns a future of its result (i.e.,
# an identifier of the result) rather than the result itself.
def parmap(f, list):
return [f.remote(x) for x in list]
# Call parmap() on a list consisting of first 5 integers.
result_ids = parmap(f, range(1, 6))
# Get the results
results = ray.get(result_ids)
print(results)
This will print:
[1, 4, 9, 16, 25]
and it will finish in approximately len(list)/p (rounded up the nearest integer) where p is number of cores on your machine. Assuming a machine with 2 cores, our example will execute in 5/2 rounded up, i.e, in approximately 3 sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Python3's Pool class has a map() method and that's all you need to parallelize map:
from multiprocessing import Pool
with Pool() as P:
xtransList = P.map(some_func, a_list)
Using with Pool() as P is similar to a process pool and will execute each item in the list in parallel. You can provide the number of cores:
with Pool(processes=4) as P:
For those who looking for Python equivalent of R's mclapply(), here is my implementation. It is an improvement of the following two examples:
"Parallelize Pandas map() or apply()", as mentioned by #Rafael
Valero.
How to apply map to functions with multiple arguments.
It can be apply to map functions with single or multiple arguments.
import numpy as np, pandas as pd
from scipy import sparse
import functools, multiprocessing
from multiprocessing import Pool
num_cores = multiprocessing.cpu_count()
def parallelize_dataframe(df, func, U=None, V=None):
#blockSize = 5000
num_partitions = 5 # int( np.ceil(df.shape[0]*(1.0/blockSize)) )
blocks = np.array_split(df, num_partitions)
pool = Pool(num_cores)
if V is not None and U is not None:
# apply func with multiple arguments to dataframe (i.e. involves multiple columns)
df = pd.concat(pool.map(functools.partial(func, U=U, V=V), blocks))
else:
# apply func with one argument to dataframe (i.e. involves single column)
df = pd.concat(pool.map(func, blocks))
pool.close()
pool.join()
return df
def square(x):
return x**2
def test_func(data):
print("Process working on: ", data.shape)
data["squareV"] = data["testV"].apply(square)
return data
def vecProd(row, U, V):
return np.sum( np.multiply(U[int(row["obsI"]),:], V[int(row["obsJ"]),:]) )
def mProd_func(data, U, V):
data["predV"] = data.apply( lambda row: vecProd(row, U, V), axis=1 )
return data
def generate_simulated_data():
N, D, nnz, K = [302, 184, 5000, 5]
I = np.random.choice(N, size=nnz, replace=True)
J = np.random.choice(D, size=nnz, replace=True)
vals = np.random.sample(nnz)
sparseY = sparse.csc_matrix((vals, (I, J)), shape=[N, D])
# Generate parameters U and V which could be used to reconstruct the matrix Y
U = np.random.sample(N*K).reshape([N,K])
V = np.random.sample(D*K).reshape([D,K])
return sparseY, U, V
def main():
Y, U, V = generate_simulated_data()
# find row, column indices and obvseved values for sparse matrix Y
(testI, testJ, testV) = sparse.find(Y)
colNames = ["obsI", "obsJ", "testV", "predV", "squareV"]
dtypes = {"obsI":int, "obsJ":int, "testV":float, "predV":float, "squareV": float}
obsValDF = pd.DataFrame(np.zeros((len(testV), len(colNames))), columns=colNames)
obsValDF["obsI"] = testI
obsValDF["obsJ"] = testJ
obsValDF["testV"] = testV
obsValDF = obsValDF.astype(dtype=dtypes)
print("Y.shape: {!s}, #obsVals: {}, obsValDF.shape: {!s}".format(Y.shape, len(testV), obsValDF.shape))
# calculate the square of testVals
obsValDF = parallelize_dataframe(obsValDF, test_func)
# reconstruct prediction of testVals using parameters U and V
obsValDF = parallelize_dataframe(obsValDF, mProd_func, U, V)
print("obsValDF.shape after reconstruction: {!s}".format(obsValDF.shape))
print("First 5 elements of obsValDF:\n", obsValDF.iloc[:5,:])
if __name__ == '__main__':
main()
I know this is an old post, but just in case, I wrote a tool to make this super, super easy called parmapper (I actually call it parmap in my use but the name was taken).
It handles a lot of the setup and deconstruction of processes and adds tons of features. In rough order of importance
Can take lambda and other unpickleable functions
Can apply starmap and other similar call methods to make it very easy to directly use.
Can split amongst both threads and/or processes
Includes features such as progress bars
It does incur a small cost but for most uses, that is negligible.
I hope you find it useful.
(Note: It, like map in Python 3+, returns an iterable so if you expect all results to pass through it immediately, use list())

Categories

Resources