I have the following code:
data = [2,5,3,16,2,5]
def f(x):
return 2*x
f_total = 0
for x in data:
f_total += f(x)
print(f_total/len(data))
which I want to speed up the for loop. (In reality the code is more complex and I want to run it in a super computer with many many processing cores). I have read that I can do this with the multiprocessing library where I can get python3 to simultaneously run different chunks of the loop at the same time but I am a bit lost with it.
Could you explain me how to do it with this minimal version of my program?
Thanks!
import multiprocessing
from numpy import random
"""
This mentions the number of worker threads that you want to run in parallel.
Depending on the number of cores in your system you should choose the appropriate
number of threads. When you call 'map' function it will distribute the input
values in that many parts
"""
NUM_CORES = 6
data = random.rand(100, 1)
"""
+2 so that the cores are not left idle in case a thread is waiting for I/O.
Choose by performing an empirical analysis depending on the function you are trying to compute.
It could match up to NUM_CORES as well. You can vary the chunksize as well depending on the size of 'data' that you have.
"""
NUM_THREADS = NUM_CORES+2
CHUNKSIZE = int(len(data)/(NUM_THREADS))
def f(x):
return 2*x
# This takes care of creating pool of worker threads which will be assigned the jobs
pool = multiprocessing.Pool(NUM_THREADS)
# map vs imap. If the data is large go for imap else map is also good.
it = pool.imap(f, data, chunksize=CHUNKSIZE)
f_total = 0
# Iterate and sum up the result
for value in it:
f_total += sum(value)
print(f_total/len(data))
Why choose imap over map?
Related
I want to use Pool to split a task among n workers. What happens is that when I'm using map with one argument in the task function, I observe that all the cores are used, all tasks are launched simultaneously.
On the other hand, when I'm using starmap, task launch is one by one and I never reach 100% CPU load.
I want to use starmap for my case because I want to pass a second argument, but there's no use if it doesn't take advantage of multiprocessing.
This is the code that works
import numpy as np
from multiprocessing import Pool
# df_a = just a pandas dataframe which I split in n parts and I
# feed each part to a task. Each one may have a few
# thousand rows
n_jobs = 16
def run_parallel(df_a):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.map(task_function, dfs_a)
return result
def task_function(left_df):
print("in task function")
# execute task...
return result
result = run_parallel(df_a)
in this case, "in task function" is printed at the same time, 16 times.
This is the code that doesn't work
n_jobs = 16
# df_b: a big pandas dataframe (~1.7M rows, ~20 columns) which I
# want to send to each task as is
def run_parallel(df_a, df_b):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.starmap(task_function, zip(dfs_a, repeat(df_b)))
return result
def task_function(left_df, right_df):
print("in task function")
# execute task
return result
result = run_parallel(df_a, df_b)
Here, "in task function" is printed sequentially and the processors never reach 100% capacity. I also tried workarounds based on this answer:
https://stackoverflow.com/a/5443941/6941970
but no luck. Even when I used map in this way:
from functools import partial
pool.map(partial(task_function, b=df_b), dfs_a)
considering that maybe repeat(*very big df*) would introduce memory issues, still there wasn't any real parallelization
We have a batch processing system which we are looking to modify to use multiple threads. The process takes in a delimited file and performs calculations on it via pandas.
I would like to split up the dataframe into N chunks if the total amount of records exceeds a threshold. Each chunk should then be fed to a thread from a threadpool executor to get the calculations done, then at the end I would wait for the threads to sync and concatenate the resulting DFs into one.
Problem is that I'm not sure how to split a Pandas DF like this. Let's say there's going to be an arbitrary number of threads, 2 (as an example), and i want to start the split if the record number is over 200000
So the idea would be, if I send a file with 200001 records, thread 1 would get 100000, and thread 2 would get 100001. If I send one with 1000000, thread 1 would get 500000 and thread 2 would get 500000.
(If the total records don't exceed this threshold, I'd just execute the process on a single thread)
I have seen related solutions, but none have applied to my case.
def do_something(df):
if len(df) > some_threshold:
pivot = len(df)//2
threading.Thread(target=do_something,args=(df[:pivot]).start()
return do_something(df[:pivot])
actually_do_something_with_smallish_df(df)
maybe?
Below, I've included example code of how to split. Then, using ThreadPoolExecutor, it will execute the code with eight threads, in my case (you can use the Thread library too). The process_pandas function is just a dummy function; you can use whatever you want:
import pandas as pd
from concurrent.futures import ThreadPoolExecutor as th
threshold = 300
block_size = 100
num_threads = 8
big_list = pd.read_csv('pandas_list.csv',delimiter=';',header=None)
blocks = []
if len(big_list) > threshold:
for i in range((len(big_list)//block_size)):
blocks.append(big_list[block_size*i:block_size*(i+1)])
i=i+1
if i*block_size < len(big_list):
blocks.append(big_list[block_size*i:])
else:
blocks.append(big_list)
def process_pandas(df):
print('Doing calculations...')
indexes = list(df.index.values)
df.loc[indexes[0], 2] = 'changed'
return df
with th(num_threads) as ex:
results = ex.map(process_pandas,blocks)
final_dataframe = pd.concat(results, axis=0)
I am trying to use python to process some large data sets from several data stations. My idea is to use multiprocessing.pool to assign each CPU the data from a single station, since the data from each station are independent from each other.
However, it seems that my calculation time does not really go down, comparing to single for loop.
Here is part of my code:
#function calculating the square of each data point, and taking the cumulative sum
def get_cumdd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
cum_dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd=np.cumsum(dd)
return cum_dd
#parallelization between each station
if __name__ == '__main__':
n_proc = np.min([mp.cpu_count(),nstation]) #nstation = 10
p = mp.Pool(processes=int(n_proc))
result = p.map(get_cumdd,data)
p.close()
p.join()
cum_dd = np.zeros((nstation,len(data[0])))
for i in range(nstation):
cum_dd[i] = result[i].T
I do not use chunksize because cum_dd takes the summation of all the previous data^2. I am essentially dividing my data into 10 equal pieces because there is no communication between processes. I wonder if I missed anything here.
My data has 2 million points per station per day, and I need to process years of data.
This doesn't address your multiprocessing question directly, but (as Ugur MULUK and Iguananaut mention) I think your get_cumdd function is inefficient. Numpy provides np.cumsum. Reimplementing your function I get more than 1000x speedup for an array with 10k elements. With 100k elements it's about 7000x faster. With 2M elements I didn't bother to let it finish.
# your function
def cum_dd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
cum_dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd[i]=np.sum(dd[0:i])
return cum_dd
# numpy implementation
def cum_dd2(data):
# adding an axis to match the shape of the output of your cum_dd function
return np.cumsum(data**2)[:, np.newaxis]
For 2e6 points this implementation takes ~11ms on my computer. I think that's about 30 seconds for 10 years of data for a single station.
NumPy already implements efficient parallel processing on CPUs and GPUs. The processing algorithms use Single Instruction Multiple Data (SIMD) instructions.
By pooling computations manually, you are reducing the efficiency. You can improve performance by vectorizing your explicit for loop.
See the video below for more information about vectorization.
https://www.youtube.com/watch?v=qsIrQi0fzbY
If you are having difficulties, I will be around for updates or help. Good luck!
Thanks a lot for all the comments and answers! After applying vectorization and pooling, I reduced the calculation time from one hour to 3 second (10*1.7 million data points). I have my code here in case anyone is interested,
def get_cumdd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd=np.cumsum(dd)
return dd,cum_dd
if __name__ == '__main__':
n_proc = np.min([mp.cpu_count(),nstation])
p = mp.Pool(processes=int(n_proc))
result = p.map(CC.get_cumdd,d)
p.close()
p.join()
I'm not using shared memory Queue because all my processes are independent from each other.
Using the threading library to accelerate calculating each point's neighborhood in a points-cloud. By calling function CalculateAllPointsNeighbors at the bottom of the post.
The function receives a search radius, maximum number of neighbors and a number of threads to split the work on. No changes are done on any of the points. And each point stores data in its own np.ndarray cell accessed by its own index.
The following function times how long it takes N number of threads to finish calculating all points neighborhoods:
def TimeFuncThreads(classObj, uptothreads):
listTimers = []
startNum = 1
EndNum = uptothreads + 1
for i in range(startNum, EndNum):
print("Current Number of Threads to Test: ", i)
tempT = time.time()
classObj.CalculateAllPointsNeighbors(searchRadius=0.05, maxNN=25, maxThreads=i)
tempT = time.time() - tempT
listTimers.append(tempT)
PlotXY(np.arange(startNum, EndNum), listTimers)
The problem is, I've been getting very different results in each run. Here are the plots from 5 subsequent runs of the function TimeFuncThreads. The X axis is number of threads, Y is the runtime. First thing is, they look totally random. And second, there is no significant acceleration boost.
I'm confused now whether I'm using the threading library wrong and what is this behavior that I'm getting?
The function that handles the threading and the function that is being called from each thread:
def CalculateAllPointsNeighbors(self, searchRadius=0.20, maxNN=50, maxThreads=8):
threadsList = []
pointsIndices = np.arange(self.numberOfPoints)
splitIndices = np.array_split(pointsIndices, maxThreads)
for i in range(maxThreads):
threadsList.append(threading.Thread(target=self.GetPointsNeighborsByID,
args=(splitIndices[i], searchRadius, maxNN)))
[t.start() for t in threadsList]
[t.join() for t in threadsList]
def GetPointsNeighborsByID(self, idx, searchRadius=0.05, maxNN=20):
if isinstance(idx, int):
idx = [idx]
for currentPointIndex in idx:
currentPoint = self.pointsOpen3D.points[currentPointIndex]
pointNeighborhoodObject = self.GetPointNeighborsByCoordinates(currentPoint, searchRadius, maxNN)
self.pointsNeighborsArray[currentPointIndex] = pointNeighborhoodObject
self.__RotatePointNeighborhood(currentPointIndex)
It pains me to be the one to introduce you to the Python Gil. Is a very nice feature that makes parallelism using threads in Python a nightmare.
If you really want to improve your code speed, you should be looking at the multiprocessing module
I have a very large list of strings (originally from a text file) that I need to process using python. Eventually I am trying to go for a map-reduce style of parallel processing.
I have written a "mapper" function and fed it to multiprocessing.Pool.map(), but it takes the same amount of time as simply calling the mapper function with the full set of data. I must be doing something wrong.
I have tried multiple approaches, all with similar results.
def initial_map(lines):
results = []
for line in lines:
processed = # process line (O^(1) operation)
results.append(processed)
return results
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
partitions = chunks(lines, len(lines)/8)
results = pool.map(initial_map, partitions, 1)
So the chunks function makes a list of sublists of the original set of lines to give to the pool.map(), then it should hand these 8 sublists to 8 different processes and run them through the mapper function. When I run this I can see all 8 of my cores peak at 100%. Yet it takes 22-24 seconds.
When I simple run this (single process/thread):
lines = list(open("../../log.txt", 'r'))
results = initial_map(results)
It takes about the same amount of time. ~24 seconds. I only see one process getting to 100% CPU.
I have also tried letting the pool split up the lines itself and have the mapper function only handle one line at a time, with similar results.
def initial_map(line):
processed = # process line (O^(1) operation)
return processed
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
pool.map(initial_map, lines)
~22 seconds.
Why is this happening? Parallelizing this should result in faster results, shouldn't it?
If the amount of work done in one iteration is very small, you're spending a big proportion of the time just communicating with your subprocesses, which is expensive. Instead, try to pass bigger slices of your data to the processing function. Something like the following:
slices = (data[i:i+100] for i in range(0, len(data), 100)
def process_slice(data):
return [initial_data(x) for x in data]
pool.map(process_slice, slices)
# and then itertools.chain the output to flatten it
(don't have my comp. so can't give you a full working solution nor verify what I said)
Edit: or see the 3rd comment on your question by #ubomb.