I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.
Related
There is this stackoverflow post that really nicely shows a way to calculate the proximity matrix of a RandomForestClassifier().
Proximity Matrix in sklearn.ensemble.RandomForestClassifier
Nevertheless the for-loop in that script is quite slow if you have a large dataframe. I tried to parallelize this for-loop, but unsuccesfully. I only get 'None' as an output.
How can I parallelize this for-loop in Spyder 4 running Python 3.8.5 on Windows 10?
proxMat = 1*np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
Here you want to perform a reduce operation - so parrallelization is not obvious.
You did not specify how you tried to parallelize the loop.
A simple way to parrallelize :
import multiprocessing
pool = multiprocessing.Pool(processes=4)
def get_outer(i):
return np.equal.outer(terminals[:,i],terminals[:,i])
todo = list(range(1, nTrees))
results = pool.map(get_outer, todo)
proxMat = 1*np.equal.outer(a, a)
for res in results:
proxMat+ = res
I'm not sure this one would help, but possibly you'd have less pickling problems :
import multiprocessing
pool = multiprocessing.Pool(processes=4)
def get_outer(t):
return np.equal.outer(t,t)
# This part might be costly !
terms = [terminals[:,i] for i in range(1, nTrees)]
results = pool.map(get_outer, terms)
proxMat = 1*np.equal.outer(a, a)
for res in results:
proxMat+ = res
I use multiprocessing Pool to run parallel. I tried with 4 cores first in HPC with sub. When it uses 4 core, the time is reduced 4 times compared to 1 core. When I check with qstat, several times it uses 4 cores but after that just 1 core, with exactly the same code.
Could you please give some advice what is wrong with my code or the system?
import pandas as pd
import numpy as np
from multiprocessing import Pool
from datetime import datetime
t1 = pd.read_csv("template.csv",header=None)
s1 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_adfr.csv")
s2 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_dock.csv")
s3 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_gemdock.csv")
s4 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_ledock.csv")
s5 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_plants.csv")
s6 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_psovina.csv")
s7 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_quickvina2.csv")
s8 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_smina.csv")
s9 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_vina.csv")
s10 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_vinaxb.csv")
#number of core and arrays
n = 4
m = (len(t1) // n)+1
g= m*n - len(t1)
for g1 in range(g):
t1.loc[len(t1)]=0
results=[]
def block_linear(i):
temp = pd.DataFrame(np.zeros((m,29)))
for a in range(0,m):
sum_matrix = (t1.iloc[a,0]*s1) + (t1.iloc[a,1]*s2) + (t1.iloc[a,2]*s3)+ (t1.iloc[a,3]*s4) + (t1.iloc[a,4]*s5) + (t1.iloc[a,5]*s6) + (t1.iloc[a,6]*s7) + (t1.iloc[a,7]*s8) + (t1.iloc[a,8]*s9) + (t1.iloc[a,9]*s10)
rank_sum= pd.DataFrame.rank(sum_matrix,axis=0,ascending=True,method='min') #real-True
temp.iloc[a,:] = rank_sum.iloc[999].values
temp['median'] = temp.median(axis=1)
temp.index = range(i*m,(i+1)*m)
return temp
start=datetime.now()
if __name__ == '__main__':
pool = Pool(processes=n)
results = pool.map(block_linear,range(0,n))
print(datetime.now()-start)
out=pd.concat(results)
out.drop(out.tail(g).index,inplace=True)
out.to_csv('test_10dock_4core.csv',index=False)
The main idea is to cut large table into smallers, run calculations and combine together.
Without a more detailed usage of the multiprocessing's Pool package is really difficult to understand and help. Please notice that the Pool package does not guarantee parallelization: the _apply function, for example, only uses one worker of the Pool, and block all your executions. You can check out more details about it here and there.
But assuming you are using the library properly, you should make sure your code is fully parallelizable: an I/O operation on disk, for example, can bottleneck your parallelization and thus making your code run in only one process at a time.
I hope it helped.
[Edit]
Since you provided more details about your problem, I can give more specific tips:
The first thing is that your code is zero parallel. You are just calling the same function N times. This is not how multiprocessing should work.
Instead, the part that should be parallel is the one that is usually in a for loops, like the one you have inside the block_linear().
So, what I recommend to you:
You should change your code to first calculate all your weighted sum and only after that do the rest of the operations. This will help a lot with parallelization.
So, put this operation in a function:
def weighted_sum(column,df2):
temp = pd.DataFrame(np.zeros(m))
for a in range(0,m):
result = (t1.iloc[a,column]*df2)
temp.iloc[a] = result
return temp
So then, you use pool.starmap to parallel the function for the 10 dataframes you have, something like this:
results = pool.starmap(weighted_sum,[(0,s1),(1,s2),(2,s3),....,[9,s10]])
ps: pool.starmap is similar to pool.map but accepts a list of tuple arguments. You can have more details about it here.
At last but not least, you should operate over your results to end your calculations. Since you will have one weighted_sum per column, you can apply a sum over the columns and then the rank_sum.
This is not a fully runnable code to solve your problem, but a general guide of how your should restructure your code to have a multiprocessing advantage. I recommend you to test it over a subsample of the data frames just to make sure it's working properly before you run it on all your data.
I am trying to use python to process some large data sets from several data stations. My idea is to use multiprocessing.pool to assign each CPU the data from a single station, since the data from each station are independent from each other.
However, it seems that my calculation time does not really go down, comparing to single for loop.
Here is part of my code:
#function calculating the square of each data point, and taking the cumulative sum
def get_cumdd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
cum_dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd=np.cumsum(dd)
return cum_dd
#parallelization between each station
if __name__ == '__main__':
n_proc = np.min([mp.cpu_count(),nstation]) #nstation = 10
p = mp.Pool(processes=int(n_proc))
result = p.map(get_cumdd,data)
p.close()
p.join()
cum_dd = np.zeros((nstation,len(data[0])))
for i in range(nstation):
cum_dd[i] = result[i].T
I do not use chunksize because cum_dd takes the summation of all the previous data^2. I am essentially dividing my data into 10 equal pieces because there is no communication between processes. I wonder if I missed anything here.
My data has 2 million points per station per day, and I need to process years of data.
This doesn't address your multiprocessing question directly, but (as Ugur MULUK and Iguananaut mention) I think your get_cumdd function is inefficient. Numpy provides np.cumsum. Reimplementing your function I get more than 1000x speedup for an array with 10k elements. With 100k elements it's about 7000x faster. With 2M elements I didn't bother to let it finish.
# your function
def cum_dd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
cum_dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd[i]=np.sum(dd[0:i])
return cum_dd
# numpy implementation
def cum_dd2(data):
# adding an axis to match the shape of the output of your cum_dd function
return np.cumsum(data**2)[:, np.newaxis]
For 2e6 points this implementation takes ~11ms on my computer. I think that's about 30 seconds for 10 years of data for a single station.
NumPy already implements efficient parallel processing on CPUs and GPUs. The processing algorithms use Single Instruction Multiple Data (SIMD) instructions.
By pooling computations manually, you are reducing the efficiency. You can improve performance by vectorizing your explicit for loop.
See the video below for more information about vectorization.
https://www.youtube.com/watch?v=qsIrQi0fzbY
If you are having difficulties, I will be around for updates or help. Good luck!
Thanks a lot for all the comments and answers! After applying vectorization and pooling, I reduced the calculation time from one hour to 3 second (10*1.7 million data points). I have my code here in case anyone is interested,
def get_cumdd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd=np.cumsum(dd)
return dd,cum_dd
if __name__ == '__main__':
n_proc = np.min([mp.cpu_count(),nstation])
p = mp.Pool(processes=int(n_proc))
result = p.map(CC.get_cumdd,d)
p.close()
p.join()
I'm not using shared memory Queue because all my processes are independent from each other.
I have a time series dataframe with about 10 columns where I am performing manipulations on the time series to return results of strategy data. I would like to test 2 parameters as they may or may not effect each other. When tested independently, each run take over 10 sec per unit(over 6.5 hours for the total run) and I'm looking to speed this up..I have been reading about dask and it seems that its the right module to use.
My current code iterates over each parameter range with a nested loops. I know it can be paralleled as the data per day is mutually exclusive.
Here is the code:
amount1=np.arange(.001,.03,.0005)
amount2=np.arange(.001,.03,.0005)
def getResults(df,amount1,amount2):
final_results=[]
for x in tqdm(amount1):
for y in amount2:
df1=None
df1=function1(df.copy(), x, y ) #takes about 2sec.
df1=function2(df1) #takes about 2sec.
df1=function3(df1) #takes about 3sec.
final_results.append([x,y,df1['results'].iloc[-1]])
return final_results
UPDATE:
So it looks like the improvements should come by adjusting the function to remove the iteration from the calls and to create a list of jobs(my understanding. Here is where I am so far. I probably will need to move my df to a dask dataframe, so that the data can be chunked into smaller pieces. The question is do I leave the function1,2 and 3 functions as pandas vector manulipulations or do they need to move to complete dask functions?
def getResults(df,amount):
df1=None
df1=dsk.delayed(function1)(df,amount[0],amount[1] )
df1=dsk.delayed(function2)(df1)
df1=dsk.delayed(function2)(df1)
return [amount[0],amount[1],df1['results'].iloc[-1]]
#Create a list of processes from jobs. jobs is a list of tuples that replaces the iteration.
processes =[getResults(df,items) for items in jobs]
#Create a process list of results
results=[]
for i in range(len(processes):
results.append(processes[i])
You probably want to use either dask.delayed or the concurrent.futures interface.
Something like the following would probably work well (untested, I recommend that you read the docs referenced above to understand what it's doing).
def getResults(df,amount1,amount2):
final_results=[]
for x in amount1:
for y in amount2:
df1=None
df1=dask.delayed(function1)(df.copy(), x, y )
df1=dask.delayed(function2)(df1)
df1=dask.delayed(function3)(df1)
final_results.append([x,y,df1['results'].iloc[-1]])
return final_results
out = getResults(df, amount1, amount2)
result = delayed(out).compute()
Also, I would avoid calling df.copy() if you can avoid it. Ideally function1 would not mutate input data.
I have a very large list of strings (originally from a text file) that I need to process using python. Eventually I am trying to go for a map-reduce style of parallel processing.
I have written a "mapper" function and fed it to multiprocessing.Pool.map(), but it takes the same amount of time as simply calling the mapper function with the full set of data. I must be doing something wrong.
I have tried multiple approaches, all with similar results.
def initial_map(lines):
results = []
for line in lines:
processed = # process line (O^(1) operation)
results.append(processed)
return results
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
partitions = chunks(lines, len(lines)/8)
results = pool.map(initial_map, partitions, 1)
So the chunks function makes a list of sublists of the original set of lines to give to the pool.map(), then it should hand these 8 sublists to 8 different processes and run them through the mapper function. When I run this I can see all 8 of my cores peak at 100%. Yet it takes 22-24 seconds.
When I simple run this (single process/thread):
lines = list(open("../../log.txt", 'r'))
results = initial_map(results)
It takes about the same amount of time. ~24 seconds. I only see one process getting to 100% CPU.
I have also tried letting the pool split up the lines itself and have the mapper function only handle one line at a time, with similar results.
def initial_map(line):
processed = # process line (O^(1) operation)
return processed
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
pool.map(initial_map, lines)
~22 seconds.
Why is this happening? Parallelizing this should result in faster results, shouldn't it?
If the amount of work done in one iteration is very small, you're spending a big proportion of the time just communicating with your subprocesses, which is expensive. Instead, try to pass bigger slices of your data to the processing function. Something like the following:
slices = (data[i:i+100] for i in range(0, len(data), 100)
def process_slice(data):
return [initial_data(x) for x in data]
pool.map(process_slice, slices)
# and then itertools.chain the output to flatten it
(don't have my comp. so can't give you a full working solution nor verify what I said)
Edit: or see the 3rd comment on your question by #ubomb.