I want to update pandas
Hello, I want to compare the speeds of single-core and multicore in pandas dataframe calculations.
The following cases are given, The column'c' in the 'i'th-row is the average of the values of 'a' from 'i-9'-row to 'i'th-row.
from multiprocessing import Process, Value, Array, Manager
import pandas as pd
import numpy as np
import time
total_num = 1000
df = pd.DataFrame(np.arange(1,total_num*2+1).reshape(total_num,2),
columns=['a','b'])
df['c']=0
df2 = pd.DataFrame(np.arange(1,total_num*2+1).reshape(total_num,2),
columns=['a','b'])
df2['c']=0
def Cal(start, end):
for i in range(end-start-1):
if i+start < 10:
df.loc[i+start,'c']=df.loc[:i+start,'c'].mean()
else :
df.loc[i+start,'c']=df.loc[i-9:i+start,'c'].mean()
def Cal2(my_df,start, end):
for i in range(end-start-1):
if i+start < 10:
my_df.df.loc[i+start,'c']=my_df.df.loc[:i+start,'c'].mean()
else :
my_df.df.loc[i+start,'c']=my_df.df.loc[i-9:i+start,'c'].mean()
print(my_df)
print('Single core : --->')
start_t = time.time()
Cal(0,total_num+1)
end_t = time.time()
print(end_t-start_t)
print('Multiprocess ---->')
if __name__=='__main__':
num=len(df2)
num_core=4
between=num//num_core
mgr=Manager()
ns = mgr.Namespace()
ns.df=df2
procs=[]
start_t =time.time()
for index in range(num_core):
proc=Process(target=Cal2,args=(ns,index*between,(index+1)*between))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
end_t = time.time()
print(end_t-start_t)
At first I realized that Multiprocessing does not use global variables. So I used Manager. However, the 'c'column of df2 did not change.
How do I do what I want to do? :p
You may look at swifter as well, iit applies functions using multiprocessing IF it helps in faster code execution.
In your case it is a terrible idea, 10 is a really small amount of data so distributing it between cores will not help and cost of processes will be much higher than operations.
Furthermore, memory sharing is not a good idea between processes (as this is really costly), and that's what you are trying to do here (usually you split data beforehand and push it to multiprocessing functions like applymap, but once again, data chunks should be much bigger).
You could use threads, as those are the ones you may be after, but remember about Python's GIL (you may read about threads, processes and GIL in other answers, e.g. here)
Related
I want to use Pool to split a task among n workers. What happens is that when I'm using map with one argument in the task function, I observe that all the cores are used, all tasks are launched simultaneously.
On the other hand, when I'm using starmap, task launch is one by one and I never reach 100% CPU load.
I want to use starmap for my case because I want to pass a second argument, but there's no use if it doesn't take advantage of multiprocessing.
This is the code that works
import numpy as np
from multiprocessing import Pool
# df_a = just a pandas dataframe which I split in n parts and I
# feed each part to a task. Each one may have a few
# thousand rows
n_jobs = 16
def run_parallel(df_a):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.map(task_function, dfs_a)
return result
def task_function(left_df):
print("in task function")
# execute task...
return result
result = run_parallel(df_a)
in this case, "in task function" is printed at the same time, 16 times.
This is the code that doesn't work
n_jobs = 16
# df_b: a big pandas dataframe (~1.7M rows, ~20 columns) which I
# want to send to each task as is
def run_parallel(df_a, df_b):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.starmap(task_function, zip(dfs_a, repeat(df_b)))
return result
def task_function(left_df, right_df):
print("in task function")
# execute task
return result
result = run_parallel(df_a, df_b)
Here, "in task function" is printed sequentially and the processors never reach 100% capacity. I also tried workarounds based on this answer:
https://stackoverflow.com/a/5443941/6941970
but no luck. Even when I used map in this way:
from functools import partial
pool.map(partial(task_function, b=df_b), dfs_a)
considering that maybe repeat(*very big df*) would introduce memory issues, still there wasn't any real parallelization
I am currently generating a nested dictionary that saves some arrays by using a nested for loop. Unfortunately, it takes quite some time; I realized that the server I am working on has a few cores available, so I was wondering if Python's multiprocessing library could be helpful to speed up the creation of the dictionary.
The nested for loop looks something like this (the actual computation is heavier and more complex):
import numpy as np
data_dict = {}
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
So this is what I tried:
from multiprocessing import Pool
import numpy as np
data_dict = {}
def process():
#sci=fits.open('{}.fits'.format(name))
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process)
But pool.map (last line) seems to require an iterable, which I'm not sure what to insert there.
In my opinion, the real problem is what kind of processing is needed to compute entries of the dictionary and how many entries are there.
The kind of processing is essential to understand if multiprocessing can significantly speed up the creation of the dictionary. If your computation is I/O bound, you should use multithreading, while if it's CPU bound you should use multiprocessing. You can find more bout this here.
Assuming that the value of each entry can be computed independently and that this computation is CPU bound, let's benchmark the difference between single process and multiprocess implementation (based on multiprocessing library).
The following code is used to test the two approaches in some scenarios, varying the complexity of the computation needed for each entry and the number of entries (for the multiprocess implementation, 7 processes were used).
import timeit
import numpy as np
def some_fun(s, d, n=1):
"""A function with an adaptable complexity"""
a = s * np.ones(np.random.randint(1, 10, (2,))) / (d + 1)
for _ in range(n):
a += np.random.random(a.shape)
return a
# Code to create dictionary with only one process
setup_simple = "from __main__ import some_fun, n_first_level, n_second_level, complexity"
code_simple = """
data_dict = {}
for s in range(n_first_level):
data_dict[s] = {}
for d in range(n_second_level):
data_dict[s][d] = some_fun(s, d, n=complexity)
"""
# Code to create a dictionary with multiprocessing: we are going to use all the available cores except 1
setup_mp = """import numpy as np
import multiprocessing as mp
import itertools
from functools import partial
from __main__ import some_fun, n_first_level, n_second_level, complexity
n_processes = mp.cpu_count() - 1
# Uncomment if you want to know how many concurrent processes are you going to use
# print(f'{n_processes} concurrent processes')
"""
code_mp = """
with mp.Pool(processes=n_processes) as pool:
dict_values = pool.starmap(partial(some_fun, n=complexity), itertools.product(range(n_first_level), range(n_second_level)))
data_dict = {
k: dict(zip(range(n_second_level), dict_values[k * n_second_level: (k + 1) * n_second_level]))
for k in range(n_first_level)
}
"""
# Time the code with different settings
print('Execution time on 10 repetitions: mean [std]')
for label, complexity, n_first_level, n_second_level in (
("TRIVIAL FUNCTION", 0, 10, 10),
("TRIVIAL FUNCTION", 0, 500, 500),
("SIMPLE FUNCTION", 5, 500, 500),
("COMPLEX FUNCTION", 50, 100, 100),
("HEAVY FUNCTION", 1000, 10, 10),
):
print(f'\n{label}, {n_first_level * n_second_level} dictionary entries')
for l, t in (
('Single process', timeit.repeat(stmt=code_simple, setup=setup_simple, number=1, repeat=10)),
('Multiprocess', timeit.repeat(stmt=code_mp, setup=setup_mp, number=1, repeat=10)),
):
print(f'\t{l}: {np.mean(t):.3e} [{np.std(t):.3e}] seconds')
These are the results:
Execution time on 10 repetitions: mean [std]
TRIVIAL FUNCTION, 100 dictionary entries
Single process: 7.752e-04 [7.494e-05] seconds
Multiprocess: 1.163e-01 [2.024e-03] seconds
TRIVIAL FUNCTION, 250000 dictionary entries
Single process: 7.077e+00 [7.098e-01] seconds
Multiprocess: 1.383e+00 [7.752e-02] seconds
SIMPLE FUNCTION, 250000 dictionary entries
Single process: 1.405e+01 [1.422e+00] seconds
Multiprocess: 2.858e+00 [5.742e-01] seconds
COMPLEX FUNCTION, 10000 dictionary entries
Single process: 1.557e+00 [4.330e-02] seconds
Multiprocess: 5.383e-01 [5.330e-02] seconds
HEAVY FUNCTION, 100 dictionary entries
Single process: 3.181e-01 [5.026e-03] seconds
Multiprocess: 1.171e-01 [2.494e-03] seconds
As you can see, assuming that you have a CPU bounded computation, the multiprocess approach achieves better results in most of the scenarios. Only if you have a very light computation for each entry and/or a very limited number of entries, the single process approach should be preferred.
On the other hand, the improvement provided by multiprocessing comes with a cost: for example, if your computation for each entry uses a significant amount of memory, you could incur an OutOfMemory error, meaning that you have to improve your code and make it more complex to avoid it, finding the right balance between memory occupation and decrease in execution time. If you look around, there are a lot of questions asking how to solve memory issues caused by a non-optimal use of multiprocessing. In other words, this means that your code will be less easy to read and maintain.
To sum up, you should judge if the improvement in execution time is worthed, even if it is possible.
I have this code which I would like to use multi-processing to speed up:
matrix=[]
for i in range(len(datasplit)):
matrix.append(np.array(np.asarray(datasplit[i].split()),dtype=float))
The variable "datasplit" is a comma-separated list of strings. Each string has around 50 numbers which are separated by a space. For each string, this code adds commas between these numbers instead of spaces, turns the entire string into an array, and turns each individual number into a string. This would now look like a an array of comma-separated strings where each string is 1 of the 50 numbers. The code then turns these strings into floats, so now we have an array of 50 comma separated numbers. After the code has run, printing, "matrix" would give a list of arrays, where each array has 50 comma separated numbers.
Now my problem is that the length of datasplit is huge. It has a length of ~ 10^7. This code takes around 15 minutes to run. I need to run this for 124 other samples of similar size, so I would like to use multiprocessing to speed up the run time.
How exactly would I re-write my code using multiprocessing to get it to run faster?
I appreciate any help.
The Python standard library provides two options for multiprocessing: The modules multiprocessing and concurrent.futures. The second adds a layer of abstraction onto the first. For simple map-scenarios like yours the usage is pretty simple.
Here's something to experiment with:
import numpy as np
from time import time
from os import cpu_count
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor
def string_to_float(string):
return np.array(np.asarray(string.split()), dtype=float)
if __name__ == '__main__':
# Example datasplit
rng = np.random.default_rng()
num_strings = 100000
datasplit = [' '.join(str(n) for n in rng.random(50))
for _ in range(num_strings)]
# Looping (sequential processing)
start = time()
matrix = []
for i in range(len(datasplit)):
matrix.append(np.array(np.asarray(datasplit[i].split()), dtype=float))
print(f'Duration of sequential processing: {time() - start:.2f} secs')
# Setting up multiprocessing
num_workers = int(0.8 * cpu_count())
chunksize = max(1, int(len(datasplit) / num_workers))
# Multiprocessing with Pool
start = time()
with Pool(num_workers) as p:
matrix = p.map(string_to_float, datasplit, chunksize)
print(f'Duration of parallel processing (Pool): {time() - start:.2f} secs')
# Multiprocessing with ProcessPoolExecutor
start = time()
with ProcessPoolExecutor(num_workers) as ppe:
matrix = list(ppe.map(string_to_float, datasplit, chunksize=chunksize))
print(f'Duration of parallel processing (PPE): {time() - start:.2f} secs')
You should play around with the num_workers and more importantly the chunksize variable. The ones I've used here worked well for me in quite a few situations. You can also let the system decide what to choose, those arguments are optional, but the results can be suboptimal, especially when the amount of data to be processed is large.
For 10 million strings (your range) and chunksize=10000 my machine produced the following results:
Duration of sequential processing: 393.78 secs
Duration of parallel processing (Pool): 73.76 secs
Duration of parallel processing (PPE): 85.82 secs
PS: Why do you use np.array(np.asarray(string.split()), dtype=float) instead of np.asarray(string.split(), dtype=float)?
This will split your tasks to multiple cores and speed up your performance by atleast 4-8x:
from multiprocessing import Pool
import os
import numpy as np
pool = Pool(os.cpu_count())
# Add your data to the datasplit variable below:
datasplit = []
results = pool.map(lambda x: np.array(np.asarray(x.split()),dtype=float), datasplit)
pool.close()
pool.join()
I use multiprocessing Pool to run parallel. I tried with 4 cores first in HPC with sub. When it uses 4 core, the time is reduced 4 times compared to 1 core. When I check with qstat, several times it uses 4 cores but after that just 1 core, with exactly the same code.
Could you please give some advice what is wrong with my code or the system?
import pandas as pd
import numpy as np
from multiprocessing import Pool
from datetime import datetime
t1 = pd.read_csv("template.csv",header=None)
s1 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_adfr.csv")
s2 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_dock.csv")
s3 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_gemdock.csv")
s4 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_ledock.csv")
s5 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_plants.csv")
s6 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_psovina.csv")
s7 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_quickvina2.csv")
s8 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_smina.csv")
s9 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_vina.csv")
s10 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_vinaxb.csv")
#number of core and arrays
n = 4
m = (len(t1) // n)+1
g= m*n - len(t1)
for g1 in range(g):
t1.loc[len(t1)]=0
results=[]
def block_linear(i):
temp = pd.DataFrame(np.zeros((m,29)))
for a in range(0,m):
sum_matrix = (t1.iloc[a,0]*s1) + (t1.iloc[a,1]*s2) + (t1.iloc[a,2]*s3)+ (t1.iloc[a,3]*s4) + (t1.iloc[a,4]*s5) + (t1.iloc[a,5]*s6) + (t1.iloc[a,6]*s7) + (t1.iloc[a,7]*s8) + (t1.iloc[a,8]*s9) + (t1.iloc[a,9]*s10)
rank_sum= pd.DataFrame.rank(sum_matrix,axis=0,ascending=True,method='min') #real-True
temp.iloc[a,:] = rank_sum.iloc[999].values
temp['median'] = temp.median(axis=1)
temp.index = range(i*m,(i+1)*m)
return temp
start=datetime.now()
if __name__ == '__main__':
pool = Pool(processes=n)
results = pool.map(block_linear,range(0,n))
print(datetime.now()-start)
out=pd.concat(results)
out.drop(out.tail(g).index,inplace=True)
out.to_csv('test_10dock_4core.csv',index=False)
The main idea is to cut large table into smallers, run calculations and combine together.
Without a more detailed usage of the multiprocessing's Pool package is really difficult to understand and help. Please notice that the Pool package does not guarantee parallelization: the _apply function, for example, only uses one worker of the Pool, and block all your executions. You can check out more details about it here and there.
But assuming you are using the library properly, you should make sure your code is fully parallelizable: an I/O operation on disk, for example, can bottleneck your parallelization and thus making your code run in only one process at a time.
I hope it helped.
[Edit]
Since you provided more details about your problem, I can give more specific tips:
The first thing is that your code is zero parallel. You are just calling the same function N times. This is not how multiprocessing should work.
Instead, the part that should be parallel is the one that is usually in a for loops, like the one you have inside the block_linear().
So, what I recommend to you:
You should change your code to first calculate all your weighted sum and only after that do the rest of the operations. This will help a lot with parallelization.
So, put this operation in a function:
def weighted_sum(column,df2):
temp = pd.DataFrame(np.zeros(m))
for a in range(0,m):
result = (t1.iloc[a,column]*df2)
temp.iloc[a] = result
return temp
So then, you use pool.starmap to parallel the function for the 10 dataframes you have, something like this:
results = pool.starmap(weighted_sum,[(0,s1),(1,s2),(2,s3),....,[9,s10]])
ps: pool.starmap is similar to pool.map but accepts a list of tuple arguments. You can have more details about it here.
At last but not least, you should operate over your results to end your calculations. Since you will have one weighted_sum per column, you can apply a sum over the columns and then the rank_sum.
This is not a fully runnable code to solve your problem, but a general guide of how your should restructure your code to have a multiprocessing advantage. I recommend you to test it over a subsample of the data frames just to make sure it's working properly before you run it on all your data.
I'm populating a matrix using a conditional lookup from a file. The file is extremely large (25,00,000 records) and is saved as a dataframe ('file').
Each matrix row operation (lookup) is independent of the other. Is there anyway I could parallelize this process?
I'm working in pandas and python. My current approach is bare naive.
for r in row:
for c in column:
num=file[(file['Unique_Inventor_Number']==r) & file['AppYearStr']==c)]['Citation'].tolist()
num = len(list(set(num)))
d.set_value(r, c, num)
For 2.5 million records you should be able to do
res = file.groupby(['Unique_Inventor_Number', 'AppYearStr']).Citation.nunique()
The matrix should be available in
res.unstack(level=1).fillna(0).values
I'm not sure if the is the fastest, but should be significantly faster than your implementation
[EDIT] As Roland mentioned in comment, in a standard Python implementation, this post does not offer any solution to improve CPU performances.
In the standard Python implementation, threads do not really improve performance on CPU-bound tasks. There is a "Global Interpreter Lock" that enforces that only one thread at a time can be executing Python bytecode. This was done to keep the complexity of memory management down.
Have you tried to use different Threads for the different functions?
Let's say you separate your dataframe into columns and create multiple threads. Then you assign each thread to apply a function to a column. If you have enough processing power, you might be able to gain a lot of time:
from threading import Thread
import pandas as pd
import numpy as np
from queue import Queue
from time import time
# Those will be used afterwards
N_THREAD = 8
q = Queue()
df2 = pd.DataFrame() # The output of the script
# You create the job that each thread will do
def apply(series, func):
df2[series.name] = series.map(func)
# You define the context of the jobs
def threader():
while True:
worker = q.get()
apply(*worker)
q.task_done()
def main():
# You import your data to a pandas dataframe
df = pd.DataFrame(np.random.randn(100000,4), columns=['A', 'B', 'C', 'D'])
# You create the functions you will apply to your columns
func1 = lambda x: x<10
func2 = lambda x: x==0
func3 = lambda x: x>=0
func4 = lambda x: x<0
func_rep = [func1, func2, func3, func4]
for x in range(N_THREAD): # You create your threads
t = Thread(target=threader)
t.start()
# Now is the tricky part: You enclose the arguments that
# will be passed to the function into a tuple which you
# put into a queue. Then you start the job by "joining"
# the queue
for i, func in enumerate(func_rep):
worker = tuple([df.iloc[:,i], func])
q.put(worker)
t0 = time()
q.join()
print("Entire job took: {:.3} s.".format(time() - t0))
if __name__ == '__main__':
main()