I have a Pandas.DataFrame with 387 rows and 26 columns. This DataFrame is then Groupby()-ed and agg()-ed, turning into a DataFrame with 1 row and 111 columns. This takes about 0.05s. For example:
frames = frames.groupby(['id']).agg({"bytes": ["count",
"sum",
"median",
"std",
"sum",
"min",
"max"],
# === add about 70 more lines of this ===
"pixels": "sum"}
All of these use Pandas' built-in Cython functions, e.g. sum, std, min, max, first, etc. I am looking to speed this process up, but is there even a way to do such a thing? Seems like it is already considered 'vectorized' to my understanding. Thus, there isn't anything more to do with Cython is there?
Maybe calculating each column separately without the .agg() would be faster?
Would greatly appreciate any ideas, or confirmation that there is nothing else to be done. Thanks!
Edit!
Here's a working example:
import pandas as pd
import numpy as np
aggs = ["sum", "mean", "std", "min"]
cols = {k:aggs for k in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'}
df = pd.DataFrame(np.random.randint(0,100,size=(387, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
df['id'] = 1
print(df.groupby("id").agg(cols))
cProfile results:
import cProfile
cProfile.run('df.groupby("id").agg(cols)', sort='cumtime')
79825 function calls (78664 primitive calls) in 0.076 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.076 0.076 {built-in method builtins.exec}
1 0.000 0.000 0.076 0.076 <string>:1(<module>)
1 0.000 0.000 0.076 0.076 generic.py:964(aggregate)
1 0.000 0.000 0.075 0.075 apply.py:143(agg)
1 0.000 0.000 0.075 0.075 apply.py:405(agg_dict_like)
1 0.000 0.000 0.062 0.062 apply.py:435(<dictcomp>)
130/26 0.000 0.000 0.059 0.002 generic.py:225(aggregate)
26 0.001 0.000 0.058 0.002 generic.py:278(_aggregate_multiple_funcs)
78 0.001 0.000 0.023 0.000 generic.py:322(_cython_agg_general)
28 0.000 0.000 0.023 0.001 frame.py:573(__init__)
I ran some benchmarks (with 10 columns to aggregate and 6 aggregation functions for each column, and at most 100 unique ids). It seems that the total time to run the aggregation does not change until the number of rows is somewhere between 10k and 100k.
If you know your dataframes in advance, you can concatenate them into a single big DataFrame with two-level index, run groupby on two columns and get significant speedup. In a way, this runs the calculation on batches of dataframes.
In my example, it takes around 400ms to process a single DataFrame, and 600ms to process a batch of 100 DataFrames, with an average speedup of around 60x.
Here is the general approach (using 100 columns instead of 10):
import numpy as np
import pandas as pd
num_rows = 100
num_cols = 100
# builds a random df
def build_df():
# build some df with num_rows rows and num_cols cols to aggregate
df = pd.DataFrame({
"id": np.random.randint(0, 100, num_rows),
})
for i in range(num_cols):
df[i] = np.random.randint(0, 10, num_rows)
return df
agg_dict = {
i: ["count", "sum", "median", "std", "min", "max"]
for i in range(num_cols)
}
# get a single small df
df = build_df()
# build 100 random dataframes
dfs = [build_df() for _ in range(100)]
# set the first df to be equal to the "small" df we computed before
dfs[0] = df.copy()
big_df = pd.concat(dfs, keys=range(100))
%timeit big_df.groupby([big_df.index.get_level_values(0), "id"]).agg(agg_dict)
# 605 ms per loop, for 100 dataframe
agg_from_big = big_df.groupby([big_df.index.get_level_values(0), "id"]).agg(agg_dict).loc[0]
%timeit df.groupby("id").agg(agg_dict)
# 417 ms per loop, for one dataframe
agg_from_small = df.groupby("id").agg(agg_dict)
assert agg_from_small.equals(agg_from_big)
Here is the benchmarking code. The timings are comparable until the number of rows increases to 10k to 100k:
def get_setup(n):
return f"""
import pandas as pd
import numpy as np
N = {n}
num_cols = 10
df = pd.DataFrame({{
"id": np.random.randint(0, 100, N),
}})
for i in range(num_cols):
df[i] = np.random.randint(0, 10, N)
agg_dict = {{
i: ["count", "sum", "median", "std", "min", "max"]
for i in range(num_cols)
}}
"""
from timeit import timeit
def time_n(n):
return timeit(
"df.groupby('id').agg(agg_dict)", setup=get_setup(n), number=100
)
times = pd.Series({n: time_n(n) for n in [10, 100, 1000, 10_000, 100_000]})
# 10 4.532458
# 100 4.398949
# 1000 4.426178
# 10000 5.009555
# 100000 11.660783
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have the following code:
import numpy as np
A = np.random.random((128,4,46,23)) + np.random.random((128,4,46,23)) * 1j
signal = np.random.random((355,256,4)) + np.random.random((355,256,4)) * 1j
# start timing here
Signal = np.fft.fft(signal, axis=1)[:,:128,:]
B = Signal[..., None, None] * A[None,...]
B = B.sum(1)
b = np.fft.ifft(B, axis=1).real
b_squared = b**2
res = b_squared.sum(1)
I need to run this code for two different values of A each time. The problem is that the code is too slow to be used in an real time application. I tried using np.einsum and although it did speed up things a little it wasn't enough for my application.
So, now I'm trying to speed things up using a GPU but I'm not sure how. I looked into OpenCL and multiplying two matrices seems fine, but I'm not sure how to do it with complex numbers and matrices with more than two dimensions(I guess using a for loop to send two axis at a time for the GPU). I also don't know how to do something like array.sum(axis). I have worked with a GPU before using OpenGL.
I would have no problem using C++ to optimize the code if needed or using something besides OpenCL, as long as it works with more than one GPU manufacturer(so no CUDA).
Edit:
Running cProfile:
import cProfile
def f():
Signal = np.fft.fft(signal, axis=1)[:,:128,:]
B = Signal[..., None, None] * A[None,...]
B = B.sum(1)
b = np.fft.ifft(B, axis=1).real
b_squared = b**2
res = b_squared.sum(1)
cProfile.run("f()", sort="cumtime")
Output:
56 function calls (52 primitive calls) in 1.555 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.555 1.555 {built-in method builtins.exec}
1 0.005 0.005 1.555 1.555 <string>:1(<module>)
1 1.240 1.240 1.550 1.550 <ipython-input-10-d4613cd45f64>:3(f)
2 0.000 0.000 0.263 0.131 {method 'sum' of 'numpy.ndarray' objects}
2 0.000 0.000 0.263 0.131 _methods.py:45(_sum)
2 0.263 0.131 0.263 0.131 {method 'reduce' of 'numpy.ufunc' objects}
6/2 0.000 0.000 0.047 0.024 {built-in method numpy.core._multiarray_umath.implement_array_function}
2 0.000 0.000 0.047 0.024 _pocketfft.py:49(_raw_fft)
2 0.047 0.023 0.047 0.023 {built-in method numpy.fft._pocketfft_internal.execute}
1 0.000 0.000 0.041 0.041 <__array_function__ internals>:2(ifft)
1 0.000 0.000 0.041 0.041 _pocketfft.py:189(ifft)
1 0.000 0.000 0.006 0.006 <__array_function__ internals>:2(fft)
1 0.000 0.000 0.006 0.006 _pocketfft.py:95(fft)
4 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(swapaxes)
4 0.000 0.000 0.000 0.000 fromnumeric.py:550(swapaxes)
4 0.000 0.000 0.000 0.000 fromnumeric.py:52(_wrapfunc)
4 0.000 0.000 0.000 0.000 {method 'swapaxes' of 'numpy.ndarray' objects}
2 0.000 0.000 0.000 0.000 _asarray.py:14(asarray)
4 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
2 0.000 0.000 0.000 0.000 {built-in method numpy.array}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
4 0.000 0.000 0.000 0.000 fromnumeric.py:546(_swapaxes_dispatcher)
2 0.000 0.000 0.000 0.000 _pocketfft.py:91(_fft_dispatcher)
Most of the number crunching libraries that interface well with the Python ecosystem have tight dependencies on Nvidia's ecosystem, but this is changing slowly. Here are some things you could try:
Profile your code. The built-in profiler (cProfile) is probably a good place to start, I'm also a fan of snakeviz for looking at performance traces. This will actually tell you if NumPy's FFT implementation is what's blocking you. Is memory being allocated efficiently? Is there some way where you could hand more data off to np.fft.ifft? How much time is Python taking to read the signal from its source and convert it into a Numpy array?
Numba is a JIT which takes Python code and further optimizes it, or compiles it to either CUDA or ROCm (AMD). I'm not sure how far off you are from your performance goals, but perhaps this could help.
Here is a list of C++ libraries to try.
Honestly, I'm kind of surprised that the NumPy build distributed on
the PyPI isn't fast enough for real-time use. I'll post an update to my comment where I benchmark this code snippet.
UPDATE: Here's an implementation which uses a multiprocessing.Pool, and the np.einsum trick kindly provided by #hpaulj.
import time
import numpy as np
import multiprocessing
NUM_RUNS = 50
A = np.random.random((128, 4, 46, 23)) + np.random.random((128, 4, 46, 23)) * 1j
signal = np.random.random((355, 256, 4)) + np.random.random((355, 256, 4)) * 1j
def worker(signal_chunk: np.ndarray, a_chunk: np.ndarray) -> np.ndarray:
return (
np.fft.ifft(np.einsum("ijk,jklm->iklm", fft_chunk, a_chunk), axis=1).real ** 2
)
# old code
serial_times = []
for _ in range(NUM_RUNS):
start = time.monotonic()
Signal = np.fft.fft(signal, axis=1)[:, :128, :]
B = Signal[..., None, None] * A[None, ...]
B = B.sum(1)
b = np.fft.ifft(B, axis=1).real
b_squared = b ** 2
res = b_squared.sum(1)
serial_times.append(time.monotonic() - start)
parallel_times = []
# My CPU is hyperthreaded, so I'm only spawning workers for the amount of physical cores
with multiprocessing.Pool(multiprocessing.cpu_count() // 2) as p:
for _ in range(NUM_RUNS):
start = time.monotonic()
# Get the FFT of the whole sample before splitting
transformed = np.fft.fft(signal, axis=1)[:, :128, :]
a_chunks = np.split(A, A.shape[0] // multiprocessing.cpu_count(), axis=0)
signal_chunks = np.split(
transformed, transformed.shape[1] // multiprocessing.cpu_count(), axis=1
)
res = np.sum(np.hstack(p.starmap(worker, zip(signal_chunks, a_chunks))), axis=1)
parallel_times.append(time.monotonic() - start)
print(
f"ORIGINAL AVG TIME: {np.mean(serial_times):.3f}\t POOLED TIME: {np.mean(parallel_times):.3f}"
)
And here are the results I'm getting on a Ryzen 3700X (8 cores, 16 threads):
ORIGINAL AVG TIME: 0.897 POOLED TIME: 0.315
I'd've loved to have offered you an FFT library written in OpenCL, but I'm not sure whether you'd have to write the Python bridge yourself (more code) or whether you'd trust the first implementation you'd come across on GitHub. If you're willing to give into CUDA's vendor lock-in, Nvidia provides an "almost drop in replacement" for NumPy called CuPy, and it has FFT and IFFT kernels., Hope this helps!
With your arrays:
In [42]: Signal.shape
Out[42]: (355, 128, 4)
In [43]: A.shape
Out[43]: (128, 4, 46, 23)
The B calc takes minutes on my modest machine:
In [44]: B = (Signal[..., None, None] * A[None,...]).sum(1)
In [45]: B.shape
Out[45]: (355, 4, 46, 23)
einsum is much faster:
In [46]: B2=np.einsum('ijk,jklm->iklm',Signal,A)
In [47]: np.allclose(B,B2)
Out[47]: True
In [48]: timeit B2=np.einsum('ijk,jklm->iklm',Signal,A)
1.05 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
reworking the dimensions to move the 128 to the classic dot positions:
In [49]: B21=np.einsum('ikj,kjn->ikn',Signal.transpose(0,2,1),A.reshape(128, 4, 46*23).transpose(1,0,2))
In [50]: B21.shape
Out[50]: (355, 4, 1058)
In [51]: timeit B21=np.einsum('ikj,kjn->ikn',Signal.transpose(0,2,1),A.reshape(128, 4, 46*23).transpose(1,0,2)
...: )
1.04 s ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With a bit more tweaking I can use matmul/# and cut the time in half:
In [52]: B3=Signal.transpose(0,2,1)[:,:,None,:]#(A.reshape(1,128, 4, 46*23).transpose(0,2,1,3))
In [53]: timeit B3=Signal.transpose(0,2,1)[:,:,None,:]#(A.reshape(1,128, 4, 46*23).transpose(0,2,1,3))
497 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [54]: B3.shape
Out[54]: (355, 4, 1, 1058)
In [56]: np.allclose(B, B3[:,:,0,:].reshape(B.shape))
Out[56]: True
Casting the arrays to the matmul format took a fair bit of experimentation. matmul makes optimal use of BLAS-like libraries. You may improve speed by installing better libraries.
Please is there a faster way to find out if a number is between two numbers. My current code is below. Thanks
lists = [2.3, 4, 3,5.5, 6.5, 7.5, 6, 8]
newlist = []
a = 2
b = 7
for i in lists:
if min(a, b) < i < max(a, b):
newlist.append(i)
print(newlist)
lists = [2.3, 4, 3,5.5, 6.5, 7.5, 6, 8]
newlist = []
a = 2
b = 7
minimum = min(a, b)
maximum = max(a, b)
for i in lists:
if minimum < i < maximum:
newlist.append(i)
print(newlist)
This will make things faster as we are not computing minimum and maximum everytime when the loop runs and conditions are checked.
Try the following:
a, b = min(a, b), max(a, b)
newlist = [x for x in lists if a < x < b]
With 100000 iterations, I found it 3 times faster than the original code. Using list comprehension instead of if helps a little, but most improvements come from pre-defining max and min before list comprehension (or if);
0.1944 sec.: list comprehension + min & max predefined
0.2672 sec.: if + min & max predefined
0.5600 sec.: original (if + min & max at each iteration)
When looking for where your code is slow, profile built-in module is handy. Here we can use it following way
import cProfile as profile
def func():
lists = [2.3, 4, 3,5.5, 6.5, 7.5, 6, 8]
newlist = []
a = 2
b = 7
for i in lists:
if min(a, b) < i < max(a, b):
newlist.append(i)
print(newlist)
profile.run('func()')
output is
[2.3, 4, 3, 5.5, 6.5, 6]
27 function calls in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <stdin>:1(func)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {built-in method builtins.exec}
8 0.000 0.000 0.000 0.000 {built-in method builtins.max}
8 0.000 0.000 0.000 0.000 {built-in method builtins.min}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
6 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
As you might deduce max and min was used 8 times each but in this case once for min and once for max would be enough. Sample data is too tiny to say anything useful about time of execution of components. If you wish you might use more data (longer lists) and look for results.
I have a df that looks like this:
time volts1 volts2
0 0.000 -0.299072 0.427551
2 0.001 -0.299377 0.427551
4 0.002 -0.298767 0.427551
6 0.003 -0.298767 0.422974
8 0.004 -0.298767 0.422058
10 0.005 -0.298462 0.422363
12 0.006 -0.298767 0.422668
14 0.007 -0.298462 0.422363
16 0.008 -0.301208 0.420227
18 0.009 -0.303345 0.418091
In actuality, the df has >50 columns, but for simplicity, I'm just showing 3.
I want to groupby this df every n rows, lets say 5. I want to aggregate time with max and the rest of the columns I want to aggregate by mean. Because there are so many columns, I'd love to be able to loop this and not have to do it manually.
I know I can do something like this where I go through and create all new columns manually:
df.groupby(df.index // 5).agg(time=('time', 'max'),
volts1=('volts1', 'mean'),
volts1=('volts1', 'mean'),
...
)
but because there are so many columns, I want to do this in a loop, something like:
df.groupby(df.index // 5).agg(time=('time', 'max'),
# df.time is always the first column
[i for i in df.columns[1:]]=(i, 'mean'),
)
If useful:
print(pd.__version__)
1.0.5
You can use a dictionary:
d = {col: "mean" if not col=='time' else "max" for col in df.columns}
#{'time': 'max', 'volts1': 'mean', 'volts2': 'mean'}
df.groupby(df.index // 5).agg(d)
time volts1 volts2
0 0.002 -0.299072 0.427551
1 0.004 -0.298767 0.422516
2 0.007 -0.298564 0.422465
3 0.009 -0.302276 0.419159
I have the following code that takes 800ms to execute, however the data is not that much.. just few columns and and few rows
Is there an opportunity to make it faster, I really don't know where is the bottelneck in that code
def compute_s_t(df,
gb=('session_time', 'trajectory_id'),
params=('t', 's', 's_normalized', 'v_direct', 't_abs', ),
fps=25, inplace=True):
if not inplace:
df = df.copy()
orig_columns = df.columns.tolist()
# compute travelled distance
df['dx'] = df['x_world'].diff()
df['dy'] = df['y_world'].diff()
t1 = datetime.datetime.now()
df['ds'] = np.sqrt(np.array(df['dx'] ** 2 + df['dy'] ** 2, dtype=np.float32))
df['ds'].iloc[0] = 0 # to avoid NaN returned by .diff()
df['s'] = df['ds'].cumsum()
df['s'] = (df.groupby('trajectory_id')['s']
.transform(subtract_nanmin))
# compute travelled time
df['dt'] = df['frame'].diff() / fps
df['dt'].iloc[0] = 0 # to avoid NaN returned by .diff()
df['t'] = df['dt'].cumsum()
df['t'] = (df.groupby('trajectory_id')['t']
.transform(subtract_nanmin))
df['t_abs'] = df['frame'] / fps
# compute velocity
# why values[:, 0]? why duplicate column?
df['v_direct'] = df['ds'].values / df['dt'].values
df.loc[df['t'] == 0, 'v'] = np.NaN
# compute normalized s
df['s_normalized'] = (df.groupby('trajectory_id')['s']
.transform(divide_nanmax))
# skip intermediate results
cols = orig_columns + list(params)
t2 = datetime.datetime.now()
print((t2 - t1).microseconds / 1000)
return df[cols]
Here is the profiler output:
18480 function calls (18196 primitive calls) in 0.593 seconds
Ordered by: call count
ncalls tottime percall cumtime percall filename:lineno(function)
11 0.000 0.000 0.580 0.053 frame.py:3105(__setitem__)
11 0.000 0.000 0.000 0.000 frame.py:3165(_ensure_valid_index)
11 0.000 0.000 0.580 0.053 frame.py:3182(_set_item)
11 0.000 0.000 0.000 0.000 frame.py:3324(_sanitize_column)
11 0.000 0.000 0.003 0.000 generic.py:2599(_set_item)
11 0.000 0.000 0.577 0.052 generic.py:2633(_check_setitem_copy)
11 0.000 0.000 0.000 0.000 indexing.py:2321(convert_to_index_sliceable)
According to the comments I have used a profiler and I put the profiling result of the function above.
def subtract_nanmin(x):
return x - np.nanmin(x)
def divide_nanmax(x):
return x / np.nanmax(x)
One thing to do is replace:
df.columns.tolist()
with
df.columns.values.tolist()
This is much faster. Here's an experiment with a random 100x100 dataframe:
%timeit df.columns.values.tolist()
output:
1.29 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
and with the same df:
%timeit df.columns.tolist()
output:
6.91 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
UPDATE:
What are subtract_nanmin and divide_nanmax?
Instead of
df['ds'].iloc[0] = 0 # to avoid NaN returned by .diff()
df['dt'].iloc[0] = 0 # to avoid NaN returned by .diff()
You can use df.fillna(0) or df['ds'].fillna(0) to get rid of NaNs
I have a big pandas dataframe (1 million rows), and I need better performance in my code to process this data.
My code is below, and a profiling analysis is also provided.
Header of the dataset:
key_id, date, par1, par2, par3, par4, pop, price, value
For each key, we have a row with every of the 5000 dates possibles
There is 200 key_id * 5000 date = 1000000 rows
Using different variables var1, ..., var4, I compute a value for each row, and I want to extract the top 20 dates with best value for each key_id, and then compute the popularity of the set of variables used.
In the end, I want to find the variables which optimize this popularity.
def compute_value_col(dataset, val1=0, val2=0, val3=0, val4=0):
dataset['value'] = dataset['price'] + val1 * dataset['par1'] \
+ val2 * dataset['par2'] + val3 * dataset['par3'] \
+ val4 * dataset['par4']
return dataset
def params_to_score(dataset, top=10, val1=0, val2=0, val3=0, val4=0):
dataset = compute_value_col(dataset, val1, val2, val3, val4)
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(top).reset_index(drop=True)
return dataset['pop'].sum()
def optimize(dataset, top):
for i,j,k,l in product(xrange(10),xrange(10),xrange(10),xrange(10)):
print i, j, k, l, params_to_score(dataset, top, 10*i, 10*j, 10*k, 10*l)
optimize(my_dataset, 20)
I need to enhance perf
Here is a %prun output, after running 49 params_to_score
ncalls tottime percall cumtime percall filename:lineno(function)
98 2.148 0.022 2.148 0.022 {pandas.algos.take_2d_axis1_object_object}
49 1.663 0.034 9.852 0.201 <ipython-input-59-88fc8127a27f>:150(params_to_score)
49 1.311 0.027 1.311 0.027 {method 'get_labels' of 'pandas.hashtable.Float64HashTable' objects}
49 1.219 0.025 1.223 0.025 {pandas.algos.groupby_indices}
49 0.875 0.018 0.875 0.018 {method 'get_labels' of 'pandas.hashtable.PyObjectHashTable' objects}
147 0.452 0.003 0.457 0.003 index.py:581(is_unique)
343 0.193 0.001 0.193 0.001 {method 'copy' of 'numpy.ndarray' objects}
1 0.136 0.136 10.058 10.058 <ipython-input-59-88fc8127a27f>:159(optimize)
147 0.122 0.001 0.122 0.001 {method 'argsort' of 'numpy.ndarray' objects}
833 0.112 0.000 0.112 0.000 {numpy.core.multiarray.empty}
49 0.109 0.002 0.109 0.002 {method 'get_labels_groupby' of 'pandas.hashtable.Int64HashTable' objects}
98 0.083 0.001 0.083 0.001 {pandas.algos.take_2d_axis1_float64_float64}
49 0.078 0.002 1.460 0.030 groupby.py:1014(_cumcount_array)
I think I could split the big dataframe in small dataframe by key_id, to improve the sort time, as I want to take the top 20 dates with best value for each key_id, so sorting by key is just to separate the different keys.
But I would need any advice, how can I improve the efficience of this code, as I would need to run thousands of params_to_score ?
EDIT: #Jeff
Thanks a lot for your help!
I tried using nsmallest instead of sort & head, but strangely it is 5-6 times slower, when I benchmark the two following functions:
def to_bench1(dataset):
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(50).reset_index(drop=True)
return dataset['pop'].sum()
def to_bench2(dataset):
dataset = dataset.set_index('pop')
dataset = dataset.groupby(['key_id'])['value'].nsmallest(50).reset_index()
return dataset['pop'].sum()
On a sample of ~100000 rows, to_bench2 performs in 0.5 seconds, while to_bench1 takes only 0.085 seconds on average.
After profiling to_bench2, I notice many more isinstance call, compared to before, but I do not know from where they come from...
The way to make this significantly faster is like this.
Create some sample data
In [148]: df = DataFrame({'A' : range(5), 'B' : [1,1,1,2,2] })
Define the compute_val_column like you have
In [149]: def f(p):
return DataFrame({ 'A' : df['A']*p, 'B' : df.B })
.....:
These are the cases (this you prob want a list of tuples), e.g. the cartesian product of all of the cases that you want to feed into the above function
In [150]: parms = [1,3]
Create a new data frame that has the full set of values, keyed by each of the parms). This is basically a broadcasting operation.
In [151]: df2 = pd.concat([ f(p) for p in parms ],keys=parms,names=['parm','indexer']).reset_index()
In [155]: df2
Out[155]:
parm indexer A B
0 1 0 0 1
1 1 1 1 1
2 1 2 2 1
3 1 3 3 2
4 1 4 4 2
5 3 0 0 1
6 3 1 3 1
7 3 2 6 1
8 3 3 9 2
9 3 4 12 2
Here's the magic. Groupby by whatever columns you want, including parm as the first one (or possibly multiple ones). Then do a partial sort (this is what nlargest does); this is more efficient that sort & head (well it depends on the group density a bit). Sum at the end (again by the groupers that we are about, as you are doing a 'partial' reduction)
In [153]: df2.groupby(['parm','B']).A.nlargest(2).sum(level=['parm','B'])
Out[153]:
parm B
1 1 3
2 7
3 1 9
2 21
dtype: int64