Looping a function with irregular index increase - python

I have my function:
import numpy as np
def monte_carlo1(N):
x = np.random.random (size = N)
y = np.random.random (size = N)
dist = np.sqrt(x ** 2 + y ** 2)
hit = 0
miss = 0
for z in dist:
if z <=1:
hit += 1
else:
miss +=1
hit_ratio = hit / N
return hit_ratio
What i want to do is run this function 100 times each for 10 different values of N, collecting the data into arrays.
For example, a couple of the data collections could be generated by:
data1 = np.array([monte_carlo1(10) for i in range(100)])
data2 = np.array([monte_carlo1(50) for i in range(100)])
data3 = np.array([monte_carlo1(100) for i in range(100)])
But it would be better if I could create a while loop which can iterate 10 times to produce 10 arrays of data instead of having 10 variables data1...data10.
However, I want to be able to increase the values of N inside monte_carlo(N) by irregular amounts, so in my loop i cant just add a fixed value to N each iteration.
WOuld someone suggest how I might build a loop like this?
Thanks
EDIT:
N_vals = [10, 50, 100, 250, 500, 1000, 3000, 5000, 7500, 10000]
def data_mc():
for n in N_vals:
data = np.array([monte_carlo1(n) for i in range(10)])
return data
I've set up the function like this but the output of the function is just one array, which suggests im doing something wrong and the N_values isnt being cycled through

Here is a solution using a pandas.DataFrame where the row index is the value of n for that iteration and the columns represent each of the repeat iterations.
import pandas as pd
def calc(n_list, repeat):
# this will be a simply list of lists, no special 'numpy' arrays
result_list = []
for n in n_list:
result_list.append([monte_carlo1(n) for _ in range(repeat)])
return pd.DataFrame(data=result_list, index=n_list)
This would allow you to do some data analysis afterwards:
>>> from my_script import calc
>>> n_list = [10, 50, 100, 250, 500]
>>> df = calc(n_list, 10)
>>> df
0 1 2 3 4 5 6 7 8 9
10 0.600 0.800 0.700 0.800 0.600 1.000 0.800 0.800 0.700 0.900
50 0.840 0.860 0.700 0.860 0.740 0.860 0.780 0.740 0.740 0.820
100 0.770 0.780 0.730 0.790 0.780 0.730 0.760 0.740 0.770 0.690
250 0.784 0.804 0.792 0.768 0.800 0.780 0.792 0.800 0.804 0.764
500 0.798 0.776 0.782 0.786 0.768 0.798 0.786 0.774 0.774 0.796
Now you can calculate statistics per value of n:
>>> import pandas as pd
>>> stats = pd.DataFrame()
>>> stats['mean'] = df.mean(axis=1)
>>> stats['standard_dev'] = df.std(axis=1)
>>> stats
mean standard_dev
10 0.7700 0.125167
50 0.7940 0.061137
100 0.7540 0.030984
250 0.7888 0.014459
500 0.7838 0.010891
This data analysis shows you, for example, that your predictions get more accurate (smaller std) as you increase n.

Related

An efficient way to calculate deltas in the DataFrame?

I need to calculate the delta and I did it, but I'm using itertuples and I want to avoid use it...
There is an efficient way to do that? Take a look how I did it:
from numpy import append, around, array, float64
from numpy.random import uniform
from pandas import DataFrame
matrix = around(a=uniform(low=1.0, high=50.0, size=(10, 2)), decimals=2)
points = DataFrame(data=matrix, columns=['x', 'y'], dtype='float64')
x_column = points.columns.get_loc('x')
y_column = points.columns.get_loc('y')
x_delta = array(object=[], dtype=float64)
y_delta = array(object=[], dtype=float64)
for row, iterator in enumerate(iterable=points.itertuples(index=False, name='Point')):
if row == 0:
x_delta = append(arr=x_delta, values=0.0)
y_delta = append(arr=y_delta, values=0.0)
else:
x_delta = append(arr=x_delta, values=iterator.x / points.iat[row - 1, x_column] - 1)
y_delta = append(arr=y_delta, values=iterator.y / points.iat[row - 1, y_column] - 1)
x_delta = around(a=x_delta, decimals=2)
y_delta = around(a=y_delta, decimals=2)
points.insert(loc=points.shape[1], column='x_delta', value=x_delta)
points.insert(loc=points.shape[1], column='y_delta', value=y_delta)
print(points)
x y x_delta y_delta
0 26.08 1.37 0.00 0.00
1 8.34 6.82 -0.68 3.98
2 38.42 45.20 3.61 5.63
3 3.59 33.12 -0.91 -0.27
4 42.94 11.06 10.96 -0.67
5 31.99 17.38 -0.26 0.57
6 4.29 17.46 -0.87 0.00
7 19.68 22.28 3.59 0.28
8 27.55 12.98 0.40 -0.42
9 40.23 9.60 0.46 -0.26
Thanks a lot!
Pandas has pct_change() function which compares the current and prior element. You can achieve the same result with one line:
points[['x_delta', 'y_delta']] = points[['x', 'y']].pct_change().fillna(0).round(2)
The fillna(0) is to fix the first row which would otherwise return as NaN.
Pandas has the .diff() built in function.
Calculates the difference of a Dataframe element compared with
another element in the Dataframe (default is element in previous row).
delta_dataframe = original_dataframe.diff()
In this case delta_dataframe will give you the change between rows of the original_dataframe.

I am trying to interpolate a value from a pandas dataframe using numpy.interp but it continuously returns a wrong interpolation

import pandas as pd
import numpy as np
# defining a function for interpolation
def interpolate(x, df, xcol, ycol):
return np.interp([x], df[xcol], df[ycol])
# function call
print(interpolate(0.4, freq_data, 'Percent_cum_freq', 'cum_OGIP'))
Trying a more direct method:
print(np.interp(0.4, freq_data['Percent_cum_freq'], freq_data['cum_OGIP']))
Output:
from function [2.37197912e+10]
from direct 23719791158.266743
For any values of x that I pass: 0.4, 0.6 and 0.9, it gives the same result, that is, 2.37197912e+10
freq_data dataframe
Percent_cum_freq cum_OGIP
0 0.999 4.455539e+07
1 0.981 1.371507e+08
2 0.913 2.777860e+08
3 0.824 4.664612e+08
4 0.720 7.031764e+08
5 0.615 9.879315e+08
6 0.547 1.320727e+09
7 0.464 1.701562e+09
8 0.396 2.130436e+09
9 0.329 2.607351e+09
10 0.285 3.132306e+09
11 0.245 3.705301e+09
12 0.199 4.326336e+09
13 0.167 4.995410e+09
14 0.136 5.712525e+09
15 0.115 6.477680e+09
16 0.085 7.290874e+09
17 0.072 8.152108e+09
18 0.056 9.061383e+09
19 0.042 1.001870e+10
20 0.034 1.102405e+10
21 0.027 1.207745e+10
22 0.022 1.317888e+10
23 0.015 1.432835e+10
24 0.013 1.552587e+10
25 0.010 1.677142e+10
26 0.007 1.806502e+10
27 0.002 1.940665e+10
28 0.002 2.079632e+10
29 0.002 2.223404e+10
30 0.001 2.371979e+10
What is wrong? How can I solve the problem?
Well I was as well surprised by the results when I implemented the code you provided. After a little search on the documentation for np.interp , found that the x-coordinates must be always increasing.
np.interp(x,list_of_x_coordinates,list_of_y_coordinates)
Where x is the value you want the value of y at.
list_of_x_coordinates is df[xcol] -> This must always be increasing. But as your dataframe is decreasing, it will never give correct result.
list_of_y_coordinates is df[ycol] -> This must be of same dimension and in order with the df[xcol]
My reproduced code:
import numpy as np
list_1=np.interp([0.1,0.5,0.8],[0.999,0.547,0.199,0.056,0.013,0.001],[4.455539e+07,1.320727e+09,4.326336e+09,9.061383e+09,1.552587e+10, 2.371979e+10])
list_2=np.interp([0.1,0.5,0.8],[0.001,0.013,0.056,0.199,0.547,0.999],[2.371979e+10,1.552587e+10,9.061383e+09,4.326336e+09,1.320727e+09,4.455539e+07])
print("In decreasing order -> As in your case",list_1)
print("In increasing order of x xoordinates",list_2)
Output:
In decreasing order -> As in your case [2.371979e+10 2.371979e+10 2.371979e+10]
In increasing order of x xoordinates [7.60444546e+09 1.72665695e+09 6.06409705e+08]
As you can understand now, you have to sort the df[x_col] and accordingly pass the df[y_col]
​
By default np.interp needs the x values to be sorted. If you do not want to sort your dataframe a workaround is to set the period argument to np.inf:
print(np.interp(0.4, freq_data['Percent_cum_freq'], freq_data['cum_OGIP'], period=np.inf))

how to set a variable dynamically - python, pandas

>>>import pandas as pd
>>>import numpy as np
>>>from pandas import Series, DataFrame
>>>rawData = pd.read_csv('wow.txt')
>>>rawData
time mean
0 0.005 0
1 0.010 258.64
2 0.015 258.43
3 0.020 253.72
4 0.025 0
5 0.030 0
6 0.035 253.84
7 0.040 254.17
8 0.045 0
9 0.050 0
10 0.055 0
11 0.060 254.73
12 0.065 254.90
.
.
.
489 4.180 167.46
I want to apply below formula and get 'y' when I enter 'x' value dynamically to plotting a graph.
y = y0 + (y1-y0)*(x-x0/x1-x0)
If 'mean' value is 0(for example index4,5 index8,9,10)
1) Ask question "Do you want to interpolate?"
2) If yes, enter the 'x' value
3) calculating using formula (repeat 1-3 until answer is no)
4) If answer is no, finish the program.
time(x-axis) mean(y-axis)
0 0.005 0
1 0.010 258.64
2 0.015 258.43
3 0.020 <--x0 253.72 <-- y0
4 0.025 0
5 0.030 0
6 0.035 <--x1 253.84 <-- y1
7 0.040 <--x0 254.17 <-- y0
8 0.045 0
9 0.050 0
10 0.055 0
11 0.060 <--x1 254.73 <-- y1
12 0.065 254.90
.
.
.
489 4.180 167.46
variable x0,x1,y0,y1 is determined when they are located outside between '0' value.
How to get a variable dynamically and calculate?
Do you have any good idea to design program?
for i in df.index:
if df.mean [i]=0:
answer=input ("Do you want to interpolate?")
if answer="Y":
x1 = df.loc[df.mean > 0].index[1]
y1 = df.loc[df.time > 0].index[1]
Interpolate eqn
else:
Process
else:
x0 = df.time [i]
y0 = df.mean[i]
Excuse typos, working on mobile phone.

MUSIC Algorithm Spectrum Python Implementation

I am working on a small radar project that can measure the Doppler shift created by the heart and chest. Since I know the number of sources in advance, I decided to choose the MUSIC Algorithm for spectral analysis. I am acquiring data and sending it to Python for analysis. However, my Python code is saying that the power for ALL frequencies of a signal with two mixed sinusoids of frequency 1 Hz and 2 Hz is equal. My code is linked here with a sample output:
from scipy import signal
import numpy as np
from numpy import linalg as LA
import matplotlib.pyplot as plt
import cmath
import scipy
N = 5
z = np.linspace(0,2*np.pi, num=N)
x = np.sin(2*np.pi * z) + np.sin(1 * np.pi * z) + np.random.random(N) * 0.3 # sample signal
conj = np.conj(x);
l = len(conj)
sRate = 25 # sampling rate
p = 2
flipped = [0 for h in range(0, l)]
flipped = conj[::-1]
acf = signal.convolve(x,flipped,'full')
a1 = scipy.linalg.toeplitz(c=np.asarray(acf),r=np.asarray(acf))#autocorrelation matrix that will be decomposed into eigenvectors
eigenValues,eigenVectors = LA.eig(a1)
idx = eigenValues.argsort()[::-1]
eigenValues = eigenValues[idx]
eigenVectors = eigenVectors[:,idx]
idx = eigenValues.argsort()[::-1]
eigenValues = eigenValues[idx]# soriting the eigenvectors and eigenvalues from greatest to least eigenvalue
eigenVectors = eigenVectors[:,idx]
signal_eigen = eigenVectors[0:p]#these vectors make up the signal subspace, by using the number of principal compoenets, 2 to split the eigenvectors
noise_eigen = eigenVectors[p:len(eigenVectors)]# noise subspace
for f in range(0, sRate):
sum1 = 0
frequencyVector = np.zeros(len(noise_eigen[0]), dtype=np.complex_)
for i in range(0,len(noise_eigen[0])):
frequencyVector[i] = np.conjugate(complex(np.cos(2 * np.pi * i * f), np.sin(2 * np.pi * i * f)))#creating a frequency vector with e to the 2pi *k *f and taking the conjugate of the each component
for u in range(0,len(noise_eigen)):
sum1 += (abs(np.dot(np.asarray(frequencyVector).transpose(), np.asarray( noise_eigen[u]) )))**2 # summing the dot product of each noise eigenvector and frequency vector taking the absolute value and squaring
print(1/sum1)
print("\n")
"""
(OUTPUT OF THE ABOVE CODE)
0.120681885992
0
0.120681885992
1
0.120681885992
2
0.120681885992
3
0.120681885992
4
0.120681885992
5
0.120681885992
6
0.120681885992
7
0.120681885992
8
0.120681885992
9
0.120681885992
10
0.120681885992
11
0.120681885992
12
0.120681885992
13
0.120681885992
14
0.120681885992
15
0.120681885992
16
0.120681885992
17
0.120681885992
18
0.120681885992
19
0.120681885992
20
0.120681885992
21
0.120681885992
22
0.120681885992
23
0.120681885992
24
Process finished with exit code 0
"""
Here is the formula for the MUSIC Algorithm:
https://drive.google.com/file/d/0B5EG2FEWlIZwYmkteUludHNXS0k/view?usp=sharing
Mathematically, the problem is that i and f are both integers. Thus, 2*π*i*f is an integral multiple of 2π. Allowing for a tiny bit of round-off error, this gives you a cosine very close to 1.0 and a sin very close to 0.0. These values yield virtually no variation in frequencyVector from one iteration to the next.
I also see a problem in that you set up your signal_eigen matrix, but never use it. Isn't the signal itself required by this algorithm? As a result, all you're doing is sampling the noise at intervals of 2πi.
Let's try chopping up one cycle into sRate evenly-spaced sampling points. This results in spikes at 0.24 and 0.76 (out of the range 0.0 - 0.99). Does this match your intuition about how this should work?
signal_eigen = eigenVectors[0:p]
noise_eigen = eigenVectors[p:len(eigenVectors)] # noise subspace
print "Signal\n", signal_eigen
print "Noise\n", noise_eigen
for f_int in range(0, sRate * p + 1):
sum1 = 0
frequencyVector = np.zeros(len(noise_eigen[0]), dtype=np.complex_)
f = float(f_int) / sRate
for i in range(0,len(noise_eigen[0])):
# create a frequency vector with e to the 2pi *k *f and taking the conjugate of the each component
frequencyVector[i] = np.conjugate(complex(np.cos(2 * np.pi * i * f), np.sin(2 * np.pi * i * f)))
# print f, i, np.pi, np.cos(2 * np.pi * i * f)
# print frequencyVector
for u in range(0,len(noise_eigen)):
# sum the squared dot product of each noise eigenvector and frequency vector.
sum1 += (abs(np.dot(np.asarray(frequencyVector).transpose(), np.asarray( noise_eigen[u]) )))**2
print f, 1/sum1
Output
Signal
[[ -3.25974386e-01 3.26744322e-01 -5.24205744e-16 -1.84108176e-01
-7.07106781e-01 -6.86652798e-17 2.71561652e-01 3.78607948e-16
4.23482344e-01]
[ 3.40976541e-01 5.42419088e-02 -5.00000000e-01 -3.62655793e-01
-1.06880232e-16 3.53553391e-01 -3.89304223e-01 -3.53553391e-01
3.12595284e-01]]
Noise
[[ -3.06261935e-01 -5.16768248e-01 7.82012443e-16 -3.72989138e-01
-3.12515753e-16 -5.00000000e-01 5.19589478e-03 -5.00000000e-01
-2.51205535e-03]
[ 3.21775774e-01 8.19916352e-02 5.00000000e-01 -3.70053622e-01
1.44550753e-16 3.53553391e-01 4.33613344e-01 -3.53553391e-01
-2.54514258e-01]
[ -4.00349040e-01 4.82750272e-01 -8.71533036e-16 -3.42123880e-01
-2.68725150e-16 2.42479504e-16 -4.16290671e-01 -4.89739378e-16
-5.62428795e-01]
[ 3.21775774e-01 8.19916352e-02 -5.00000000e-01 -3.70053622e-01
-2.80456498e-16 -3.53553391e-01 4.33613344e-01 3.53553391e-01
-2.54514258e-01]
[ -3.06261935e-01 -5.16768248e-01 1.08027782e-15 -3.72989138e-01
-1.25036869e-16 5.00000000e-01 5.19589478e-03 5.00000000e-01
-2.51205535e-03]
[ 3.40976541e-01 5.42419088e-02 5.00000000e-01 -3.62655793e-01
-2.64414807e-16 -3.53553391e-01 -3.89304223e-01 3.53553391e-01
3.12595284e-01]
[ -3.25974386e-01 3.26744322e-01 -4.97151703e-16 -1.84108176e-01
7.07106781e-01 -1.62796158e-16 2.71561652e-01 2.06561854e-16
4.23482344e-01]]
0.0 0.115397176866
0.04 0.12355071192
0.08 0.135377011677
0.12 0.136669716901
0.16 0.148772917566
0.2 0.195742574649
0.24 0.237792763699
0.28 0.181921271171
0.32 0.12959840172
0.36 0.121070836044
0.4 0.139075881122
0.44 0.139216853056
0.48 0.117815494324
0.52 0.117815494324
0.56 0.139216853056
0.6 0.139075881122
0.64 0.121070836044
0.68 0.12959840172
0.72 0.181921271171
0.76 0.237792763699
0.8 0.195742574649
0.84 0.148772917566
0.88 0.136669716901
0.92 0.135377011677
0.96 0.12355071192
I'm also unsure of the correct implementation; having more of the paper for formula context would help. I'm not certain about the range and sampling of the f values. When I worked on FFT software, f was swept over the wave form in small increments, typically 2π/sRate.
I'm not getting those distinctive spikes now -- not sure what I did before. I made a small parametrized change, adding a num_slice variable:
num_slice = sRate * N
for f_int in range(0, num_slice + 1):
sum1 = 0
frequencyVector = np.zeros(len(noise_eigen[0]), dtype=np.complex_)
f = float(f_int) / num_slice
You can compute it however you like, of course, but the ensuing loop runs through just the one cycle. Here's my output:
0.0 0.136398199883
0.008 0.136583829848
0.016 0.13711117893
0.024 0.137893463111
0.032 0.138792904453
0.04 0.139633157335
0.048 0.140219450839
0.056 0.140365986349
0.064 0.139926689416
0.072 0.138822121693
0.08 0.137054535152
0.088 0.13470609994
0.096 0.131921188389
0.104 0.128879079596
0.112 0.125765649854
0.12 0.122750994163
0.128 0.119976226317
0.136 0.117549199221
0.144 0.115546862203
0.152 0.114021482029
0.16 0.113008398728
0.168 0.112533730494
0.176 0.112621097254
0.184 0.113296863522
0.192 0.114593615279
0.2 0.116551634665
0.208 0.119218062482
0.216 0.12264326497
0.224 0.126873674308
0.232 0.131940131305
0.24 0.137840727381
0.248 0.144517728837
0.256 0.151830000359
0.264 0.159526062508
0.272 0.167228413981
0.28 0.174444818009
0.288 0.180621604818
0.296 0.185241411664
0.304 0.187943197745
0.312 0.188619481273
0.32 0.187445977812
0.328 0.184829467764
0.336 0.181300320748
0.344 0.177396490666
0.352 0.173576190425
0.36 0.170171993077
0.368 0.167379359825
0.376 0.165265454514
0.384 0.163786582966
0.392 0.16280869726
0.4 0.162130870823
0.408 0.161514399035
0.416 0.160719375729
0.424 0.159546457646
0.432 0.157875982968
0.44 0.155693319037
0.448 0.153091632029
0.456 0.150251065569
0.464 0.147402137481
0.472 0.144785618099
0.48 0.14261932062
0.488 0.141076562538
0.496 0.140275496354
0.504 0.140275496354
0.512 0.141076562538
0.52 0.14261932062
0.528 0.144785618099
0.536 0.147402137481
0.544 0.150251065569
0.552 0.153091632029
0.56 0.155693319037
0.568 0.157875982968
0.576 0.159546457646
0.584 0.160719375729
0.592 0.161514399035
0.6 0.162130870823
0.608 0.16280869726
0.616 0.163786582966
0.624 0.165265454514
0.632 0.167379359825
0.64 0.170171993077
0.648 0.173576190425
0.656 0.177396490666
0.664 0.181300320748
0.672 0.184829467764
0.68 0.187445977812
0.688 0.188619481273
0.696 0.187943197745
0.704 0.185241411664
0.712 0.180621604818
0.72 0.174444818009
0.728 0.167228413981
0.736 0.159526062508
0.744 0.151830000359
0.752 0.144517728837
0.76 0.137840727381
0.768 0.131940131305
0.776 0.126873674308
0.784 0.12264326497
0.792 0.119218062482
0.8 0.116551634665
0.808 0.114593615279
0.816 0.113296863522
0.824 0.112621097254
0.832 0.112533730494
0.84 0.113008398728
0.848 0.114021482029
0.856 0.115546862203
0.864 0.117549199221
0.872 0.119976226317
0.88 0.122750994163
0.888 0.125765649854
0.896 0.128879079596
0.904 0.131921188389
0.912 0.13470609994
0.92 0.137054535152
0.928 0.138822121693
0.936 0.139926689416
0.944 0.140365986349
0.952 0.140219450839
0.96 0.139633157335
0.968 0.138792904453
0.976 0.137893463111
0.984 0.13711117893
0.992 0.136583829848
1.0 0.136398199883

Improve performance on processing a big pandas dataframe

I have a big pandas dataframe (1 million rows), and I need better performance in my code to process this data.
My code is below, and a profiling analysis is also provided.
Header of the dataset:
key_id, date, par1, par2, par3, par4, pop, price, value
For each key, we have a row with every of the 5000 dates possibles
There is 200 key_id * 5000 date = 1000000 rows
Using different variables var1, ..., var4, I compute a value for each row, and I want to extract the top 20 dates with best value for each key_id, and then compute the popularity of the set of variables used.
In the end, I want to find the variables which optimize this popularity.
def compute_value_col(dataset, val1=0, val2=0, val3=0, val4=0):
dataset['value'] = dataset['price'] + val1 * dataset['par1'] \
+ val2 * dataset['par2'] + val3 * dataset['par3'] \
+ val4 * dataset['par4']
return dataset
def params_to_score(dataset, top=10, val1=0, val2=0, val3=0, val4=0):
dataset = compute_value_col(dataset, val1, val2, val3, val4)
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(top).reset_index(drop=True)
return dataset['pop'].sum()
def optimize(dataset, top):
for i,j,k,l in product(xrange(10),xrange(10),xrange(10),xrange(10)):
print i, j, k, l, params_to_score(dataset, top, 10*i, 10*j, 10*k, 10*l)
optimize(my_dataset, 20)
I need to enhance perf
Here is a %prun output, after running 49 params_to_score
ncalls tottime percall cumtime percall filename:lineno(function)
98 2.148 0.022 2.148 0.022 {pandas.algos.take_2d_axis1_object_object}
49 1.663 0.034 9.852 0.201 <ipython-input-59-88fc8127a27f>:150(params_to_score)
49 1.311 0.027 1.311 0.027 {method 'get_labels' of 'pandas.hashtable.Float64HashTable' objects}
49 1.219 0.025 1.223 0.025 {pandas.algos.groupby_indices}
49 0.875 0.018 0.875 0.018 {method 'get_labels' of 'pandas.hashtable.PyObjectHashTable' objects}
147 0.452 0.003 0.457 0.003 index.py:581(is_unique)
343 0.193 0.001 0.193 0.001 {method 'copy' of 'numpy.ndarray' objects}
1 0.136 0.136 10.058 10.058 <ipython-input-59-88fc8127a27f>:159(optimize)
147 0.122 0.001 0.122 0.001 {method 'argsort' of 'numpy.ndarray' objects}
833 0.112 0.000 0.112 0.000 {numpy.core.multiarray.empty}
49 0.109 0.002 0.109 0.002 {method 'get_labels_groupby' of 'pandas.hashtable.Int64HashTable' objects}
98 0.083 0.001 0.083 0.001 {pandas.algos.take_2d_axis1_float64_float64}
49 0.078 0.002 1.460 0.030 groupby.py:1014(_cumcount_array)
I think I could split the big dataframe in small dataframe by key_id, to improve the sort time, as I want to take the top 20 dates with best value for each key_id, so sorting by key is just to separate the different keys.
But I would need any advice, how can I improve the efficience of this code, as I would need to run thousands of params_to_score ?
EDIT: #Jeff
Thanks a lot for your help!
I tried using nsmallest instead of sort & head, but strangely it is 5-6 times slower, when I benchmark the two following functions:
def to_bench1(dataset):
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(50).reset_index(drop=True)
return dataset['pop'].sum()
def to_bench2(dataset):
dataset = dataset.set_index('pop')
dataset = dataset.groupby(['key_id'])['value'].nsmallest(50).reset_index()
return dataset['pop'].sum()
On a sample of ~100000 rows, to_bench2 performs in 0.5 seconds, while to_bench1 takes only 0.085 seconds on average.
After profiling to_bench2, I notice many more isinstance call, compared to before, but I do not know from where they come from...
The way to make this significantly faster is like this.
Create some sample data
In [148]: df = DataFrame({'A' : range(5), 'B' : [1,1,1,2,2] })
Define the compute_val_column like you have
In [149]: def f(p):
return DataFrame({ 'A' : df['A']*p, 'B' : df.B })
.....:
These are the cases (this you prob want a list of tuples), e.g. the cartesian product of all of the cases that you want to feed into the above function
In [150]: parms = [1,3]
Create a new data frame that has the full set of values, keyed by each of the parms). This is basically a broadcasting operation.
In [151]: df2 = pd.concat([ f(p) for p in parms ],keys=parms,names=['parm','indexer']).reset_index()
In [155]: df2
Out[155]:
parm indexer A B
0 1 0 0 1
1 1 1 1 1
2 1 2 2 1
3 1 3 3 2
4 1 4 4 2
5 3 0 0 1
6 3 1 3 1
7 3 2 6 1
8 3 3 9 2
9 3 4 12 2
Here's the magic. Groupby by whatever columns you want, including parm as the first one (or possibly multiple ones). Then do a partial sort (this is what nlargest does); this is more efficient that sort & head (well it depends on the group density a bit). Sum at the end (again by the groupers that we are about, as you are doing a 'partial' reduction)
In [153]: df2.groupby(['parm','B']).A.nlargest(2).sum(level=['parm','B'])
Out[153]:
parm B
1 1 3
2 7
3 1 9
2 21
dtype: int64

Categories

Resources