Improve performance on processing a big pandas dataframe - python

I have a big pandas dataframe (1 million rows), and I need better performance in my code to process this data.
My code is below, and a profiling analysis is also provided.
Header of the dataset:
key_id, date, par1, par2, par3, par4, pop, price, value
For each key, we have a row with every of the 5000 dates possibles
There is 200 key_id * 5000 date = 1000000 rows
Using different variables var1, ..., var4, I compute a value for each row, and I want to extract the top 20 dates with best value for each key_id, and then compute the popularity of the set of variables used.
In the end, I want to find the variables which optimize this popularity.
def compute_value_col(dataset, val1=0, val2=0, val3=0, val4=0):
dataset['value'] = dataset['price'] + val1 * dataset['par1'] \
+ val2 * dataset['par2'] + val3 * dataset['par3'] \
+ val4 * dataset['par4']
return dataset
def params_to_score(dataset, top=10, val1=0, val2=0, val3=0, val4=0):
dataset = compute_value_col(dataset, val1, val2, val3, val4)
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(top).reset_index(drop=True)
return dataset['pop'].sum()
def optimize(dataset, top):
for i,j,k,l in product(xrange(10),xrange(10),xrange(10),xrange(10)):
print i, j, k, l, params_to_score(dataset, top, 10*i, 10*j, 10*k, 10*l)
optimize(my_dataset, 20)
I need to enhance perf
Here is a %prun output, after running 49 params_to_score
ncalls tottime percall cumtime percall filename:lineno(function)
98 2.148 0.022 2.148 0.022 {pandas.algos.take_2d_axis1_object_object}
49 1.663 0.034 9.852 0.201 <ipython-input-59-88fc8127a27f>:150(params_to_score)
49 1.311 0.027 1.311 0.027 {method 'get_labels' of 'pandas.hashtable.Float64HashTable' objects}
49 1.219 0.025 1.223 0.025 {pandas.algos.groupby_indices}
49 0.875 0.018 0.875 0.018 {method 'get_labels' of 'pandas.hashtable.PyObjectHashTable' objects}
147 0.452 0.003 0.457 0.003 index.py:581(is_unique)
343 0.193 0.001 0.193 0.001 {method 'copy' of 'numpy.ndarray' objects}
1 0.136 0.136 10.058 10.058 <ipython-input-59-88fc8127a27f>:159(optimize)
147 0.122 0.001 0.122 0.001 {method 'argsort' of 'numpy.ndarray' objects}
833 0.112 0.000 0.112 0.000 {numpy.core.multiarray.empty}
49 0.109 0.002 0.109 0.002 {method 'get_labels_groupby' of 'pandas.hashtable.Int64HashTable' objects}
98 0.083 0.001 0.083 0.001 {pandas.algos.take_2d_axis1_float64_float64}
49 0.078 0.002 1.460 0.030 groupby.py:1014(_cumcount_array)
I think I could split the big dataframe in small dataframe by key_id, to improve the sort time, as I want to take the top 20 dates with best value for each key_id, so sorting by key is just to separate the different keys.
But I would need any advice, how can I improve the efficience of this code, as I would need to run thousands of params_to_score ?
EDIT: #Jeff
Thanks a lot for your help!
I tried using nsmallest instead of sort & head, but strangely it is 5-6 times slower, when I benchmark the two following functions:
def to_bench1(dataset):
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(50).reset_index(drop=True)
return dataset['pop'].sum()
def to_bench2(dataset):
dataset = dataset.set_index('pop')
dataset = dataset.groupby(['key_id'])['value'].nsmallest(50).reset_index()
return dataset['pop'].sum()
On a sample of ~100000 rows, to_bench2 performs in 0.5 seconds, while to_bench1 takes only 0.085 seconds on average.
After profiling to_bench2, I notice many more isinstance call, compared to before, but I do not know from where they come from...

The way to make this significantly faster is like this.
Create some sample data
In [148]: df = DataFrame({'A' : range(5), 'B' : [1,1,1,2,2] })
Define the compute_val_column like you have
In [149]: def f(p):
return DataFrame({ 'A' : df['A']*p, 'B' : df.B })
.....:
These are the cases (this you prob want a list of tuples), e.g. the cartesian product of all of the cases that you want to feed into the above function
In [150]: parms = [1,3]
Create a new data frame that has the full set of values, keyed by each of the parms). This is basically a broadcasting operation.
In [151]: df2 = pd.concat([ f(p) for p in parms ],keys=parms,names=['parm','indexer']).reset_index()
In [155]: df2
Out[155]:
parm indexer A B
0 1 0 0 1
1 1 1 1 1
2 1 2 2 1
3 1 3 3 2
4 1 4 4 2
5 3 0 0 1
6 3 1 3 1
7 3 2 6 1
8 3 3 9 2
9 3 4 12 2
Here's the magic. Groupby by whatever columns you want, including parm as the first one (or possibly multiple ones). Then do a partial sort (this is what nlargest does); this is more efficient that sort & head (well it depends on the group density a bit). Sum at the end (again by the groupers that we are about, as you are doing a 'partial' reduction)
In [153]: df2.groupby(['parm','B']).A.nlargest(2).sum(level=['parm','B'])
Out[153]:
parm B
1 1 3
2 7
3 1 9
2 21
dtype: int64

Related

Printing out effect sizes from an anova_lm model

I have the following code. I am running a linear model on the dataframe 'x', with gender and highest education level achieved as categorical variables.
The aim is to assess how well age, gender and highest level of education achieved can predict 'weighteddistance'.
resultmodeldistancevariation2sleep = smf.ols(formula='weighteddistance ~ age + C(gender) + C(highest_education_level_acheived)',data=x).fit()
summarymodel = resultmodeldistancevariation2sleep.summary()
print(summarymodel)
This gives me the output:
0 1 2 3 4 5 6
0 coef std err t P>|t| [0.025 0.975]
1 Intercept 6.3693 1.391 4.580 0.000 3.638 9.100
2 C(gender)[T.2.0] 0.2301 0.155 1.489 0.137 -0.073 0.534
3 C(gender)[T.3.0] 0.0302 0.429 0.070 0.944 -0.812 0.872
4 C(highest_education_level_acheived)[T.3] 1.1292 0.501 2.252 0.025 0.145 2.114
5 C(highest_education_level_acheived)[T.4] 1.0876 0.513 2.118 0.035 0.079 2.096
6 C(highest_education_level_acheived)[T.5] 1.0692 0.498 2.145 0.032 0.090 2.048
7 C(highest_education_level_acheived)[T.6] 1.2995 0.525 2.476 0.014 0.269 2.330
8 C(highest_education_level_acheived)[T.7] 1.7391 0.605 2.873 0.004 0.550 2.928
However, I want to calculate the main effect of each categorical variable on distance, which are not shown in the model above, and so I entered the model fit into an anova using 'anova_lm'.
anovaoutput = sm.stats.anova_lm(resultmodeldistancevariation2sleep)
anovaoutput['PR(>F)'] = anovaoutput['PR(>F)'].round(4)
This gives me following output below, and as I wanted, does show me the main effect of each categorical variable - gender and highest education level achieved - rather than the different groups within that variable (meaning that there is no gender[2.0] and gender[3.0] in the output below).
df sum_sq mean_sq F PR(>F)
C(gender) 2.0 4.227966 2.113983 5.681874 0.0036
C(highest_education_level_acheived) 5.0 11.425706 2.285141 6.141906 0.0000
age 1.0 8.274317 8.274317 22.239357 0.0000
Residual 647.0 240.721120 0.372057 NaN NaN
However, this output no longer shows me the confidence intervals or the coefficients for each variable.
So in other words, I would like the bottom anova table should have a column with 'coef' and '[0.025 0.975]' like in the first table.
How can I achieve this?
I would be so grateful for a helping hand!

New pandas version: how to groupby all columns with different aggregation statistics

I have a df that looks like this:
time volts1 volts2
0 0.000 -0.299072 0.427551
2 0.001 -0.299377 0.427551
4 0.002 -0.298767 0.427551
6 0.003 -0.298767 0.422974
8 0.004 -0.298767 0.422058
10 0.005 -0.298462 0.422363
12 0.006 -0.298767 0.422668
14 0.007 -0.298462 0.422363
16 0.008 -0.301208 0.420227
18 0.009 -0.303345 0.418091
In actuality, the df has >50 columns, but for simplicity, I'm just showing 3.
I want to groupby this df every n rows, lets say 5. I want to aggregate time with max and the rest of the columns I want to aggregate by mean. Because there are so many columns, I'd love to be able to loop this and not have to do it manually.
I know I can do something like this where I go through and create all new columns manually:
df.groupby(df.index // 5).agg(time=('time', 'max'),
volts1=('volts1', 'mean'),
volts1=('volts1', 'mean'),
...
)
but because there are so many columns, I want to do this in a loop, something like:
df.groupby(df.index // 5).agg(time=('time', 'max'),
# df.time is always the first column
[i for i in df.columns[1:]]=(i, 'mean'),
)
If useful:
print(pd.__version__)
1.0.5
You can use a dictionary:
d = {col: "mean" if not col=='time' else "max" for col in df.columns}
#{'time': 'max', 'volts1': 'mean', 'volts2': 'mean'}
df.groupby(df.index // 5).agg(d)
time volts1 volts2
0 0.002 -0.299072 0.427551
1 0.004 -0.298767 0.422516
2 0.007 -0.298564 0.422465
3 0.009 -0.302276 0.419159

I am trying to interpolate a value from a pandas dataframe using numpy.interp but it continuously returns a wrong interpolation

import pandas as pd
import numpy as np
# defining a function for interpolation
def interpolate(x, df, xcol, ycol):
return np.interp([x], df[xcol], df[ycol])
# function call
print(interpolate(0.4, freq_data, 'Percent_cum_freq', 'cum_OGIP'))
Trying a more direct method:
print(np.interp(0.4, freq_data['Percent_cum_freq'], freq_data['cum_OGIP']))
Output:
from function [2.37197912e+10]
from direct 23719791158.266743
For any values of x that I pass: 0.4, 0.6 and 0.9, it gives the same result, that is, 2.37197912e+10
freq_data dataframe
Percent_cum_freq cum_OGIP
0 0.999 4.455539e+07
1 0.981 1.371507e+08
2 0.913 2.777860e+08
3 0.824 4.664612e+08
4 0.720 7.031764e+08
5 0.615 9.879315e+08
6 0.547 1.320727e+09
7 0.464 1.701562e+09
8 0.396 2.130436e+09
9 0.329 2.607351e+09
10 0.285 3.132306e+09
11 0.245 3.705301e+09
12 0.199 4.326336e+09
13 0.167 4.995410e+09
14 0.136 5.712525e+09
15 0.115 6.477680e+09
16 0.085 7.290874e+09
17 0.072 8.152108e+09
18 0.056 9.061383e+09
19 0.042 1.001870e+10
20 0.034 1.102405e+10
21 0.027 1.207745e+10
22 0.022 1.317888e+10
23 0.015 1.432835e+10
24 0.013 1.552587e+10
25 0.010 1.677142e+10
26 0.007 1.806502e+10
27 0.002 1.940665e+10
28 0.002 2.079632e+10
29 0.002 2.223404e+10
30 0.001 2.371979e+10
What is wrong? How can I solve the problem?
Well I was as well surprised by the results when I implemented the code you provided. After a little search on the documentation for np.interp , found that the x-coordinates must be always increasing.
np.interp(x,list_of_x_coordinates,list_of_y_coordinates)
Where x is the value you want the value of y at.
list_of_x_coordinates is df[xcol] -> This must always be increasing. But as your dataframe is decreasing, it will never give correct result.
list_of_y_coordinates is df[ycol] -> This must be of same dimension and in order with the df[xcol]
My reproduced code:
import numpy as np
list_1=np.interp([0.1,0.5,0.8],[0.999,0.547,0.199,0.056,0.013,0.001],[4.455539e+07,1.320727e+09,4.326336e+09,9.061383e+09,1.552587e+10, 2.371979e+10])
list_2=np.interp([0.1,0.5,0.8],[0.001,0.013,0.056,0.199,0.547,0.999],[2.371979e+10,1.552587e+10,9.061383e+09,4.326336e+09,1.320727e+09,4.455539e+07])
print("In decreasing order -> As in your case",list_1)
print("In increasing order of x xoordinates",list_2)
Output:
In decreasing order -> As in your case [2.371979e+10 2.371979e+10 2.371979e+10]
In increasing order of x xoordinates [7.60444546e+09 1.72665695e+09 6.06409705e+08]
As you can understand now, you have to sort the df[x_col] and accordingly pass the df[y_col]
​
By default np.interp needs the x values to be sorted. If you do not want to sort your dataframe a workaround is to set the period argument to np.inf:
print(np.interp(0.4, freq_data['Percent_cum_freq'], freq_data['cum_OGIP'], period=np.inf))

Obtaining the last value that equals or most near in the column dataframe

i have an issue in my code, i'm making points of cuts.
First, this is my Dataframe Column:
In [23]: df['bad_%']
0 0.025
1 0.007
2 0.006
3 0.006
4 0.006
5 0.006
6 0.007
7 0.007
8 0.007
9 0.006
10 0.006
11 0.009
12 0.009
13 0.009
14 0.008
15 0.008
16 0.008
17 0.012
18 0.012
19 0.05
20 0.05
21 0.05
22 0.05
23 0.05
24 0.05
25 0.05
26 0.05
27 0.062
28 0.062
29 0.061
5143 0.166
5144 0.166
5145 0.166
5146 0.167
5147 0.167
5148 0.167
5149 0.167
5150 0.167
5151 0.05
5152 0.167
5153 0.167
5154 0.167
5155 0.167
5156 0.051
5157 0.052
5158 0.161
5159 0.149
5160 0.168
5161 0.168
5162 0.168
5163 0.168
5164 0.168
5165 0.168
5166 0.168
5167 0.168
5168 0.049
5169 0.168
5170 0.168
5171 0.168
5172 0.168
Name: bad%, Length: 5173, dtype: float64
I used this code to detected the value equals or most near to 0.05 (VALUE THAT INTRODUCED on the CONSOLE)
error = 100 #Margin of error
valuesA = [] #array to save data
pointCut=0 #identify cut point
for index, row in df.iterrows():
if(abs(row['bad%'] - a) <= error):
valuesA = row
error = abs(row['bad%'] - a)
#Variable "a" introduced by console, in this case is "0.05"
pointCut = index
This code return the value "0.05" in the index 5151, in first instance looks good, because the "0.05" in the index "5151" is the last "0.05".
Out [27]:
5151 0.05
But my objetive is obtain THE LAST VALUE IN THE COLUMN equal or most near to "0.05", in this case this value correspond to "0.049" in the index "5168", i need obtain this value.
Exists an algorithm that permit this? Any solution or recomendation?
Thanks in advance.
Solutions if exist at leas one value:
Use [::-1] for swap values from back and get idxmax for last matched index value:
a = 0.05
s = df['bad%']
b = s[[(s[::-1] <= a).idxmax()]]
print (b)
5168 0.049
Or:
b = s[(s <= a)].iloc[[-1]]
print (b)
5168 0.049
Name: bad%, dtype: float64
Solution working also if value not exist - then empty Series yields:
a = 0.05
s = df['bad%']
m1 = (s <= a)
m2 = m1[::-1].cumsum().eq(1)
b = s[m1 & m2]
print (b)
5168 0.049
Name: bad%, dtype: float64
Sample data:
df = pd.DataFrame({'bad%': {5146: 0.16699999999999998, 5147: 0.16699999999999998, 5148: 0.16699999999999998, 5149: 0.049, 5150: 0.16699999999999998, 5151: 0.05, 5152: 0.16699999999999998, 5167: 0.168, 5168: 0.049, 5169: 0.168}})

Calculating subtractions of pairs of columns in pandas DataFrame

I work with significantly sized (48K rows, up to tens of columns) DataFrames. At a certain point in their manipulation, I need to do pair-wise subtractions of column values and I was wondering if there is a more efficient way to do so rather than the one I'm doing (see below).
My current code:
# Matrix is the pandas DataFrame containing all the data
comparison_df = pandas.DataFrame(index=matrix.index)
combinations = itertools.product(group1, group2)
for observed, reference in combinations:
observed_data = matrix[observed]
reference_data = matrix[reference]
comparison = observed_data - reference_data
name = observed + "_" + reference
comparison_df[name] = comparison
Since the data can be large (I'm using this piece of code also during a permutation test), I'm interested in knowing if it can be optimized a bit.
EDIT: As requested, here's a sample of a typical data set
ID A1 A2 A3 B1 B2 B3
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
And a typical result would be, if the "A" group is group1 and "B" group2, for each ID row, to have for each column a pair (e.g., A1_B1, A2_B1, A3_B1...) corresponding to the pairings generated above, containing the subtraction for each row ID.
Using itertools.combinations() on DataFrame columns
You can create combinations of columns with itertools.combinations() and evaluate subtractions along with new names based on these pairs:
import pandas as pd
from cStringIO import StringIO
import itertools as iter
matrix = pd.read_csv(StringIO('''ID,A1,A2,A3,B1,B2,B3
Ku8QhfS0n_hIOABXuE,6.343,6.304,6.410,6.287,6.403,6.279
fqPEquJRRlSVSfL.8A,6.752,6.681,6.680,6.677,6.525,6.739
ckiehnugOno9d7vf1Q,6.297,6.248,6.524,6.382,6.316,6.453
x57Vw5B5Fbt5JUnQkI,6.268,6.451,6.379,6.371,6.458,6.333''')).set_index('ID')
print 'Original DataFrame:'
print matrix
print
# Create DataFrame to fill with combinations
comparison_df = pd.DataFrame(index=matrix.index)
# Create combinations of columns
for a, b in iter.combinations(matrix.columns, 2):
# Subtract column combinations
comparison_df['{}_{}'.format(a, b)] = matrix[a] - matrix[b]
print 'Combination DataFrame:'
print comparison_df
Original DataFrame:
A1 A2 A3 B1 B2 B3
ID
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
Combination DataFrame:
A1_A2 A1_A3 A1_B1 A1_B2 A1_B3 A2_A3 A2_B1 A2_B2 A2_B3 A3_B1 A3_B2 A3_B3 B1_B2 B1_B3 B2_B3
ID
Ku8QhfS0n_hIOABXuE 0.039 -0.067 0.056 -0.060 0.064 -0.106 0.017 -0.099 0.025 0.123 0.007 0.131 -0.116 0.008 0.124
fqPEquJRRlSVSfL.8A 0.071 0.072 0.075 0.227 0.013 0.001 0.004 0.156 -0.058 0.003 0.155 -0.059 0.152 -0.062 -0.214
ckiehnugOno9d7vf1Q 0.049 -0.227 -0.085 -0.019 -0.156 -0.276 -0.134 -0.068 -0.205 0.142 0.208 0.071 0.066 -0.071 -0.137
x57Vw5B5Fbt5JUnQkI -0.183 -0.111 -0.103 -0.190 -0.065 0.072 0.080 -0.007 0.118 0.008 -0.079 0.046 -0.087 0.038 0.125

Categories

Resources