How to sum values of a row of a pandas dataframe efficiently - python

I have a python dataframe with 1.5 million rows and 8 columns. I want combine few columns and create a new column. I know how to do this but wanted to know which one is faster and efficient. I am reproducing my code here
import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
Now here is what I want to achieve
df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The other alternative is to use the apply functionality of pandas
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])
I wanted to know which method takes less time when we have 1.5 millon rows and have to combine 8 columns

First method is faster, because is vectorized:
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
print (df)
In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop
In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop
In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop
Another testing in 1.5M size DataFrame, apply method is very slow:
#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)
In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop
In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop

Using #jezrael's setup
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
df = pd.concat([df]*30000).reset_index(drop=True)
Far more efficient to use a dot product.
np.array([[.5, .3, .2]]).dot(df.values.T).T
Timing

Related

Averaging indices of table using pandas and numpy

I have been playing with pandas for a few hours now, I was wondering whether there is another faster way to add an extra column to your table which consists of the average of that row? I am creating a new list which contains the mean and then I am incorporating it in the data frame.
This is my code:
import numpy as np
import pandas as pd
userdata={"A":[2,5],"B":[4,6]}
tab=pd.DataFrame((userdata), columns=["A","B"])
lst=[np.mean([tab.loc[i,"A"],tab.loc[i,"B"]]) for i in range(len(tab.index))]
tab["Average of A and B"]=pd.DataFrame(lst)
tab
try df.mean(1) with assign. df.mean(1) tells pandas to calculate the mean along axis=1 (rows). axis=0 is the default.
df.assign(Mean=df.mean(1))
This produces a copy of df with added column.
To alter the existing dataframe
df['Mean'] = df.mean(1)
demo
tab.assign(Mean=tab.mean(1))
A B Mean
0 2 4 3.0
1 5 6 5.5
A NumPy solution would be to work with the underlying array data for performance -
tab['average'] = tab.values.mean(1)
To choose specific columns, like 'A' and 'B' -
tab['average'] = tab[['A','B']].values.mean(1)
Runtime test -
In [41]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
# #piRSquared's soln
In [42]: %timeit tab.assign(Mean=tab.mean(1))
1000 loops, best of 3: 615 µs per loop
In [43]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
In [44]: %timeit tab['average'] = tab.values.mean(1)
1000 loops, best of 3: 297 µs per loop
In [37]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
# #piRSquared's soln
In [38]: %timeit tab.assign(Mean=tab.mean(1))
100 loops, best of 3: 4.71 ms per loop
In [39]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
In [40]: %timeit tab['average'] = tab.values.mean(1)
100 loops, best of 3: 3.6 ms per loop

Is there a more efficient and elegant way to filter pandas index by date?

I often use DatetimeIndex.date, especially in groupby methods. However, DatetimeIndex.date is slow when compared to DatetimeIndex.year/month/day. From what I understand, it is because the .date attribute works with a lambda function over the index and returns a datetime ordered index, while index.year/month/day just returns integer indices. I have made a small example function that performs a bit better and would speed up some of my code (at least for finding the values in a groupby), but I feel that there must be a better way:
In [217]: index = pd.date_range('2011-01-01', periods=100000, freq='h')
In [218]: data = np.random.rand(len(index))
In [219]: df = pd.DataFrame({'data':data},index)
In [220]: def func(df):
...: groupby = df.groupby([df.index.year, df.index.month, df.index.day]).mean()
...: index = pd.date_range(df.index[0], periods = len(groupby), freq='D')
...: groupby.index = index
...: return groupby
...:
In [221]: df.groupby(df.index.date).mean().equals(func(df))
Out[221]: True
In [222]: df.groupby(df.index.date).mean().index.equals(func(df).index)
Out[222]: True
In [223]: %timeit df.groupby(df.index.date).mean()
1 loop, best of 3: 1.32 s per loop
In [224]: %timeit func(df)
10 loops, best of 3: 89.2 ms per loop
Does the pandas/index have a similar functionality that I am not finding?
You can even improve it a little bit:
In [69]: %timeit func(df)
10 loops, best of 3: 84.3 ms per loop
In [70]: %timeit df.groupby(pd.TimeGrouper('1D')).mean()
100 loops, best of 3: 6 ms per loop
In [84]: %timeit df.groupby(pd.Grouper(level=0, freq='1D')).mean()
100 loops, best of 3: 6.48 ms per loop
In [71]: (func(df) == df.groupby(pd.TimeGrouper('1D')).mean()).all()
Out[71]:
data True
dtype: bool
another solution - using DataFrame.resample() method:
In [73]: (df.resample('1D').mean() == func(df)).all()
Out[73]:
data True
dtype: bool
In [74]: %timeit df.resample('1D').mean()
100 loops, best of 3: 6.63 ms per loop
UPDATE: grouping by the string:
In [75]: %timeit df.groupby(df.index.strftime('%Y%m%d')).mean()
1 loop, best of 3: 2.6 s per loop
In [76]: %timeit df.groupby(df.index.date).mean()
1 loop, best of 3: 1.07 s per loop

Conditional Iteration over a dataframe

I have a dataframe df which looks like:
id location grain
0 BBG.XETR.AD.S XETR 16.545
1 BBG.XLON.VB.S XLON 6.2154
2 BBG.XLON.HF.S XLON NaN
3 BBG.XLON.RE.S XLON NaN
4 BBG.XLON.LL.S XLON NaN
5 BBG.XLON.AN.S XLON 3.215
6 BBG.XLON.TR.S XLON NaN
7 BBG.XLON.VO.S XLON NaN
In reality this dataframe will be much larger. I would like to iterate over this dataframe returning the 'grain' value but I am only interested in the rows that have a value (not NaN) in the 'grain' column. So only returning as I iterate over the dataframe the following values:
16.545
6.2154
3.215
I can iterate over the dataframe using:
for staticidx, row in df.iterrows():
value= row['grain']
But this returns a value for all rows including those with a NaN value. Is there a way to either remove the NaN rows from the dataframe or skip the rows in the dataframe where grain equals NaN?
Many thanks
You can specify a list of columns in dropna on which to subset the data:
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include
>>> df.dropna(subset=['grain'])
id location grain
0 BBG.XETR.AD.S XETR 16.5450
1 BBG.XLON.VB.S XLON 6.2154
5 BBG.XLON.AN.S XLON 3.2150
This:
df[pd.notnull(df['grain'])]
Or this:
df['grain].dropna()
Let's compare different methods (for 800K rows DF):
In [21]: df = pd.concat([df] * 10**5, ignore_index=True)
In [22]: df.shape
Out[22]: (800000, 3)
In [23]: %timeit df.grain[~pd.isnull(df.grain)]
The slowest run took 5.33 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 17.1 ms per loop
In [24]: %timeit df.ix[df.grain.notnull(), 'grain']
10 loops, best of 3: 23.9 ms per loop
In [25]: %timeit df[pd.notnull(df['grain'])]
10 loops, best of 3: 35.9 ms per loop
In [26]: %timeit df.grain.ix[df.grain.notnull()]
100 loops, best of 3: 17.4 ms per loop
In [27]: %timeit df.dropna(subset=['grain'])
10 loops, best of 3: 56.6 ms per loop
In [28]: %timeit df.grain[df.grain.notnull()]
100 loops, best of 3: 17 ms per loop
In [30]: %timeit df['grain'].dropna()
100 loops, best of 3: 16.3 ms per loop

Looping through pandas dataframe for speed

I'm trying to understand the fastest way to loop through in pandas. I read in many places that itertuples is much better than just regularly looping through data, and the best is apply. If this is the case why do regular loops come out the fastest? Maybe I'm not understanding the results, what does 10 loops, best of 3 mean?
%%timeit
xlist= []
for row in toMood.itertuples():
xlist.append(row[1] + 1)
1 loop, best of 3: 266 ms per loop
In [54]:
%%timeit
zlist = []
for row in toMood['user_id']:
zlist.append(row + 1)
10 loops, best of 3: 83 ms per loop
In [56]:
%%timeit
tlist = toMood['user_id'].apply(lambda x: x+1)
10 loops, best of 3: 138 ms per loop

Pandas selecting columns - best habit and performance

There are many different ways to select a column in a pandas.DataFrame (same for rows). I am wondering if it makes any difference and if there are any performance and style recommendations.
E.g., if I have a DataFrame as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.random((10,4)), columns=['a','b','c','d'])
df.head()
There are many different ways to select e.g., column d
1) df['d']
2) df.loc[:,'d'] (where df.loc[row_indexer,column_indexer])
3) df.loc[:]['d']
4) df.ix[:]['d']
5) df.ix[:,'d']
Intuitively, I would prefer 2), maybe because I am used to the [row_indexer,column_indexer] style from numpy
I would use ipython's magic function %timeit to find out the best performant method.
The results are:
%timeit df['d']
100000 loops, best of 3: 5.35 µs per loop
%timeit df.loc[:,'d']
10000 loops, best of 3: 44.3 µs per loop
%timeit df.loc[:]['d']
100000 loops, best of 3: 12.4 µs per loop
%timeit df.ix[:]['d']
100000 loops, best of 3: 10.4 µs per loop
%timeit df.ix[:,'d']
10000 loops, best of 3: 53 µs per loop
It turns out that the 1st method is considerably faster than others.

Categories

Resources