pandas rolling computation with window based on values instead of counts - python

I'm looking for a way to do something like the various rolling_* functions of pandas, but I want the window of the rolling computation to be defined by a range of values (say, a range of values of a column of the DataFrame), not by the number of rows in the window.
As an example, suppose I have this data:
>>> print d
RollBasis ToRoll
0 1 1
1 1 4
2 1 -5
3 2 2
4 3 -4
5 5 -2
6 8 0
7 10 -13
8 12 -2
9 13 -5
If I do something like rolling_sum(d, 5), I get a rolling sum in which each window contains 5 rows. But what I want is a rolling sum in which each window contains a certain range of values of RollBasis. That is, I'd like to be able to do something like d.roll_by(sum, 'RollBasis', 5), and get a result where the first window contains all rows whose RollBasis is between 1 and 5, then the second window contains all rows whose RollBasis is between 2 and 6, then the third window contains all rows whose RollBasis is between 3 and 7, etc. The windows will not have equal numbers of rows, but the range of RollBasis values selected in each window will be the same. So the output should be like:
>>> d.roll_by(sum, 'RollBasis', 5)
1 -4 # sum of elements with 1 <= Rollbasis <= 5
2 -4 # sum of elements with 2 <= Rollbasis <= 6
3 -6 # sum of elements with 3 <= Rollbasis <= 7
4 -2 # sum of elements with 4 <= Rollbasis <= 8
# etc.
I can't do this with groupby, because groupby always produces disjoint groups. I can't do it with the rolling functions, because their windows always roll by number of rows, not by values. So how can I do it?

I think this does what you want:
In [1]: df
Out[1]:
RollBasis ToRoll
0 1 1
1 1 4
2 1 -5
3 2 2
4 3 -4
5 5 -2
6 8 0
7 10 -13
8 12 -2
9 13 -5
In [2]: def f(x):
...: ser = df.ToRoll[(df.RollBasis >= x) & (df.RollBasis < x+5)]
...: return ser.sum()
The above function takes a value, in this case RollBasis, and then indexes the data frame column ToRoll based on that value. The returned series consists of ToRoll values that meet the RollBasis + 5 criterion. Finally, that series is summed and returned.
In [3]: df['Rolled'] = df.RollBasis.apply(f)
In [4]: df
Out[4]:
RollBasis ToRoll Rolled
0 1 1 -4
1 1 4 -4
2 1 -5 -4
3 2 2 -4
4 3 -4 -6
5 5 -2 -2
6 8 0 -15
7 10 -13 -20
8 12 -2 -7
9 13 -5 -5
Code for the toy example DataFrame in case someone else wants to try:
In [1]: from pandas import *
In [2]: import io
In [3]: text = """\
...: RollBasis ToRoll
...: 0 1 1
...: 1 1 4
...: 2 1 -5
...: 3 2 2
...: 4 3 -4
...: 5 5 -2
...: 6 8 0
...: 7 10 -13
...: 8 12 -2
...: 9 13 -5
...: """
In [4]: df = read_csv(io.BytesIO(text), header=0, index_col=0, sep='\s+')

Based on Zelazny7's answer, I created this more general solution:
def rollBy(what, basis, window, func):
def applyToWindow(val):
chunk = what[(val<=basis) & (basis<val+window)]
return func(chunk)
return basis.apply(applyToWindow)
>>> rollBy(d.ToRoll, d.RollBasis, 5, sum)
0 -4
1 -4
2 -4
3 -4
4 -6
5 -2
6 -15
7 -20
8 -7
9 -5
Name: RollBasis
It's still not ideal as it is very slow compared to rolling_apply, but perhaps this is inevitable.

Based on BrenBarns's answer, but speeded up by using label based indexing rather than boolean based indexing:
def rollBy(what,basis,window,func,*args,**kwargs):
#note that basis must be sorted in order for this to work properly
indexed_what = pd.Series(what.values,index=basis.values)
def applyToWindow(val):
# using slice_indexer rather that what.loc [val:val+window] allows
# window limits that are not specifically in the index
indexer = indexed_what.index.slice_indexer(val,val+window,1)
chunk = indexed_what[indexer]
return func(chunk,*args,**kwargs)
rolled = basis.apply(applyToWindow)
return rolled
This is much faster than not using an indexed column:
In [46]: df = pd.DataFrame({"RollBasis":np.random.uniform(0,1000000,100000), "ToRoll": np.random.uniform(0,10,100000)})
In [47]: df = df.sort("RollBasis")
In [48]: timeit("rollBy_Ian(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Ian,df", number =3)
Out[48]: 67.6615059375763
In [49]: timeit("rollBy_Bren(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Bren,df", number =3)
Out[49]: 515.0221037864685
Its worth noting that the index based solution is O(n), while the logical slicing version is O(n^2) in the average case (I think).
I find it more useful to do this over evenly spaced windows from the min value of Basis to the max value of Basis, rather than at every value of basis. This means altering the function thus:
def rollBy(what,basis,window,func,*args,**kwargs):
#note that basis must be sorted in order for this to work properly
windows_min = basis.min()
windows_max = basis.max()
window_starts = np.arange(windows_min, windows_max, window)
window_starts = pd.Series(window_starts, index = window_starts)
indexed_what = pd.Series(what.values,index=basis.values)
def applyToWindow(val):
# using slice_indexer rather that what.loc [val:val+window] allows
# window limits that are not specifically in the index
indexer = indexed_what.index.slice_indexer(val,val+window,1)
chunk = indexed_what[indexer]
return func(chunk,*args,**kwargs)
rolled = window_starts.apply(applyToWindow)
return rolled

To extend the answer of #Ian Sudbury, I've extended it in such a way that one can use it directly on a dataframe by binding the method to the DataFrame class (I expect that there definitely might be some improvements on my code in speed, because I do not know how to access all internals of the class).
I've also added functionality for backward facing windows and centered windows. They only function perfectly when you're away from the edges.
import pandas as pd
import numpy as np
def roll_by(self, basis, window, func, forward=True, *args, **kwargs):
the_indexed = pd.Index(self[basis])
def apply_to_window(val):
if forward == True:
indexer = the_indexed.slice_indexer(val, val+window)
elif forward == False:
indexer = the_indexed.slice_indexer(val-window, val)
elif forward == 'both':
indexer = the_indexed.slice_indexer(val-window/2, val+window/2)
else:
raise RuntimeError('Invalid option for "forward". Can only be True, False, or "both".')
chunck = self.iloc[indexer]
return func(chunck, *args, **kwargs)
rolled = self[basis].apply(apply_to_window)
return rolled
pd.DataFrame.roll_by = roll_by
For the other tests, I've used the following definitions:
def rollBy_Ian_iloc(what,basis,window,func,*args,**kwargs):
#note that basis must be sorted in order for this to work properly
indexed_what = pd.Series(what.values,index=basis.values)
def applyToWindow(val):
# using slice_indexer rather that what.loc [val:val+window] allows
# window limits that are not specifically in the index
indexer = indexed_what.index.slice_indexer(val,val+window,1)
chunk = indexed_what.iloc[indexer]
return func(chunk,*args,**kwargs)
rolled = basis.apply(applyToWindow)
return rolled
def rollBy_Ian_index(what,basis,window,func,*args,**kwargs):
#note that basis must be sorted in order for this to work properly
indexed_what = pd.Series(what.values,index=basis.values)
def applyToWindow(val):
# using slice_indexer rather that what.loc [val:val+window] allows
# window limits that are not specifically in the index
indexer = indexed_what.index.slice_indexer(val,val+window,1)
chunk = indexed_what[indexed_what.index[indexer]]
return func(chunk,*args,**kwargs)
rolled = basis.apply(applyToWindow)
return rolled
def rollBy_Bren(what, basis, window, func):
def applyToWindow(val):
chunk = what[(val<=basis) & (basis<val+window)]
return func(chunk)
return basis.apply(applyToWindow)
Timings and tests:
df = pd.DataFrame({"RollBasis":np.random.uniform(0,100000,10000), "ToRoll": np.random.uniform(0,10,10000)}).sort_values("RollBasis")
In [14]: %timeit rollBy_Ian_iloc(df.ToRoll,df.RollBasis,10,sum)
...: %timeit rollBy_Ian_index(df.ToRoll,df.RollBasis,10,sum)
...: %timeit rollBy_Bren(df.ToRoll,df.RollBasis,10,sum)
...: %timeit df.roll_by('RollBasis', 10, lambda x: x['ToRoll'].sum())
...:
484 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.58 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.12 s ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.48 s ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Conclusion: the bound method is not as fast as the method by #Ian Sudbury, but not as slow as that of #BrenBarn, but it does allow for more flexibility regarding the functions one can call on them.

Related

How do I create a new variable (column) in the table, using the For loop in python? [duplicate]

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.
I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic).
It's almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.
From the dataframe below I need to calculate a new column based on the following spec in SQL:
CRITERIA
IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”
Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”
Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”
DATAFRAME
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined
0 MOST JEFF E 0 0 0 0 0 1 White
1 CRUISE TOM E 0 0 0 1 0 0 White
2 DEPP JOHNNY 0 0 0 0 0 1 Unknown
3 DICAP LEO 0 0 0 0 0 1 Unknown
4 BRANDO MARLON E 0 0 0 0 0 0 White
5 HANKS TOM 0 0 0 0 0 1 Unknown
6 DENIRO ROBERT E 0 1 0 0 0 1 White
7 PACINO AL E 0 0 0 0 0 1 White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White
9 EASTWOOD CLINT E 0 0 0 0 0 1 White
OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".
Next, use the apply function in pandas to apply the function - e.g.
df.apply (lambda row: label_race(row), axis=1)
Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.
df['race_label'] = df.apply (lambda row: label_race(row), axis=1)
The resultant dataframe looks like this (scroll to the right to see the new column):
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 MOST JEFF E 0 0 0 0 0 1 White White
1 CRUISE TOM E 0 0 0 1 0 0 White Hispanic
2 DEPP JOHNNY NaN 0 0 0 0 0 1 Unknown White
3 DICAP LEO NaN 0 0 0 0 0 1 Unknown White
4 BRANDO MARLON E 0 0 0 0 0 0 White Other
5 HANKS TOM NaN 0 0 0 0 0 1 Unknown White
6 DENIRO ROBERT E 0 1 0 0 0 1 White Two Or More
7 PACINO AL E 0 0 0 0 0 1 White White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White Haw/Pac Isl.
9 EASTWOOD CLINT E 0 0 0 0 0 1 White White
Since this is the first Google result for 'pandas new column from others', here's a simple example:
import pandas as pd
# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
# a b
# 0 1 3
# 1 2 4
# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0 4
# 1 6
# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
# a b c
# 0 1 3 4
# 1 2 4 6
If you get the SettingWithCopyWarning you can do it this way also:
fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'
Source: https://stackoverflow.com/a/12555510/243392
And if your column name includes spaces you can use syntax like this:
df = df.assign(**{'some column name': col.values})
And here's the documentation for apply, and assign.
The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:
First, define conditions:
conditions = [
df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1,
]
Now, define the corresponding outputs:
outputs = [
'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]
Finally, using numpy.select:
res = np.select(conditions, outputs, 'Other')
pd.Series(res)
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
Why should numpy.select be used over apply? Here are some performance checks:
df = pd.concat([df]*1000)
In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %%timeit
...: conditions = [
...: df['eri_hispanic'] == 1,
...: df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
...: df['eri_nat_amer'] == 1,
...: df['eri_asian'] == 1,
...: df['eri_afr_amer'] == 1,
...: df['eri_hawaiian'] == 1,
...: df['eri_white'] == 1,
...: ]
...:
...: outputs = [
...: 'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
...: ]
...:
...: np.select(conditions, outputs, 'Other')
...:
...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.
.apply() takes in a function as the first parameter; pass in the label_race function as so:
df['race_label'] = df.apply(label_race, axis=1)
You don't need to make a lambda function to pass in a function.
try this,
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
O/P:
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian \
0 MOST JEFF E 0 0 0
1 CRUISE TOM E 0 0 0
2 DEPP JOHNNY NaN 0 0 0
3 DICAP LEO NaN 0 0 0
4 BRANDO MARLON E 0 0 0
5 HANKS TOM NaN 0 0 0
6 DENIRO ROBERT E 0 1 0
7 PACINO AL E 0 0 0
8 WILLIAMS ROBIN E 0 0 1
9 EASTWOOD CLINT E 0 0 0
eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 0 0 1 White White
1 1 0 0 White Hispanic
2 0 0 1 Unknown White
3 0 0 1 Unknown White
4 0 0 0 White Other
5 0 0 1 Unknown White
6 0 0 1 White Two Or More
7 0 0 1 White White
8 0 0 0 White Haw/Pac Isl.
9 0 0 1 White White
use .loc instead of apply.
it improves vectorization.
.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.
for more details visit, .loc docs
Performance metrics:
Accepted Answer:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)
%timeit df.apply(lambda row: label_race(row), axis=1)
1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My Proposed Answer:
def label_race(df):
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)
%timeit label_race(df)
24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If we inspect its source code, apply() is a syntactic sugar for a Python for-loop (via the apply_series_generator() method of the FrameApply class). Because it has the pandas overhead, it's generally slower than a Python loop.
Use optimized (vectorized) methods wherever possible. If you have to use a loop, use #numba.jit decorator.
1. Don't use apply() for an if-else ladder
df.apply() is just about the slowest way to do this in pandas. As shown in the answers of user3483203 and Mohamed Thasin ah, depending on the dataframe size, np.select() and df.loc may be 50-300 times faster than df.apply() to produce the same output.
As it happens, a loop implementation (not unlike apply()) with the #jit decorator from numba module is (about 50-60%) faster than df.loc and np.select.1
Numba works on numpy arrays, so before using the jit decorator, you need to convert the dataframe into a numpy array. Then fill in values in a pre-initialized empty array by checking the conditions in a loop. Since numpy arrays don't have column names, you have to access the columns by their index in the loop. The most inconvenient part of the if-else ladder in the jitted function over the one in apply() is accessing the columns by their indices. Otherwise it's almost the same implementation.
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def conditional_assignment(arr, res):
length = len(arr)
for i in range(length):
if arr[i][3] == 1:
res[i] = 'Hispanic'
elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1:
res[i] = 'Two Or More'
elif arr[i][0] == 1:
res[i] = 'Black/AA'
elif arr[i][1] == 1:
res[i] = 'Asian'
elif arr[i][2] == 1:
res[i] = 'Haw/Pac Isl.'
elif arr[i][4] == 1:
res[i] = 'A/I AK Native'
elif arr[i][5] == 1:
res[i] = 'White'
else:
res[i] = 'Other'
return res
# the columns with the boolean data
cols = [c for c in df.columns if c.startswith('eri_')]
# initialize an empty array to be filled in a loop
# for string dtype arrays, we need to know the length of the longest string
# and use it to set the dtype
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
# pass the underlying numpy array of `df[cols]` into the jitted function
df['rno_defined'] = conditional_assignment(df[cols].values, res)
2. Don't use apply() for numeric operations
If you need to add a new row by adding two columns, your first instinct may be to write
df['c'] = df.apply(lambda row: row['a'] + row['b'], axis=1)
But instead of this, row-wise add using sum(axis=1) method (or + operator if there are only a couple of columns):
df['c'] = df[['a','b']].sum(axis=1)
# equivalently
df['c'] = df['a'] + df['b']
Depending on the dataframe size, sum(1) may be 100s of times faster than apply().
In fact, you will almost never need apply() for numeric operations on a pandas dataframe because it has optimized methods for most operations: addition (sum(1)), subtraction (sub() or diff()), multiplication (prod(1)), division (div() or /), power (pow()), >, >=, ==, %, //, &, | etc. can all be performed on the entire dataframe without apply().
For example, let's say you want to create a new column using the following rule:
IF [colC] > 0 THEN RETURN [colA] * [colB]
ELSE RETURN [colA] / [colB]
Using the optimized pandas methods, this can be written as
df['new'] = df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
the equivalent apply() solution is:
df['new'] = df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
The approach using the optimized methods is 250 times faster than the equivalent apply() approach for dataframes with 20k rows. This gap only increases as the data size increases (for a dataframe with 1 mil rows, it's 365 times faster) and the time difference will become more and more noticeable.2
1: In the below result, I show the performance of the three approaches using a dataframe with 24 mil rows (this is the largest frame I can construct on my machine). For smaller frames, the numba-jitted function consistently runs at least 50% faster than the other two as well (you can check yourself).
def pd_loc(df):
df['rno_defined'] = 'Other'
df.loc[df['eri_nat_amer'] == 1, 'rno_defined'] = 'A/I AK Native'
df.loc[df['eri_asian'] == 1, 'rno_defined'] = 'Asian'
df.loc[df['eri_afr_amer'] == 1, 'rno_defined'] = 'Black/AA'
df.loc[df['eri_hawaiian'] == 1, 'rno_defined'] = 'Haw/Pac Isl.'
df.loc[df['eri_white'] == 1, 'rno_defined'] = 'White'
df.loc[df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1) > 1, 'rno_defined'] = 'Two Or More'
df.loc[df['eri_hispanic'] == 1, 'rno_defined'] = 'Hispanic'
return df
def np_select(df):
conditions = [df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1]
outputs = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']
df['rno_defined'] = np.select(conditions, outputs, 'Other')
return df
#nb.jit(nopython=True)
def conditional_assignment(arr, res):
length = len(arr)
for i in range(length):
if arr[i][3] == 1 :
res[i] = 'Hispanic'
elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1 :
res[i] = 'Two Or More'
elif arr[i][0] == 1:
res[i] = 'Black/AA'
elif arr[i][1] == 1:
res[i] = 'Asian'
elif arr[i][2] == 1:
res[i] = 'Haw/Pac Isl.'
elif arr[i][4] == 1 :
res[i] = 'A/I AK Native'
elif arr[i][5] == 1:
res[i] = 'White'
else:
res[i] = 'Other'
return res
def nb_loop(df):
cols = [c for c in df.columns if c.startswith('eri_')]
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
df['rno_defined'] = conditional_assignment(df[cols].values, res)
return df
# df with 24mil rows
n = 4_000_000
df = pd.DataFrame({
'eri_afr_amer': [0, 0, 0, 0, 0, 0]*n,
'eri_asian': [1, 0, 0, 0, 0, 0]*n,
'eri_hawaiian': [0, 0, 0, 1, 0, 0]*n,
'eri_hispanic': [0, 1, 0, 0, 1, 0]*n,
'eri_nat_amer': [0, 0, 0, 0, 1, 0]*n,
'eri_white': [0, 0, 1, 1, 0, 0]*n
}, dtype='int8')
df.insert(0, 'name', ['MOST', 'CRUISE', 'DEPP', 'DICAP', 'BRANDO', 'HANKS']*n)
%timeit nb_loop(df)
# 5.23 s ± 45.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit pd_loc(df)
# 7.97 s ± 28.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit np_select(df)
# 8.5 s ± 39.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
2: In the below result, I show the performance of the two approaches using a dataframe with 20k rows and again with 1 mil rows. For smaller frames, the gap is smaller because the optimized approach has an overhead while apply() is a loop. As the size of the frame increases, the vectorization overhead cost diminishes w.r.t. to the overall runtime of the code while apply() remains a loop over the frame.
n = 20_000 # 1_000_000
df = pd.DataFrame(np.random.rand(n,3)-0.5, columns=['colA','colB','colC'])
%timeit df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
# n = 20000: 2.69 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# n = 1000000: 86.2 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
# n = 20000: 679 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# n = 1000000: 31.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Yet another (easily generalizable) approach, whose corner-stone is pandas.DataFrame.idxmax. First, the easily generalizable preamble.
# Indeed, all your conditions boils down to the following
_gt_1_key = 'two_or_more'
_lt_1_key = 'other'
# The "dictionary-based" if-else statements
labels = {
_gt_1_key : 'Two Or More',
'eri_hispanic': 'Hispanic',
'eri_nat_amer': 'A/I AK Native',
'eri_asian' : 'Asian',
'eri_afr_amer': 'Black/AA',
'eri_hawaiian': 'Haw/Pac Isl.',
'eri_white' : 'White',
_lt_1_key : 'Other',
}
# The output-driving 1-0 matrix
mat = df.filter(regex='^eri_').copy() # `~.copy` to avoid `SettingWithCopyWarning`
... and, finally, in a vectorized fashion:
mat[_gt_1_key] = gt1 = mat.sum(axis=1)
mat[_lt_1_key] = gt1.eq(0).astype(int)
race_label = mat.idxmax(axis=1).map(labels)
where
>>> race_label
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
that is a pandas.Series instance you can easily host within df, i.e. doing df['race_label'] = race_label.
Choosing a method according to the complexity of the criteria
For the examples below - in order to show multiple types of rules for the new column - we will assume a DataFrame with columns 'red', 'green' and 'blue', containing floating-point values ranging 0 to 1.
General case: .apply
As long as the necessary logic to compute the new value can be written as a function of other values in the same row, we can use the .apply method of the DataFrame to get the desired result. Write the function so that it accepts a single parameter, which is a single row of the input:
def as_hex(value):
# clamp to avoid rounding errors etc.
return min(max(0, int(value * 256)), 255)
def hex_color(row):
r, g, b = as_hex(row['red']), as_hex(row['green']), as_hex(row['blue'])
return f'#{r:02x}{g:02x}{b:02x}'
Pass the function itself (don't write parentheses after the name) to .apply, and specify axis=1 (meaning to supply rows to the categorizing function, so as to compute a column - rather than the other way around). Thus:
df['hex_color'] = df.apply(hex_color, axis=1)
Note that wrapping in lambda is not necessary, since we are not binding any arguments or otherwise modifying the function.
The .apply step is necessary because the conversion function itself is not vectorized. Thus, a naive approach like df['color'] = hex_color(df) will not work (example question).
This tool is powerful, but inefficient. For best performance, please use a more specific approach where applicable.
Multiple choices with conditions: numpy.select or repeated assignment with df.loc or df.where
Suppose we were thresholding the color values, and computing rough color names like so:
def additive_color(row):
# Insert here: logic that takes values from the `row` and computes
# the desired cell value for the new column in that row.
# The `row` is an ordinary `Series` object representing a row of the
# original `DataFrame`; it can be indexed with column names, thus:
if row['red'] > 0.5:
if row['green'] > 0.5:
return 'white' if row['blue'] > 0.5 else 'yellow'
else:
return 'magenta' if row['blue'] > 0.5 else 'red'
elif row['green'] > 0.5:
return 'cyan' if row['blue'] > 0.5 else 'green'
else:
return 'blue' if row['blue'] > 0.5 else 'black'
In cases like this - where the categorizing function would be an if/else ladder, or match/case in 3.10 and up - we may get much faster performance using numpy.select.
This approach works very differently. First, compute masks on the data for where each condition applies:
black = (df['red'] <= 0.5) & (df['green'] <= 0.5) & (df['blue'] <= 0.5)
white = (df['red'] > 0.5) & (df['green'] > 0.5) & (df['blue'] > 0.5)
To call numpy.select, we need two parallel sequences - one of the conditions, and another of the corresponding values:
df['color'] = np.select(
[white, black],
['white', 'black'],
'colorful'
)
The optional third argument specifies a value to use when none of the conditions are met. (As an exercise: fill in the remaining conditions, and try it without a third argument.)
A similar approach is to make repeated assignments based on each condition. Assign the default value first, and then use df.loc to assign specific values for each condition:
df['color'] = 'colorful'
df.loc[white, 'color'] = 'white'
df.loc[black, 'color'] = 'black'
Alternately, df.where can be used to do the assignments. However, df.where, used like this, assigns the specified value in places where the condition is not met, so the conditions must be inverted:
df['color'] = 'colorful'
df['color'] = df['color'].where(~white, 'white').where(~black, 'black')
Simple mathematical manipulations: built-in mathematical operators and broadcasting
For example, an apply-based approach like:
def brightness(row):
return row['red'] * .299 + row['green'] * .587 + row['blue'] * .114
df['brightness'] = df.apply(brightness, axis=1)
can instead be written by broadcasting the operators, for much better performance (and is also simpler):
df['brightness'] = df['red'] * .299 + df['green'] * .587 + df['blue'] * .114
As an exercise, here's the first example redone that way:
def as_hex(column):
scaled = (column * 256).astype(int)
clamped = scaled.where(scaled >= 0, 0).where(scaled <= 255, 255)
return clamped.apply(lambda i: f'{i:02x}')
df['hex_color'] = '#' + as_hex(df['red']) + as_hex(df['green']) + as_hex(df['blue'])
I was unable to find a vectorized equivalent to format the integer values as hex strings, so .apply is still used internally here - meaning that the full speed penalty still comes into play. Still, this demonstrates some general techniques.
For more details and examples, see cottontail's answer.

Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)

I have a dataframe with one column containing numbers (quantity). Every row represents one day so whole dataframe is should be treated as sequential data. I want to add second column that would calculate cumulative sum of the quantity column but if at any point cumulative sum is greater than 0, next row should start counting cumulative sum from 0.
I solved this problem using iterrows() but I read that this function is very inefficient and having millions of rows, calculation takes over 20 minutes. My solution below:
import pandas as pd
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
print(df)
# quantity outcome
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 15 11.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 5 1.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# 15 14.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
Is there faster (more optimized way) to calculate this?
I'm also not sure if the "if index == 0" block is the best solution and if this can be solved in more elegant way? Without this block there is an error since in first row there cannot be "previous row" for calculation.
Iterating over DataFrame rows is very slow and should be avoided. Working with chunks of data is the way to go with pandas.
For you case, looking at your DataFrame column quantity as a numpy array, the code below should speed up the process quite a lot compared to your approach:
import pandas as pd
import numpy as np
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
print(df)
Out :
quantity outcome
0 -1 -1.0
1 -1 -2.0
2 -1 -3.0
3 -1 -4.0
4 15 11.0
5 -1 -1.0
6 -1 -2.0
7 -1 -3.0
8 -1 -4.0
9 5 1.0
10 -1 -1.0
11 15 14.0
12 -1 -1.0
13 -1 -2.0
14 -1 -3.0
If you still need more speed, suggest to have a look at numba as per jezrael answer.
Edit - Performance test
I got curious about performance and did this module with all 3 approaches.
I haven't optimised the individual functions, just copied the code from OP and jezrael answer with minor changes.
"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.
Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np
def pditerrows(df):
"""Iterate over DataFrame using `iterrows`"""
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
return df
def nparray(df):
"""Convert DataFrame column to `numpy` arrays."""
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
return df
#njit
def f(x, lim):
result = np.empty(len(x))
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
def numbaloop(df):
"""Convert DataFrame to `numpy` arrays and loop using `numba`.
See [https://stackoverflow.com/a/69750009/5069105]
"""
df['outcome'] = f(df.quantity.to_numpy(), 0)
return df
def create_df(size):
"""Create a DataFrame filed with -1's and 15's, with 90% of
the entries equal to -1 and 10% equal to 15, randomly
placed in the array.
"""
df = pd.DataFrame(
np.random.choice(
(-1, 15),
size=size,
p=[0.9, 0.1]
),
columns=['quantity'])
return df
# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))
Running for a somewhat small array, size = 20_000, leads to:
In: import bench_dataframe as bd
.. df = bd.create_df(size=20_000)
In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here numpy arrays were 700+ times faster than iterrows(), and numba was still 22 times faster than numpy.
And for larger arrays, size = 200_000, we get:
In: import bench_dataframe as bd
.. df = bd.create_df(size=200_000)
In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P
In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Making numba again 25+ times faster than numpy arrays for this example, and confirming that you should avoid at all costs to use iterrows() for anything more than a couple of hundreds of rows.
I think numba is the best when working with loops if performance is important:
#njit
def f(x, lim):
result = np.empty(len(x), dtype=np.int)
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
quantity outcome outcome1
0 -1 -1.0 -1
1 -1 -2.0 -2
2 -1 -3.0 -3
3 -1 -4.0 -4
4 15 11.0 11
5 -1 -1.0 -1
6 -1 -2.0 -2
7 -1 -3.0 -3
8 -1 -4.0 -4
9 5 1.0 1
10 -1 -1.0 -1
11 15 14.0 14
12 -1 -1.0 -1
13 -1 -2.0 -2
14 -1 -3.0 -3

Update column values in a group based on one row in that group

I have a dataframe from source data that resembles the following:
In[1]: df = pd.DataFrame({'test_group': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'test_type': [np.nan,'memory', np.nan, np.nan, 'visual', np.nan, np.nan,
'auditory', np.nan]}
Out[1]:
test_group test_type
0 1 NaN
1 1 memory
2 1 NaN
3 2 NaN
4 2 visual
5 2 NaN
6 3 NaN
7 3 auditory
8 3 NaN
test_group represents the grouping of the rows, which represent a test. I need to replace the NaNs in column test_type in each test_group with the value of the row that is not a NaN, e.g. memory, visual, etc.
I've tried a variety of approaches including isolating the "real" value in test_type such as
In [4]: df.groupby('test_group')['test_type'].unique()
Out[4]:
test_group
1 [nan, memory]
2 [nan, visual]
3 [nan, auditory]
Easy enough, I can index into each row and pluck out the value I want. This seems to head in the right direction:
In [6]: df.groupby('test_group')['test_type'].unique().apply(lambda x: x[1])
Out[6]:
test_group
1 memory
2 visual
3 auditory
I tried this among many other things but it doesn't quite work (note: apply and transform give the same result):
In [15]: grp = df.groupby('test_group')
In [16]: df['test_type'] = grp['test_type'].unique().transform(lambda x: x[1])
In [17]: df
Out[17]:
test_group test_type
0 1 NaN
1 1 memory
2 1 visual
3 2 auditory
4 2 NaN
5 2 NaN
6 3 NaN
7 3 NaN
8 3 NaN
I'm sure if I looped it I'd be done with things, but loops are too slow as the data set is millions of records per file.
You can use GroupBy.size to get the size of each group. Then boolean index using Series.isna. Now, use Index.repeat with df.reindex
repeats = df.groupby('test_group').size()
out = df[~df['test_type'].isna()]
out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
timeit analysis:
Benchmarking dataframe:
df = pd.DataFrame({'test_group': [1]*10_001 + [2]*10_001 + [3]*10_001,
'test_type' : [np.nan]*10_000 + ['memory'] +
[np.nan]*10_000 + ['visual'] +
[np.nan]*10_000 + ['auditory']})
df.shape
# (30003, 2)
Results:
# Ch3steR's answer
In [54]: %%timeit
...: repeats = df.groupby('test_group').size()
...: out = df[~df['test_type'].isna()]
...: out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
...:
...:
2.56 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# timgeb's answer
In [55]: %%timeit
...: df['test_type'] = df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
...:
...:
10.1 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Almost ~4X faster. I believe it's because boolean indexing is very fast. And reindex + repeat is lightwieght compared to dual fillna.
Under the assumption that there's a unique non-nan value per group, the following should satisfy your request.
>>> df['test_type'] = df.groupby('test_group')['test_type'].ffill().bfill()
>>> df
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
edit:
The original answer used
df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
but it looks like according to schwim's timings ffill/bfill is significantly faster (for some reason).

Compute column from multiple previous rows in dataframes with conditionals

I'm starting to belive that pandas dataframes are much less intuitive to handle than Excel, but I'm not giving up yet!
So, I'm JUST trying to check data in the same column but in (various) previous rows using the .shift() method. I'm using the following DF as an example since the original is too complicated to copy into here, but the principle is the same.
counter = list(range(20))
df1 = pd.DataFrame(counter, columns=["Counter"])
df1["Event"] = [True, False, False, False, False, False, True, False,False,False,False,False,False,False,False,False,False,False,False,True]
I'm trying to create sums of the column counter, but only under the following conditions:
If the "Event" = True I want to sum the "Counter" values for the last 10 previous rows before the event happened.
EXCEPT if there is another Event within those 10 previous rows. In this case I only want to sum up the counter values between those two events (without exceeding 10 rows).
To clarify my goal this is the result I had in mind:
My attempt so far looks like this:
for index, row in df1.iterrows():
if row["Event"] == True:
counter = 1
summ = 0
while counter < 10 and row["Event"].shift(counter) == False:
summ += row["Counter"].shift(counter)
counter += 1
else:
df1.at[index, "Sum"] = summ
I'm trying to first find Event == True and from there start iterating backwards with a counter and summing up the counters as I go. However it seems to have a problem with shift:
AttributeError: 'bool' object has no attribute 'shift'
Please shatter my believes and show me, that Excel isn't actually superior.
We need create a subgroup key with cumsum , then do rolling sum
n=10
s=df1.Counter.groupby(df1.Event.iloc[::-1].cumsum()).\
rolling(n+1,min_periods=1).sum().\
reset_index(level=0,drop=True).where(df1.Event)
df1['sum']=(s-df1.Counter).fillna(0)
df1
Counter Event sum
0 0 True 0.0
1 1 False 0.0
2 2 False 0.0
3 3 False 0.0
4 4 False 0.0
5 5 False 0.0
6 6 True 15.0
7 7 False 0.0
8 8 False 0.0
9 9 False 0.0
10 10 False 0.0
11 11 False 0.0
12 12 False 0.0
13 13 False 0.0
14 14 False 0.0
15 15 False 0.0
16 16 False 0.0
17 17 False 0.0
18 18 False 0.0
19 19 True 135.0
Element-wise approach
You definitely can approach a task in pandas the way you would in excel. Your approach needs to be tweaked a bit because pandas.Series.shift operates on whole arrays or Series, not on a single value - you can't use it just to move back up the dataframe relative to a value.
The following loops through the indices of your dataframe, walking back up (up to) 10 spots for each Event:
def create_sum_column_loop(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
for index in range(df.shape[0]):
counter = 1
summ = 0
if df.loc[index, "Event"]: # == True is implied
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if df.loc[index - backup, "Event"]:
break
# increment by the counter
summ += df.loc[index - backup, "Counter"]
df.loc[index, "Sum"] = summ
return df
This does the job:
In [15]: df1_sum1 = create_sum_column(df1.copy()) # copy to preserve original
In [16]: df1_sum1
Counter Event Sum
0 0 True 0
1 1 False 0
2 2 False 0
3 3 False 0
4 4 False 0
5 5 False 0
6 6 True 15
7 7 False 0
8 8 False 0
9 9 False 0
10 10 False 0
11 11 False 0
12 12 False 0
13 13 False 0
14 14 False 0
15 15 False 0
16 16 False 0
17 17 False 0
18 18 False 0
19 19 True 135
Better: vectorized operations
However, the power of pandas comes in its vectorized operations. Python is an interpreted, dynamically-typed language, meaning it's flexible, user friendly (easy to read/write/learn), and slow. To combat this, many commonly-used workflows, including many pandas.Series operations, are written in optimized, compiled code from other languages like C, C++, and Fortran. Under the hood, they're doing the same thing... df1.Counter.cumsum() does loop through the elements and create a running total, but it does it in C, making it lightning fast.
This is what makes learning a framework like pandas difficult - you need to relearn how to do math using that framework. For pandas, the entire game is learning how to use pandas and numpy built-in operators to do your work.
Borrowing the clever solution from #YOBEN_S:
def create_sum_column_vectorized(df):
n = 10
s = (
df.Counter
# group by a unique identifier for each event. This is a
# particularly clever bit, where #YOBEN_S reverses
# the order of df.Event, then computes a running total
.groupby(df.Event.iloc[::-1].cumsum())
# compute the rolling sum within each group
.rolling(n+1,min_periods=1).sum()
# drop the group index so we can align with the original DataFrame
.reset_index(level=0,drop=True)
# drop all non-event observations
.where(df.Event)
)
# remove the counter value for the actual event
# rows, then fill the remaining rows with 0s
df['sum'] = (s - df.Counter).fillna(0)
return df
We can see that the result is the same as the one above (though the values are suddenly floats):
In [23]: df1_sum2 = create_sum_column_vectorized(df1) # copy to preserve original
In [24]: df1_sum2
The difference comes in the performance. In ipython or jupyter we can use the %timeit command to see how long a statement takes to run:
In [25]: %timeit create_sum_column_loop(df1.copy())
3.21 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit create_sum_column_vectorized(df1.copy())
7.76 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For small datasets, like the one in your example, the difference will be negligible or will even slightly favor the pure python loop.
For much larger datasets, the difference becomes apparent. Let's create a dataset similar to your example, but with 100,000 rows:
In [27]: df_big = pd.DataFrame({
...: 'Counter': np.arange(100000),
...: 'Event': np.random.random(size=100000) > 0.9,
...: })
...:
Now, you can really see the performance benefit of the vectorized approach:
In [28]: %timeit create_sum_column_loop(df_big.copy())
13 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [29]: %timeit create_sum_column_vectorized(df_big.copy())
5.81 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The vectorized version takes less than half the time. This difference will continue to widen as the amount of data increases.
Compiling your own workflows with numba
Note that for specific operations, it is possible to speed up operations further by pre-compiling the code yourself. In this case, the looped version can be compiled with numba:
import numba
#numba.jit(nopython=True)
def _inner_vectorized_loop(counter, event, sum_col):
for index in range(len(counter)):
summ = 0
if event[index]:
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if event[index - backup]:
break
# increment by the counter
summ = summ + counter[index - backup]
sum_col[index] = summ
return sum_col
def create_sum_column_loop_jit(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
df["Sum"] = _inner_vectorized_loop(
df.Counter.values, df.Event.values, df.Sum.values)
return df
This beats both pandas and the for loop by a factor of more than 1000!
In [90]: %timeit create_sum_column_loop_jit(df_big.copy())
1.62 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Balancing readability, efficiency, and flexibility is the constant challenge. Best of luck as you dive in!

Efficient way of doing permutations with pandas over a large DataFrame

Currently I have a pandas DataFrame like this:
ID A1 A2 A3 B1 B2 B3
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
This DataFrame is used with a statistic which then requires a permutation test (EDIT: to be precise, random permutation). The indices of each column need to be shuffled (sampled) 100 times. To give an idea of the size, the number of rows can be around 50,000.
EDIT: The permutation is along the rows, i.e. shuffle the index for each column.
The biggest issue here is one of performance. I want to permute things in a fast way.
An example I had in mind was:
import random
import joblib
def permutation(dataframe):
return dataframe.apply(random.sample, axis=1, k=len(dataframe))
permute = joblib.delayed(permutation)
pool = joblib.Parallel(n_jobs=-2) # all cores minus 1
result = pool(permute(dataframe) for item in range(100))
The issue here is that by doing this, the test is not stable: apparently the permutation works, but it is not as "random" as it would without being done in parallel, and thus there's a loss of stability in the results when I use the permuted data in follow-up calculations.
So my only "solution" was to precalculate all indices for all columns prior to doing the paralel code, which slows things down considerably.
My questions are:
Is there a more efficient way to do this permutation? (not necessarily parallel)
Is the parallel approach (using multiple processes, not threads) feasible?
EDIT: To make things clearer, here's what should happen for example to column A1 after one shuffling:
Ku8QhfS0n_hIOABXuE 6.268
fqPEquJRRlSVSfL.8A 6.343
ckiehnugOno9d7vf1Q 6.752
x57Vw5B5Fbt5JUnQk 6.297
(i.e. the row values were moving around).
EDIT2: Here's what I'm using now:
def _generate_indices(indices, columns, nperm):
random.seed(1234567890)
num_genes = indices.size
for item in range(nperm):
permuted = pandas.DataFrame(
{column: random.sample(genes, num_genes) for column in columns},
index=range(genes.size)
)
yield permuted
(in short, building a DataFrame of resampled indices for each column)
And later on (yes, I know it's pretty ugly):
# Data is the original DataFrame
# Indices one of the results of that generator
permuted = dict()
for column in data.columns:
value = data[column]
permuted[column] = value[indices[column].values].values
permuted_table = pandas.DataFrame(permuted, index=data.index)
How about this:
In [1]: import numpy as np; import pandas as pd
In [2]: df = pd.DataFrame(np.random.randn(50000, 10))
In [3]: def shuffle(df, n):
....: for i in n:
....: np.random.shuffle(df.values)
....: return df
In [4]: df.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9
0 0.329588 -0.513814 -1.267923 0.691889 -0.319635 -1.468145 -0.441789 0.004142 -0.362073 -0.555779
1 0.495670 2.460727 1.174324 1.115692 1.214057 -0.843138 0.217075 0.495385 1.568166 0.252299
2 -0.898075 0.994281 -0.281349 -0.104684 -1.686646 0.651502 -1.466679 -1.256705 1.354484 0.626840
3 1.158388 -1.227794 -0.462005 -1.790205 0.399956 -1.631035 -1.707944 -1.126572 -0.892759 1.396455
4 -0.049915 0.006599 -1.099983 0.775028 -0.694906 -1.376802 -0.152225 1.413212 0.050213 -0.209760
In [5]: shuffle(df, 1).head(5)
Out[5]:
0 1 2 3 4 5 6 7 8 9
0 2.044131 0.072214 -0.304449 0.201148 1.462055 0.538476 -0.059249 -0.133299 2.925301 0.529678
1 0.036957 0.214003 -1.042905 -0.029864 1.616543 0.840719 0.104798 -0.766586 -0.723782 -0.088239
2 -0.025621 0.657951 1.132175 -0.815403 0.548210 -0.029291 0.575587 0.032481 -0.261873 0.010381
3 1.396024 0.859455 -1.514801 0.353378 1.790324 0.286164 -0.765518 1.363027 -0.868599 -0.082818
4 -0.026649 -0.090119 -2.289810 -0.701342 -0.116262 -0.674597 -0.580760 -0.895089 -0.663331 0.
In [6]: %timeit shuffle(df, 100)
Out[6]:
1 loops, best of 3: 14.4 s per loop
This does what you need it to. The only question is whether or not it is fast enough.
Update
Per the comments by #Einar I have changed my solution.
In[7]: def shuffle2(df, n):
ind = df.index
for i in range(n):
sampler = np.random.permutation(df.shape[0])
new_vals = df.take(sampler).values
df = pd.DataFrame(new_vals, index=ind)
return df
In [8]: df.head()
Out[8]:
0 1 2 3 4 5 6 7 8 9
0 -0.175006 -0.462306 0.565517 -0.309398 1.100570 0.656627 1.207535 -0.221079 -0.933068 -0.192759
1 0.388165 0.155480 -0.015188 0.868497 1.102662 -0.571818 -0.994005 0.600943 2.205520 -0.294121
2 0.281605 -1.637529 2.238149 0.987409 -1.979691 -0.040130 1.121140 1.190092 -0.118919 0.790367
3 1.054509 0.395444 1.239756 -0.439000 0.146727 -1.705972 0.627053 -0.547096 -0.818094 -0.056983
4 0.209031 -0.233167 -1.900261 -0.678022 -0.064092 -1.562976 -1.516468 0.512461 1.058758 -0.206019
In [9]: shuffle2(df, 1).head()
Out[9]:
0 1 2 3 4 5 6 7 8 9
0 0.054355 0.129432 -0.805284 -1.713622 -0.610555 -0.874039 -0.840880 0.593901 0.182513 -1.981521
1 0.624562 1.097495 -0.428710 -0.133220 0.675428 0.892044 0.752593 -0.702470 0.272386 -0.193440
2 0.763551 -0.505923 0.206675 0.561456 0.441514 -0.743498 -1.462773 -0.061210 -0.435449 -2.677681
3 1.149586 -0.003552 2.496176 -0.089767 0.246546 -1.333184 0.524872 -0.527519 0.492978 -0.829365
4 -1.893188 0.728737 0.361983 -0.188709 -0.809291 2.093554 0.396242 0.402482 1.884082 1.373781
In [10]: timeit shuffle2(df, 100)
1 loops, best of 3: 2.47 s per loop

Categories

Resources