Compute column from multiple previous rows in dataframes with conditionals - python

I'm starting to belive that pandas dataframes are much less intuitive to handle than Excel, but I'm not giving up yet!
So, I'm JUST trying to check data in the same column but in (various) previous rows using the .shift() method. I'm using the following DF as an example since the original is too complicated to copy into here, but the principle is the same.
counter = list(range(20))
df1 = pd.DataFrame(counter, columns=["Counter"])
df1["Event"] = [True, False, False, False, False, False, True, False,False,False,False,False,False,False,False,False,False,False,False,True]
I'm trying to create sums of the column counter, but only under the following conditions:
If the "Event" = True I want to sum the "Counter" values for the last 10 previous rows before the event happened.
EXCEPT if there is another Event within those 10 previous rows. In this case I only want to sum up the counter values between those two events (without exceeding 10 rows).
To clarify my goal this is the result I had in mind:
My attempt so far looks like this:
for index, row in df1.iterrows():
if row["Event"] == True:
counter = 1
summ = 0
while counter < 10 and row["Event"].shift(counter) == False:
summ += row["Counter"].shift(counter)
counter += 1
else:
df1.at[index, "Sum"] = summ
I'm trying to first find Event == True and from there start iterating backwards with a counter and summing up the counters as I go. However it seems to have a problem with shift:
AttributeError: 'bool' object has no attribute 'shift'
Please shatter my believes and show me, that Excel isn't actually superior.

We need create a subgroup key with cumsum , then do rolling sum
n=10
s=df1.Counter.groupby(df1.Event.iloc[::-1].cumsum()).\
rolling(n+1,min_periods=1).sum().\
reset_index(level=0,drop=True).where(df1.Event)
df1['sum']=(s-df1.Counter).fillna(0)
df1
Counter Event sum
0 0 True 0.0
1 1 False 0.0
2 2 False 0.0
3 3 False 0.0
4 4 False 0.0
5 5 False 0.0
6 6 True 15.0
7 7 False 0.0
8 8 False 0.0
9 9 False 0.0
10 10 False 0.0
11 11 False 0.0
12 12 False 0.0
13 13 False 0.0
14 14 False 0.0
15 15 False 0.0
16 16 False 0.0
17 17 False 0.0
18 18 False 0.0
19 19 True 135.0

Element-wise approach
You definitely can approach a task in pandas the way you would in excel. Your approach needs to be tweaked a bit because pandas.Series.shift operates on whole arrays or Series, not on a single value - you can't use it just to move back up the dataframe relative to a value.
The following loops through the indices of your dataframe, walking back up (up to) 10 spots for each Event:
def create_sum_column_loop(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
for index in range(df.shape[0]):
counter = 1
summ = 0
if df.loc[index, "Event"]: # == True is implied
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if df.loc[index - backup, "Event"]:
break
# increment by the counter
summ += df.loc[index - backup, "Counter"]
df.loc[index, "Sum"] = summ
return df
This does the job:
In [15]: df1_sum1 = create_sum_column(df1.copy()) # copy to preserve original
In [16]: df1_sum1
Counter Event Sum
0 0 True 0
1 1 False 0
2 2 False 0
3 3 False 0
4 4 False 0
5 5 False 0
6 6 True 15
7 7 False 0
8 8 False 0
9 9 False 0
10 10 False 0
11 11 False 0
12 12 False 0
13 13 False 0
14 14 False 0
15 15 False 0
16 16 False 0
17 17 False 0
18 18 False 0
19 19 True 135
Better: vectorized operations
However, the power of pandas comes in its vectorized operations. Python is an interpreted, dynamically-typed language, meaning it's flexible, user friendly (easy to read/write/learn), and slow. To combat this, many commonly-used workflows, including many pandas.Series operations, are written in optimized, compiled code from other languages like C, C++, and Fortran. Under the hood, they're doing the same thing... df1.Counter.cumsum() does loop through the elements and create a running total, but it does it in C, making it lightning fast.
This is what makes learning a framework like pandas difficult - you need to relearn how to do math using that framework. For pandas, the entire game is learning how to use pandas and numpy built-in operators to do your work.
Borrowing the clever solution from #YOBEN_S:
def create_sum_column_vectorized(df):
n = 10
s = (
df.Counter
# group by a unique identifier for each event. This is a
# particularly clever bit, where #YOBEN_S reverses
# the order of df.Event, then computes a running total
.groupby(df.Event.iloc[::-1].cumsum())
# compute the rolling sum within each group
.rolling(n+1,min_periods=1).sum()
# drop the group index so we can align with the original DataFrame
.reset_index(level=0,drop=True)
# drop all non-event observations
.where(df.Event)
)
# remove the counter value for the actual event
# rows, then fill the remaining rows with 0s
df['sum'] = (s - df.Counter).fillna(0)
return df
We can see that the result is the same as the one above (though the values are suddenly floats):
In [23]: df1_sum2 = create_sum_column_vectorized(df1) # copy to preserve original
In [24]: df1_sum2
The difference comes in the performance. In ipython or jupyter we can use the %timeit command to see how long a statement takes to run:
In [25]: %timeit create_sum_column_loop(df1.copy())
3.21 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit create_sum_column_vectorized(df1.copy())
7.76 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For small datasets, like the one in your example, the difference will be negligible or will even slightly favor the pure python loop.
For much larger datasets, the difference becomes apparent. Let's create a dataset similar to your example, but with 100,000 rows:
In [27]: df_big = pd.DataFrame({
...: 'Counter': np.arange(100000),
...: 'Event': np.random.random(size=100000) > 0.9,
...: })
...:
Now, you can really see the performance benefit of the vectorized approach:
In [28]: %timeit create_sum_column_loop(df_big.copy())
13 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [29]: %timeit create_sum_column_vectorized(df_big.copy())
5.81 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The vectorized version takes less than half the time. This difference will continue to widen as the amount of data increases.
Compiling your own workflows with numba
Note that for specific operations, it is possible to speed up operations further by pre-compiling the code yourself. In this case, the looped version can be compiled with numba:
import numba
#numba.jit(nopython=True)
def _inner_vectorized_loop(counter, event, sum_col):
for index in range(len(counter)):
summ = 0
if event[index]:
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if event[index - backup]:
break
# increment by the counter
summ = summ + counter[index - backup]
sum_col[index] = summ
return sum_col
def create_sum_column_loop_jit(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
df["Sum"] = _inner_vectorized_loop(
df.Counter.values, df.Event.values, df.Sum.values)
return df
This beats both pandas and the for loop by a factor of more than 1000!
In [90]: %timeit create_sum_column_loop_jit(df_big.copy())
1.62 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Balancing readability, efficiency, and flexibility is the constant challenge. Best of luck as you dive in!

Related

Add column to dataframe that has each row's duplicate count value takes too long

I've read SOF posts on how to create a field that contains the number of duplicates that row contains in a pandas DataFrame. Without using any other libraries, I tried writing a function that does this, and it works on small DataFrame objects; however, it takes way too long on larger ones and consumes too much memory.
This is the function:
def count_duplicates(dataframe):
function = lambda x: dataframe.to_numpy().tolist().count(x.to_list()) - 1
return dataframe.apply(function, axis=1)
I did a dir into a numpy array from the DataFrame.to_numpy function, and I didn't see a function quite like the list.count function. The reason why this takes so long is because for each row, it needs to compare the row with all of the rows in the numpy array. I'd like a much more efficient way to do this, even if it's not using a pandas DataFrame. I feel like there should be a simple way to do this with numpy, but I'm just not familiar enough. I've been testing different approaches for a while and it's resulting in a lot of errors. I'm going to keep testing different approaches, but felt the community might provide a better way.
Thank you for your help.
Here is an example DataFrame:
one two
0 1 1
1 2 2
2 3 3
3 1 1
I'd use it like this:
d['duplicates'] = count_duplicates(d)
The resulting DataFrame is:
one two duplicates
0 1 1 1
1 2 2 0
2 3 3 0
3 1 1 1
The problem is the actual DataFrame will have 1.4 million rows, and each lambda takes an average of 0.148558 seconds, which if multiplied by 1.4 million rows is about 207981.459 seconds or 57.772 hours. I need a much faster way to accomplish this.
Thank you again.
I updated the function which is speeding things up:
def _counter(series_to_count, list_of_lists):
return list_of_lists.count(series_to_count.to_list()) - 1
def count_duplicates(dataframe):
df_list = dataframe.to_numpy().tolist()
return dataframe.apply(_counter, args=(df_list,), axis=1)
This takes only 29.487 seconds. The bottleneck was converting the dataframe on each function call.
I'm still interested in optimizing this. I'd like to get this down to 2-3 seconds if at all possible. It may not be, but I'd like to make sure it is as fast as possible.
Thank you again.
Here is a vectorized way to do this. For 1.4 million rows, with an average of 140 duplicates for each row, it takes under 0.05 seconds. When there are no duplicates at all, it takes about 0.4 second.
d['duplicates'] = d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
On your example:
>>> d
one two duplicates
0 1 1 1
1 2 2 0
2 3 3 0
3 1 1 1
Speed
Relatively high rate of duplicates:
n = 1_400_000
d = pd.DataFrame(np.random.randint(0, 100, size=(n, 2)), columns='one two'.split())
%timeit d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
# 48.3 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# how many duplicates on average?
>>> (d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1).mean()
139.995841
# (as expected: n / 100**2)
No duplicates
n = 1_400_000
d = pd.DataFrame(np.arange(2 * n).reshape(-1, 2), columns='one two'.split())
%timeit d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
# 389 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is there a way to run this Python snippet faster?

from collections import defaultdict
dct = defaultdict(list)
for n in range(len(res)):
for i in indices_ordered:
dct[i].append(res[n][i])
Note that res is a list of pandas Series of length 5000, and indices_ordered is a list of strings of length 20000. It takes 23 minutes in my Mac (2.3 GHz Intel Core i5 and 16 GB 2133 MHz LPDDR3) to run this code. I am pretty new to Python, but I feel a more clever coding (maybe less looping) would help a lot.
Edit:
Here is an example of how to create data (res and indices_ordered) to be able to run above snippet (which is slightly changed to access the only field rather than by field name since I could not find how to construct inline a Series with a field name)
import random, string, pandas
index_sz = 20000
res_sz = 5000
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)) for i in range(index_sz)]
res = [pandas.Series([random.randint(0,10) for i in range(index_sz)], index = random.sample(indices_ordered, index_sz)) for i in range(res_sz)]
The issue here is that you iterate over indices_ordered for every single value. Just drop indices_ordered. Stripping it way back in orders of magnitude to test the timings:
import random
import string
import numpy as np
import pandas as pd
from collections import defaultdict
index_sz = 200
res_sz = 50
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits)
for _ in range(10)) for i in range(index_sz)]
res = [pd.Series([random.randint(0,10) for i in range(index_sz)],
index = random.sample(indices_ordered, index_sz))
for i in range(res_sz)]
def your_way(res, indices_ordered):
dct = defaultdict(list)
for n in range(len(res)):
for i in indices_ordered:
dct[i].append(res[n][i])
def my_way(res):
dct = defaultdict(list)
for item in res:
for string_item, value in item.iteritems():
dct[string_item].append(value)
Gives:
%timeit your_way(res, indices_ordered)
160 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit my_way(res)
6.79 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This reduces the time complexity of the whole approach because you don't keep going through indicies_ordered each time and assigning values, so the difference will become much more stark as the size of the data grows.
Just increasing one order of magnitude:
index_sz = 2000
res_sz = 500
Gives:
%timeit your_way(res, indices_ordered)
17.8 s ± 999 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit my_way(res)
543 ms ± 9.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
EDIT: Now that testing data is available, it is clear that the changes below have no effect on run-time. The described techniques are only effective when the inner loop is very efficient (on the order of 5-10 dict lookups), making it more efficient still by removing some of the said lookups. Here the r[i] item lookup dwarfs anything else by orders of magnitude, so the optimizations are simply irrelevant.
Your outer loop takes 5000 iterations, and your inner loop 20000 iterations. This means that you are executing 100 million iterations in 23 minutes, i.e. that each iteration takes 13.8 μs. That is not fast, even in Python.
I would try to cut down the run-time by stripping any unnecessary work from the inner loop. Specifically:
convert for n in range(len(res)) followed by res[n] to for r in res. I don't know how efficient item lookup is in pandas, but it's better to do it in the outer than in the inner loop.
move the score attribute lookup to the outer loop.
get rid of defaultdict and pre-create the lists and use an ordinary dict.
avoid dict stores at all and work on the lists directly, pre-creating them in a sequence. Only create a dictionary at the end.
cache the lookup of the append list method, and prepare in advance the (append, i) pairs that the inner loop needs.
Here is code that implements the above suggestions:
# pre-create the lists
lsts = [[] for _ in range(len(indices_ordered))]
# prepare the pairs (appendfn, i)
fast_append = [(l.append, i)
for (l, i) in zip(lsts, indices_ordered)]
for r in res:
# pre-fetch res[n].score
r_score = r.score
for append, i in fast_append:
append(r_score[i])
# finally, create the dict out of the lists
dct = {i: lst for (i, lst) in zip(indices_ordered, lsts)}
You really should use a DataFrame.
Here's a way to create the data directly:
import pandas as pd
import numpy as np
import random
import string
index_sz = 3
res_sz = 10
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(3)) for i in range(index_sz)]
df = pd.DataFrame(np.random.randint(10, size=(res_sz, index_sz)), columns=indices_ordered)
There's no need to sort or index anything. A DataFrame can basically be accessed as an array or as a dict.
It should be much faster than juggling with defaultdicts, lists and Series.
df now looks like:
>>> df
7XQ VTV 38Y
0 6 9 5
1 5 5 4
2 6 0 7
3 0 0 8
4 7 8 9
5 8 6 4
6 2 4 9
7 3 2 2
8 7 6 0
9 8 0 1
>>> df['7XQ']
0 6
1 5
2 6
3 0
4 7
5 8
6 2
7 3
8 7
9 8
Name: 7XQ, dtype: int64
>>> df['7XQ'][:5]
0 6
1 5
2 6
3 0
4 7
Name: 7XQ, dtype: int64
With the original size, this script outputs a 5000 rows × 20000 columns DataFrame
in less than 3 seconds on my laptop.
Use pandas magic (with 2 lines of code) on your input list of pd.Series objects:
all_data = pd.concat([*res])
d = all_data.groupby(all_data.index).apply(list).to_dict()
Implied actions:
pd.concat([*res]) - concatenates all series into a single one preserving indices of each series object (pandas.concat)
all_data.groupby(all_data.index).apply(list).to_dict() - determine a groups of same index label values upon all_data.index, then put each group values into a list with .apply(list) and eventually convert grouped result into a dictionary .to_dict() (pandas.Series.groupby)

In pandas, how do I only do a calculation for a specific subset of indices?

I have a Pandas dataframe X with two columns, 'report_me' and 'n'. I want to get a list (or series) that contains, for each element X where report_me is true, the sum of the n values for the previous two elements of the dataframe (regardless of their report_me values). For instance, if the data frame is:
X = pd.DataFrame({"report_me":[False,False,False,True,False,
False,True,False,False,False],
"n":range(10)})
then I want the result (3, 9).
One way to do this is:
sums = df['n'].shift(1) + df['n'].shift(2)
display(sums[df["report_me"]])
but this is slow because it computes the values of sums for all the indices, not just the ones that are going to be reported. One could also try filtering by report_me first:
reported = df[df["report_me"]]
display(reported["n"].shift(1) + reported["n"].shift(2))
but this gives the wrong answer because now you are getting rid of the previous values that you would be using to compute sums. Is there a way to do this that doesn't do unnecessary work?
If report_me is sparse, you might gain some speed using a numpy solution as follows:
# find the index where report_me is True
idx = np.where(X.report_me.values)
# find previous two indices when report_me is True, subset the value from n, and sum
X.n.values[idx - np.arange(1,3)[:,None]].sum(axis=0)
You might need some extra logic to handle edge cases as pointed out in the comment
Timing:
%%timeit
idx = np.where(X.report_me.values)
X.n.values[idx - np.arange(1,3)[:,None]].sum(axis=0)
# 10000 loops, best of 3: 23 µs per loop
%timeit X.rolling(2).n.sum().shift()[X.report_me]
#1000 loops, best of 3: 684 µs per loop
%%timeit
sums = df['n'].shift(1) + df['n'].shift(2)
sums[df["report_me"]]
# 1000 loops, best of 3: 704 µs per loop
X['report_sum'] = (X.loc[X.report_me]
.apply(lambda x: X.iloc[[x.name-1, x.name-2]].n.sum(),
axis=1))
n report_me report_sum
0 0 False NaN
1 1 False NaN
2 2 False NaN
3 3 True 3.0
4 4 False NaN
5 5 False NaN
6 6 True 9.0
7 7 False NaN
8 8 False NaN
9 9 False NaN
If you just want the non-NaN values, take .values from the right-hand side of the assignment statement.

How can I vectorize a function that uses lagged values of its own output?

I'm sorry for the poor phrasing of the question, but it was the best I could do.
I know exactly what I want, but not exactly how to ask for it.
Here is the logic demonstrated by an example:
Two conditions that take on the values 1 or 0 trigger a signal that also takes on the values 1 or 0. Condition A triggers the signal (If A = 1 then signal = 1, else signal = 0) no matter what. Condition B does NOT trigger the signal, but the signal stays triggered if condition B stays equal to 1
after the signal previously has been triggered by condition A.
The signal goes back to 0 only after both A and B have gone back to 0.
1. Input:
2. Desired output (signal_d) and confirmation that a for loop can solve it (signal_l):
3. My attempt using numpy.where():
4. Reproducible snippet:
# Settings
import numpy as np
import pandas as pd
import datetime
# Data frame with input and desired output i column signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
# Solution using a for loop with nested ifs in column signal_l
df['signal_l'] = df['condition_A'].copy(deep = True)
i=0
for observations in df['signal_l']:
if df.ix[i,'condition_A'] == 1:
df.ix[i,'signal_l'] = 1
else:
# Signal previously triggered by condition_A
# AND kept "alive" by condition_B:
if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
df.ix[i,'signal_l'] = 1
else:
df.ix[i,'signal_l'] = 0
i = i + 1
# My attempt with np.where in column signal_v1
df['Signal_v1'] = df['condition_A'].copy()
df['Signal_v1'] = np.where(df.condition_A == 1, 1, np.where( (df.shift(1).Signal_v1 == 1) & (df.condition_B == 1), 1, 0))
print(df)
This is pretty straight forward using a for loop with lagged values and nested if sentences, but I can't figure it out using vectorized functions like numpy.where(). And I know this would be much faster for bigger data frames.
Thank you for any suggestions!
I don't think there is a way to vectorize this operation that will be significantly faster than a Python loop. (At least, not if you want to stick with just Python, pandas and numpy.)
However, you can improve the performance of this operation by simplifying your code. Your implementation uses if statements and a lot of DataFrame indexing. These are relatively costly operations.
Here's a modification of your script that includes two functions: add_signal_l(df) and add_lagged(df). The first is your code, just wrapped up in a function. The second uses a simpler function to achieve the same result--still a Python loop, but it uses numpy arrays and bitwise operators.
import numpy as np
import pandas as pd
import datetime
#-----------------------------------------------------------------------
# Create the test DataFrame
# Data frame with input and desired output i column signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
#-----------------------------------------------------------------------
def add_signal_l(df):
# Solution using a for loop with nested ifs in column signal_l
df['signal_l'] = df['condition_A'].copy(deep = True)
i=0
for observations in df['signal_l']:
if df.ix[i,'condition_A'] == 1:
df.ix[i,'signal_l'] = 1
else:
# Signal previously triggered by condition_A
# AND kept "alive" by condition_B:
if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
df.ix[i,'signal_l'] = 1
else:
df.ix[i,'signal_l'] = 0
i = i + 1
def compute_lagged_signal(a, b):
x = np.empty_like(a)
x[0] = a[0]
for i in range(1, len(a)):
x[i] = a[i] | (x[i-1] & b[i])
return x
def add_lagged(df):
df['lagged'] = compute_lagged_signal(df['condition_A'].values, df['condition_B'].values)
Here's a comparison of the timing of the two function, run in an IPython session:
In [85]: df
Out[85]:
condition_A condition_B signal_d
dates
2017-06-09 0 0 0
2017-06-10 0 1 0
2017-06-11 0 1 0
2017-06-12 0 1 0
2017-06-13 1 0 1
2017-06-14 1 0 1
2017-06-15 0 1 1
2017-06-16 0 1 1
2017-06-17 0 1 1
2017-06-18 0 1 1
2017-06-19 0 1 1
2017-06-20 1 0 1
2017-06-21 1 0 1
2017-06-22 0 0 0
In [86]: %timeit add_signal_l(df)
8.45 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [87]: %timeit add_lagged(df)
137 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As you can see, add_lagged(df) is much faster.

pandas rolling computation with window based on values instead of counts

I'm looking for a way to do something like the various rolling_* functions of pandas, but I want the window of the rolling computation to be defined by a range of values (say, a range of values of a column of the DataFrame), not by the number of rows in the window.
As an example, suppose I have this data:
>>> print d
RollBasis ToRoll
0 1 1
1 1 4
2 1 -5
3 2 2
4 3 -4
5 5 -2
6 8 0
7 10 -13
8 12 -2
9 13 -5
If I do something like rolling_sum(d, 5), I get a rolling sum in which each window contains 5 rows. But what I want is a rolling sum in which each window contains a certain range of values of RollBasis. That is, I'd like to be able to do something like d.roll_by(sum, 'RollBasis', 5), and get a result where the first window contains all rows whose RollBasis is between 1 and 5, then the second window contains all rows whose RollBasis is between 2 and 6, then the third window contains all rows whose RollBasis is between 3 and 7, etc. The windows will not have equal numbers of rows, but the range of RollBasis values selected in each window will be the same. So the output should be like:
>>> d.roll_by(sum, 'RollBasis', 5)
1 -4 # sum of elements with 1 <= Rollbasis <= 5
2 -4 # sum of elements with 2 <= Rollbasis <= 6
3 -6 # sum of elements with 3 <= Rollbasis <= 7
4 -2 # sum of elements with 4 <= Rollbasis <= 8
# etc.
I can't do this with groupby, because groupby always produces disjoint groups. I can't do it with the rolling functions, because their windows always roll by number of rows, not by values. So how can I do it?
I think this does what you want:
In [1]: df
Out[1]:
RollBasis ToRoll
0 1 1
1 1 4
2 1 -5
3 2 2
4 3 -4
5 5 -2
6 8 0
7 10 -13
8 12 -2
9 13 -5
In [2]: def f(x):
...: ser = df.ToRoll[(df.RollBasis >= x) & (df.RollBasis < x+5)]
...: return ser.sum()
The above function takes a value, in this case RollBasis, and then indexes the data frame column ToRoll based on that value. The returned series consists of ToRoll values that meet the RollBasis + 5 criterion. Finally, that series is summed and returned.
In [3]: df['Rolled'] = df.RollBasis.apply(f)
In [4]: df
Out[4]:
RollBasis ToRoll Rolled
0 1 1 -4
1 1 4 -4
2 1 -5 -4
3 2 2 -4
4 3 -4 -6
5 5 -2 -2
6 8 0 -15
7 10 -13 -20
8 12 -2 -7
9 13 -5 -5
Code for the toy example DataFrame in case someone else wants to try:
In [1]: from pandas import *
In [2]: import io
In [3]: text = """\
...: RollBasis ToRoll
...: 0 1 1
...: 1 1 4
...: 2 1 -5
...: 3 2 2
...: 4 3 -4
...: 5 5 -2
...: 6 8 0
...: 7 10 -13
...: 8 12 -2
...: 9 13 -5
...: """
In [4]: df = read_csv(io.BytesIO(text), header=0, index_col=0, sep='\s+')
Based on Zelazny7's answer, I created this more general solution:
def rollBy(what, basis, window, func):
def applyToWindow(val):
chunk = what[(val<=basis) & (basis<val+window)]
return func(chunk)
return basis.apply(applyToWindow)
>>> rollBy(d.ToRoll, d.RollBasis, 5, sum)
0 -4
1 -4
2 -4
3 -4
4 -6
5 -2
6 -15
7 -20
8 -7
9 -5
Name: RollBasis
It's still not ideal as it is very slow compared to rolling_apply, but perhaps this is inevitable.
Based on BrenBarns's answer, but speeded up by using label based indexing rather than boolean based indexing:
def rollBy(what,basis,window,func,*args,**kwargs):
#note that basis must be sorted in order for this to work properly
indexed_what = pd.Series(what.values,index=basis.values)
def applyToWindow(val):
# using slice_indexer rather that what.loc [val:val+window] allows
# window limits that are not specifically in the index
indexer = indexed_what.index.slice_indexer(val,val+window,1)
chunk = indexed_what[indexer]
return func(chunk,*args,**kwargs)
rolled = basis.apply(applyToWindow)
return rolled
This is much faster than not using an indexed column:
In [46]: df = pd.DataFrame({"RollBasis":np.random.uniform(0,1000000,100000), "ToRoll": np.random.uniform(0,10,100000)})
In [47]: df = df.sort("RollBasis")
In [48]: timeit("rollBy_Ian(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Ian,df", number =3)
Out[48]: 67.6615059375763
In [49]: timeit("rollBy_Bren(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Bren,df", number =3)
Out[49]: 515.0221037864685
Its worth noting that the index based solution is O(n), while the logical slicing version is O(n^2) in the average case (I think).
I find it more useful to do this over evenly spaced windows from the min value of Basis to the max value of Basis, rather than at every value of basis. This means altering the function thus:
def rollBy(what,basis,window,func,*args,**kwargs):
#note that basis must be sorted in order for this to work properly
windows_min = basis.min()
windows_max = basis.max()
window_starts = np.arange(windows_min, windows_max, window)
window_starts = pd.Series(window_starts, index = window_starts)
indexed_what = pd.Series(what.values,index=basis.values)
def applyToWindow(val):
# using slice_indexer rather that what.loc [val:val+window] allows
# window limits that are not specifically in the index
indexer = indexed_what.index.slice_indexer(val,val+window,1)
chunk = indexed_what[indexer]
return func(chunk,*args,**kwargs)
rolled = window_starts.apply(applyToWindow)
return rolled
To extend the answer of #Ian Sudbury, I've extended it in such a way that one can use it directly on a dataframe by binding the method to the DataFrame class (I expect that there definitely might be some improvements on my code in speed, because I do not know how to access all internals of the class).
I've also added functionality for backward facing windows and centered windows. They only function perfectly when you're away from the edges.
import pandas as pd
import numpy as np
def roll_by(self, basis, window, func, forward=True, *args, **kwargs):
the_indexed = pd.Index(self[basis])
def apply_to_window(val):
if forward == True:
indexer = the_indexed.slice_indexer(val, val+window)
elif forward == False:
indexer = the_indexed.slice_indexer(val-window, val)
elif forward == 'both':
indexer = the_indexed.slice_indexer(val-window/2, val+window/2)
else:
raise RuntimeError('Invalid option for "forward". Can only be True, False, or "both".')
chunck = self.iloc[indexer]
return func(chunck, *args, **kwargs)
rolled = self[basis].apply(apply_to_window)
return rolled
pd.DataFrame.roll_by = roll_by
For the other tests, I've used the following definitions:
def rollBy_Ian_iloc(what,basis,window,func,*args,**kwargs):
#note that basis must be sorted in order for this to work properly
indexed_what = pd.Series(what.values,index=basis.values)
def applyToWindow(val):
# using slice_indexer rather that what.loc [val:val+window] allows
# window limits that are not specifically in the index
indexer = indexed_what.index.slice_indexer(val,val+window,1)
chunk = indexed_what.iloc[indexer]
return func(chunk,*args,**kwargs)
rolled = basis.apply(applyToWindow)
return rolled
def rollBy_Ian_index(what,basis,window,func,*args,**kwargs):
#note that basis must be sorted in order for this to work properly
indexed_what = pd.Series(what.values,index=basis.values)
def applyToWindow(val):
# using slice_indexer rather that what.loc [val:val+window] allows
# window limits that are not specifically in the index
indexer = indexed_what.index.slice_indexer(val,val+window,1)
chunk = indexed_what[indexed_what.index[indexer]]
return func(chunk,*args,**kwargs)
rolled = basis.apply(applyToWindow)
return rolled
def rollBy_Bren(what, basis, window, func):
def applyToWindow(val):
chunk = what[(val<=basis) & (basis<val+window)]
return func(chunk)
return basis.apply(applyToWindow)
Timings and tests:
df = pd.DataFrame({"RollBasis":np.random.uniform(0,100000,10000), "ToRoll": np.random.uniform(0,10,10000)}).sort_values("RollBasis")
In [14]: %timeit rollBy_Ian_iloc(df.ToRoll,df.RollBasis,10,sum)
...: %timeit rollBy_Ian_index(df.ToRoll,df.RollBasis,10,sum)
...: %timeit rollBy_Bren(df.ToRoll,df.RollBasis,10,sum)
...: %timeit df.roll_by('RollBasis', 10, lambda x: x['ToRoll'].sum())
...:
484 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.58 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.12 s ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.48 s ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Conclusion: the bound method is not as fast as the method by #Ian Sudbury, but not as slow as that of #BrenBarn, but it does allow for more flexibility regarding the functions one can call on them.

Categories

Resources