pandas dataframe calculate rolling mean using cutomized window size - python

I'm trying to calculate the rolling mean/std for a column in dataframe. The pandas or numpy_ext rolling methods seem to need a fixed window size. The dataframe has a column "dates", I want to decide the window size based on the "dates", for example, when calculating mean/std, for rows at day 10, including rows from day 2 to day 6, for rows at day 11, including rows from day 3 to day 7, for rows at day 12, including rows from day 4 to day 8, etc.
I want to know if there are methods to do it except the brute force coding. Sample data, "quantity" is the target field to calculate mean and std.
dates
material
location
quantity
1
C
A
870
2
D
A
920
3
C
A
120
4
D
A
120
6
C
A
120
8
D
A
1200
8
c
A
720
10
D
A
480
11
D
A
600
12
C
A
720
13
D
A
80
13
D
A
600
14
D
A
1200
18
E
B
150
For example, for each row, I want to get the rolling mean for "quantity" of the previous 3-8 days (if any), the expected output will be:
| dates | material | location | quantity | Mean |
|-------|--------- |----------|----------|-------------------------|
| 1 | C | A | 870 | Nan |
| 2 | D | A | 920 | Nan |
| 3 | C | A | 120 | Nan |
| 4 | D | A | 120 | Nan |
| 6 | C | A | 120 |(870+920)/2 = 895 |
| 8 | D | A | 1200 |(870+920+120+120)/4=507.5|
| 8 | c | A | 720 |(870+920+120+120)/4=507.5|
| 10 | D | A | 480 |(920+120+120+120)/4=320 |
| 11 | D | A | 600 |(120+120+120)/3=120 |
| 12 | C | A | 720 |(120+120+1200+720)/4=540 |
| 13 | D | A | 80 |(120+1200+720)/3=680 |
| 13 | D | A | 600 |(120+1200+720)/3=680 |
| 14 | D | A | 1200 |(120+1200+720+480)/4=630 |
| 18 | E | B | 150|(480+600+720+80+600+1200)/6=613|
A follow-up question:
Is there a way to further filter the window by other columns? For example, when calculating the rolling mean for "quantity" of the previous 3-8 days, the rows in the rolling window must have the same "material" as the corresponding row. So the new expected output would be:
| dates | material | location | quantity | Mean |
|-------|--------- |----------|----------|-----------------------|
| 1 | C | A | 870 | Nan |
| 2 | D | A | 920 | Nan |
| 3 | C | A | 120 | Nan |
| 4 | D | A | 120 | Nan |
| 6 | C | A | 120 |(870)/1 = 870 |
| 8 | D | A | 1200 |(920+120)/2=520 |
| 8 | c | A | 720 |(870+120)/2=495 |
| 10 | D | A | 480 |(920+120)/2=520 |
| 11 | D | A | 600 |(120)/1 = 120 |
| 12 | C | A | 720 |(120+720)/2=420 |
| 13 | D | A | 80 |(1200)/1=1200 |
| 13 | D | A | 600 |(1200)/1=1200 |
| 14 | D | A | 1200 |(1200+480)/2=840 |
| 18 | E | B | 150 |Nan |

Inspired by #PaulS's answer, here is a simple way to select based on conditions from multiple columns:
def get_selection(row):
dates_mask = (df['dates'] < row['dates'] - 3) & (df['dates'] >= row['dates'] - 8)
material_mask = df['material'] == row['material']
return df[dates_mask & material_mask]
df['Mean'] = df.apply(lambda row: get_selection(row)['quantity'].mean(),
axis=1)
df['Std'] = df.apply(lambda row: get_selection(row)['quantity'].std(),
axis=1)
dates material location quantity Mean Std
0 1 C A 870 NaN NaN
1 2 D A 920 NaN NaN
2 3 C A 120 NaN NaN
3 4 D A 120 NaN NaN
4 6 C A 120 870.0 NaN
5 8 D A 1200 520.0 565.685425
6 8 C A 720 495.0 530.330086
7 10 D A 480 520.0 565.685425
8 11 D A 600 120.0 NaN
9 12 C A 720 420.0 424.264069
10 13 D A 80 1200.0 NaN
11 13 D A 600 1200.0 NaN
12 14 D A 1200 840.0 509.116882
13 18 E B 150 NaN NaN

You can perform your operation with a rolling, you however have to pre- and post-process the DataFrame a bit to generate the shift:
A = 3
B = 8
s = (df
# de-duplicate by getting the sum/count per identical date
.groupby('dates')['quantity']
.agg(['sum', 'count'])
# reindex to fill missing dates
.reindex(range(df['dates'].min(),
df['dates'].max()+1),
fill_value=0)
# compute classical rolling
.rolling(B-A, min_periods=1).sum()
# compute mean
.assign(mean=lambda d: d['sum']/d['count'])
['mean'].shift(A+1)
)
df['Mean'] = df['dates'].map(s)
output:
dates material location quantity Mean
0 1 C A 870 NaN
1 2 C A 920 NaN
2 3 C A 120 NaN
3 4 C A 120 NaN
4 6 C A 120 895.000000
5 8 D A 1200 507.500000
6 8 D A 720 507.500000
7 10 D A 480 320.000000
8 11 D A 600 120.000000
9 12 D A 720 540.000000
10 13 D A 80 680.000000
11 13 D A 600 680.000000
12 14 D A 1200 630.000000
13 18 E B 150 613.333333

Another possible solution:
def f(x):
return np.arange(np.amax([0, x-8]), np.amax([0, x-3]))
df['Mean'] = df.dates.map(lambda x: df.quantity[df.dates.isin(f(x))].mean())
Output:
dates material location quantity Mean
0 1 C A 870 NaN
1 2 C A 920 NaN
2 3 C A 120 NaN
3 4 C A 120 NaN
4 6 C A 120 895.000000
5 8 D A 1200 507.500000
6 8 D A 720 507.500000
7 10 D A 480 320.000000
8 11 D A 600 120.000000
9 12 D A 720 540.000000
10 13 D A 80 680.000000
11 13 D A 600 680.000000
12 14 D A 1200 630.000000
13 18 E B 150 613.333333
14 19 E B 1416 640.000000
15 20 F B 1164 650.000000
16 21 G B 11520 626.666667

The DataFrame constructor for anyone else to try:
d = {'dates': [1, 2, 3, 4, 6, 8, 8, 10, 11, 12, 13, 13, 14, 18],
'material': ['C','C','C','C','C','D','D','D','D','D','D','D','D','E'],
'location':['A','A','A','A','A','A','A','A','A','A','A','A','A','B'],
'quantity': [870, 920, 120, 120, 120, 1200, 720, 480, 600, 720, 80, 600, 1200, 150]}
df = pd.DataFrame(d)
df.rolling does accept "a time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes."
So we would have to convert your days to datetimelike (e.g., a pd.Timestamp, or a pd.Timedelta), and set it as index.
But this method won't have the ability perform the shift that you want (e.g., for day 14 you want not up to day 14 but up to day 10: 4 days before it).
So there is another option, which df.rolling also accepts:
Use a BaseIndexer subclass
There is very little documentation on it and I'm not an expert, but I was able to hack my solution into it. Surely, there must be a better (proper) way to use all its attributes correctly, and hopefully someone will show it in their answer here.
How I did it:
Inside our BaseIndexer subclass, we have to define the get_window_bounds method that returns a tuple of ndarrays: index positions of the starts of all windows, and those of the ends of all windows respectively (index positions like the ones that can be used in iloc - not with loc).
To to find them, I used the most efficient method from this answer: np.searchsorted.
Your 'dates' must be sorted for this.
Any keyword arguments that we pass to the BaseIndexer subclass constructor will be set as its attributes. I will set day_from, day_to and days:
from pandas.api.indexers import BaseIndexer
class CustomWindow(BaseIndexer):
"""
Indexer that selects the dates.
It uses the arguments:
----------------------
day_from : int
day_to : int
days : np.ndarray
"""
def get_window_bounds(self,
num_values: int,
min_periods: int | None,
center: bool | None,
closed: str | None) -> tuple[np.ndarray, np.ndarray]:
"""
I'm not using these arguments, but they must be present (not sure why):
`num_values` is the length of the df,
`center`: False, `closed`: None.
"""
days = self.days
# With `side` I'm making both ends inclusive:
window_starts = np.searchsorted(days, days + self.day_from, side='left')
window_ends = np.searchsorted(days, days + self.day_to, side='right')
return (window_starts, window_ends)
# In my implementation both ends are inclusive:
day_from = -8
day_to = -4
days = df['dates'].to_numpy()
my_indexer = CustomWindow(day_from=day_from, day_to=day_to, days=days)
df[['mean', 'std']] = (df['quantity']
.rolling(my_indexer, min_periods=0)
.agg(['mean', 'std']))
Result:
dates material location quantity mean std
0 1 C A 870 NaN NaN
1 2 C A 920 NaN NaN
2 3 C A 120 NaN NaN
3 4 C A 120 NaN NaN
4 6 C A 120 895.000000 35.355339
5 8 D A 1200 507.500000 447.911822
6 8 D A 720 507.500000 447.911822
7 10 D A 480 320.000000 400.000000
8 11 D A 600 120.000000 0.000011
9 12 D A 720 540.000000 523.067873
10 13 D A 80 680.000000 541.109970
11 13 D A 600 680.000000 541.109970
12 14 D A 1200 630.000000 452.990066
13 18 E B 150 613.333333 362.803896

Related

column calculation based on column names Python

I have this dataframe with columns like
| LHA_1 | JH_1 | LHA_2 | JH_2 | LHA_3 | JH_3 | LHA_4 | JH_5 | ....
What I would like to do is to have LHA_2 - JH_1, LHA3 - JH_2, LHA4 - JH_3.....
and the final would look like
| LHA_1 | JH_1 | LHA_2 |LHA_2 - JH_1| JH_2 | LHA_3 |LHA_3 - JH_2| JH_3 | LHA_4 | JH_5 | ....
You can use pd.IndexSlice. Suppose the following dataframe:
>>> df
LHA_1 JH_1 LHA_2 JH_2 LHA_3 JH_3 LHA_4 JH_5 LHA_6
0 9 8 7 6 5 4 3 2 1
1 19 18 17 16 15 14 13 12 11
res = df.loc[:, pd.IndexSlice['LHA_2'::2]].values \
- df.loc[:, pd.IndexSlice['JH_1'::2]].values
res = pd.DataFrame(res).add_prefix('LHA_JH_')
out = pd.concat([df, res], axis=1)
print(out)
# Output
LHA_1 JH_1 LHA_2 JH_2 LHA_3 JH_3 LHA_4 JH_5 LHA_6 LHA_JH_0 LHA_JH_1 LHA_JH_2 LHA_JH_3
0 9 8 7 6 5 4 3 2 1 -1 -1 -1 -1
1 19 18 17 16 15 14 13 12 11 -1 -1 -1 -1

Calculate the sum and average of a fixed number of iterative consecutive rows in a dataframe

I have a dataframe as below:
| ID | Date | Value |
------------------------------------
| A | 01-01-2020 | 0.4854 |
| A | 02-01-2020 | 0.4856 |
| A | 03-01-2020 | 0.3982 |
---
| A | 29-12-2020 | 0.2139 |
| A | 30-12-2020 | 0.6290 |
| A | 31-12-2020 | 1.3921 |
---
| B | 01-01-2020 | 2.198 |
| B | 02-01-2020 | 1.4856 |
| B | 03-01-2020 | 2.3982 |
---
For a given ID, I need to find the sum and average of "Value" for all 14 day periods and then return the sum and average along with the start date and end date. Let's say, 01-01-2020 to 14-01-2020 is a 14 day period and its sum of "Value" is 3.27 and average of "Value" is 0.4239, then 02-01-2020 to 15-01-2020 is another 14 day period and its sum of "Value" is 3.34 and average of "Value" is 0.4456.. likewise I need to find the sum and average of all possible consecutive 14 day periods. The 14 day period needs to be consecutive.
My output should look like:
| ID | Start Date | End Date | Sum | Average |
--------------------------------------------------------
| A | 01-01-2020 | 14-01-2020 | 3.2685 | 0.4239 |
| A | 02-01-2020 | 15-01-2020 | 3.3371 | 0.4456 |
| A | 03-01-2020 | 16-01-2020 | 3.1982 | 0.3987 |
---
| B | 01-01-2020 | 14-01-2020 | 4.2685 | 0.6321 |
| B | 02-01-2020 | 15-01-2020 | 5.3371 | 0.7892 |
| B | 03-01-2020 | 16-01-2020 | 4.1982 | 0.6210 |
My approach is to add a new column for the end date to the data frame. I use iterows() to extract the rows in the data frame on a row-by-row basis to calculate the sum and average.
import pandas as pd
import numpy as np
import random
import datetime
date_rng = pd.date_range('2020-01-01', '2020-01-31', freq='1D')
date_rng = date_rng.append(date_rng)
# value = np.random.uniform(0, 5, 62)
value = np.random.randint(0, 5, (62,))
Id = ['A']*31+['B']*31
df = pd.DataFrame({'ID':Id,'date':date_rng,'Value':value+value[::-1]})
df['date'] = pd.to_datetime(df['date'])
df['End Date'] = df['date']+ datetime.timedelta(days=13)
df.columns = ['ID', 'Start Date', 'Value', 'End Date']
import itertools
for idx,row in df.iterrows():
start = row['Start Date']
end = row['End Date']
i = row['ID']
d = df[(df['ID'] == i) & (df['Start Date'] >= df.loc[idx,'Start Date']) & (df['Start Date'] <= df.loc[idx,'End Date'])]['Value']
df.loc[idx,'sum'] = d.sum()
df.loc[idx,'mean'] = d.mean()
df.head(15)
ID Start Date Value End Date sum mean
0 A 2020-01-01 6 2020-01-14 56.0 4.000000
1 A 2020-01-02 7 2020-01-15 53.0 3.785714
2 A 2020-01-03 1 2020-01-16 50.0 3.571429
3 A 2020-01-04 1 2020-01-17 55.0 3.928571
4 A 2020-01-05 0 2020-01-18 57.0 4.071429
5 A 2020-01-06 5 2020-01-19 60.0 4.285714
6 A 2020-01-07 3 2020-01-20 61.0 4.357143
7 A 2020-01-08 8 2020-01-21 61.0 4.357143
8 A 2020-01-09 4 2020-01-22 55.0 3.928571
9 A 2020-01-10 5 2020-01-23 54.0 3.857143
10 A 2020-01-11 6 2020-01-24 53.0 3.785714
11 A 2020-01-12 6 2020-01-25 52.0 3.714286
12 A 2020-01-13 0 2020-01-26 50.0 3.571429
13 A 2020-01-14 4 2020-01-27 51.0 3.642857
14 A 2020-01-15 3 2020-01-28 51.0 3.642857
I have a similar approach to #r-beginners but this one uses transform and calls the function to calculate the sum and mean values.
import pandas as pd
date1 = '2011-05-03'
df = pd.DataFrame()
df['start_date'] = pd.date_range(date1, periods=100,freq='D')
df['end_date'] = df['start_date'] + pd.to_timedelta(13, unit='D')
df['score'] = range(1, 1+len(df))
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
def sum_scores(d):
return df[(df['start_date'] <= d) &
(df['end_date'] >= d)]['score'].sum()
def mean_scores(d):
return df[(df['start_date'] <= d) &
(df['end_date'] >= d)]['score'].mean()
df['sum'] = df['end_date'].transform(sum_scores)
df['mean'] = df['end_date'].transform(mean_scores)
print (df)
The results from this are:
start_date end_date score sum mean
0 2011-05-03 2011-05-16 1 105 7.5
1 2011-05-04 2011-05-17 2 119 8.5
2 2011-05-05 2011-05-18 3 133 9.5
3 2011-05-06 2011-05-19 4 147 10.5
4 2011-05-07 2011-05-20 5 161 11.5
5 2011-05-08 2011-05-21 6 175 12.5
6 2011-05-09 2011-05-22 7 189 13.5
7 2011-05-10 2011-05-23 8 203 14.5
8 2011-05-11 2011-05-24 9 217 15.5
9 2011-05-12 2011-05-25 10 231 16.5
10 2011-05-13 2011-05-26 11 245 17.5
11 2011-05-14 2011-05-27 12 259 18.5
12 2011-05-15 2011-05-28 13 273 19.5
13 2011-05-16 2011-05-29 14 287 20.5
14 2011-05-17 2011-05-30 15 301 21.5
15 2011-05-18 2011-05-31 16 315 22.5
16 2011-05-19 2011-06-01 17 329 23.5
17 2011-05-20 2011-06-02 18 343 24.5
18 2011-05-21 2011-06-03 19 357 25.5
19 2011-05-22 2011-06-04 20 371 26.5
You can do groupby the ID to get this broken down by each ID.

How to remove a level from the columns of a dataframe produced by pivot_table?

The issue
I have a dataset that looks like the toy example below.
I need to create a table that sums the value for each combination of item and period, and displays it in a crosstab / pivot table format.
If I use pandas.crosstab() I get the output I want.
If I use pandas.pivot_table I get what seems like a multi-level index for the columns.
How can I get rid of the multi-level index?
Yes, I could use just crosstab, but():
in general, I want to learn about multi-level indices
sometimes I
don't have the 'raw' data and I receive the data in the format
produced by pivot_table
What I have tried
I have tried totals_pivot_table.droplevel(0) but it says there is only one level. What does this mean?
dataframe.columns.droplevel() is no longer supported
Example tables
This is the output of pivot_table:
+--------+-------+-------+-------+-------+
| | value | value | value | value |
+--------+-------+-------+-------+-------+
| period | 1 | 2 | 3 | All |
| item | | | | |
| x | 10 | 11 | 12 | 33 |
| y | 13 | 14 | 15 | 42 |
| All | 23 | 25 | 27 | 75 |
+--------+-------+-------+-------+-------+
This is what I need:
+------+----+----+----+-----+
| item | 1 | 2 | 3 | All |
+------+----+----+----+-----+
| x | 10 | 11 | 12 | 33 |
| y | 13 | 14 | 15 | 42 |
| All | 23 | 25 | 27 | 75 |
+------+----+----+----+-----+
Toy code
df = pd.DataFrame()
df['item'] = np.repeat(['x','y'],3)
df['period'] = np.tile([1,2,3],2)
df['value'] = np.arange(10,16)
pivot = df.pivot(index ='item', columns ='period', values = None)
totals_pivot_table = df.pivot_table(index ='item', columns = 'period', aggfunc ='sum', margins = True)
totals_ct = pd.crosstab( df['item'], df['period'], values =df['value'] , aggfunc ='sum', margins=True)
Better is specified values parameter:
totals_pivot_table = df.pivot_table(index ='item',
columns = 'period',
values='value',
aggfunc ='sum',
margins=True)
print (totals_pivot_table)
period 1 2 3 All
item
x 10 11 12 33
y 13 14 15 42
All 23 25 27 75
If not possible is possible use DataFrame.droplevel, but be carefull for duplicated columns names:
print (totals_pivot_table.droplevel(0, axis=1))
period 1 2 3 All
item
x 10 11 12 33
y 13 14 15 42
All 23 25 27 75
df = pd.DataFrame()
df['item'] = np.repeat(['x','y'],3)
df['period'] = np.tile([1,2,3],2)
df['value'] = np.arange(10,16)
df['value1'] = np.arange(7,13)
print (df)
item period value value1
0 x 1 10 7
1 x 2 11 8
2 x 3 12 9
3 y 1 13 10
4 y 2 14 11
5 y 3 15 12
totals_pivot_table = df.pivot_table(index ='item',
columns = 'period',
aggfunc ='sum',
margins=True)
print (totals_pivot_table)
value value1
period 1 2 3 All 1 2 3 All
item
x 10 11 12 33 7 8 9 24
y 13 14 15 42 10 11 12 33
All 23 25 27 75 17 19 21 57
print (totals_pivot_table.droplevel(0, axis=1))
period 1 2 3 All 1 2 3 All
item
x 10 11 12 33 7 8 9 24
y 13 14 15 42 10 11 12 33
All 23 25 27 75 17 19 21 57
Use reset_index() on making your df totals_ct
totals_ct.index
gives:
Index(['x', 'y', 'All'], dtype='object', name='item')
However, using reset_index() when making totals.ct gets rid of all the three indexes
totals_ct = pd.crosstab( df['item'], df['period'], values =df['value'] , aggfunc ='sum', margins=True).reset_index()
check for result:
totals_ct.index
gives:
RangeIndex(start=0, stop=3, step=1)
maybe this is what you are looking for.
greetings Jan

Move row by name to desired location in df

I have a df which looks like this:
a b
apple | 7 | 2 |
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
other | 8 | 9 |
I want to select a given row say by name, say "apple" and move it to a new location, say -1 (second last row)
desired output
a b
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
apple | 7 | 2 |
other | 8 | 9 |
Is there any functions available to achieve this?
Use Index.difference for remove value and numpy.insert for add value to new index, last use DataFrame.reindex or DataFrame.loc for change order of rows:
a = 'apple'
idx = np.insert(df.index.difference([a], sort=False), -1, a)
print (idx)
Index(['google', 'swatch', 'merc', 'apple', 'other'], dtype='object')
df = df.reindex(idx)
#alternative
#df = df.loc[idx]
print (df)
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
This seems good, I am using pd.Index.insert() and pd.Index.drop_duplicates():
df.reindex(df.index.insert(-1,'apple').drop_duplicates(keep='last'))
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
I'm not aware of any built-in function, but one approach would be to manipulate the index only, then use the new index to re-order the DataFrame (assumes all index values are unique):
name = 'apple'
position = -1
new_index = [i for i in df.index if i != name]
new_index.insert(position, name)
df = df.loc[new_index]
Results:
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9

How to identify a specific occurrence across two rows and calculate the count

Let's say I have these 2 pandas dataframes:
id | userid | type
1 | 20 | a
2 | 20 | a
3 | 20 | b
4 | 21 | a
5 | 21 | b
6 | 21 | a
7 | 21 | b
8 | 21 | b
I want to obtain the number of times 'b follows a' for each user, and obtain a new dataframe like this:
userid | b_follows_a
20 | 1
21 | 2
I know I can do this using for loop. However, I wonder if there is a more elegant solution to this.
You can use shift() to check if a is followed by b with vectorized & and then count the trues with a sum:
df.groupby('userid').type.apply(lambda x: ((x == "a") & (x.shift(-1) == "b")).sum()).reset_index()
#userid type
#0 20 1
#1 21 2
Creative solution:
In [49]: df.groupby('userid')['type'].sum().str.count('ab').reset_index()
Out[49]:
userid type
0 20 1
1 21 2
Explanation:
In [50]: df.groupby('userid')['type'].sum()
Out[50]:
userid
20 aab
21 ababb
Name: type, dtype: object

Categories

Resources