creating pandas function equivalent for EXCEL OFFSET function - python

Let's say input was
d = {'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,3,4,5,6,7,8,9,10],
'col3': [1,2,3,4,5,6,7,8,9,10],
'offset': [1,2,3,1,2,3,1,2,3,1]}
df = pd.DataFrame(data=d)
I want to create an additional column that looks like this:
df['output'] = [1, 4, 9, 4, 10, 18, 7, 16, 27, 10]
Basically each number in offset is telling you the number of columns to sum over (from col1 as ref point).
Is there a vectorized way to do this without iterating through each value in offset?

You use np.select. To use it, create each of the column sum (1, 2, 3 ... as needed) as the possible choices, and create a boolean masks for each value in offset column as the possible conditons.
# get all possible values from offset
lOffset = df['offset'].unique()
# get te result with np.select
df['output'] = np.select(
# create mask for each values in offset
condlist=[df['offset'].eq(i) for i in lOffset],
# crerate the sum over the number of columns per offset value
choicelist=[df.iloc[:,:i].sum(axis=1) for i in lOffset]
)
print(df)
# col1 col2 col3 offset output
# 0 1 1 1 1 1
# 1 2 2 2 2 4
# 2 3 3 3 3 9
# 3 4 4 4 1 4
# 4 5 5 5 2 10
# 5 6 6 6 3 18
# 6 7 7 7 1 7
# 7 8 8 8 2 16
# 8 9 9 9 3 27
# 9 10 10 10 1 10
Note: this assumes that your offset column is the last one

It can be done with pd.crosstab then we mask all 0 to NaN and back fill, this will return 1 as all value ned to sum
df['new'] = df.filter(like = 'col').where(pd.crosstab(df.index,df.offset).mask(lambda x : x==0).bfill(1).values==1).sum(1)
Out[710]:
0 1.0
1 4.0
2 9.0
3 4.0
4 10.0
5 18.0
6 7.0
7 16.0
8 27.0
9 10.0
dtype: float64

Related

Pandas: Resample a dataframe given a list of indexes that are not evenly spaces

Given a dataframe df: pd.Dataframe and a subset selected_indexes of indexes from df.index how can I resample df with the max operator applied to each interval selected_indexes[i], selected_indexes[i+1] ?
For example, given a dataframe:
col
0 5
1 0
2 3
3 3
4 7
5 9
6 3
7 5
8 2
9 4
And a selection of index "selected_indexes = [0, 5, 6, 9]" and applying the maximum on the col column between each interval (assuming we keep the end point and exclude the starting point), we should get:
col
0 5
5 9
6 3
9 5
For example the line 9 was made with max(5, 2, 4) from lines 7, 8, 9 \in (6, 9].
new interpretation
selected_indexes = [0, 5, 6, 9]
group = (df.index.to_series().shift() # make groups
.isin(selected_indexes) # based on
.cumsum() # previous indices
)
# get max per group
out = df.groupby(group).max().set_axis(selected_indexes)
# or for many aggregations (see comments):
out = (df.groupby(group).agg({'col1': 'max', 'col2': 'min'})
.set_axis(selected_indexes)
)
Output:
col
0 5
5 9
6 3
9 5
previous interpretation of the question
You likely need a rolling.max, not resample:
out = df.loc[selected_indexes].rolling(3, center=True).max()
Or, if you want the ±1 to apply to the data before selection:
out = df.rolling(3, center=True).max().loc[selected_indexes]
Example:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 10)})
selected_indexes = [1, 2, 3, 5, 6, 8, 9]
print(df)
col
0 5
1 0
2 3
3 3
4 7
5 9
6 3
7 5
8 2
9 4
out = df.loc[selected_indexes].rolling(3, center=True).max()
print(out)
col
1 NaN
2 3.0
3 9.0
5 9.0
6 9.0
8 4.0
9 NaN
out2 = df.rolling(3, center=True).max().loc[selected_indexes]
print(out2)
col
1 5.0
2 3.0
3 7.0
5 9.0
6 9.0
8 5.0
9 NaN

How to aggregate the previous values of a pandas dataframe based only on historical values? Similarly, how to mean encode only historical groups? [duplicate]

I have a dataframe which looks like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]:
category order_start time
0 1 1 1
1 1 2 4
2 1 3 3
3 2 1 6
4 2 2 8
5 2 3 17
6 3 1 14
7 3 2 12
8 3 3 13
9 4 1 16
I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?
The new column should look like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]:
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0 = 1 / 1
2 1 3 3 2.5 = (4+1)/2
3 2 1 6 NaN
4 2 2 8 6.0 = 6 / 1
5 2 3 17 7.0 = (8+6) / 2
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Note: If it is the first time, the mean should be NaN.
EDIT: as stated by cs95, my question was not really the same as this one since here, expanding is required.
"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding (and a shift):
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Another way to calculate this is without the apply (chaining two groupby calls):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
In terms of performance, it really depends on the number and size of your groups.
Inspired by my answer here, one can define a function first:
def mean_previous(df, Category, Order, Var):
# Order the dataframe first
df.sort_values([Category, Order], inplace=True)
# Calculate the ordinary grouped cumulative sum
# and then substract with the grouped cumulative sum of the last order
csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
# Calculate the ordinary grouped cumulative count
# and then substract with the grouped cumulative count of the last order
ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
return csp / ccp
And the desired column is
df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
Performance-wise, I believe it's very fast.

How to calculate slope of Pandas dataframe column based on previous N rows

I have the following example dataframe:
import pandas as pd
d = {'col1': [2, 5, 6, 5, 4, 6, 7, 8, 9, 7, 5]}
df = pd.DataFrame(data=d)
print(df)
Output:
col1
0 2
1 5
2 6
3 5
4 4
5 6
6 7
7 8
8 9
9 7
10 5
I need to calculate the slope of the previous N rows from col1 and save the slope value in a separate column (call it slope). The desired output may look like the following: (Given slope values below are just random numbers for the sake of example.)
col1 slope
0 2
1 5
2 6
3 5
4 4 3
5 6 4
6 7 5
7 8 2
8 9 4
9 7 6
10 5 5
So, in the row with the index number 4, the slope is 3 and it is the slope of [2, 5, 6, 5, 4].
Is there an elegant way of doing it without using for loop?
ADDENDUM:
Based on the accepted answer below, in case you get the following error:
TypeError: ufunc 'true_divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
It may be so that the index of your dataframe is maybe not numerical. The following modification makes it work then:
df['slope'] = df['col1'].rolling(5).apply(lambda s: linregress(range(5), s.values)[0])
You can use rolling+apply and scipy.stats.linregress:
from scipy.stats import linregress
df['slope'] = df['col1'].rolling(5).apply(lambda s: linregress(s.reset_index())[0])
print(df)
output:
col1 slope
0 2 NaN
1 5 NaN
2 6 NaN
3 5 NaN
4 4 0.4
5 6 0.0
6 7 0.3
7 8 0.9
8 9 1.2
9 7 0.4
10 5 -0.5
Let us do with numpy
def slope_numpy(x,y):
fit = np.polyfit(x, y, 1)
return np.poly1d(fit)[0]
df.col1.rolling(5).apply(lambda x : slope_numpy(range(5),x))
0 NaN
1 NaN
2 NaN
3 NaN
4 3.6
5 5.2
6 5.0
7 4.2
8 4.4
9 6.6
10 8.2
Name: col1, dtype: float64

Drop specific column and indexes in pandas DataFrame

DataFrame:
A B C
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
Is it possible to drop values from index 2 to 4 in column B? or replace it with NaN.
In this case, values: [8, 9, 10] should be removed.
I tried this: df.drop(columns=['B'], index=[8, 9, 10]), but then column B is removed.
Drop values does not make sense into DataFrame. You can set values to NaN instead and use .loc / .iloc to access index/columns:
>>> df
A B C
a 1 6 11
b 2 7 12
c 3 8 13
d 4 9 14
e 5 10 15
# By name:
df.loc['c':'e', 'B'] = np.nan
# By number:
df.iloc[2:5, 2] = np.nan
Read carefully Indexing and selecting data
import pandas as pd
data = [
['A','B','C'],
[1,6,11],
[2,7,12],
[3,8,13],
[4,9,14],
[5,10,15]
]
df = pd.DataFrame(data=data[1:], columns=data[0])
df['B'] = df['B'].shift(3)
>>>
A B C
0 1 NaN 11
1 2 NaN 12
2 3 NaN 13
3 4 6.0 14
4 5 7.0 15

pandas GroupBy and cumulative mean of previous rows in group

I have a dataframe which looks like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]:
category order_start time
0 1 1 1
1 1 2 4
2 1 3 3
3 2 1 6
4 2 2 8
5 2 3 17
6 3 1 14
7 3 2 12
8 3 3 13
9 4 1 16
I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?
The new column should look like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]:
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0 = 1 / 1
2 1 3 3 2.5 = (4+1)/2
3 2 1 6 NaN
4 2 2 8 6.0 = 6 / 1
5 2 3 17 7.0 = (8+6) / 2
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Note: If it is the first time, the mean should be NaN.
EDIT: as stated by cs95, my question was not really the same as this one since here, expanding is required.
"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding (and a shift):
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Another way to calculate this is without the apply (chaining two groupby calls):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
In terms of performance, it really depends on the number and size of your groups.
Inspired by my answer here, one can define a function first:
def mean_previous(df, Category, Order, Var):
# Order the dataframe first
df.sort_values([Category, Order], inplace=True)
# Calculate the ordinary grouped cumulative sum
# and then substract with the grouped cumulative sum of the last order
csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
# Calculate the ordinary grouped cumulative count
# and then substract with the grouped cumulative count of the last order
ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
return csp / ccp
And the desired column is
df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
Performance-wise, I believe it's very fast.

Categories

Resources