Spread single value in group across all other NaN values in group - python

In this example, we attempt to apply value in group and column to all other NaNs, that are in the same group and column.
import pandas as pd
df = pd.DataFrame({'id':[1,1,2,2,3,4,5], 'Year':[2000,2000, 2001, 2001, 2000, 2000, 2000], 'Values': [1, 3, 2, 3, 4, 5,6]})
df['pct'] = df.groupby(['id', 'Year'])['Values'].apply(lambda x: x/x.shift() - 1)
print(df)
id Year Values pct
0 1 2000 1 NaN
1 1 2000 3 2.0
2 2 2001 2 NaN
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN
I have tried to use .ffill() to fill the NaN's within each group that contains a value. For example, the code is trying to make it so that the NaN associated with index 0, to be 2.0, and the NaN associated to index 2 to be 0.5.
df['pct'] = df.groupby(['id', 'Year'])['pct'].ffill()
print(df)
id Year Values pct
0 1 2000 1 NaN
1 1 2000 3 2.0
2 2 2001 2 NaN
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN

It should be bfill
df['pct'] = df.groupby(['id', 'Year'])['pct'].bfill()
df
Out[109]:
id Year Values pct
0 1 2000 1 2.0
1 1 2000 3 2.0
2 2 2001 2 0.5
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN

Related

How to aggregate the previous values of a pandas dataframe based only on historical values? Similarly, how to mean encode only historical groups? [duplicate]

I have a dataframe which looks like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]:
category order_start time
0 1 1 1
1 1 2 4
2 1 3 3
3 2 1 6
4 2 2 8
5 2 3 17
6 3 1 14
7 3 2 12
8 3 3 13
9 4 1 16
I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?
The new column should look like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]:
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0 = 1 / 1
2 1 3 3 2.5 = (4+1)/2
3 2 1 6 NaN
4 2 2 8 6.0 = 6 / 1
5 2 3 17 7.0 = (8+6) / 2
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Note: If it is the first time, the mean should be NaN.
EDIT: as stated by cs95, my question was not really the same as this one since here, expanding is required.
"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding (and a shift):
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Another way to calculate this is without the apply (chaining two groupby calls):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
In terms of performance, it really depends on the number and size of your groups.
Inspired by my answer here, one can define a function first:
def mean_previous(df, Category, Order, Var):
# Order the dataframe first
df.sort_values([Category, Order], inplace=True)
# Calculate the ordinary grouped cumulative sum
# and then substract with the grouped cumulative sum of the last order
csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
# Calculate the ordinary grouped cumulative count
# and then substract with the grouped cumulative count of the last order
ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
return csp / ccp
And the desired column is
df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
Performance-wise, I believe it's very fast.

Remove columns with more than N consecutive NaN's

I have a DataFrame with around 1000 columns, some columns have 0 NaNs, some have 3, some have 400.
What I want to do is remove all columns where there exists a number of consecutive NaNs that are larger than some threshold N, the rest I will impute by taking mean of nearest neighbors.
df
ColA | ColB | ColC | ColD | ColE
NaN 5 3 NaN NaN
NaN 6 NaN 4 4
NaN 7 4 4 4
NaN 5 5 NaN NaN
NaN 5 4 NaN 4
NaN 3 3 NaN 3
threshold = 2
remove_consecutive_nan(df,threshold)
Which would return
ColB | ColC | ColE
5 3 NaN
6 NaN 4
7 4 4
5 5 NaN
5 4 4
3 3 3
How would I write the remove_consecutive_nan function?
You can create groups for each column of same values for consecutive missing values, then count each column separately and last filter out columns by threshold:
def remove_consecutive_nan(df,threshold):
m = df.notna()
mask = m.cumsum().mask(m).apply(pd.Series.value_counts).gt(threshold)
return df.loc[:, ~mask.any(axis=0)]
print (remove_consecutive_nan(df, 2))
ColB ColC ColE
0 5 3.0 NaN
1 6 NaN 4.0
2 7 4.0 4.0
3 5 5.0 NaN
4 5 4.0 4.0
5 3 3.0 3.0
Alternative with counts missing values by range:
def remove_consecutive_nan(df,threshold):
m = df.isna()
b = m.cumsum()
mask = b.sub(b.mask(m).ffill().fillna(0)).gt(threshold)
return df.loc[:, ~mask.any(axis=0)]

Is there an easy way to expand/complete a pandas DataFrame to include missing observations with multiple columns?

I have a DataFrame that looks like this:
>>> df = pd.DataFrame({
'category1': list('AABAAB'),
'category2': list('xyxxyx'),
'year': [2000, 2000, 2000, 2002, 2002, 2002],
'value': [0, 1, 0, 4, 3, 4]
})
>>> df
category1 category2 year value
0 A x 2000 0
1 A y 2000 1
2 B x 2000 0
3 A x 2002 4
4 A y 2002 3
5 B x 2002 4
I'd like to expand the data to include missing years in a range. For example, if the range were range(2000, 2003), the expanded DataFrame should look like this:
category1 category2 year value
0 A x 2000 0.0
1 A y 2000 1.0
2 B x 2000 0.0
3 A x 2001 NaN
4 A y 2001 NaN
5 B x 2001 NaN
6 A x 2002 4.0
7 A y 2002 3.0
8 B x 2002 4.0
I have tried an approach using pd.MultiIndex.from_product, but that creates rows that aren't valid combinations of category1 and category2 (for example, B and y shouldn't go together). Using from_product and then filtering is too slow for my actual data, which includes many more combinations.
Is there an easier solution to this that can scale well?
Edit
This is the solution I ended up going with, trying to generalize the problem a bit:
id_cols = ['category1', 'category2']
df_out = (df.pivot_table(index=id_cols, values='value', columns='year')
.reindex(columns=range(2000, 2003))
.stack(dropna=False)
.sort_index(level=-1)
.reset_index(name='value'))
category1 category2 year value
0 A x 2000 0.0
1 A y 2000 1.0
2 B x 2000 0.0
3 A x 2001 NaN
4 A y 2001 NaN
5 B x 2001 NaN
6 A x 2002 4.0
7 A y 2002 3.0
8 B x 2002 4.0
Let us do stack and unstack
dfout=df.set_index(['year','category1','category2']).\
value.unstack(level=0).\
reindex(columns=range(2000,2003)).\
stack(dropna=False).to_frame('value').\
sort_index(level=2).reset_index()
category1 category2 year value
0 A x 2000 0.0
1 A y 2000 1.0
2 B x 2000 0.0
3 A x 2001 NaN
4 A y 2001 NaN
5 B x 2001 NaN
6 A x 2002 4.0
7 A y 2002 3.0
8 B x 2002 4.0
#create a separate dataframe and merge with df to get your new form
a = ('A','x',range(2000,2003))
b = ('A','y',range(2000,2003))
c = ('B','x',range(2000,2003))
from itertools import product, chain
res = ((product(*ent)) for ent in (a,b,c))
columns = ['category1','category2','year']
fake = pd.DataFrame(chain.from_iterable(res),columns=columns)
fake.merge(df,on=columns,how='left').sort_values('year',ignore_index=True)
category1 category2 year value
0 A x 2000 0.0
1 A y 2000 1.0
2 B x 2000 0.0
3 A x 2001 NaN
4 A y 2001 NaN
5 B x 2001 NaN
6 A x 2002 4.0
7 A y 2002 3.0
8 B x 2002 4.0
alternatively :
fake = df.drop_duplicates(['category1','category2']).filter(['category1','category2'])
fake.index = [2001]*len(fake)
#merge two indexes on year
pd.concat((df.set_index('year'),fake)).sort_index()
Update
You can abstract the process by using the complete function from pyjanitor:
# pip install pyjanitor
import janitor
df.complete(
("category1", "category2"),
{"year": [2000, 2001, 2002]},
sort = True
)
category1 category2 year value
0 A x 2000 0.0
1 A x 2001 NaN
2 A x 2002 4.0
3 A y 2000 1.0
4 A y 2001 NaN
5 A y 2002 3.0
6 B x 2000 0.0
7 B x 2001 NaN
8 B x 2002 4.0

pandas GroupBy and cumulative mean of previous rows in group

I have a dataframe which looks like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]:
category order_start time
0 1 1 1
1 1 2 4
2 1 3 3
3 2 1 6
4 2 2 8
5 2 3 17
6 3 1 14
7 3 2 12
8 3 3 13
9 4 1 16
I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?
The new column should look like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]:
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0 = 1 / 1
2 1 3 3 2.5 = (4+1)/2
3 2 1 6 NaN
4 2 2 8 6.0 = 6 / 1
5 2 3 17 7.0 = (8+6) / 2
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Note: If it is the first time, the mean should be NaN.
EDIT: as stated by cs95, my question was not really the same as this one since here, expanding is required.
"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding (and a shift):
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Another way to calculate this is without the apply (chaining two groupby calls):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
In terms of performance, it really depends on the number and size of your groups.
Inspired by my answer here, one can define a function first:
def mean_previous(df, Category, Order, Var):
# Order the dataframe first
df.sort_values([Category, Order], inplace=True)
# Calculate the ordinary grouped cumulative sum
# and then substract with the grouped cumulative sum of the last order
csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
# Calculate the ordinary grouped cumulative count
# and then substract with the grouped cumulative count of the last order
ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
return csp / ccp
And the desired column is
df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
Performance-wise, I believe it's very fast.

Function to determine window in a rolling function

I have a dataframe in which I want to apply a rolling mean over a column of numbers that come in 3-pairs where I only want 4 unique values to go into the mean.
Lets say my dataframe looks like:
Group Column to roll
1 9
2 5
2 5
2 4
2 4
2 4
2 3
2 3
2 3
2 6
2 6
2 6
2 8
Since I want 4 unique values to go into the mean but all values to be of equal weight and within the same group, my expected output (assuming I need 4 unique values) would be:
Group Output
1 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 (6+3+4+5)/4
2 (6+3+4+5)/4
2 (6+3+4+5)/4
2 (8+6+3+4)/4
Any ideas how to do this?
You could try something like this:
df['Column to roll'].drop_duplicates().rolling(4).mean().reindex(df.index).ffill()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 4.50
9 4.50
10 4.50
11 5.25
Name: Column to roll, dtype: float64
Edit question changed
df_out = df.groupby('Group')['Column to roll']\
.apply(lambda x: x.drop_duplicates().rolling(4).mean()).rename('Output')
df.set_index('Group',append=True).swaplevel(0,1)\
.join(df_out, how='left').ffill().reset_index(level=1, drop=True)
Output:
Column to roll Output
Group
1 9 NaN
2 5 NaN
2 5 NaN
2 4 NaN
2 4 NaN
2 4 NaN
2 3 NaN
2 3 NaN
2 3 NaN
2 6 4.50
2 6 4.50
2 6 4.50
2 8 5.25

Categories

Resources