Pandas: query() groupby() mean() using second column list - python

I'm trying to decypher some inherited pandas code and cannot determine what the list [['DemandRate','DemandRateQtr','AcceptRate']] is doing in this line of code:
plot_data = (my_dataframe.query("quote_date>'2020-02-01'")
.groupby(['quote_date'])[['DemandRate', 'DemandRateQtr', 'AcceptRate']]
.mean()
.reset_index()
)
Can anyone tell me what the list does?

It is filter by columns names, here are aggregate only columns from list.
['DemandRate', 'DemandRateQtr', 'AcceptRate']
If there are some another columns like this list and from by list(here ['quote_date']) are omitted:
my_dataframe = pd.DataFrame({
'quote_date':pd.date_range('2020-02-01', periods=3).tolist() * 2,
'DemandRate':[4,5,4,5,5,4],
'DemandRateQtr':[7,8,9,4,2,3],
'AcceptRate':[1,3,5,7,1,0],
'column':[5,3,6,9,2,4]
})
print(my_dataframe)
quote_date DemandRate DemandRateQtr AcceptRate column
0 2020-02-01 4 7 1 5
1 2020-02-02 5 8 3 3
2 2020-02-03 4 9 5 6
3 2020-02-01 5 4 7 9
4 2020-02-02 5 2 1 2
5 2020-02-03 4 3 0 4
plot_data = (my_dataframe.query("quote_date>'2020-02-01'")
.groupby(['quote_date'])[['DemandRate', 'DemandRateQtr', 'AcceptRate']]
.mean()
.reset_index())
print (plot_data)
#here is not column
quote_date DemandRate DemandRateQtr AcceptRate
0 2020-02-02 5.0 5.0 2.0
1 2020-02-03 4.0 6.0 2.5

Related

Fill NaNs in Df using groupby and rolling mean

I have a dataframe that looks like this
d = {'date': ['1999-01-01', '1999-01-02', '1999-01-03', '1999-01-04', '1999-01-05', '1999-01-06'], 'ID': [1,1,1,1,1,1], 'Value':[1,2,3,np.NaN,5,6]}
df = pd.DataFrame(data = d)
date ID Value
0 1999-01-01 1 1
1 1999-01-02 1 2
2 1999-01-03 1 3
3 1999-01-04 1 NaN
4 1999-01-05 1 5
5 1999-01-06 1 6
I would like to fill in NaNs using a rolling mean (e.g 2) and extend that to a df with multiple IDs and dates. I tried s.th like this but it takes a very long time and fails with the error "cannot join with no overlapping index names"
df.groupby(['date','ID']).fillna(df.rolling(2, min_periods=1).mean().shift())
or
df.groupby(['date','ID']).fillna(df.groupby(['date','ID']).rolling(2, min_periods=1).mean().shift())
IIUC, here is one way to do it
if you add expected output that will help validate this solution
df2=df.fillna(0).groupby('ID')['Value'].rolling(2).mean().reset_index()
df.update(df2, overwrite=False)
df
date ID Value
0 1999-01-01 1 1.0
1 1999-01-02 1 2.0
2 1999-01-03 1 3.0
3 1999-01-04 1 1.5
4 1999-01-05 1 5.0
5 1999-01-06 1 6.0

Duplicate Quantity as new columns

This is my table
I want it to be the following, i.e., by duplicating Quantity of (shopID, productID) Quantity of other difference (shopID, productID) as new columns, Quantity_shopID_productID.
Following is my code:
from datetime import date
import pandas as pd
df=pd.DataFrame({"Date":[date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2)],
"ShopID":[1,1,1,1,2,2,2,2],
"ProductID":[1,1,2,2,1,1,2,2],
"Quantity":[3,3,4,4,5,5,6,6]})
for sid in df.ShopID.unique():
for pid in df.ProductID.unique():
col_name='Quantity{}_{}'.format(sid,pid)
print(col_name)
df1=df[(df.ShopID==sid) & (df.ProductID==pid)][['Date','Quantity']]
df1.rename(columns={'Quantity':col_name}, inplace=True)
display(df1)
df=df.merge(df1, how="left",on="Date")
df.loc[(df.ShopID==sid) & (df.ProductID==pid),col_name]=None
print(df)
The problem is, it works very slow as I have over 108 different (shopID, productID) combinations over 3 years period. Is there anyway to make it more efficient?
Method 1: using pivot_table with join (vectorized solution)
We can pivot your quantity values per shopid, productid to columns, and then join them back to your original dataframe. This should be way faster than your forloops since this is a vectorized approach:
piv = df.pivot_table(index=['ShopID', 'ProductID'], columns=['ShopID', 'ProductID'], values='Quantity')
piv2 = piv.ffill().bfill()
piv3 = piv2.mask(piv2.eq(piv))
final = df.set_index(['ShopID', 'ProductID']).join(piv3).reset_index()
Output
ShopID ProductID dt Quantity (1, 1) (1, 2) (2, 1) (2, 2)
0 1 1 2019-10-01 3 NaN 4.0 5.0 6.0
1 1 1 2019-10-02 3 NaN 4.0 5.0 6.0
2 1 2 2019-10-01 4 3.0 NaN 5.0 6.0
3 1 2 2019-10-02 4 3.0 NaN 5.0 6.0
4 2 1 2019-10-01 5 3.0 4.0 NaN 6.0
5 2 1 2019-10-02 5 3.0 4.0 NaN 6.0
6 2 2 2019-10-01 6 3.0 4.0 5.0 NaN
7 2 2 2019-10-02 6 3.0 4.0 5.0 NaN
Method 2, using GroupBy, mask, where:
We can speed up your code by using GroupBy and mask + where instead of two for-loops:
groups = df.groupby(['ShopID', 'ProductID'])
for grp, data in groups:
m = df['ShopID'].eq(grp[0]) & df['ProductID'].eq(grp[1])
values = df['Quantity'].where(m).ffill().bfill()
df[f'Quantity_{grp[0]}_{grp[1]}'] = values.mask(m)
Output
dt ShopID ProductID Quantity Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
0 2019-10-01 1 1 3 NaN 4.0 5.0 6.0
1 2019-10-02 1 1 3 NaN 4.0 5.0 6.0
2 2019-10-01 1 2 4 3.0 NaN 5.0 6.0
3 2019-10-02 1 2 4 3.0 NaN 5.0 6.0
4 2019-10-01 2 1 5 3.0 4.0 NaN 6.0
5 2019-10-02 2 1 5 3.0 4.0 NaN 6.0
6 2019-10-01 2 2 6 3.0 4.0 5.0 NaN
7 2019-10-02 2 2 6 3.0 4.0 5.0 NaN
This is a pivot and merge problem with a little extra:
# somehow merge only works with pandas datetime
df['Date'] = pd.to_datetime(df['Date'])
# define the new column names
df['new_col'] = 'Quantity_'+df['ShopID'].astype(str) + '_' + df['ProductID'].astype(str)
# new data to merge:
pivot = df.pivot_table(index='Date',
columns='new_col',
values='Quantity')
# merge
new_df = df.merge(pivot, left_on='Date', right_index=True)
# mask
mask = new_df['new_col'].values[:,None] == pivot.columns.values
# adding the None the values:
new_df[pivot.columns] = new_df[pivot.columns].mask(mask)
Output:
Date ShopID ProductID Quantity new_col Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
-- ------------------- -------- ----------- ---------- ------------ -------------- -------------- -------------- --------------
0 2019-10-01 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
1 2019-10-02 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
2 2019-10-01 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
3 2019-10-02 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
4 2019-10-01 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
5 2019-10-02 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
6 2019-10-01 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
7 2019-10-02 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
Test data with similar size to your actual data:
# 3 years dates
dates = pd.date_range('2015-01-01', '2018-12-31', freq='D')
# 12 Shops and 9 products
idx = pd.MultiIndex.from_product((dates, range(1,13), range(1,10)),
names=('Date','ShopID', 'ProductID'))
# the test data
np.random.seed(1)
df = pd.DataFrame({'Quantity':np.random.randint(0,10, len(idx))},
index=idx).reset_index()
The above code tooks about 10 seconds on an i5 laptop :-)

Pandas - Replace NaNs in a column with the mean of specific group

I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019

(pandas) Fill NaN based on groupby and column condition

Using 'bfill' or 'ffill' on a groupby element is trivial, but what if you need to fill the na with a specific value in a second column, based on a condition in a third column?
For example:
>>> df=pd.DataFrame({'date':['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'], 'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
>>> df
a b date
0 1 4.0 01/10/2017
1 1 NaN 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 NaN 01/11/2017
5 2 7.0 02/10/2016
I need to group by column 'a', and fill the NaN with the column 'b' value where the date for that row is closest to the date in the NaN row.
So the output should look like:
a b date
0 1 4.0 01/10/2017
1 1 6.0 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 5.0 01/11/2017
5 2 7.0 02/10/2016
Assume there is a closest_date() function that takes the NaN date and the list of other dates in that group, and returns the closest date.
I'm trying to find a clean solution that doesn't have to iterate through rows, ideally able to use apply() with lambdas. Any ideas?
This should work:
df['closest_date_by_a'] = df.groupby('a')['date'].apply(closest_date)
df['b'] = df.groupby(['a', 'closest_date_by_a'])['b'].ffill().bfill()
Given a function (closest_date()), you need to apply that function by group so it calculates the closest dates for rows within each group. Then you can group by both the main grouping column (a) and the closest date column (closest_date_by_a) and perform your filling.
Ensure that your date column are in fact dates.
df = pd.DataFrame(
{'date': ['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'],
'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
df.date = pd.to_datetime(df.date)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 NaN 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 NaN 2017-01-11
5 2 7.0 2016-02-10
Use reindex with method='nearest' after having dropna()
def fill_with_nearest(df):
s = df.set_index('date').b
s = s.dropna().reindex(s.index, method='nearest')
s.index = df.index
return s
df.loc[df.b.isnull(), 'b'] = df.groupby('a').apply(fill_with_nearest).reset_index(0, drop=True)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 4.0 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 5.0 2017-01-11
5 2 7.0 2016-02-10

Select values in Pandas groupby dataframe that are present in n previous groups

I have a Pandas dataframe groupby object which looks like the following:
ID
2014-11-30 1
2
3
2014-12-31 1
2
3
4
2015-01-31 2
3
4
2015-02-28 1
3
4
5
2015-03-31 1
2
4
5
6
2015-04-30 3
4
5
6
What I want to do is create another dataframe where the values in groupby date x are values that are in each of groupby dates y(x-1) thru y(x-n) where y is the n period previous groupby. So for instance, if n=1, then if x groupby period is '2015-04-30', then you would check against '2015-03-31'. If n=2, then if groupby date '2015-02-28', then you would check against groupby dates ['2015-01-31', '2014-12-31'].
The resulting dataframe from the above would look like this for n=1:
ID
2014-12-31 1
2
3
2015-01-31 2
3
4
2015-02-28 3
4
2015-03-31 1
4
5
2015-04-30 4
5
6
The resulting dataframe for n=2 would be:
2015-01-31 2
3
2015-02-28 3
4
2015-03-31 4
2015-04-30 4
5
Looking forward to some pythonic solutions!
This would seem to work:
def filter_unique(df, n):
data_by_date = df.groupby('date')['ID'].apply(lambda x: x.tolist())
filtered_data = {}
previous = []
for i, (date, data) in enumerate(data_by_date.items()):
if i >= n:
if len(previous)==1:
filtered_data[date] = list(set(previous[i-n]).intersection(data))
else:
filtered_data[date] = list(set.intersection(*[set(x) for x in previous[i-n:]]).intersection(data))
else:
filtered_data[date] = data
previous.append(data)
result = pd.DataFrame.from_dict(filtered_data, orient='index').stack()
result.index = result.index.droplevel(1)
filter_unique(df, 2)
1/31/15 2
1/31/15 3
1/31/15 4
11/30/14 1
11/30/14 2
11/30/14 3
12/31/14 2
12/31/14 3
2/28/15 1
2/28/15 3
3/31/15 1
3/31/15 4
4/30/15 4
4/30/15 5

Categories

Resources