Ununderstandable Pandas groupby results - python

Coming from R and been working with the tidyverse mostly, I wonder how does pandas groupby and aggregations work. I have this code and the results are heartbreaking to me.
import pandas as pd
df = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
df.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
Now I would like to calculate the average displacement (disp) by cylinders, like that:
df['avg_disp'] = df.groupby('cyl').disp.mean()
Which results in something like:
cyl disp avg_disp
31 4 121.0 NaN
2 4 108.0 NaN
27 4 95.1 NaN
26 4 120.3 NaN
25 4 79.0 NaN
20 4 120.1 NaN
7 4 146.7 NaN
8 4 140.8 353.100000
19 4 71.1 NaN
18 4 75.7 NaN
17 4 78.7 NaN
29 6 145.0 NaN
0 6 160.0 NaN
1 6 160.0 NaN
3 6 258.0 NaN
10 6 167.6 NaN
9 6 167.6 NaN
5 6 225.0 NaN
13 8 275.8 NaN
28 8 351.0 NaN
4 8 360.0 105.136364
24 8 400.0 NaN
23 8 350.0 NaN
22 8 304.0 NaN
21 8 318.0 NaN
6 8 360.0 183.314286
11 8 275.8 NaN
16 8 440.0 NaN
30 8 301.0 NaN
14 8 472.0 NaN
12 8 275.8 NaN
15 8 460.0 NaN
After searching for a while, I discovered the transform function which leads to the correct value for avg_disp by assigning the mean value to each row according to the grouping cyl var.
My point is... why can't it be done easily with the mean function instead of using .transform('mean') on the grouped data frame?

If you want to add the results back to the ungrouped dataframe you could use .transform:
... and return a DataFrame having the same indexes as the original object filled with the transformed values.
df['avg_disp'] = df.groupby('cyl').disp.transform('mean')

Related

remove certain numbers from two dataframes python

I have two dataframes
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 36 28 6 20 1 ... 5 0 0 50 23 0
1 2021-04-13 46 15 5 16 6 ... 5 0 0 122 12 1
2 2021-04-14 12 4 1 5 2 ... 2 0 0 39 1 0
3 2021-04-15 30 23 3 14 2 ... 15 0 0 101 9 0
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 41 28 4 33 10 ... 5 0 0 56 14 3
1 2021-04-13 76 22 7 12 29 ... 4 0 0 134 8 2
2 2021-04-14 21 15 2 7 16 ... 2 0 0 61 3 0
3 2021-04-15 54 43 9 2 31 ... 16 0 0 83 13 1
I want to remove numbers from two dataframe that are lower than 10 if the instance is deleted from one dataframe the same cell should be remove in another dataframe same thing goes other way around
Appreciate your help
Use a mask:
## pre-requisite
df1 = df1.set_index('dt')
df2 = df2.set_index('dt')
## processing
mask = df1.lt(10) | df2.lt(10)
df1 = df1.mask(mask)
df2 = df2.mask(mask)
output:
>>> df1
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 36 28.0 NaN 20.0 NaN NaN NaN NaN 50 23.0 NaN
2021-04-13 46 15.0 NaN 16.0 NaN NaN NaN NaN 122 NaN NaN
2021-04-14 12 NaN NaN NaN NaN NaN NaN NaN 39 NaN NaN
2021-04-15 30 23.0 NaN NaN NaN 15.0 NaN NaN 101 NaN NaN
>>> df2
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 41 28.0 NaN 33.0 NaN NaN NaN NaN 56 14.0 NaN
2021-04-13 76 22.0 NaN 12.0 NaN NaN NaN NaN 134 NaN NaN
2021-04-14 21 NaN NaN NaN NaN NaN NaN NaN 61 NaN NaN
2021-04-15 54 43.0 NaN NaN NaN 16.0 NaN NaN 83 NaN NaN

Pandas: grab positions in dataframe which indexes are listed in another dataframe

Suppose that I have 2 dataframes, with indexes populated so that elements in columns are unique, because in real data they are:
vals = pd.DataFrame(np.random.randint(0,10,(10, 3)), columns=list('ABC'))
indexes = pd.DataFrame(np.argsort(np.random.randint(0,10,(10, 3)), axis=0)[:5], columns=list('ABC'))
>>> vals
A B C
0 64 20 48
1 28 60 81
2 5 73 77
3 74 66 86
4 41 39 21
5 65 37 98
6 10 20 73
7 6 70 3
8 36 29 28
9 43 13 12
>>> indexes
A B C
0 4 2 3
1 3 3 8
2 5 1 7
3 9 8 9
4 2 4 0
I would like to retain only those values in vals which indexes are listed in indexes. I don't care about row integrity or NAs, as I'll use the columns as Series later.
This is what I came up with:
vals_indexes = pd.DataFrame()
for i in range(vals.shape[1]):
vals_indexes = pd.concat([vals_indexes, vals.iloc[[e for e in indexes.iloc[:, i] if e in vals.index], i]], axis=1)
>>> vals_indexes
A B C
0 NaN NaN 48.0
1 NaN 60.0 NaN
2 5.0 73.0 NaN
3 74.0 66.0 86.0
4 41.0 39.0 NaN
5 65.0 NaN NaN
7 NaN NaN 3.0
8 NaN 29.0 28.0
9 43.0 NaN 12.0
Which is a bit ugly, but works for me. Question: is there a more effective way to do this?
use .loc within a loop to replace non existing index with nan
for i in vals.columns:
vals.loc[vals[i].isin(list(indexes[i].unique())),i]=np.nan
print(vals)
A B C
0 NaN 2.0 NaN
1 NaN 5.0 NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
4 NaN NaN 6.0
5 9.0 NaN NaN
6 NaN NaN 4.0
7 NaN 7.0 NaN
8 2.0 NaN NaN
9 NaN NaN NaN

How To Map Column Values where two others match? "Reindexing only valid with uniquely valued Index objects"?

I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN
Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN

Problem with merging Pandas Dataframes with Columns that don't line up

I am attempting to transpose and merge two pandas dataframes, one containing accounts, the segment which they received their deposit, their deposit information, and what day they received the deposit; the other has the accounts, and withdrawal information. The issue is, for indexing purposes, the segment information from one dataframe should line up with the information of the other, regardless of there being a withdrawal or not.
Notes:
There will always be an account for every person
There will not always be a withdrawal for every person
The accounts and data for the withdrawal dataframe only exist if a withdrawal occurs
Account Dataframe Code
accounts = DataFrame({'person':[1,1,1,1,1,2,2,2,2,2],
'segment':[1,2,3,4,5,1,2,3,4,5],
'date_received':[10,20,30,40,50,11,21,31,41,51],
'amount_received':[1,2,3,4,5,6,7,8,9,10]})
accounts = accounts.pivot_table(index=["person"], columns=["segment"])
Account Dataframe
amount_received date_received
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
Withdrawal Dataframe Code
withdrawals = DataFrame({'person':[1,1,1,2,2],
'withdrawal_segment':[1,1,5,2,3],
'withdraw_date':[1,2,3,4,5],
'withdraw_amount':[10,20,30,40,50]})
withdrawals = withdrawals.reset_index().pivot_table(index = ['index', 'person'], columns = ['withdrawal_segment'])
Since there can only be unique segments for a person it is required that my column only consists of a unique number once, while still holding all of the data, which is why this dataframe looks so much different.
Withdrawal Dataframe
withdraw_date withdraw_amount
withdrawal_segment 1 2 3 5 1 2 3 5
index person
0 1 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2.0 NaN NaN NaN 20.0 NaN NaN NaN
2 1 NaN NaN NaN 3.0 NaN NaN NaN 30.0
3 2 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
4 2 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
Merge
merge = accounts.merge(withdrawals, on='person', how='left')
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 5 1 2 3 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN 20.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN 3.0 NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
The problem with the merged dataframe is that segments from the withdrawal dataframe aren't lined up with the accounts segments.
The desired dataframe should look something like:
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN NaN 10.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN NaN 20.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN NaN 3.0 NaN NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN NaN 40.0 NaN NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN NaN 50.0 NaN NaN
My problem is that I can't seem to merge across both person and segments. I've thought about inserting a row and column, but because I don't know which segments are and aren't going to have a withdrawal this gets difficult. Is it possible to merge the dataframes so that they line up across both people and segments? Thanks!
Method 1 , using reindex
withdrawals=withdrawals.reindex(pd.MultiIndex.from_product([withdrawals.columns.levels[0],accounts.columns.levels[1]]),axis=1)
merge = accounts.merge(withdrawals, on='person', how='left')
merge
Out[79]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN
Method 2 , using unstack and stack
merge = accounts.merge(withdrawals, on='person', how='left')
merge.stack(dropna=False).unstack()
Out[82]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN

How to sum certain values in a pandas column DataFrame in a specific date range

I have a large DataFrame that looks something like this:
df =
UPC Unit_Sales Price Price_Change Date
0 22 15 1.99 NaN 2017-10-10
1 22 7 2.19 True 2017-10-12
2 22 6 2.19 NaN 2017-10-13
3 22 7 1.99 True 2017-10-16
4 22 4 1.99 NaN 2017-10-17
5 35 15 3.99 NaN 2017-10-09
6 35 17 3.99 NaN 2017-10-11
7 35 5 4.29 True 2017-10-13
8 35 8 4.29 NaN 2017-10-15
9 35 2 4.29 NaN 2017-10-15
Basically I am trying to record how the sales of a product(UPC) reacted once the price changed for the following 7 days. I want to create a new column ['Reaction'] which records the sum of the unit sales from the day of price change, and 7 days forward. Keep in mind, sometimes a UPC has more than 2 price changes, so I want a different sum for each price change.
So I want to see this:
UPC Unit_Sales Price Price_Change Date Reaction
0 22 15 1.99 NaN 2017-10-10 NaN
1 22 7 2.19 True 2017-10-12 13
2 22 6 2.19 NaN 2017-10-13 NaN
3 22 7 1.99 True 2017-10-16 11
4 22 4 1.99 NaN 2017-10-19 NaN
5 35 15 3.99 NaN 2017-10-09 NaN
6 35 17 3.99 NaN 2017-10-11 NaN
7 35 5 4.29 True 2017-10-13 15
8 35 8 4.29 NaN 2017-10-15 NaN
9 35 2 4.29 NaN 2017-10-18 NaN
What is difficult is how the dates are set up in my data. Sometimes (like for UPC 35) the dates don't range past 7 days. So I would want it to default to the next nearest date, or however many dates there are (if there are less than 7 days).
Here's what I've tried:
I set the date to a datetime and I'm thinking of counting days by .days method.
This is how I'm thinking of setting a code up (rough draft):
x = df.loc[df['Price_Change'] == 'True']
for x in df:
df['Reaction'] = sum(df.Unit_Sales[1day :8days])
Is there an easier way to do this, maybe without a for loop?
You just need ffill with groupby
df.loc[df.Price_Change==True,'Reaction']=df.groupby('UPC').apply(lambda x : (x['Price_Change'].ffill()*x['Unit_Sales']).sum()).values
df
Out[807]:
UPC Unit_Sales Price Price_Change Date Reaction
0 22 15 1.99 NaN 2017-10-10 NaN
1 22 7 2.19 True 2017-10-12 24.0
2 22 6 2.19 NaN 2017-10-13 NaN
3 22 7 2.19 NaN 2017-10-16 NaN
4 22 4 2.19 NaN 2017-10-17 NaN
5 35 15 3.99 NaN 2017-10-09 NaN
6 35 17 3.99 NaN 2017-10-11 NaN
7 35 5 4.29 True 2017-10-13 15.0
8 35 8 4.29 NaN 2017-10-15 NaN
9 35 2 4.29 NaN 2017-10-15 NaN
Update
df['New']=df.groupby('UPC').apply(lambda x : x['Price_Change']==True).cumsum().values
v1=df.groupby(['UPC','New']).apply(lambda x : (x['Price_Change'].ffill()*x['Unit_Sales']).sum())
df=df.merge(v1.reset_index())
df[0]=df[0].mask(df['Price_Change']!=True)
df
Out[927]:
UPC Unit_Sales Price Price_Change Date New 0
0 22 15 1.99 NaN 2017-10-10 0 NaN
1 22 7 2.19 True 2017-10-12 1 13.0
2 22 6 2.19 NaN 2017-10-13 1 NaN
3 22 7 1.99 True 2017-10-16 2 11.0
4 22 4 1.99 NaN 2017-10-17 2 NaN
5 35 15 3.99 NaN 2017-10-09 2 NaN
6 35 17 3.99 NaN 2017-10-11 2 NaN
7 35 5 4.29 True 2017-10-13 3 15.0
8 35 8 4.29 NaN 2017-10-15 3 NaN
9 35 2 4.29 NaN 2017-10-15 3 NaN

Categories

Resources