mean per group in a fragmented dataset - python

this is actually an extension of my previous question, but I was requested to put it as a separate question
Rolling average on previous dates per group
I have the following dataset:
Name Loc Site Date Total
Alex Italy A 12.31.2020 30
Alex Italy B 12.31.2020 20
Alex Italy B 12.30.2020 100
Alex Italy B 12.28.2020 40
Alex Italy A 12.23.2020 80
Alex France A 12.28.2020 10
Alex France B 12.28.2020 20
Alex France B 12.23.2020 10
Alex France A 12.23.2020 100
Alex France B 12.21.2020 25
I want to add per each row the average of total in arbitrary time frame before the Date per Name, Loc and Date
This is the outcome I'm looking for previous 5 days (excluding Date):
Name Loc Site Date Total Prv_Avg
Alex Italy A 12.31.2020 30 70
Alex Italy B 12.31.2020 20 70
Alex Italy B 12.30.2020 100 40
Alex Italy B 12.28.2020 40 80
Alex Italy A 12.23.2020 80 NaN
Alex France A 12.28.2020 10 55
Alex France B 12.28.2020 20 55
Alex France B 12.23.2020 10 25
Alex France A 12.23.2020 100 25
Alex France B 12.21.2020 25 NaN
The Nulls are for rows where there are not 5 previous days in the data

Use custom lambda function in GroupBy.transform with replace match values to NaNs and create averages by numpy.nanmean:
df['Date'] = pd.to_datetime(df['Date'])
def f(x):
arr = x.index.to_numpy()
s = x.to_numpy()
prev = arr - pd.Timedelta(5, 'day')
return np.nanmean(np.where((arr[:, None] > arr) &
(arr >= prev[:, None]), s, np.nan), axis=1)
df['Prv_Avg'] = (df.set_index('Date')
.groupby(['Name','Loc'])['Total']
.transform(f)
.to_numpy())
print (df)
Name Loc Site Date Total Prv_Avg
0 Alex Italy A 2020-12-31 30 70.0
1 Alex Italy B 2020-12-31 20 70.0
2 Alex Italy B 2020-12-30 100 40.0
3 Alex Italy B 2020-12-28 40 80.0
4 Alex Italy A 2020-12-23 80 NaN
5 Alex France A 2020-12-28 10 55.0
6 Alex France B 2020-12-28 20 55.0
7 Alex France B 2020-12-23 10 25.0
8 Alex France A 2020-12-23 100 25.0
9 Alex France B 2020-12-21 25 NaN

Related

How to sum amount for all rows if the current rows close date falls between the other rows close and open date columns for each sales rep

I have a dataframe with "close_date", "open_date", "amount", "sales_rep".
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
Jim
1/01/2021
2/05/2021
3
Jim
1/15/2021
4/06/2021
26
Jim
2/01/2021
2/06/2021
7
Jim
2/15/2021
3/14/2021
12
Jim
3/01/2021
4/22/2021
13
Jim
3/15/2021
3/29/2021
5
Jim
4/01/2021
4/20/2021
17
Bob
1/01/2021
1/12/2021
23
Bob
1/15/2021
2/16/2021
12
Bob
2/01/2021
3/04/2021
4
Bob
2/15/2021
4/05/2021
23
Bob
3/01/2021
3/24/2021
12
Bob
3/15/2021
4/15/2021
7
Bob
4/01/2021
5/01/2021
20
I want to create a column that tells me the open amount. So if we take the second row we can see that the opp was closed on 04/06/2021. I want to know how many open opps there were before that date. So I would look to see if the open date for row 5 was before the close date of 4/06/2021 and that the close date for row 5 is also after 04/06/2021. In this case it is so I would add that to the sum. I also want to current row value to be included in the sum. This should be done for each sales rep in the dataframe. I have filled in the table with the expected values below.
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
open_amount_sum
Jim
1/01/2021
2/05/2021
3
36 (I got this by adding 3, 26, and 7 because those are the only two values that fit the condition and the 3 because it is the value for that row.)
Jim
1/15/2021
4/06/2021
26
56
Jim
2/01/2021
2/06/2021
7
33
Jim
2/15/2021
3/14/2021
12
51
Jim
3/01/2021
4/22/2021
13
13
Jim
3/15/2021
3/29/2021
5
44
Jim
4/01/2021
4/20/2021
17
30
Bob
1/01/2021
1/12/2021
23
23
Bob
1/15/2021
2/16/2021
12
39
Bob
2/01/2021
3/04/2021
4
39
Bob
2/15/2021
4/05/2021
23
50
Bob
3/01/2021
3/24/2021
12
42
Bob
3/15/2021
4/15/2021
7
27
Bob
4/01/2021
5/01/2021
20
20
Edit #RJ's solution from the comments is better. here it is formatted slightly differently
df['open_amount_sum'] = df.apply(
lambda x: df[
df['sales_rep'].eq(x['sales_rep']) &
df['open_date'].le(x['close_date']) &
df['close_date'].ge(x['close_date'])
]['amount'].sum(),
axis=1,
)
Here is a solution, but it is slow and kind of ugly. can definitely be improved
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
"""
sales_rep,open_date,close_date,amount
Jim,1/01/2021,2/05/2021,3
Jim,1/15/2021,4/06/2021,26
Jim,2/01/2021,2/06/2021,7
Jim,2/15/2021,3/14/2021,12
Jim,3/01/2021,4/22/2021,13
Jim,3/15/2021,3/29/2021,5
Jim,4/01/2021,4/20/2021,17
Bob,1/01/2021,1/12/2021,23
Bob,1/15/2021,2/16/2021,12
Bob,2/01/2021,3/04/2021,4
Bob,2/15/2021,4/05/2021,23
Bob,3/01/2021,3/24/2021,12
Bob,3/15/2021,4/15/2021,7
Bob,4/01/2021,5/01/2021,20
"""
))
sum_df = df.groupby('sales_rep').apply(
lambda g:
g['close_date'].apply(
lambda close:
g.loc[
g['open_date'].le(close) & g['close_date'].ge(close),
'amount'
].sum())
).reset_index(level=0)
df['close_sum'] = sum_df['close_date']
df
Merge the dataframe unto itself, then filter, before grouping:
(df
.merge(df, on='sales_rep')
.query('open_date_y <= close_date_x<=close_date_y')
.loc(axis=1)['sales_rep', 'open_date_x', 'close_date_x', 'amount_x', 'amount_y']
.rename(columns=lambda col: col.removesuffix('_x'))
.rename(columns = {'amount_y' : 'open_sum_amount'})
.groupby(['sales_rep', 'open_date', 'close_date', 'amount'],
sort = False,
as_index = False)
.sum()
)
sales_rep open_date close_date amount open_sum_amount
0 Jim 2021-01-01 2021-02-05 3 36
1 Jim 2021-01-15 2021-04-06 26 56
2 Jim 2021-02-01 2021-02-06 7 33
3 Jim 2021-02-15 2021-03-14 12 51
4 Jim 2021-03-01 2021-04-22 13 13
5 Jim 2021-03-15 2021-03-29 5 44
6 Jim 2021-04-01 2021-04-20 17 30
7 Bob 2021-01-01 2021-01-12 23 23
8 Bob 2021-01-15 2021-02-16 12 39
9 Bob 2021-02-01 2021-03-04 4 39
10 Bob 2021-02-15 2021-04-05 23 50
11 Bob 2021-03-01 2021-03-24 12 42
12 Bob 2021-03-15 2021-04-15 7 27
13 Bob 2021-04-01 2021-05-01 20 20

Get column name as new column with the same column value

I have dataframe similar to this one:
name hobby date country 5 10 15 20 ...
Toby Guitar 2020-01-19 Brazil 0.1245 0.2543 0.7763 0.2264
Linda Cooking 2020-03-05 Italy 0.5411 0.2213 Nan 0.3342
Ben Diving 2020-04-02 USA 0.8843 0.2333 0.4486 0.2122
...
I want to the the int colmns, duplicate them, and put the int as the new value of the column, something like this:
name hobby date country 5 5 10 10 15 15 20 20...
Toby Guitar 2020-01-19 Brazil 0.1245 5 0.2543 10 0.7763 15 0.2264 20
Linda Cooking 2020-03-05 Italy 0.5411 5 0.2213 10 Nan 15 0.3342 20
Ben Diving 2020-04-02 USA 0.8843 5 0.2333 10 0.4486 15 0.2122 20
...
I'm not sure how to tackle this and looking for ideas
Here is a solution you can try out,
digits_ = pd.DataFrame(
{col: [int(col)] * len(df) for col in df.columns if col.isdigit()}
)
pd.concat([df, digits_], axis=1)
name hobby date country 5 ... 20 5 10 15 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 0.2264 5 10 15 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 0.3342 5 10 15 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 0.2122 5 10 15 20
I'm not sure if it is the best way to organise data with duplicated column names. I would recommend stacking (melting) it into long format.
df.melt(id_vars=["name", "hobby", "date", "country"])
Result
name hobby date country variable value
0 Toby Guitar 2020-01-19 Brazil 5 0.1245
1 Linda Cooking 2020-03-05 Italy 5 0.5411
2 Ben Diving 2020-04-02 USA 5 0.8843
3 Toby Guitar 2020-01-19 Brazil 10 0.2543
4 Linda Cooking 2020-03-05 Italy 10 0.2213
5 Ben Diving 2020-04-02 USA 10 0.2333
6 Toby Guitar 2020-01-19 Brazil 15 0.7763
7 Linda Cooking 2020-03-05 Italy 15 Nan
8 Ben Diving 2020-04-02 USA 15 0.4486
9 Toby Guitar 2020-01-19 Brazil 20 0.2264
10 Linda Cooking 2020-03-05 Italy 20 0.3342
11 Ben Diving 2020-04-02 USA 20 0.2122
You could use the pandas insert(...) function combined with a for loop
import numpy as np
import pandas as pd
df = pd.DataFrame([['Toby', 'Guitar', '2020-01-19', 'Brazil', 0.1245, 0.2543, 0.7763, 0.2264],
['Linda', 'Cooking', '2020-03-05', 'Italy', 0.5411, 0.2213, np.nan, 0.3342],
['Ben', 'Diving', '2020-04-02', 'USA', 0.8843, 0.2333, 0.4486, 0.2122]],
columns=['name', 'hobby', 'date', 'country', 5, 10, 5, 20])
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)
results:
name hobby date country 5 ... 10 5 5 20 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 10 0.7763 5 0.2264 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 10 NaN 5 0.3342 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 10 0.4486 5 0.2122 20
[3 rows x 12 columns]
I assumed that all your columns are digits from the 5th, but if not you could add in the for loop an if condition to prevent this :
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
if type(dcol) is int:
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)

Python : Dropping rows of a dataframe and keep a specific group

The question is still not answered !!!!
Let's say that I have this dataframe :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','ID_bal_amt', 'ID_bal_time','Dan_city','ID_bal_mod','Dan_country','ID_bal_type', 'ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ,'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country','ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12','June','Berlin','OPBD', '55','CRDT','432', 'August', 'CLBD','DBT', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP','432','March','FABD','CRDT']
Ccy = ['','','','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','USD','','CHF', '','DKN','','','USD','CHF']
Group = ['0','0','0','1','1','1','1','1','1','2','2','2','2','2','2','2','3','3','3','4','4','4','4']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_bal_amt 12 EUR 1
4 ID_bal_time June EUR 1
5 Dan_city Berlin 1
6 ID_bal_mod OPBD EUR 1
7 Dan_country 55 1
8 ID_bal_type CRDT 1
9 ID_bal_amt 432 2
10 ID_bal_time August EUR 2
11 ID_bal_mod CLBD EUR 2
12 ID_bal_type DBT USD 2
13 Dan_sex M USD 2
14 Dan_Age 22 USD 2
15 Dan_country FRA 2
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
19 ID_bal_amt 432 4
20 ID_bal_time March 4
21 ID_bal_mod FABD USD 4
22 ID_bal_type CRDT CHF 4
I want to reduce this dataframe ! I want to reduce only the rows that contains the string "bal" by keeping the group of rows that is associated at the the mode : "CLBD". That means that I search the value "CLBD" for the the name "ID_bal_mod" and then I keep all the others names ID_bal_amt, ID_bal_time, ID_bal_mod, ID_bal_type that are in the same group. In our example, it is the names that are in the group 2
In addition, I want to change the their value in the column "Group" to 0.
So at the end I would like to get this new dataframe where the indexing is reset too
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_city Berlin 1
4 Dan_country 55 1
5 ID_bal_amt 432 0
6 ID_bal_time August EUR 0
7 ID_bal_mod CLBD EUR 0
8 ID_bal_type DBT USD 0
9 Dan_sex M USD 2
10 Dan_Age 22 USD 2
11 Dan_country FRA 2
12 Dan_sex M CHF 3
13 Dan_city Madrid 3
14 Dan_country ESP DKN 3
Anyone has an efficient idea ?
Thank you
Let's try your logic:
rows_with_bal = df['Name'].str.contains('bal')
groups_with_CLBD = ((rows_with_bal & df['Value'].eq('CLBD'))
.groupby(df['Group']).transform('any')
)
# set the `Group` to 0 for `groups_with_CLBD`
df.loc[groups_with_CLBD, 'Group'] = 0
# keep the rows without bal or `groups_with_CLBD`
df = df.loc[(~rows_with_bal) | groups_with_CLBD]
Output:
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
5 Dan_city Berlin 1
7 Dan_country 55 1
9 ID_bal_amt 432 0
10 ID_bal_time August EUR 0
11 ID_bal_mod CLBD EUR 0
12 ID_bal_type DBT USD 0
13 Dan_sex M USD 0
14 Dan_Age 22 USD 0
15 Dan_country FRA 0
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3

Python - replace values in dataframe based on another dataframe match

Suppose that I have 2 Python data frame2 named A and B like shown below.
How could I replace column Value in data frame A based on the matches of columns ID and Month from B?
Any ideas?
Thanks
Dataframe A:
ID Month City Brand Value
1 1 London Unilever 100
1 2 London Unilever 120
1 3 London Unilever 150
1 4 London Unilever 140
2 1 NY JP Morgan 90
2 2 NY JP Morgan 105
2 3 NY JP Morgan 100
2 4 NY JP Morgan 140
3 1 Paris Loreal 60
3 2 Paris Loreal 75
3 3 Paris Loreal 65
3 4 Paris Loreal 80
4 1 Tokyo Sony 100
4 2 Tokyo Sony 90
4 3 Tokyo Sony 85
4 4 Tokyo Sony 80
Dataframe B:
ID Month Value
2 1 100
3 3 80
Use merge with left join and replace missing values by original values by fillna:
df = df1.merge(df2, on=['ID', 'Month'], how='left', suffixes=('_',''))
df['Value'] = df['Value'].fillna(df['Value_']).astype(int)
df = df.drop('Value_', axis=1)
print (df)
ID Month City Brand Value
0 1 1 London Unilever 100
1 1 2 London Unilever 120
2 1 3 London Unilever 150
3 1 4 London Unilever 140
4 2 1 NY JP Morgan 100
5 2 2 NY JP Morgan 105
6 2 3 NY JP Morgan 100
7 2 4 NY JP Morgan 140
8 3 1 Paris Loreal 60
9 3 2 Paris Loreal 75
10 3 3 Paris Loreal 80
11 3 4 Paris Loreal 80
12 4 1 Tokyo Sony 100
13 4 2 Tokyo Sony 90
14 4 3 Tokyo Sony 85
15 4 4 Tokyo Sony 80
Merge them and then remove the not used fields:
C = pd.merge(A[['ID', 'Month', 'City', 'Brand']],B, on=['ID', 'Month'])
C = C[['ID', 'Month', 'City', 'Brand', 'Value']]
This should work

Pandas Python Groupby Cummulative Sum Reverse

I have found Pandas groupby cumulative sum and found it very useful. However, I would like to determine how to calculate a reverse cumulative sum.
The link suggests the following.
df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
In order to reverse sum, I tried slicing the data, but it fails.
df.groupby(by=['name','day']).ix[::-1, 'no'].sum().groupby(level=[0]).cumsum()
Jack | Monday | 10 | 90
Jack | Tuesday | 30 | 80
Jack | Wednesday | 50 | 50
Jill | Monday | 40 | 80
Jill | Wednesday | 40 | 40
EDIT:
Based on feedback, I tried to implement the code and make the dataframe larger:
import pandas as pd
df = pd.DataFrame(
{'name': ['Jack', 'Jack', 'Jack', 'Jill', 'Jill'],
'surname' : ['Jones','Jones','Jones','Smith','Smith'],
'car' : ['VW','Mazda','VW','Merc','Merc'],
'country' : ['UK','US','UK','EU','EU'],
'year' : [1980,1980,1980,1980,1980],
'day': ['Monday', 'Tuesday','Wednesday','Monday','Wednesday'],
'date': ['2016-02-31','2016-01-31','2016-01-31','2016-01-31','2016-01-31'],
'no': [10,30,50,40,40],
'qty' : [100,500,200,433,222]})
I then try and group on a number of columns but it fails to apply the grouping.
df = df.groupby(by=['name','surname','car','country','year','day','date']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1].reset_index()
Why is the case? I expect Jack Jones with car Mazda to be a separate cumulative quantity from Jack Jones with a VW.
You can use double iloc:
df = df.groupby(by=['name','day']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1]
print (df)
no
name day
Jack Monday 90
Tuesday 80
Wednesday 50
Jill Monday 80
Wednesday 40
For another column solution is simplify:
df = df.groupby(by=['name','day']).sum()
df['new'] = df.iloc[::-1].groupby(level=[0]).cumsum()
print (df)
no new
name day
Jack Monday 10 90
Tuesday 30 80
Wednesday 50 50
Jill Monday 40 80
Wednesday 40 40
EDIT:
There is problem in second groupby need to append more levels - level=[0,1,2] means group by first name, second surname and third car levels.
df1 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum())
print (df1)
no qty
name surname car country year day date
Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
VW UK 1980 Monday 2016-02-31 10 100
Wednesday 2016-01-31 50 200
Jill Smith Merc EU 1980 Monday 2016-01-31 40 433
Wednesday 2016-01-31 40 222
df2 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum()
.iloc[::-1]
.groupby(level=[0,1,2])
.cumsum()
.iloc[::-1]
.reset_index())
print (df2)
name surname car country year day date no qty
0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300
2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200
3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655
4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222
Or is possible select by names - see groupby enhancements in 0.20.1+:
df2 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum()
.iloc[::-1]
.groupby(['name','surname','car'])
.cumsum()
.iloc[::-1]
.reset_index())
print (df2)
name surname car country year day date no qty
0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300
2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200
3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655
4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222

Categories

Resources