mean per group in a fragmented dataset

mean per group in a fragmented dataset - python

this is actually an extension of my previous question, but I was requested to put it as a separate question
Rolling average on previous dates per group
I have the following dataset:
Name Loc Site Date Total
Alex Italy A 12.31.2020 30
Alex Italy B 12.31.2020 20
Alex Italy B 12.30.2020 100
Alex Italy B 12.28.2020 40
Alex Italy A 12.23.2020 80
Alex France A 12.28.2020 10
Alex France B 12.28.2020 20
Alex France B 12.23.2020 10
Alex France A 12.23.2020 100
Alex France B 12.21.2020 25
I want to add per each row the average of total in arbitrary time frame before the Date per Name, Loc and Date
This is the outcome I'm looking for previous 5 days (excluding Date):
Name Loc Site Date Total Prv_Avg
Alex Italy A 12.31.2020 30 70
Alex Italy B 12.31.2020 20 70
Alex Italy B 12.30.2020 100 40
Alex Italy B 12.28.2020 40 80
Alex Italy A 12.23.2020 80 NaN
Alex France A 12.28.2020 10 55
Alex France B 12.28.2020 20 55
Alex France B 12.23.2020 10 25
Alex France A 12.23.2020 100 25
Alex France B 12.21.2020 25 NaN
The Nulls are for rows where there are not 5 previous days in the data

Use custom lambda function in GroupBy.transform with replace match values to NaNs and create averages by numpy.nanmean:
df['Date'] = pd.to_datetime(df['Date'])
def f(x):
arr = x.index.to_numpy()
s = x.to_numpy()
prev = arr - pd.Timedelta(5, 'day')
return np.nanmean(np.where((arr[:, None] > arr) &
(arr >= prev[:, None]), s, np.nan), axis=1)
df['Prv_Avg'] = (df.set_index('Date')
.groupby(['Name','Loc'])['Total']
.transform(f)
.to_numpy())
print (df)
Name Loc Site Date Total Prv_Avg
0 Alex Italy A 2020-12-31 30 70.0
1 Alex Italy B 2020-12-31 20 70.0
2 Alex Italy B 2020-12-30 100 40.0
3 Alex Italy B 2020-12-28 40 80.0
4 Alex Italy A 2020-12-23 80 NaN
5 Alex France A 2020-12-28 10 55.0
6 Alex France B 2020-12-28 20 55.0
7 Alex France B 2020-12-23 10 25.0
8 Alex France A 2020-12-23 100 25.0
9 Alex France B 2020-12-21 25 NaN

Related

How to sum amount for all rows if the current rows close date falls between the other rows close and open date columns for each sales rep

I have a dataframe with "close_date", "open_date", "amount", "sales_rep".
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
Jim
1/01/2021
2/05/2021
3
Jim
1/15/2021
4/06/2021
26
Jim
2/01/2021
2/06/2021
7
Jim
2/15/2021
3/14/2021
12
Jim
3/01/2021
4/22/2021
13
Jim
3/15/2021
3/29/2021
5
Jim
4/01/2021
4/20/2021
17
Bob
1/01/2021
1/12/2021
23
Bob
1/15/2021
2/16/2021
12
Bob
2/01/2021
3/04/2021
4
Bob
2/15/2021
4/05/2021
23
Bob
3/01/2021
3/24/2021
12
Bob
3/15/2021
4/15/2021
7
Bob
4/01/2021
5/01/2021
20
I want to create a column that tells me the open amount. So if we take the second row we can see that the opp was closed on 04/06/2021. I want to know how many open opps there were before that date. So I would look to see if the open date for row 5 was before the close date of 4/06/2021 and that the close date for row 5 is also after 04/06/2021. In this case it is so I would add that to the sum. I also want to current row value to be included in the sum. This should be done for each sales rep in the dataframe. I have filled in the table with the expected values below.
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
open_amount_sum
Jim
1/01/2021
2/05/2021
3
36 (I got this by adding 3, 26, and 7 because those are the only two values that fit the condition and the 3 because it is the value for that row.)
Jim
1/15/2021
4/06/2021
26
56
Jim
2/01/2021
2/06/2021
7
33
Jim
2/15/2021
3/14/2021
12
51
Jim
3/01/2021
4/22/2021
13
13
Jim
3/15/2021
3/29/2021
5
44
Jim
4/01/2021
4/20/2021
17
30
Bob
1/01/2021
1/12/2021
23
23
Bob
1/15/2021
2/16/2021
12
39
Bob
2/01/2021
3/04/2021
4
39
Bob
2/15/2021
4/05/2021
23
50
Bob
3/01/2021
3/24/2021
12
42
Bob
3/15/2021
4/15/2021
7
27
Bob
4/01/2021
5/01/2021
20
20

Edit #RJ's solution from the comments is better. here it is formatted slightly differently
df['open_amount_sum'] = df.apply(
lambda x: df[
df['sales_rep'].eq(x['sales_rep']) &
df['open_date'].le(x['close_date']) &
df['close_date'].ge(x['close_date'])
]['amount'].sum(),
axis=1,
)
Here is a solution, but it is slow and kind of ugly. can definitely be improved
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
"""
sales_rep,open_date,close_date,amount
Jim,1/01/2021,2/05/2021,3
Jim,1/15/2021,4/06/2021,26
Jim,2/01/2021,2/06/2021,7
Jim,2/15/2021,3/14/2021,12
Jim,3/01/2021,4/22/2021,13
Jim,3/15/2021,3/29/2021,5
Jim,4/01/2021,4/20/2021,17
Bob,1/01/2021,1/12/2021,23
Bob,1/15/2021,2/16/2021,12
Bob,2/01/2021,3/04/2021,4
Bob,2/15/2021,4/05/2021,23
Bob,3/01/2021,3/24/2021,12
Bob,3/15/2021,4/15/2021,7
Bob,4/01/2021,5/01/2021,20
"""
))
sum_df = df.groupby('sales_rep').apply(
lambda g:
g['close_date'].apply(
lambda close:
g.loc[
g['open_date'].le(close) & g['close_date'].ge(close),
'amount'
].sum())
).reset_index(level=0)
df['close_sum'] = sum_df['close_date']
df

Merge the dataframe unto itself, then filter, before grouping:
(df
.merge(df, on='sales_rep')
.query('open_date_y <= close_date_x<=close_date_y')
.loc(axis=1)['sales_rep', 'open_date_x', 'close_date_x', 'amount_x', 'amount_y']
.rename(columns=lambda col: col.removesuffix('_x'))
.rename(columns = {'amount_y' : 'open_sum_amount'})
.groupby(['sales_rep', 'open_date', 'close_date', 'amount'],
sort = False,
as_index = False)
.sum()
)
sales_rep open_date close_date amount open_sum_amount
0 Jim 2021-01-01 2021-02-05 3 36
1 Jim 2021-01-15 2021-04-06 26 56
2 Jim 2021-02-01 2021-02-06 7 33
3 Jim 2021-02-15 2021-03-14 12 51
4 Jim 2021-03-01 2021-04-22 13 13
5 Jim 2021-03-15 2021-03-29 5 44
6 Jim 2021-04-01 2021-04-20 17 30
7 Bob 2021-01-01 2021-01-12 23 23
8 Bob 2021-01-15 2021-02-16 12 39
9 Bob 2021-02-01 2021-03-04 4 39
10 Bob 2021-02-15 2021-04-05 23 50
11 Bob 2021-03-01 2021-03-24 12 42
12 Bob 2021-03-15 2021-04-15 7 27
13 Bob 2021-04-01 2021-05-01 20 20

Get column name as new column with the same column value

I have dataframe similar to this one:
name hobby date country 5 10 15 20 ...
Toby Guitar 2020-01-19 Brazil 0.1245 0.2543 0.7763 0.2264
Linda Cooking 2020-03-05 Italy 0.5411 0.2213 Nan 0.3342
Ben Diving 2020-04-02 USA 0.8843 0.2333 0.4486 0.2122
...
I want to the the int colmns, duplicate them, and put the int as the new value of the column, something like this:
name hobby date country 5 5 10 10 15 15 20 20...
Toby Guitar 2020-01-19 Brazil 0.1245 5 0.2543 10 0.7763 15 0.2264 20
Linda Cooking 2020-03-05 Italy 0.5411 5 0.2213 10 Nan 15 0.3342 20
Ben Diving 2020-04-02 USA 0.8843 5 0.2333 10 0.4486 15 0.2122 20
...
I'm not sure how to tackle this and looking for ideas

Here is a solution you can try out,
digits_ = pd.DataFrame(
{col: [int(col)] * len(df) for col in df.columns if col.isdigit()}
)
pd.concat([df, digits_], axis=1)
name hobby date country 5 ... 20 5 10 15 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 0.2264 5 10 15 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 0.3342 5 10 15 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 0.2122 5 10 15 20

I'm not sure if it is the best way to organise data with duplicated column names. I would recommend stacking (melting) it into long format.
df.melt(id_vars=["name", "hobby", "date", "country"])
Result
name hobby date country variable value
0 Toby Guitar 2020-01-19 Brazil 5 0.1245
1 Linda Cooking 2020-03-05 Italy 5 0.5411
2 Ben Diving 2020-04-02 USA 5 0.8843
3 Toby Guitar 2020-01-19 Brazil 10 0.2543
4 Linda Cooking 2020-03-05 Italy 10 0.2213
5 Ben Diving 2020-04-02 USA 10 0.2333
6 Toby Guitar 2020-01-19 Brazil 15 0.7763
7 Linda Cooking 2020-03-05 Italy 15 Nan
8 Ben Diving 2020-04-02 USA 15 0.4486
9 Toby Guitar 2020-01-19 Brazil 20 0.2264
10 Linda Cooking 2020-03-05 Italy 20 0.3342
11 Ben Diving 2020-04-02 USA 20 0.2122

You could use the pandas insert(...) function combined with a for loop
import numpy as np
import pandas as pd
df = pd.DataFrame([['Toby', 'Guitar', '2020-01-19', 'Brazil', 0.1245, 0.2543, 0.7763, 0.2264],
['Linda', 'Cooking', '2020-03-05', 'Italy', 0.5411, 0.2213, np.nan, 0.3342],
['Ben', 'Diving', '2020-04-02', 'USA', 0.8843, 0.2333, 0.4486, 0.2122]],
columns=['name', 'hobby', 'date', 'country', 5, 10, 5, 20])
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)
results:
name hobby date country 5 ... 10 5 5 20 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 10 0.7763 5 0.2264 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 10 NaN 5 0.3342 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 10 0.4486 5 0.2122 20
[3 rows x 12 columns]
I assumed that all your columns are digits from the 5th, but if not you could add in the for loop an if condition to prevent this :
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
if type(dcol) is int:
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)

Python : Dropping rows of a dataframe and keep a specific group

The question is still not answered !!!!
Let's say that I have this dataframe :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','ID_bal_amt', 'ID_bal_time','Dan_city','ID_bal_mod','Dan_country','ID_bal_type', 'ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ,'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country','ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12','June','Berlin','OPBD', '55','CRDT','432', 'August', 'CLBD','DBT', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP','432','March','FABD','CRDT']
Ccy = ['','','','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','USD','','CHF', '','DKN','','','USD','CHF']
Group = ['0','0','0','1','1','1','1','1','1','2','2','2','2','2','2','2','3','3','3','4','4','4','4']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_bal_amt 12 EUR 1
4 ID_bal_time June EUR 1
5 Dan_city Berlin 1
6 ID_bal_mod OPBD EUR 1
7 Dan_country 55 1
8 ID_bal_type CRDT 1
9 ID_bal_amt 432 2
10 ID_bal_time August EUR 2
11 ID_bal_mod CLBD EUR 2
12 ID_bal_type DBT USD 2
13 Dan_sex M USD 2
14 Dan_Age 22 USD 2
15 Dan_country FRA 2
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
19 ID_bal_amt 432 4
20 ID_bal_time March 4
21 ID_bal_mod FABD USD 4
22 ID_bal_type CRDT CHF 4
I want to reduce this dataframe ! I want to reduce only the rows that contains the string "bal" by keeping the group of rows that is associated at the the mode : "CLBD". That means that I search the value "CLBD" for the the name "ID_bal_mod" and then I keep all the others names ID_bal_amt, ID_bal_time, ID_bal_mod, ID_bal_type that are in the same group. In our example, it is the names that are in the group 2
In addition, I want to change the their value in the column "Group" to 0.
So at the end I would like to get this new dataframe where the indexing is reset too
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_city Berlin 1
4 Dan_country 55 1
5 ID_bal_amt 432 0
6 ID_bal_time August EUR 0
7 ID_bal_mod CLBD EUR 0
8 ID_bal_type DBT USD 0
9 Dan_sex M USD 2
10 Dan_Age 22 USD 2
11 Dan_country FRA 2
12 Dan_sex M CHF 3
13 Dan_city Madrid 3
14 Dan_country ESP DKN 3
Anyone has an efficient idea ?
Thank you

Let's try your logic:
rows_with_bal = df['Name'].str.contains('bal')
groups_with_CLBD = ((rows_with_bal & df['Value'].eq('CLBD'))
.groupby(df['Group']).transform('any')
)
# set the `Group` to 0 for `groups_with_CLBD`
df.loc[groups_with_CLBD, 'Group'] = 0
# keep the rows without bal or `groups_with_CLBD`
df = df.loc[(~rows_with_bal) | groups_with_CLBD]
Output:
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
5 Dan_city Berlin 1
7 Dan_country 55 1
9 ID_bal_amt 432 0
10 ID_bal_time August EUR 0
11 ID_bal_mod CLBD EUR 0
12 ID_bal_type DBT USD 0
13 Dan_sex M USD 0
14 Dan_Age 22 USD 0
15 Dan_country FRA 0
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3

Python - replace values in dataframe based on another dataframe match

Suppose that I have 2 Python data frame2 named A and B like shown below.
How could I replace column Value in data frame A based on the matches of columns ID and Month from B?
Any ideas?
Thanks
Dataframe A:
ID Month City Brand Value
1 1 London Unilever 100
1 2 London Unilever 120
1 3 London Unilever 150
1 4 London Unilever 140
2 1 NY JP Morgan 90
2 2 NY JP Morgan 105
2 3 NY JP Morgan 100
2 4 NY JP Morgan 140
3 1 Paris Loreal 60
3 2 Paris Loreal 75
3 3 Paris Loreal 65
3 4 Paris Loreal 80
4 1 Tokyo Sony 100
4 2 Tokyo Sony 90
4 3 Tokyo Sony 85
4 4 Tokyo Sony 80
Dataframe B:
ID Month Value
2 1 100
3 3 80

Use merge with left join and replace missing values by original values by fillna:
df = df1.merge(df2, on=['ID', 'Month'], how='left', suffixes=('_',''))
df['Value'] = df['Value'].fillna(df['Value_']).astype(int)
df = df.drop('Value_', axis=1)
print (df)
ID Month City Brand Value
0 1 1 London Unilever 100
1 1 2 London Unilever 120
2 1 3 London Unilever 150
3 1 4 London Unilever 140
4 2 1 NY JP Morgan 100
5 2 2 NY JP Morgan 105
6 2 3 NY JP Morgan 100
7 2 4 NY JP Morgan 140
8 3 1 Paris Loreal 60
9 3 2 Paris Loreal 75
10 3 3 Paris Loreal 80
11 3 4 Paris Loreal 80
12 4 1 Tokyo Sony 100
13 4 2 Tokyo Sony 90
14 4 3 Tokyo Sony 85
15 4 4 Tokyo Sony 80

Merge them and then remove the not used fields:
C = pd.merge(A[['ID', 'Month', 'City', 'Brand']],B, on=['ID', 'Month'])
C = C[['ID', 'Month', 'City', 'Brand', 'Value']]
This should work

Pandas Python Groupby Cummulative Sum Reverse

I have found Pandas groupby cumulative sum and found it very useful. However, I would like to determine how to calculate a reverse cumulative sum.
The link suggests the following.
df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
In order to reverse sum, I tried slicing the data, but it fails.
df.groupby(by=['name','day']).ix[::-1, 'no'].sum().groupby(level=[0]).cumsum()
Jack | Monday | 10 | 90
Jack | Tuesday | 30 | 80
Jack | Wednesday | 50 | 50
Jill | Monday | 40 | 80
Jill | Wednesday | 40 | 40
EDIT:
Based on feedback, I tried to implement the code and make the dataframe larger:
import pandas as pd
df = pd.DataFrame(
{'name': ['Jack', 'Jack', 'Jack', 'Jill', 'Jill'],
'surname' : ['Jones','Jones','Jones','Smith','Smith'],
'car' : ['VW','Mazda','VW','Merc','Merc'],
'country' : ['UK','US','UK','EU','EU'],
'year' : [1980,1980,1980,1980,1980],
'day': ['Monday', 'Tuesday','Wednesday','Monday','Wednesday'],
'date': ['2016-02-31','2016-01-31','2016-01-31','2016-01-31','2016-01-31'],
'no': [10,30,50,40,40],
'qty' : [100,500,200,433,222]})
I then try and group on a number of columns but it fails to apply the grouping.
df = df.groupby(by=['name','surname','car','country','year','day','date']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1].reset_index()
Why is the case? I expect Jack Jones with car Mazda to be a separate cumulative quantity from Jack Jones with a VW.

You can use double iloc:
df = df.groupby(by=['name','day']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1]
print (df)
no
name day
Jack Monday 90
Tuesday 80
Wednesday 50
Jill Monday 80
Wednesday 40
For another column solution is simplify:
df = df.groupby(by=['name','day']).sum()
df['new'] = df.iloc[::-1].groupby(level=[0]).cumsum()
print (df)
no new
name day
Jack Monday 10 90
Tuesday 30 80
Wednesday 50 50
Jill Monday 40 80
Wednesday 40 40
EDIT:
There is problem in second groupby need to append more levels - level=[0,1,2] means group by first name, second surname and third car levels.
df1 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum())
print (df1)
no qty
name surname car country year day date
Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
VW UK 1980 Monday 2016-02-31 10 100
Wednesday 2016-01-31 50 200
Jill Smith Merc EU 1980 Monday 2016-01-31 40 433
Wednesday 2016-01-31 40 222
df2 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum()
.iloc[::-1]
.groupby(level=[0,1,2])
.cumsum()
.iloc[::-1]
.reset_index())
print (df2)
name surname car country year day date no qty
0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300
2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200
3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655
4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222
Or is possible select by names - see groupby enhancements in 0.20.1+:
df2 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum()
.iloc[::-1]
.groupby(['name','surname','car'])
.cumsum()
.iloc[::-1]
.reset_index())
print (df2)
name surname car country year day date no qty
0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300
2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200
3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655
4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

mean per group in a fragmented dataset - python

Related

How to sum amount for all rows if the current rows close date falls between the other rows close and open date columns for each sales rep

Get column name as new column with the same column value

Python : Dropping rows of a dataframe and keep a specific group

Python - replace values in dataframe based on another dataframe match

Pandas Python Groupby Cummulative Sum Reverse

Categories

Resources