Evaluate monthly fraction of yearly data - Python [duplicate] - python

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 10 months ago.
I have a pandas dataframe as:
ID
Date
Value
A
1/1/2000
5
A
2/1/2000
10
A
3/1/2000
20
A
4/1/2000
10
B
1/1/2000
100
B
2/1/2000
200
B
3/1/2000
300
B
4/1/2000
400
How do I evaluate the monthly fraction of the total yearly value for each ID as the fourth column?
ID
Date
Value
Fraction
A
1/1/2000
5
0.11
A
2/1/2000
10
0.22
A
3/1/2000
20
0.44
A
4/1/2000
10
0.11
B
1/1/2000
100
0.11
B
2/1/2000
200
0.22
B
3/1/2000
300
0.33
B
4/1/2000
400
0.44
I guess I could use groupby?

I think your data is missing another year to be representative, if you do not have just a single year in the DataFrame.
I just added one line for 2001:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
print(df)
ID Date Value
0 A 2000-01-01 5
1 A 2000-02-01 10
2 A 2000-03-01 20
3 A 2000-04-01 10
4 B 2000-01-01 100
5 B 2000-02-01 200
6 B 2000-03-01 300
7 B 2000-04-01 400
8 B 2001-04-01 20
If I understood correctly you can do it like this:
df['Fraction'] = (df['Value'] / df.groupby(['ID', df['Date'].dt.year])['Value'].transform('sum')).round(2)
print(df)
ID Date Value Fraction
0 A 2000-01-01 5 0.11
1 A 2000-02-01 10 0.22
2 A 2000-03-01 20 0.44
3 A 2000-04-01 10 0.22
4 B 2000-01-01 100 0.10
5 B 2000-02-01 200 0.20
6 B 2000-03-01 300 0.30
7 B 2000-04-01 400 0.40
8 B 2001-04-01 20 1.00

You can divide the Value column by the result of a groupby.transform sum, followed by round(2) to match your expected output:
df['Fraction'] = df['Value'] / df.groupby('ID')['Value'].transform('sum')
df['Fraction'] = df['Fraction'].round(2)
print(df)
ID Date Value Fraction
0 A 1/1/2000 5 0.11
1 A 2/1/2000 10 0.22
2 A 3/1/2000 20 0.44
3 A 4/1/2000 10 0.22
4 B 1/1/2000 100 0.10
5 B 2/1/2000 200 0.20
6 B 3/1/2000 300 0.30
7 B 4/1/2000 400 0.40

Related

Share/percent across a list of columns in a pandas agg

I have this kind of data frame
dat = [{"date": datetime.date(2021,1,1), "c_id" : "a", "var1": 2, "var2": 1, "var3" : 10 },
{"date": datetime.date(2021,1,1), "c_id" : "b", "var1": 2, "var2": 0, "var3" : 20 },
{"date": datetime.date(2021,2,1), "c_id" : "a", "var1": 2, "var2": 1, "var3" : 30 },
{"date": datetime.date(2021,2,1), "c_id" : "b", "var1": 2, "var2": 3, "var3" : 10 },
{"date": datetime.date(2021,3,1), "c_id" : "a", "var1": 2, "var2": 1, "var3" : 30 },
{"date": datetime.date(2021,3,1), "c_id" : "b", "var1": 2, "var2": 3, "var3" : 20 },
]
df = pd.DataFrame(dat)
>>> df
date c_id var1 var2 var3
0 2021-01-01 a 2 1 10
1 2021-01-01 b 2 0 20
2 2021-02-01 a 2 1 30
3 2021-02-01 b 2 3 10
4 2021-03-01 a 2 1 30
5 2021-03-01 b 2 3 20
I'd like to have the share of these 3 named variables per (date, c_id). So for example...
>>> df
date c_id var1 var2 var3 var1_share var2_share var3_share
0 2021-01-01 a 2 1 10 0.15 0.07 0.76
1 2021-01-01 b 2 0 20 0.09 0.00 0.90
2 2021-02-01 a 2 1 30 0.06 0.03 0.90
3 2021-02-01 b 2 3 10 0.13 0.20 0.66
4 2021-03-01 a 2 1 30 0.06 0.03 0.90
5 2021-03-01 b 2 3 20 0.08 0.12 0.80
While I can do this in kind of a dumb way if I list these out individually...
>>> df.insert(5, "var1_share", df.apply(lambda x: x["var1"] / x[["var1", "var2", "var3"]].sum(), axis=1))
>>> df
date c_id var1 var2 var3 var1_share
0 2021-01-01 a 2 1 10 0.153846
1 2021-01-01 b 2 0 20 0.090909
2 2021-02-01 a 2 1 30 0.060606
3 2021-02-01 b 2 3 10 0.133333
4 2021-03-01 a 2 1 30 0.060606
5 2021-03-01 b 2 3 20 0.080000
What's the pandas magic for iterating this procedure over some list of valid columns, mylist= ["var1", "var2", "var3"]? I suspect there is an apply that can do this in a one-liner?
Also, pandas experts, what would this operation be called across columns of a dataframe? I'm sure this is common, but I'm not sure how I could have searched for it better.
Try this:
cols = pd.Index(['var1', 'var2', 'var3'])
df[cols+'_share'] = df[cols].div(df.sum(axis=1), axis=0)
Output:
date c_id var1 var2 var3 var1_share var2_share var3_share
0 2021-01-01 a 2 1 10 0.153846 0.076923 0.769231
1 2021-01-01 b 2 0 20 0.090909 0.000000 0.909091
2 2021-02-01 a 2 1 30 0.060606 0.030303 0.909091
3 2021-02-01 b 2 3 10 0.133333 0.200000 0.666667
4 2021-03-01 a 2 1 30 0.060606 0.030303 0.909091
5 2021-03-01 b 2 3 20 0.080000 0.120000 0.800000
Let's use pandas intrinsic data alignment and pd.DataFrame.div with parameter axis=0 and pd.DataFrame.sum with axis=1.
you can do it using sum along the columns.
mylist= ["var1", "var2", "var3"]
df[[f'{c}_share' for c in mylist]] = (df[mylist]/df[mylist].sum(axis=1).to_numpy()[:, None]).round(2)
print(df)
date c_id var1 var2 var3 var1_share var2_share var3_share
0 2021-01-01 a 2 1 10 0.15 0.08 0.77
1 2021-01-01 b 2 0 20 0.09 0.00 0.91
2 2021-02-01 a 2 1 30 0.06 0.03 0.91
3 2021-02-01 b 2 3 10 0.13 0.20 0.67
4 2021-03-01 a 2 1 30 0.06 0.03 0.91
5 2021-03-01 b 2 3 20 0.08 0.12 0.80
Something from numpy
s = df.filter(like = 'var')
out = df.join(s/s.sum(axis=1).values[:,None],rsuffix = '_share')
out
Out[121]:
date c_id var1 var2 var3 var1_share var2_share var3_share
0 2021-01-01 a 2 1 10 0.153846 0.076923 0.769231
1 2021-01-01 b 2 0 20 0.090909 0.000000 0.909091
2 2021-02-01 a 2 1 30 0.060606 0.030303 0.909091
3 2021-02-01 b 2 3 10 0.133333 0.200000 0.666667
4 2021-03-01 a 2 1 30 0.060606 0.030303 0.909091
5 2021-03-01 b 2 3 20 0.080000 0.120000 0.800000

Pandas rolling cumulative sum of across two dataframes

I'm looking to create a rolling grouped cumulative sum across two dataframes. I can get the result via iteration, but wanted to see if there was a more intelligent way.
I need the 5 row block of A to roll through the rows of B and accumulate. Think of it as rolling balance with a block of contributions and rolling returns.
So, here's the calculation for C
A B
1 100.00 1 0.01 101.00
2 110.00 2 0.02 215.22 102.00
3 120.00 3 0.03 345.28 218.36 103.00
4 130.00 4 0.04 494.29 351.89 221.52 104.00
5 140.00 5 0.05 666.00 505.99 358.60 224.70 105.00
6 0.06 684.75 517.91 365.38 227.90 106.00
7 0.07 703.97 530.06 372.25 231.12
8 0.08 723.66 542.43 379.21
9 0.09 743.85 555.04
10 0.10 764.54
C Row 5
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.01 101.00
101.00 110.00 0.02 215.22
215.22 120.00 0.03 345.28
345.28 130.00 0.04 494.29
494.29 140.00 0.05 666.00
C Row 6
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.02 102.00
102.00 110.00 0.03 218.36
218.36 120.00 0.04 351.89
351.89 130.00 0.05 505.99
505.99 140.00 0.06 684.75
Here's what the source data looks like:
A B
1 100.00 1 0.01
2 110.00 2 0.02
3 120.00 3 0.03
4 130.00 4 0.04
5 140.00 5 0.05
6 0.06
7 0.07
8 0.08
9 0.09
10 0.10
Here is the desired result:
C
1 Nan
2 Nan
3 Nan
4 Nan
5 666.00
6 684.75
7 703.97
8 723.66
9 743.85
10 764.54

How to convert the long to wide format in Pandas dataframe?

I am having dataframe df in below format
Date TLRA_CAPE TLRA_Pct B_CAPE B_Pct RC_CAPE RC_Pct
1/1/2000 10 0.20 30 0.40 50 0.60
2/1/2000 15 0.25 35 0.45 55 0.65
3/1/2000 17 0.27 37 0.47 57 0.6
I need to convert into below format
Date Variable CAPE Pct
1/1/2000 TLRA 10 0.20
2/1/2000 TLRA 15 0.25
3/1/2000 TLRA 17 0.27
1/1/2000 B 30 0.40
2/1/2000 B 35 0.45
3/1/2000 B 37 0.47
1/1/2000 RC 50 0.60
2/1/2000 RC 55 0.65
3/1/2000 RC 57 0.6
I am struggling to convert to required format. I tried using pd.melt , pd.pivot but those are not working.
After change of your columns you can do with wide_to_long, and you have both PCT and Pct, I assumed that is typo, if not , do df.columns=df.columns.str.upper()
df=df.set_index('Date')
df.columns=df.columns.str.split('_').map(lambda x : '_'.join(x[::-1]))
pd.wide_to_long(df.reset_index(),['CAPE','Pct'],i='Date',j='Variable',sep='_',suffix='\w+')
Out[63]:
CAPE Pct
Date Variable
1/1/2000 TLRA 10 0.20
2/1/2000 TLRA 15 0.25
3/1/2000 TLRA 17 0.27
1/1/2000 B 30 0.40
2/1/2000 B 35 0.45
3/1/2000 B 37 0.47
1/1/2000 RC 50 0.60
2/1/2000 RC 55 0.65
3/1/2000 RC 57 0.60

Sorting and ranking by dates, on a group in a pandas df

From the following sort of dataframe I would like to be able to both sort and rank the id field on date:
df = pd.DataFrame({
'id':[1, 1, 2, 3, 3, 4, 5, 6,6,6,7,7],
'value':[.01, .4, .2, .3, .11, .21, .4, .01, 3, .5, .8, .9],
'date':['10/01/2017 15:45:00','05/01/2017 15:56:00',
'11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00',
'05/01/2017 09:55:00','05/01/2017 10:08:00','03/02/2017 08:55:00',
'03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
'19/01/2017 16:34:00']})
to effectively rank or index, per id, based on date.
I've used
df.groupby('id')['date'].min()
which allows me to extract the first date (although I don't know how to use this to filter out the rows) but I might not always need the first date - sometimes it will be the second or third so I need to generate a new column, with an index for the date - the result would look like:
Any ideas on this sorting/ranking/labelling?
EDIT
My original model ignored a very prevalent issue.
As there are feasibly some ids that have multiple tests performed on them in parallel, therefore they show in multiple rows in the datebase, with matching dates (date corresponds to when they were logged). These should be counted as the same date and not increment the date_rank: I've generated a model, with updated date_rank to demonstrate how this would look:
df = pd.DataFrame({
'id':[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6,6,6,7,7],
'value':[.01, .4, .5, .7, .77, .1,.2, 0.3, .11, .21, .4, .01, 3, .5, .8, .9, .1],
'date':['10/01/2017 15:45:00','10/01/2017 15:45:00','05/01/2017 15:56:00',
'11/01/2017 15:22:00','11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00','05/01/2017 09:37:00','05/01/2017 09:55:00',
'05/01/2017 09:55:00','05/01/2017 10:08:00','05/01/2017 10:09:00','03/02/2017 08:55:00',
'03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
'19/01/2017 16:34:00']})
And the counter would afford this:
You can try of sorting date values in descending and aggregating the 'id' group values
#praveen's logic is very simpler, by extending of logic, you can use astype of category to convert the values to categories and can retrive the codes (keys') of that categories, but it will be little bit different to your expected output
df1 = df.sort_values(['id', 'date'], ascending=[True, False])
df1['date_rank'] =df1.groupby(['id']).apply(lambda x: x['date'].astype('category',ordered=False).cat.codes+1).values
Out:
date id value date_rank
0 10/01/2017 15:45:00 1 0.01 2
1 10/01/2017 15:45:00 1 0.40 2
2 05/01/2017 15:56:00 1 0.50 1
3 11/01/2017 15:22:00 2 0.70 1
4 11/01/2017 15:22:00 2 0.77 1
5 06/01/2017 11:02:00 3 0.10 2
6 05/01/2017 09:37:00 3 0.20 1
7 05/01/2017 09:37:00 3 0.30 1
8 05/01/2017 09:55:00 4 0.11 1
9 05/01/2017 09:55:00 4 0.21 1
11 05/01/2017 10:09:00 5 0.01 2
10 05/01/2017 10:08:00 5 0.40 1
14 03/02/2017 09:31:00 6 0.80 3
13 03/02/2017 09:15:00 6 0.50 2
12 03/02/2017 08:55:00 6 3.00 1
16 19/01/2017 16:34:00 7 0.10 2
15 09/01/2017 15:42:00 7 0.90 1
but to get your exact output, here i have used dictionary and reversing dictionary keys with extracting values
df1 = df.sort_values(['id', 'date'], ascending=[True, False])
df1['date_rank'] = df1.groupby(['id'])['date'].transform(lambda x: list(map(lambda y: dict(map(reversed, dict(enumerate(x.unique())).items()))[y]+1,x)) )
Out:
date id value date_rank
0 10/01/2017 15:45:00 1 0.01 1
1 10/01/2017 15:45:00 1 0.40 1
2 05/01/2017 15:56:00 1 0.50 2
3 11/01/2017 15:22:00 2 0.70 1
4 11/01/2017 15:22:00 2 0.77 1
5 06/01/2017 11:02:00 3 0.10 1
6 05/01/2017 09:37:00 3 0.20 2
7 05/01/2017 09:37:00 3 0.30 2
8 05/01/2017 09:55:00 4 0.11 1
9 05/01/2017 09:55:00 4 0.21 1
11 05/01/2017 10:09:00 5 0.01 1
10 05/01/2017 10:08:00 5 0.40 2
14 03/02/2017 09:31:00 6 0.80 1
13 03/02/2017 09:15:00 6 0.50 2
12 03/02/2017 08:55:00 6 3.00 3
16 19/01/2017 16:34:00 7 0.10 1
15 09/01/2017 15:42:00 7 0.90 2
You can do this by sort_values, groupby and cumcount
df['date_rank'] = df.sort_values(['id', 'date'], ascending=[True, False]).groupby(['id']).cumcount() + 1
demo
In [1]: df = pd.DataFrame({
...: 'id':[1, 1, 2, 3, 3, 4, 5, 6,6,6,7,7],
...: 'value':[.01, .4, .2, .3, .11, .21, .4, .01, 3, .5, .8, .9],
...: 'date':['10/01/2017 15:45:00','05/01/2017 15:56:00',
...: '11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00',
...: '05/01/2017 09:55:00','05/01/2017 10:08:00','03/02/2017 08:55:00',
...: '03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
...: '19/01/2017 16:34:00']})
...:
In [2]: df['date_rank'] = df.sort_values(['id', 'date'], ascending=[True, False]).groupby(['id']).cumcount() + 1
...:
In [3]: df
Out[3]:
id value date date_rank
0 1 0.01 10/01/2017 15:45:00 1
1 1 0.40 05/01/2017 15:56:00 2
2 2 0.20 11/01/2017 15:22:00 1
3 3 0.30 06/01/2017 11:02:00 1
4 3 0.11 05/01/2017 09:37:00 2
5 4 0.21 05/01/2017 09:55:00 1
6 5 0.40 05/01/2017 10:08:00 1
7 6 0.01 03/02/2017 08:55:00 3
8 6 3.00 03/02/2017 09:15:00 2
9 6 0.50 03/02/2017 09:31:00 1
10 7 0.80 09/01/2017 15:42:00 2
11 7 0.90 19/01/2017 16:34:00 1
Edit
you can do this by rank method
df.groupby(['id'])['date'].rank(ascending=False, method='dense').astype(int)
demo
In [1]: df['rank'] = df.groupby(['id'])['date'].rank(ascending=False, method='dense').astype(int)
In [2]: df
Out[2]:
id value date rank
0 1 0.01 2017-10-01 15:45:00 1
1 1 0.40 2017-10-01 15:45:00 1
2 1 0.50 2017-05-01 15:56:00 2
3 2 0.70 2017-11-01 15:22:00 1
4 2 0.77 2017-11-01 15:22:00 1
5 3 0.10 2017-06-01 11:02:00 1
6 3 0.20 2017-05-01 09:37:00 2
7 3 0.30 2017-05-01 09:37:00 2
8 4 0.11 2017-05-01 09:55:00 1
9 4 0.21 2017-05-01 09:55:00 1
10 5 0.40 2017-05-01 10:08:00 2
11 5 0.01 2017-05-01 10:09:00 1
12 6 3.00 2017-03-02 08:55:00 3
13 6 0.50 2017-03-02 09:15:00 2
14 6 0.80 2017-03-02 09:31:00 1
15 7 0.90 2017-09-01 15:42:00 1
16 7 0.10 2017-01-19 16:34:00 2

Fill zero values for combinations of unique multi-index values after groupby

To better explain by problem better lets pretend i have a shop with 3 unique customers and my dataframe contains every purchase of my customers with weekday, name and paid price.
name price weekday
0 Paul 18.44 0
1 Micky 0.70 0
2 Sarah 0.59 0
3 Sarah 0.27 1
4 Paul 3.45 2
5 Sarah 14.03 2
6 Paul 17.21 3
7 Micky 5.35 3
8 Sarah 0.49 4
9 Micky 17.00 4
10 Paul 2.62 4
11 Micky 17.61 5
12 Micky 10.63 6
The information i would like to get is the average price per unique customer per weekday. What i often do in similar situations is to group by several columns with sum and then take the average of a subset of the columns.
df = df.groupby(['name','weekday']).sum()
price
name weekday
Micky 0 0.70
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
2 3.45
3 17.21
4 2.62
Sarah 0 0.59
1 0.27
2 14.03
4 0.49
df = df.groupby(['weekday']).mean()
price
weekday
0 6.576667
1 0.270000
2 8.740000
3 11.280000
4 6.703333
5 17.610000
6 10.630000
Of course this only works if all my unique customers would have at least one purchase per day.
Is there an elegant way to get a zero value for all combinations between unique index values that have no sum after the first groupby?
My solutions has been so far to either to reindex on a multi index i created from the unique values of the grouped columns or the combination of unstack-fillna-stack but both solutions do not really satisfy me.
Appreciate your help!
IIUC, let's use unstack and fillna then stack:
df_out = df.groupby(['name','weekday']).sum().unstack().fillna(0).stack()
Output:
price
name weekday
Micky 0 0.70
1 0.00
2 0.00
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
1 0.00
2 3.45
3 17.21
4 2.62
5 0.00
6 0.00
Sarah 0 0.59
1 0.27
2 14.03
3 0.00
4 0.49
5 0.00
6 0.00
And,
df_out.groupby('weekday').mean()
Output:
price
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I think you can use pivot_table to do all the steps at once. I'm not exactly sure what you want but the default aggregation from pivot_table is the mean. You can change it to 'sum'.
df1 = df.pivot_table(index='name', columns='weekday', values='price',
fill_value=0, aggfunc='sum')
weekday 0 1 2 3 4 5 6
name
Micky 0.70 0.00 0.00 5.35 17.00 17.61 10.63
Paul 18.44 0.00 3.45 17.21 2.62 0.00 0.00
Sarah 0.59 0.27 14.03 0.00 0.49 0.00 0.00
And then take the mean of each column.
df1.mean()
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333

Categories

Resources