Assuming I have a Pandas dataframe similar to the below, how would I get the rolling correlation (for 2 days in this example) between 2 specific columns and group by the 'ID' column? I am familiar with the Pandas rolling_corr() function but I cannot figure out how to combine that with the groupby() clause.
What I have:
ID Date Val1 Val2
A 1-Jan 45 22
A 2-Jan 15 66
A 3-Jan 55 13
B 1-Jan 41 12
B 2-Jan 87 45
B 3-Jan 82 66
C 1-Jan 33 34
C 2-Jan 15 67
C 3-Jan 46 22
What I need:
ID Date Val1 Val2 Rolling_Corr
A 1-Jan 45 22
A 2-Jan 15 66 0.1
A 3-Jan 55 13 0.16
B 1-Jan 41 12
B 2-Jan 87 45 0.15
B 3-Jan 82 66 0.05
C 1-Jan 33 34
C 2-Jan 15 67 0.09
C 3-Jan 46 22 0.11
Thanks!
You can actually start with the simple approach here:
Pandas Correlation Groupby
and then add rolling(3) like this:
df.groupby('ID')[['Val1','Val2']].rolling(3).corr()
I've changed the window from 2 to 3 because you'll only get 1 or -1 with a window size of 2. Unfortunately, that output (not shown) is a bit verbose because it outputs a 2x2 correlation matrix when all you need is a scalar.
But with an additional line you can make the output more concise:
df2 = df.groupby('ID')[['Val1','Val2']].rolling(3).corr()
df2.groupby(level=[0,1]).last()['Val1']
ID
A 0 NaN
1 NaN
2 -0.996539
B 3 NaN
4 NaN
5 0.879868
C 6 NaN
7 NaN
8 -0.985529
Related
I need to pivot long pandas dataframe to wide. The issue is that for some id there are multiple values for the same parameter. Some parameters present only in a few ids.
df = pd.DataFrame({'indx':[11,11,11,11,12,12,12,13,13,13,13],'param':['a','b','b','c','a','b','d','a','b','c','c'],'value':[100,54,65,65,789,24,98,24,27,75,35]})
indx param value
11 a 100
11 b 54
11 b 65
11 c 65
12 a 789
12 b 24
12 d 98
13 a 24
13 b 27
13 c 75
13 c 35
Want to receive something like this:
indx a b c d
11 100 `54,65` 65 None
12 789 None 98 24
13 24 27 `75,35` None
or
indx a b b1 c c1 d
11 100 54 65 65 None None
12 789 None None 98 None 24
13 24 27 None 75 35 None
So, obviously direct df.pivot() not a solution.
Thanks for any help.
Option 1:
df.astype(str).groupby(['indx', 'param'])['value'].agg(','.join).unstack()
Output:
param a b c d
indx
11 100 54,65 65 NaN
12 789 24 NaN 98
13 24 27 75,35 NaN
Option 2
df_out = df.set_index(['indx', 'param', df.groupby(['indx','param']).cumcount()])['value'].unstack([1,2])
df_out.columns = [f'{i}_{j}' if j != 0 else f'{i}' for i, j in df_out.columns]
df_out.reset_index()
Output:
indx a b b_1 c d c_1
0 11 100.0 54.0 65.0 65.0 NaN NaN
1 12 789.0 24.0 NaN NaN 98.0 NaN
2 13 24.0 27.0 NaN 75.0 NaN 35.0
Ok, found a solution (there is method df.pivot_table for such cases,allows different aggregation functions):
df.pivot_table(index='indx', columns='param',values='value', aggfunc=lambda x: ','.join(x.astype(str)) )
indx a b c d
11 100 54,65 65 NaN
12 789 24 NaN 98
13 24 27 75,35 NaN
I want to get multiple customized percentiles (0.10, 0.20, ..., 0.50) values for each type from the following dataframe:
type value
0 a 90
1 a 63
2 a 11
3 a 61
4 a 78
5 a 67
6 a 89
7 a 12
8 a 43
9 a 30
10 b 72
11 b 84
12 b 74
13 b 66
14 b 80
15 b 75
16 b 47
17 b 22
18 b 5
19 b 64
20 b 32
21 b 45
I have proceed to this step:
df['percentile_rank'] = df.groupby('type').value.rank(pct=True).round(2)
Ouput:
type value percentile_rank
0 a 90 1.00
1 a 63 0.60
2 a 11 0.10
3 a 61 0.50
4 a 78 0.80
5 a 67 0.70
6 a 89 0.90
7 a 12 0.20
8 a 43 0.40
9 a 30 0.30
10 b 72 0.67
11 b 84 1.00
12 b 74 0.75
13 b 66 0.58
14 b 80 0.92
15 b 75 0.83
16 b 47 0.42
17 b 22 0.17
18 b 5 0.08
19 b 64 0.50
20 b 32 0.25
21 b 45 0.33
But I dont't know how to get an expected result like this:
type top10 top20 top30 top40 top50
0 a 89 78 67 63 61
1 b 80 75 72 66 64
In the table above top10 represents percentile_rank equals 0.90, top20 for 0.80, etc. If there are no exact percentile values, then we take the closest values, for example, top10 for type of b, I use the value of 80 whose percentile_rank is 0.92.
Thanks for your help at advance.
Update:
Output from Andy L.'s method, you can noticed NaNs for top55 and top45:
type top95 top90 top85 top80 top75 top70 top65 top60 top55 top50 \
0 e 40.82 41.81 41.82 42.35 43.85 44.42 44.99 45.92 NaN 45.94
top45 top40 top35 top30 top25 top20 top15 top10 top5
0 NaN 46.04 46.25 46.45 46.85 47.49 48.55 49.82 52.18
Output from YOBEN_S's method:
type top95 top90 top85 top80 top75 top70 top65 top60 top55 top50 \
0 e 40.704 41.82 41.82 42.326 43.7 44.36 44.94 45.94 45.94 45.94
top45 top40 top35 top30 top25 top20 top15 top10 top5
0 45.94 46.04 46.226 46.42 46.82 47.412 48.412 49.776 52.008
We could do quantile
s=df.groupby('type').value.apply(lambda x : x.quantile([0.9,.8,.7,.6,.5])).unstack()
Out[64]:
0.9 0.8 0.7 0.6 0.5
type
a 89.1 80.2 70.3 64.6 62.0
b 79.5 74.8 73.4 69.6 65.0
If you want use the existing values, I guess you may use pd.cut and groupby max as follows
bins = [0, 0.5, 0.6, 0.7, 0.8, 0.99]
labels = ['top50', 'top40', 'top30', 'top20', 'top10']
s = pd.cut(df.percentile_rank, bins=bins, labels=labels, right=True)
df_out = df.groupby(['type', s]).value.max().unstack()
Out[57]:
percentile_rank top50 top40 top30 top20 top10
type
a 61 63 67 78 89
b 64 66 72 74 80
I've faced a problem while applying sort_values() and cumsum() within a group.
I have a dataaset:
Basically, I need to sort values within a group, get cumulative sales and select those lines that compose 90% of sales.
to get first
and then, just select 90% of sales within each region
I have tried the following but the last line doesn't work. I returns an error: Cannot access callable attribute 'sort_values' of 'SeriesGroupBy' objects, try using the 'apply' method
I've tried apply also..
import pandas as pd
df = pd.DataFrame({'id':['id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()
Thank you for any suggestions
You can definitely sort the dataframe first, then do groupby():
df.sort_values(['region','sales'], ascending=[True,False],inplace=True)
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cummul'] = df.groupby('region')['%'].cumsum()
# filter
df[df['cummul'].le(0.9)]
Output:
id region sales % cummul
5 id_6 1 98 0.216336 0.216336
4 id_5 1 78 0.172185 0.388521
6 id_7 1 76 0.167770 0.556291
3 id_4 1 56 0.123620 0.679912
0 id_1 1 54 0.119205 0.799117
1 id_2 1 34 0.075055 0.874172
9 id_2 2 89 0.204598 0.204598
10 id_3 2 76 0.174713 0.379310
14 id_7 2 56 0.128736 0.508046
11 id_4 2 54 0.124138 0.632184
15 id_8 2 54 0.124138 0.756322
13 id_6 2 45 0.103448 0.859770
First we use your logic to create the % column, but we multiply by 100 and round to whole numbers.
Then we sort by region and %, no need for groupby.
After we sort, we create the cumul column.
And finally we select those within the 90% range with query:
df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()
df.query('cumul.le(90)')
output
id region sales % cumul
5 id_6 1 98 22.0 22.0
4 id_5 1 78 17.0 39.0
6 id_7 1 76 17.0 56.0
0 id_1 1 54 12.0 68.0
3 id_4 1 56 12.0 80.0
1 id_2 1 34 8.0 88.0
9 id_2 2 89 20.0 20.0
10 id_3 2 76 17.0 37.0
14 id_7 2 56 13.0 50.0
11 id_4 2 54 12.0 62.0
15 id_8 2 54 12.0 74.0
13 id_6 2 45 10.0 84.0
If you just need the sales data without the percentage, this can easily be done with method chaining:
(
df
.sort_values(by='sales', ascending=False)
.groupby('region')
.apply(lambda x[x.sales > x.sales.quantile(.1)])
.reset_index(level=0, drop=True)
)
Output
id region sales
5 id_6 1 98
4 id_5 1 78
6 id_7 1 76
3 id_4 1 56
0 id_1 1 54
1 id_2 1 34
7 id_8 1 34
9 id_2 2 89
10 id_3 2 76
14 id_7 2 56
11 id_4 2 54
15 id_8 2 54
13 id_6 2 45
12 id_5 2 34
This works because getting all values greater than 10% is essentially the same as getting the top 90%.
I have a group of data like below
ID Type value_1 value_2
1 A 12 89
2 A 13 78
3 A 11 92
4 A 9 79
5 B 15 83
6 B 34 91
7 B 2 87
8 B 3 86
9 B 7 85
10 C 9 83
11 C 3 85
12 C 2 87
13 C 12 88
14 C 11 82
I want to get the top 3 member of each Type according to the value_1 . The only solution occurs to me is that: first , get each Type data into a dataframe and sorted according to the value_1 and get the top 3; Then, merge the result together.
But is ther any simple method to solve it ? For easy discuss , I have codes below
#coding:utf-8
import pandas as pd
_data = [
["1","A",12,89],
["2","A",13,78],
["3","A",11,92],
["4","A",9,79],
["5","B",15,83],
["6","B",34,91],
["7","B",2,87],
["8","B",3,86],
["9","B",7,85],
["10","C",9,83],
["11","C",3,85],
["12","C",2,87],
["13","C",12,88],
["14","C",11,82]
]
head= ["ID","type","value_1","value_2"]
df = pd.DataFrame(_data, columns=head)
Then we using groupby tail with sort_values
newdf=df.sort_values(['type','value_1']).groupby('type').tail(3)
newer
ID type value_1 value_2
2 3 A 11 92
0 1 A 12 89
1 2 A 13 78
8 9 B 7 85
4 5 B 15 83
5 6 B 34 91
9 10 C 9 83
13 14 C 11 82
12 13 C 12 88
Sure! DataFrame.groupby can split a dataframe into different parts by the group fields and apply function can apply UDF on each group.
df.groupby('type', as_index=False, group_keys=False)\
.apply(lambda x: x.sort_values('value_1', ascending=False).head(3))
I am working with a pandas df that looks like this:
ID time
34 43
2 99
2 20
34 8
2 90
What would be the best approach to a create variable that represents the difference from the most recent time per ID?
ID time diff
34 43 35
2 99 9
2 20 NA
34 8 NA
2 90 70
Here's one possibility
df["diff"] = df.sort_values("time").groupby("ID")["time"].diff()
df
ID time diff
0 34 43 35.0
1 2 99 9.0
2 2 20 NaN
3 34 8 NaN
4 2 90 70.0