How to Group by the mean of specific columns in Python - python

In the dataframe below:
import pandas as pd
import numpy as np
df= {
'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'],
'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19],
'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12],
'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
}
df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7'])
df.info()
I want to do the following:
Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7)
Export all the outputs as one xlsx file (refer to the figure below)
The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated)
My attempt:
# Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7
df1 = df[['AIC_TRX', 'diff', 'series',
'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4',
'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']]
#Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7'
df1.groupby('series').Grwth_Time1.agg(['mean'])
Thanks in advance

You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time.
df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3',
'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx")
Output:
AIC_TRX 1 2 3 4
series
1 150.78 208.07 146.87 229.28
2 162.34 217.76 182.54 244.73
4 188.53 229.48 189.57 269.91
8 197.69 139.51 199.97 249.19
AIC_TRX 1 2 3 4
series
1 250.78 308.07 346.87 329.28
2 262.34 317.70 382.54 347.73
4 288.53 329.81 369.59 369.91
8 297.69 339.15 399.97 349.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 270.84 318.73 398.75 494.85
2 282.14 327.47 432.18 509.39
4 298.53 369.63 449.78 515.52
8 306.69 389.59 473.55 539.23
AIC_TRX 1 2 3 4
series
1 25.78 30.07 34.87 29.28
2 22.34 17.70 32.54 34.73
4 28.53 29.81 36.59 36.91
8 27.69 33.15 39.97 34.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 27.84 18.73 38.75 13.85
2 28.14 27.47 24.18 9.39
4 29.53 36.63 24.78 15.52
8 30.69 38.59 21.55 39.23

Just use the df.apply method to average across each column based on series and AIC_TRX grouping.
result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1)
Result:
series AIC_TRX
1 1 0 120.738
2 4 156.281
3 8 170.285
4 12 196.270
2 1 1 122.358
2 5 152.758
3 9 184.494
4 13 205.175
4 1 2 135.471
2 6 171.968
3 10 187.825
4 14 214.907
8 1 3 142.183
2 7 162.849
3 11 196.851
4 15 216.455
dtype: float64

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

Pandas: how to select rows in data frame based on condition of a specific value on a specific column [duplicate]

This question already has answers here:
Pandas split DataFrame by column value
(5 answers)
Closed 3 years ago.
I have a given data frame as below example:
0 1 2 3 4 5 6 7 8
0 842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869
1 84300903 M 19.69 21.25 130 1203 0.1096 0.1599 0.1974
2 84348301 M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414
3 843786 M 12.45 15.7 82.57 477.1 0.1278 0.17 0.1578
4 844359 M 18.25 19.98 119.6 1040 0.09463 0.109 0.1127
And I wrote a function that should split the dataset into 2 data frames, based on comparison of a value in a specific column and a specific value.
For example, if I have col_idx = 2 and value=18.3 the result should be:
df1 - below the value:
0 1 2 3 4 5 6 7 8
2 84348301 M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414
3 843786 M 12.45 15.7 82.57 477.1 0.1278 0.17 0.1578
4 844359 M 18.25 19.98 119.6 1040 0.09463 0.109 0.1127
df2 - above the value:
0 1 2 3 4 5 6 7 8
0 842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869
1 84300903 M 19.69 21.25 130 1203 0.1096 0.1599 0.1974
The function should look like:
def split_dataset(data_set, col_idx, value):
below_df = ?
above_df = ?
return below_df, above_df
Can anybody complete my script please?
below_df = data_set[data_set[col_idx] < value]
above_df = data_set[data_set[col_idx] > value] # you have to deal with data_set[col_idx] == value though
You can use loc:
def split_dataset(data_set, col_idx, value):
below_df = df.loc[df[col_idx]<=value]
above_df = df.loc[df[col_idx]>=value]
return below_df, above_df
df1,df2=split_dataset(df,'2',18.3)
Output:
df1
0 1 2 3 4 5 6 7 8
2 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.2839 0.2414
3 843786 M 12.45 15.70 82.57 477.1 0.12780 0.1700 0.1578
4 844359 M 18.25 19.98 119.60 1040.0 0.09463 0.1090 0.1127
df2
0 1 2 3 4 5 6 7 8
0 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869
1 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974
Note:
Note that in this function call the names of the columns are numbers. You have to know before calling the function the correct type of columns. You may have to use string type or not.
You should also define what happens if the value with which the data frame is divided (value) is included in the column of the data frame.

Pandas : How to calculate PCT Change for all columns dynamically?

I got the following pandas df by using the following command, how to get PCT Change for all the columns dynamically for AAL , AAN ... 100 more
price['AABA_PCT_CHG'] = price.AABA.pct_change()
AABA AAL AAN AABA_PCT_CHG
0 16.120001 9.635592 18.836105 NaN
1 16.400000 8.363149 23.105881 0.017370
2 16.680000 8.460282 24.892321 0.017073
3 17.700001 8.829385 28.275263 0.061151
4 16.549999 8.839100 27.705627 -0.064972
5 15.040000 8.654548 27.754738 -0.091239
Apply on dataframe like
In [424]: price.pct_change().add_suffix('_PCT_CHG')
Out[424]:
AABA_PCT_CHG AAL_PCT_CHG AAN_PCT_CHG
0 NaN NaN NaN
1 0.017370 -0.132057 0.226680
2 0.017073 0.011614 0.077315
3 0.061151 0.043628 0.135903
4 -0.064972 0.001100 -0.020146
5 -0.091239 -0.020879 0.001773
In [425]: price.join(price.pct_change().add_suffix('_PCT_CHG'))
Out[425]:
AABA AAL AAN AABA_PCT_CHG AAL_PCT_CHG AAN_PCT_CHG
0 16.120001 9.635592 18.836105 NaN NaN NaN
1 16.400000 8.363149 23.105881 0.017370 -0.132057 0.226680
2 16.680000 8.460282 24.892321 0.017073 0.011614 0.077315
3 17.700001 8.829385 28.275263 0.061151 0.043628 0.135903
4 16.549999 8.839100 27.705627 -0.064972 0.001100 -0.020146
5 15.040000 8.654548 27.754738 -0.091239 -0.020879 0.001773

Calculate values without looping

I am attempting to do a monte carlo-esque projection using pandas on some stock prices. I used numpy to create some random correlated values for percentage price change, however I am struggling on how to use those values to create a 'running tally' of the actual asset price. So I have a DataFrame that looks like this:
abc xyz def
0 0.093889 0.113750 0.082923
1 -0.130293 -0.148742 -0.061890
2 0.062175 -0.005463 0.022963
3 -0.029041 -0.015918 0.006735
4 -0.048950 -0.010945 -0.034421
5 0.082868 0.080570 0.074637
6 0.048782 -0.030702 -0.003748
7 -0.027402 -0.065221 -0.054764
8 0.095154 0.063978 0.039480
9 0.059001 0.114566 0.056582
How can I create something like this, where abc_px = previous price * (1 + abc). I know I could iterate over, but I would rather not for performance reasons.
Something like, assuming the initial price on all of these was 100:
abc xyz def abc_px xyz_px def_px
0 0.093889 0.11375 0.082923 109.39 111.38 108.29
1 -0.130293 -0.148742 -0.06189 95.14 94.81 101.59
2 0.062175 -0.005463 0.022963 101.05 94.29 103.92
3 -0.029041 -0.015918 0.006735 98.12 92.79 104.62
4 -0.04895 -0.010945 -0.034421 93.31 91.77 101.02
5 0.082868 0.08057 0.074637 101.05 99.17 108.56
6 0.048782 -0.030702 -0.003748 105.98 96.12 108.15
7 -0.027402 -0.065221 -0.054764 103.07 89.85 102.23
8 0.095154 0.063978 0.03948 112.88 95.60 106.27
9 0.059001 0.114566 0.056582 119.54 106.56 112.28
Is that what you want?
In [131]: new = df.add_suffix('_px') + 1
In [132]: new
Out[132]:
abc_px xyz_px def_px
0 1.093889 1.113750 1.082923
1 0.869707 0.851258 0.938110
2 1.062175 0.994537 1.022963
3 0.970959 0.984082 1.006735
4 0.951050 0.989055 0.965579
5 1.082868 1.080570 1.074637
6 1.048782 0.969298 0.996252
7 0.972598 0.934779 0.945236
8 1.095154 1.063978 1.039480
9 1.059001 1.114566 1.056582
In [133]: df.join(new.cumprod() * 100)
Out[133]:
abc xyz def abc_px xyz_px def_px
0 0.093889 0.113750 0.082923 109.388900 111.375000 108.292300
1 -0.130293 -0.148742 -0.061890 95.136292 94.808860 101.590090
2 0.062175 -0.005463 0.022963 101.051391 94.290919 103.922903
3 -0.029041 -0.015918 0.006735 98.116758 92.789996 104.622824
4 -0.048950 -0.010945 -0.034421 93.313942 91.774410 101.021601
5 0.082868 0.080570 0.074637 101.046682 99.168674 108.561551
6 0.048782 -0.030702 -0.003748 105.975941 96.123997 108.154662
7 -0.027402 -0.065221 -0.054764 103.071989 89.854694 102.231680
8 0.095154 0.063978 0.039480 112.879701 95.603418 106.267787
9 0.059001 0.114566 0.056582 119.539716 106.556319 112.280631

after groupby and sum,how to get the max value rows in `pandas.DataFrame`?

here the df(i updated by real data ):
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150425232836 0PU_IS_PS_44 REQU_51NHAJUV06IMMP16BVE572JM2 17020
>20150128165726 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M 6925
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150309153828 0HR_PA_0 REQU_51385K5F3AGGFVCGHU997QF9M 0
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150307222336 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ 13889
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150405162213 0HR_PA_0 REQU_51FFR7T4YQ2F766PFY0W9WUDM 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150102162140 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>and the relationships between columns is
>OLTPSOURCE--RNR:1>n
>RNR--RQDRECORD:1>N
and my requirement is:
sum the RQDRECORD by RNR;
get the max sum result of every OLTPSOURCE;
Finally, I would draw a graph showing the results of all
sumed largest OLTPSOURCE by time
Thanks everyone, I further explain my problem:
if OLTPSOURCE:RNR:RQDRECORD= 1:1:1
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:1:N
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:N:(N OR 1)
sum RQDRECORD by RNR GROUP first,THEN Find the max result of one OLTPSOURCE,return all the OLTPSOURCE with the max RQDRECORD .
So for the above sample data, I eventually want the result as follows
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
Referring to EdChum's approach, I made some adjustments, the results were as follows, because the amount of data is too big, I did "'RQDRECORD> 100000'" is set, in fact I would like to sort and then take the top 100, but not success
[1]: http://i.imgur.com/FgfZaDY.jpg "result"
You can take the groupby result, call max on this and pass param level=0 or level='clsa' if you prefer, this will return you the max count for that level. However this loses the 'clsb' column so what you can do is merge this back to your grouped result after calling reset_index on the grouped object, you can reorder the resulting df columns by using fancy indexing:
In [149]:
gp = df.groupby(['clsa','clsb']).sum()
result = gp.max(level=0).reset_index().merge(gp.reset_index())
result = result.ix[:,['clsa','clsb','count']]
result
Out[149]:
clsa clsb count
0 a a1 9
1 b b2 8
2 c c2 10
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%Y%m%d%H%M%S')
df_gb = df.groupby(['OLTPSOURCE', 'RNR'], as_index=False).aggregate(sum)
final = pd.merge(df[['TIMESTAMP', 'OLTPSOURCE', 'RNR']], df_gb.groupby(['OLTPSOURCE'], as_index=False).first(), on=['OLTPSOURCE', 'RNR'], how='right').sort('OLTPSOURCE')
final.plot(kind='bar')
plt.show()
print final
TIMESTAMP OLTPSOURCE RNR \
3 2015-06-23 21:52:02 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q
2 2015-01-07 20:13:58 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6
5 2015-06-26 18:55:31 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A
11 2015-04-17 18:49:16 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU
6 2015-03-07 22:23:36 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ
4 2015-07-15 14:41:39 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY
10 2015-01-02 16:21:40 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U
13 2015-04-19 23:07:24 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I
7 2015-06-30 16:34:19 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2
8 2015-04-24 16:22:26 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I
0 2015-01-28 16:57:26 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M
12 2015-02-05 15:06:33 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM
9 2015-06-17 14:37:20 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM
1 2015-07-01 14:42:53 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM
RQDRECORD
3 0
2 14205
5 0
11 0
6 13889
4 25381
10 0
13 22528
7 0
8 0
0 6925
12 6667
9 6
1 2

Categories

Resources