Calculating current, min, max, mean monthly growth from pandas dataframe - python

I have a dataset similar to the one below:
product_ID month amount_sold
1 1 23
1 2 34
1 3 85
2 1 47
2 2 28
2 3 9
3 1 73
3 2 84
3 3 12
I want the output to be like this:
For example, for product 1:
-avg_monthly_growth is calculated by ((85-34)/34*100 + (34-23)/23*100)/2 = 98.91%
-lowest_monthly_growth is (34-23)/23*100) = 47.83%
-highest_monthly_growth is (85-34)/34*100) = 150%
-current_monthly_growth is the growth between the lastest two months (in this case, it's the growth from month 2 to month 3, as the month ranges from 1-3 for each product)
product_ID avg_monthly_growth lowest_monthly_growth highest_monthly_growth current_monthly_growth
1 98.91% 47.83% 150% 150%
2 ... ... ... ...
3 ... ... ... ...
I've tried df.loc[df.groupby('product_ID')['amount_sold'].idxmax(), :].reset_index() which gets me the max (and similarly the min), but I'm not too sure how to get the percentage growths.

You can use a pivot_table withh pct_change() on axis=1 , then create a dictionary with desired series and create a df:
m=df.pivot_table(index='product_ID',columns='month',values='amount_sold').pct_change(axis=1)
d={'avg_monthly_growth':m.mean(axis=1)*100,'lowest_monthly_growth':m.min(1)*100,
'highest_monthly_growth':m.max(1)*100,'current_monthly_growth':m.iloc[:,-1]*100}
final=pd.DataFrame(d)
print(final)
avg_monthly_growth lowest_monthly_growth highest_monthly_growth \
product_ID
1 98.913043 47.826087 150.000000
2 -54.141337 -67.857143 -40.425532
3 -35.322896 -85.714286 15.068493
current_monthly_growth
product_ID
1 150.000000
2 -67.857143
3 -85.714286

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

Rearranging data from a table in python to correlate columns

I got a table that look like this:
code
year
month
Value A
Value B
1
2020
1
120
100
1
2020
2
130
90
1
2020
3
90
89
1
2020
4
67
65
...
...
...
...
...
100
2020
10
90
90
100
2020
11
115
100
100
2020
12
150
135
I would like to know if there's a way to rearrange the data to find the correlation between A and B for every distinct code.
What I'm thinking is, for example, getting an array for every code, like:
[(A1,A2,A3...,A12),(B1,B2,B3...,B12)]
where A and B is the values for the respective month, and then I could see the correlation between these two columns. Is there a way to make this dynamic?
IIUC, you don't need to re-arrange to get the correlation for each "code". Instead, try with groupby:
>>> df.groupby("code").apply(lambda x: x["Value A"].corr(x["Value B"]))
code
1 0.830163
100 0.977093
dtype: float64

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

Get maximum relative difference between row-values and row-mean in new pandas dataframe column

I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})

print(df)

a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]


# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64

after groupby and sum,how to get the max value rows in `pandas.DataFrame`?

here the df(i updated by real data ):
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150425232836 0PU_IS_PS_44 REQU_51NHAJUV06IMMP16BVE572JM2 17020
>20150128165726 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M 6925
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150309153828 0HR_PA_0 REQU_51385K5F3AGGFVCGHU997QF9M 0
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150307222336 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ 13889
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150405162213 0HR_PA_0 REQU_51FFR7T4YQ2F766PFY0W9WUDM 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150102162140 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>and the relationships between columns is
>OLTPSOURCE--RNR:1>n
>RNR--RQDRECORD:1>N
and my requirement is:
sum the RQDRECORD by RNR;
get the max sum result of every OLTPSOURCE;
Finally, I would draw a graph showing the results of all
sumed largest OLTPSOURCE by time
Thanks everyone, I further explain my problem:
if OLTPSOURCE:RNR:RQDRECORD= 1:1:1
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:1:N
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:N:(N OR 1)
sum RQDRECORD by RNR GROUP first,THEN Find the max result of one OLTPSOURCE,return all the OLTPSOURCE with the max RQDRECORD .
So for the above sample data, I eventually want the result as follows
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
Referring to EdChum's approach, I made some adjustments, the results were as follows, because the amount of data is too big, I did "'RQDRECORD> 100000'" is set, in fact I would like to sort and then take the top 100, but not success
[1]: http://i.imgur.com/FgfZaDY.jpg "result"
You can take the groupby result, call max on this and pass param level=0 or level='clsa' if you prefer, this will return you the max count for that level. However this loses the 'clsb' column so what you can do is merge this back to your grouped result after calling reset_index on the grouped object, you can reorder the resulting df columns by using fancy indexing:
In [149]:
gp = df.groupby(['clsa','clsb']).sum()
result = gp.max(level=0).reset_index().merge(gp.reset_index())
result = result.ix[:,['clsa','clsb','count']]
result
Out[149]:
clsa clsb count
0 a a1 9
1 b b2 8
2 c c2 10
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%Y%m%d%H%M%S')
df_gb = df.groupby(['OLTPSOURCE', 'RNR'], as_index=False).aggregate(sum)
final = pd.merge(df[['TIMESTAMP', 'OLTPSOURCE', 'RNR']], df_gb.groupby(['OLTPSOURCE'], as_index=False).first(), on=['OLTPSOURCE', 'RNR'], how='right').sort('OLTPSOURCE')
final.plot(kind='bar')
plt.show()
print final
TIMESTAMP OLTPSOURCE RNR \
3 2015-06-23 21:52:02 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q
2 2015-01-07 20:13:58 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6
5 2015-06-26 18:55:31 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A
11 2015-04-17 18:49:16 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU
6 2015-03-07 22:23:36 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ
4 2015-07-15 14:41:39 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY
10 2015-01-02 16:21:40 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U
13 2015-04-19 23:07:24 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I
7 2015-06-30 16:34:19 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2
8 2015-04-24 16:22:26 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I
0 2015-01-28 16:57:26 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M
12 2015-02-05 15:06:33 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM
9 2015-06-17 14:37:20 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM
1 2015-07-01 14:42:53 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM
RQDRECORD
3 0
2 14205
5 0
11 0
6 13889
4 25381
10 0
13 22528
7 0
8 0
0 6925
12 6667
9 6
1 2

Categories

Resources