difference between two rows pandas - python

i have a dataframe as :
id|amount|date
20|-7|2017:12:25
20|-170|2017:12:26
20|7|2017:12:27
i want to subtract each row from another for 'amount' column:
the output should be like:
id|amount|date|amount_diff
20|-7|2017:12:25|0
20|-170|2017:12:26|-177
20|7|2017:12:27|-163
i used the code:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['invoice_amount'].diff()
and obtained the output as:
id|amount|date|amount_diff
20|-7|2017:12:25|163
20|-170|2017:12:26|-218
20|48|2017:12:27|0

IIUC you need:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['amount'].add(df['amount'].shift()).fillna(0)
print (df)
id amount date amount_diff
0 20 -7 2017:12:25 0.0
1 20 -170 2017:12:26 -177.0
2 20 7 2017:12:27 -163.0
Because if want subtract your solution should work:
df.sort_values(by='date',inplace=True)
df['amount_diff1'] = df['amount'].sub(df['amount'].shift()).fillna(0)
df['amount_diff2'] = df['amount'].diff().fillna(0)
print (df)
id amount date amount_diff1 amount_diff2
0 20 -7 2017:12:25 0.0 0.0
1 20 -170 2017:12:26 -163.0 -163.0
2 20 7 2017:12:27 177.0 177.0

Related

How to calculate the expanding mean of all the columns across the DataFrame and add to DataFrame

I am trying to calculate the means of all previous rows for each column of the DataFrame and add the calculated mean column to the DataFrame.
I am using a set of nba games data that contains 20+ features (columns) that I am trying to calculate the means for. Example of the dataset is below. (Note. "...." represent rest of the feature columns)
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Example for calculating two of the columns:
dataset = pd.read_csv('nba.games.stats.csv')
df = dataset
df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean()))
df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))
Again, the code only calculates the mean and adding the column to the DataFrame one at a time. Is there a way to get the column means and add them to the DataFrame without doing one at a time? For loop? Example of what I am looking for is below.
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ...("..." = mean columns of rest of the feature columns)
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Try this one:
(0) sample input:
>>> df
col1 col2 col3
0 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797
2 0.042541 1.196383 6.568839
3 4.784911 0.444671 8.019933
4 3.831556 0.902672 0.198920
5 3.672763 2.236639 1.528215
6 0.792616 2.604049 0.373296
7 2.281992 2.563639 1.500008
8 4.096861 0.598854 4.934116
9 3.632607 1.502801 0.241920
Then processing:
(1) side table to get all the means on the side (I didn't find cummulative mean function, so went with cumsum + count)
>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
col1 col2 col3 col_temp
0 1.490977 1.784433 0.852842 1.0
1 5.217640 4.629801 8.619638 2.0
2 5.260182 5.826184 15.188477 3.0
3 10.045093 6.270855 23.208410 4.0
4 13.876649 7.173527 23.407330 5.0
5 17.549412 9.410166 24.935545 6.0
6 18.342028 12.014215 25.308841 7.0
7 20.624021 14.577855 26.808849 8.0
8 24.720882 15.176708 31.742965 9.0
9 28.353489 16.679509 31.984885 10.0
>>> for el in df.columns:
... df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842
1 2.608820 2.314901 4.309819
2 1.753394 1.942061 5.062826
3 2.511273 1.567714 5.802103
4 2.775330 1.434705 4.681466
5 2.924902 1.568361 4.155924
6 2.620290 1.716316 3.615549
7 2.578003 1.822232 3.351106
8 2.746765 1.686301 3.526996
9 2.835349 1.667951 3.198489
(2) joining back, on index:
>>> df_final=df.join(df_side)
>>> df_final
col1 col2 col3 col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797 2.608820 2.314901 4.309819
2 0.042541 1.196383 6.568839 1.753394 1.942061 5.062826
3 4.784911 0.444671 8.019933 2.511273 1.567714 5.802103
4 3.831556 0.902672 0.198920 2.775330 1.434705 4.681466
5 3.672763 2.236639 1.528215 2.924902 1.568361 4.155924
6 0.792616 2.604049 0.373296 2.620290 1.716316 3.615549
7 2.281992 2.563639 1.500008 2.578003 1.822232 3.351106
8 4.096861 0.598854 4.934116 2.746765 1.686301 3.526996
9 3.632607 1.502801 0.241920 2.835349 1.667951 3.198489
I am trying to calculate the means of all previous rows for each column of the DataFrame
To get all of the columns, you can do:
df_means = df.join(df.cumsum()/
df.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
However, if Team is a column rather the index, you'd want to get rid of it:
df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
df_data.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
You could also do
import numpy as np
df_data = df[[col for col in df.columns
if np.issubdtype(df[col],np.number)]]
Or manually define a list of columns that you want to take the mean of, cols_for_mean, and then do
df_data = df[cols_for_mean]

Calculating current, min, max, mean monthly growth from pandas dataframe

I have a dataset similar to the one below:
product_ID month amount_sold
1 1 23
1 2 34
1 3 85
2 1 47
2 2 28
2 3 9
3 1 73
3 2 84
3 3 12
I want the output to be like this:
For example, for product 1:
-avg_monthly_growth is calculated by ((85-34)/34*100 + (34-23)/23*100)/2 = 98.91%
-lowest_monthly_growth is (34-23)/23*100) = 47.83%
-highest_monthly_growth is (85-34)/34*100) = 150%
-current_monthly_growth is the growth between the lastest two months (in this case, it's the growth from month 2 to month 3, as the month ranges from 1-3 for each product)
product_ID avg_monthly_growth lowest_monthly_growth highest_monthly_growth current_monthly_growth
1 98.91% 47.83% 150% 150%
2 ... ... ... ...
3 ... ... ... ...
I've tried df.loc[df.groupby('product_ID')['amount_sold'].idxmax(), :].reset_index() which gets me the max (and similarly the min), but I'm not too sure how to get the percentage growths.
You can use a pivot_table withh pct_change() on axis=1 , then create a dictionary with desired series and create a df:
m=df.pivot_table(index='product_ID',columns='month',values='amount_sold').pct_change(axis=1)
d={'avg_monthly_growth':m.mean(axis=1)*100,'lowest_monthly_growth':m.min(1)*100,
'highest_monthly_growth':m.max(1)*100,'current_monthly_growth':m.iloc[:,-1]*100}
final=pd.DataFrame(d)
print(final)
avg_monthly_growth lowest_monthly_growth highest_monthly_growth \
product_ID
1 98.913043 47.826087 150.000000
2 -54.141337 -67.857143 -40.425532
3 -35.322896 -85.714286 15.068493
current_monthly_growth
product_ID
1 150.000000
2 -67.857143
3 -85.714286

How to insert rows with 0 data for missing quarters into a pandas dataframe?

I have a dataframe with specific Quota values for given quarters (YYYY-Qx format), and need to visualize them with some linecharts. However, some of the quarters are missing (as there was no Quota during those quarters).
Period Quota
2017-Q1 500
2017-Q3 600
2018-Q2 700
I want to add them (starting at 2017-Q1 until today, so 2019-Q2) to the dataframe with a default value of 0 in the Quota column. A desired output would be the following:
Period Quota
2017-Q1 500
2017-Q2 0
2017-Q3 600
2017-Q4 0
2018-Q1 0
2018-Q2 700
2018-Q3 0
2018-Q4 0
2019-Q1 0
2019-Q2 0
I tried
df['Period'] = pd.to_datetime(df['Period']).dt.to_period('Q')
And then resampling the df with 'Q' frequency, but I must be doing something wrong, as it doesn't help with anything.
Any help would be much appreciated.
Use:
df.index = pd.to_datetime(df['Period']).dt.to_period('Q')
end = pd.Period(pd.datetime.now(), freq='Q')
df = (df['Quota'].reindex(pd.period_range(df.index.min(), end), fill_value=0)
.rename_axis('Period')
.reset_index()
)
df['Period'] = df['Period'].dt.strftime('%Y-Q%q')
print (df)
Period Quota
0 2017-Q1 500
1 2017-Q2 0
2 2017-Q3 600
3 2017-Q4 0
4 2018-Q1 0
5 2018-Q2 700
6 2018-Q3 0
7 2018-Q4 0
8 2019-Q1 0
9 2019-Q2 0
#An alternate solution based on left join
qtr=['Q1','Q2','Q3','Q4']
finl=[]
for i in range(2017,2020):
for j in qtr:
finl.append((str(i)+'_'+j))
df1=pd.DataFrame({'year_qtr':finl}).reset_index(drop=True)
df1.head(2)
original_value=['2017_Q1' ,'2017_Q3' ,'2018_Q2']
df_original=pd.DataFrame({'year_qtr':original_value,
'value':[500,600,700]}).reset_index(drop=True)
final=pd.merge(df1,df_original,how='left',left_on=['year_qtr'], right_on =['year_qtr'])
final.fillna(0)
Output
year_qtr value
0 2017_Q1 500.0
1 2017_Q2 0.0
2 2017_Q3 600.0
3 2017_Q4 0.0
4 2018_Q1 0.0
5 2018_Q2 700.0
6 2018_Q3 0.0
7 2018_Q4 0.0
8 2019_Q1 0.0
9 2019_Q2 0.0
10 2019_Q3 0.0
11 2019_Q4 0.0

after groupby and sum,how to get the max value rows in `pandas.DataFrame`?

here the df(i updated by real data ):
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150425232836 0PU_IS_PS_44 REQU_51NHAJUV06IMMP16BVE572JM2 17020
>20150128165726 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M 6925
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150309153828 0HR_PA_0 REQU_51385K5F3AGGFVCGHU997QF9M 0
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150307222336 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ 13889
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150405162213 0HR_PA_0 REQU_51FFR7T4YQ2F766PFY0W9WUDM 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150102162140 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>and the relationships between columns is
>OLTPSOURCE--RNR:1>n
>RNR--RQDRECORD:1>N
and my requirement is:
sum the RQDRECORD by RNR;
get the max sum result of every OLTPSOURCE;
Finally, I would draw a graph showing the results of all
sumed largest OLTPSOURCE by time
Thanks everyone, I further explain my problem:
if OLTPSOURCE:RNR:RQDRECORD= 1:1:1
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:1:N
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:N:(N OR 1)
sum RQDRECORD by RNR GROUP first,THEN Find the max result of one OLTPSOURCE,return all the OLTPSOURCE with the max RQDRECORD .
So for the above sample data, I eventually want the result as follows
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
Referring to EdChum's approach, I made some adjustments, the results were as follows, because the amount of data is too big, I did "'RQDRECORD> 100000'" is set, in fact I would like to sort and then take the top 100, but not success
[1]: http://i.imgur.com/FgfZaDY.jpg "result"
You can take the groupby result, call max on this and pass param level=0 or level='clsa' if you prefer, this will return you the max count for that level. However this loses the 'clsb' column so what you can do is merge this back to your grouped result after calling reset_index on the grouped object, you can reorder the resulting df columns by using fancy indexing:
In [149]:
gp = df.groupby(['clsa','clsb']).sum()
result = gp.max(level=0).reset_index().merge(gp.reset_index())
result = result.ix[:,['clsa','clsb','count']]
result
Out[149]:
clsa clsb count
0 a a1 9
1 b b2 8
2 c c2 10
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%Y%m%d%H%M%S')
df_gb = df.groupby(['OLTPSOURCE', 'RNR'], as_index=False).aggregate(sum)
final = pd.merge(df[['TIMESTAMP', 'OLTPSOURCE', 'RNR']], df_gb.groupby(['OLTPSOURCE'], as_index=False).first(), on=['OLTPSOURCE', 'RNR'], how='right').sort('OLTPSOURCE')
final.plot(kind='bar')
plt.show()
print final
TIMESTAMP OLTPSOURCE RNR \
3 2015-06-23 21:52:02 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q
2 2015-01-07 20:13:58 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6
5 2015-06-26 18:55:31 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A
11 2015-04-17 18:49:16 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU
6 2015-03-07 22:23:36 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ
4 2015-07-15 14:41:39 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY
10 2015-01-02 16:21:40 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U
13 2015-04-19 23:07:24 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I
7 2015-06-30 16:34:19 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2
8 2015-04-24 16:22:26 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I
0 2015-01-28 16:57:26 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M
12 2015-02-05 15:06:33 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM
9 2015-06-17 14:37:20 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM
1 2015-07-01 14:42:53 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM
RQDRECORD
3 0
2 14205
5 0
11 0
6 13889
4 25381
10 0
13 22528
7 0
8 0
0 6925
12 6667
9 6
1 2

pandas subtracting two grouped dataframes of different size

i have two dataframes:
my stock solutions (df1):
pH salt_conc
5.5 0 23596.0
200 19167.0
400 17052.5
6.0 0 37008.5
200 27652.0
400 30385.5
6.5 0 43752.5
200 41146.0
400 39965.0
and my measurements after i did something (df2):
pH salt_conc id
5.5 0 8 20953.0
11 24858.0
200 3 20022.5
400 13 17691.0
20 18774.0
6.0 0 14 38639.0
200 1 37223.5
2 36597.0
7 37039.0
10 37088.5
15 35968.5
16 36344.5
17 34894.0
18 36388.5
400 9 33386.0
6.5 0 4 41401.5
12 44933.5
200 5 43074.5
400 6 42210.5
19 41332.5
I would like to normalize each measurement in the second dataframe (df2) with its corresponding stock solution from which i took the sample.
Any suggestions ?
Figured it out with the help of this post:
SO: Binary operation broadcasting across multiindex
I had to reset the index of both grouped dataframes and set it again.
df_initial = df_initial.reset_index().set_index(['pH','salt_conc'])
df_second = df_second.reset_index().set_index(['pH','salt_conc'])
No i can do any calculation i want to do.

Categories

Resources