Categorize the highest value output among different columns - python

In my dataframe I want to choose the highest value among columns A,B,C. Then I will let to categorize the highest value in my output dataframe. I would also like to include a special condition, where if all the values are negative, then the output will return as N.A.
input df:
A B C
Date
2020-01-05 3.57 5.29 6.23
2020-01-04 4.98 9.64 7.58
2020-01-03 3.79 5.25 6.26
2020-01-02 3.95 5.65 6.61
2020-01-01 -3.10 -7.20 -8.16
output df:
A B C HIGHEST_CAT
Date
2020-01-05 3.57 5.29 6.23 C
2020-01-04 4.98 9.64 7.58 B
2020-01-03 3.79 5.25 6.26 C
2020-01-02 3.95 5.65 6.61 C
2020-01-01 -3.10 -7.20 -8.16 N.A.
How could I achieve this output?

Use DataFrame.idxmax with condition for test all values bellow 0 by DataFrame.lt and DataFrame.all in numpy.where:
df['HIGHEST_CAT'] = np.where(df.lt(0).all(axis=1), np.nan, df.idxmax(axis=1))
Or in Series.mask with default np.nan, so not necessary specify:
df['HIGHEST_CAT'] = df.idxmax(axis=1).mask(df.lt(0).all(axis=1))
Or:
df.loc[df.gt(0).all(axis=1), 'HIGHEST_CAT'] = df.idxmax(axis=1)
print (df)
A B C HIGHEST_CAT
Date
2020-01-05 3.57 5.29 6.23 C
2020-01-04 4.98 9.64 7.58 B
2020-01-03 3.79 5.25 6.26 C
2020-01-02 3.95 5.65 6.61 C
2020-01-01 -3.10 -7.20 -8.16 NaN

Use df.where:
In [375]: df['HIGHEST_CAT'] = df.idxmax(axis=1).where(df.gt(0).all(axis=1))
In [376]: df
Out[376]:
A B C HIGHEST_CAT
Date
2020-01-05 3.57 5.29 6.23 C
2020-01-04 4.98 9.64 7.58 B
2020-01-03 3.79 5.25 6.26 C
2020-01-02 3.95 5.65 6.61 C
2020-01-01 -3.10 -7.20 -8.16 NaN

Related

How to groupby in pandas and return dataframe instead of series?

print(df.sample(50)):
match_datetime country league home_team away_team home_odds draw_odds away_odds run_time home_score away_score
72170 2021-10-17 12:30:00 Ukraine Persha Liga Alliance Uzhhorod 1.22 5.62 9.71 2021-10-17 09:22:20.212731 NaN NaN
100398 2021-11-02 14:35:00 Saudi Arabia Division 1 Al Qadisiya Bisha 1.61 3.61 4.94 2021-11-02 09:13:18.768604 2.0 1.0
33929 2021-09-11 23:00:00 Panama LPF Veraguas Plaza Amador 2.75 2.75 2.71 2021-09-10 23:47:54.682982 1.0 1.0
12328 2021-08-15 15:30:00 Poland Ekstraklasa Slask Wroclaw Leczna 1.74 3.74 4.59 2021-08-14 22:44:26.136608 0.0 0.0
81500 2021-10-24 13:00:00 Italy Serie D - Group A Caronnese Saluzzo 1.69 3.60 4.28 2021-10-23 13:37:16.920175 2.0 2.0
143370 2021-12-05 14:00:00 Poland Division 1 Chrobry Glogow Widzew Lodz 3.36 3.17 2.15 2021-11-30 17:40:24.833519 0.0 0.0
175061 2022-01-08 18:00:00 Spain Primera RFEF - Group 1 R. Union Extremadura UD 1.26 4.40 18.00 2022-01-08 17:00:46.662761 0.0 1.0
21293 2021-08-29 16:00:00 Italy Serie B Cittadella Crotone 2.32 3.11 3.31 2021-08-26 18:04:46.221393 4.0 2.0
97427 2021-11-01 17:00:00 Israel Leumit League M. Nazareth Beitar Tel Aviv 1.92 3.26 3.75 2021-10-30 09:40:08.966330 4.0 2.0
177665 2022-01-13 12:30:00 Egypt Division 2 - Group C Said El Mahalla Al Magd 4.12 3.08 1.94 2022-01-12 17:53:33.570126 0.0 0.0
69451 2021-10-17 05:00:00 South Korea K League 1 Gangwon Gwangju FC 2.06 3.38 3.65 2021-10-15 09:55:54.578112 NaN NaN
4742 2021-08-10 20:30:00 Peru Liga 2 Deportivo Coopsol Grau 3.14 3.49 2.06 2021-08-10 18:14:01.996860 0.0 2.0
22266 2021-08-29 13:00:00 France Ligue 1 Angers Rennes 2.93 3.27 2.56 2021-08-27 12:26:34.904374 2.0 0.0
46412 2021-09-26 04:00:00 Japan J2 League Okayama Blaublitz 2.24 2.90 3.63 2021-09-23 09:08:26.979783 1.0 1.0
133207 2021-11-27 21:15:00 Bolivia Division Profesional Palmaflor Blooming 1.51 4.05 5.10 2021-11-25 18:22:28.275844 3.0 0.0
140825 2021-11-28 11:00:00 Spain Tercera RFEF - Group 6 Valencia B Torrellano 1.58 3.56 5.26 2021-11-28 19:54:40.066637 2.0 0.0
226985 2022-03-04 00:30:00 Argentina Copa de la Liga Profesional Central Cordoba Rosario Central 2.36 3.26 2.86 2022-03-02 17:23:10.014424 0.0 1.0
137226 2021-11-28 12:45:00 Greece Super League 2 Apollon Pontou PAOK B 3.37 3.25 2.01 2021-11-27 15:13:05.937815 0.0 3.0
182756 2022-01-22 10:30:00 Turkey 1. Lig Umraniyespor Menemenspor 1.40 4.39 7.07 2022-01-19 17:25:27.128331 2.0 1.0
89895 2021-10-28 16:45:00 Netherlands KNVB Beker Ajax Cambuur 9.10 5.55 1.26 2021-10-27 07:46:56.253996 0.0 5.0
227595 2022-03-06 17:00:00 Israel Ligat ha'Al Ashdod Maccabi Petah Tikva 2.30 3.21 3.05 2022-03-02 17:23:10.014424 NaN NaN
57568 2021-10-02 13:00:00 Estonia Meistriliiga Kalju Legion 1.58 4.10 4.84 2021-10-02 10:55:35.287359 2.0 2.0
227035 2022-03-04 19:00:00 Denmark Superliga FC Copenhagen Randers FC 1.70 3.84 5.06 2022-03-02 17:23:10.014424 NaN NaN
108668 2021-11-07 13:30:00 Germany Oberliga Mittelrhein Duren Freialdenhoven 1.35 5.20 6.35 2021-11-06 17:37:37.629603 2.0 0.0
86270 2021-10-25 18:00:00 Belgium Pro League U21 Lommel SK U21 Lierse K. U21 3.23 3.84 1.92 2021-10-26 01:22:31.111441 0.0 0.0
89437 2021-11-01 02:10:00 Colombia Primera A America De Cali Petrolera 1.86 2.92 4.60 2021-10-27 07:41:24.427246 NaN NaN
13986 2021-08-21 13:00:00 France Ligue 2 Dijon Toulouse 3.92 3.51 1.94 2021-08-16 13:22:02.749887 2.0 4.0
105179 2021-11-06 15:00:00 England NPL Premier Division Atherton South Shields 3.90 3.42 1.82 2021-11-05 10:01:28.567328 1.0 1.0
142821 2021-12-01 12:30:00 Bulgaria Vtora liga Marek Septemvri Simitli 1.79 3.38 4.35 2021-11-30 17:40:24.833519 2.0 2.0
45866 2021-09-24 00:30:00 Venezuela Primera Division Dep. Tachira Portuguesa 1.96 3.60 3.22 2021-09-23 09:08:26.979783 4.0 1.0
76100 2021-10-22 16:30:00 Denmark 1st Division Hvidovre IF Koge 1.91 3.56 3.81 2021-10-21 08:43:12.445245 NaN NaN
115896 2021-11-14 16:00:00 Spain Tercera RFEF - Group 6 Olimpic Xativa Torrellano 2.78 2.89 2.39 2021-11-13 12:21:45.955738 1.0 0.0
156159 2021-12-12 16:00:00 Spain Segunda RFEF - Group 1 Marino de Luanco Coruxo FC 2.19 3.27 3.07 2021-12-10 09:26:45.001977 0.0 0.0
18240 2021-08-21 12:00:00 Germany Regionalliga West Rodinghausen Fortuna Koln 3.25 3.60 2.00 2021-08-21 03:30:43.193978 NaN NaN
184913 2022-01-22 10:00:00 World Club Friendly Zilina B Trinec 3.56 4.14 1.78 2022-01-22 16:44:32.650325 0.0 3.0
16782 2021-08-22 23:05:00 Colombia Primera A Petrolera Dep. Cali 3.01 3.00 2.44 2021-08-19 18:24:24.966505 2.0 3.0
63847 2021-10-10 09:30:00 Spain Tercera RFEF - Group 7 Carabanchel RSD Alcala 4.39 3.42 1.75 2021-10-09 12:03:50.720013 NaN NaN
7254 2021-08-12 16:45:00 Europe Europa Conference League Hammarby Cukaricki 1.72 3.87 4.13 2021-08-11 23:48:31.958394 NaN NaN
82727 2021-10-24 14:00:00 Lithuania I Lyga Zalgiris 2 Neptunas 1.76 3.78 3.35 2021-10-24 12:02:06.306279 1.0 3.0
43074 2021-09-22 18:00:00 Ukraine Super Cup Shakhtar Donetsk Dyn. Kyiv 2.57 3.49 2.59 2021-09-19 09:39:56.624504 NaN NaN
65187 2021-10-11 18:45:00 World World Cup Norway Montenegro 1.56 4.17 6.28 2021-10-11 10:56:09.973470 NaN NaN
120993 2021-11-18 00:00:00 USA NISA Maryland Bobcats California Utd. 2.76 3.23 2.39 2021-11-17 20:36:26.562731 1.0 1.0
201469 2022-02-12 15:00:00 England League One AFC Wimbledon Sunderland 3.30 3.48 2.17 2022-02-10 17:47:36.501159 1.0 1.0
142180 2021-12-01 19:45:00 Scotland Premiership St. Mirren Ross County 2.06 3.25 3.85 2021-11-29 18:28:22.249662 0.0 0.0
4681 2021-08-10 18:30:00 Europe Champions League Young Boys CFR Cluj 1.48 4.29 6.92 2021-08-10 18:14:01.996860 3.0 1.0
67321 2021-10-17 13:00:00 Spain LaLiga Rayo Vallecano Elche 1.78 3.64 4.99 2021-10-13 11:22:34.979378 NaN NaN
27499 2021-09-04 14:00:00 Iceland Inkasso-deildin Kordrengir Fjolnir 2.18 3.66 2.82 2021-09-02 23:28:49.414126 1.0 4.0
48962 2021-09-25 21:00:00 Mexico Liga Premier Serie B Uruapan Lobos Huerta 1.83 3.69 3.70 2021-09-25 13:02:58.238466 NaN NaN
65636 2021-10-16 17:00:00 Switzerland Super League Young Boys Luzern 1.26 6.04 9.43 2021-10-11 10:56:09.973470 NaN NaN
17333 2021-08-21 14:00:00 Finland Kakkonen Group A Atlantis Kiffen 1.57 4.29 4.42 2021-08-20 12:41:03.159846 1.0 1.0
I am trying to get the latest 2 match_datetime values for every run_time and then filter(join) df to get all the relevant values as below:
df['match_datetime'] = pd.to_datetime(df['match_datetime'])
s = (df['match_datetime'].dt.normalize()
.groupby([df['run_time']])
.value_counts()
.groupby(level=0)
.head(2))
print(s)
run_time match_datetime
2021-08-07 00:04:36.326391 2021-08-07 255
2021-08-06 188
2021-08-07 10:50:34.574040 2021-08-07 649
2021-08-08 277
2021-08-07 16:56:22.322338 2021-08-07 712
This returns a series while I want a DataFrame so I can merge.
To do this:
df_n = df.reset_index().merge(s, how="left",
left_on=["match_datetime", "run_time"],
right_on=["match_datetime", "run_time"])
While I am sure there is a better manner I can write function s but I am unsure how to do it the correct way.
If I understand correctly, you would like to filter the dataframe to retain, for each run_time, the last two rows (or up to two rows) by match_datetime.
Simplified answer
This can be done relatively easily without any join, using GroupBy.tail(). (Note, my original answer was using GroupBy.rank(), but this is simpler, although slower):
out = df.sort_values(
['run_time', 'match_datetime']
).groupby('run_time').tail(2)
Minimal example
import numpy as np
np.random.seed(0)
n = 10
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='D'), n)
df = pd.DataFrame({
'run_time': rt,
'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='days'),
})
df['match_datetime'] = df['match_datetime'].dt.round('h')
Then:
out = df.sort_values(
['run_time', 'match_datetime']
).groupby('run_time').tail(2)
>>> out
run_time match_datetime
9 2022-01-01 2021-12-31 04:00:00
1 2022-01-01 2021-12-31 15:00:00
5 2022-01-02 2022-01-01 02:00:00
7 2022-01-03 2022-01-02 22:00:00
3 2022-01-04 2022-01-03 11:00:00
6 2022-01-04 2022-01-03 22:00:00
0 2022-01-05 2022-01-04 01:00:00
8 2022-01-05 2022-01-05 00:00:00
On the OP's (extended) data
The output is quite verbose. Here is a sample:
>>> out['run_time match_datetime country draw_odds'.split()].head()
run_time match_datetime country draw_odds
4681 2021-08-10 18:14:01.996860 2021-08-10 18:30:00 Europe 4.29
4742 2021-08-10 18:14:01.996860 2021-08-10 20:30:00 Peru 3.49
7254 2021-08-11 23:48:31.958394 2021-08-12 16:45:00 Europe 3.87
12328 2021-08-14 22:44:26.136608 2021-08-15 15:30:00 Poland 3.74
13986 2021-08-16 13:22:02.749887 2021-08-21 13:00:00 France 3.51
Performance
For several millions of rows, the timing difference starts counting, and using rank is faster. Even faster, you can avoid sorting on run_time (the result is the same, but the rows are in a different order):
np.random.seed(0)
n = 1_000_000
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='min'), n)
df = pd.DataFrame({
'run_time': rt,
'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='s'),
})
%timeit df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
# 981 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
# 355 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.sort_values(['match_datetime']).groupby('run_time').tail(2)
# 258 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Check solutions:
a = df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
b = df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
c = df.sort_values(['match_datetime']).groupby('run_time').tail(2)
by = ['run_time', 'match_datetime']
>>> a.sort_values(by).equals(b.sort_values(by))
True
>>> b.sort_values(by).equals(c.sort_values(by))
True

How to merge two DataFrames with DatetimeIndex preserved in pandas?

I have 2 DataFrames, df1, and df2.
df1 has the following contents:
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
Date
2019-01-29 4.02 159.342209 4.02 161.570007 4.07 163.240005 3.93
2019-01-30 4.06 163.395538 4.06 165.679993 4.09 166.279999 4.01
2019-01-31 3.99 165.841370 3.99 168.160004 4.06 168.990005 3.93
2019-02-01 4.02 165.141129 4.02 167.449997 4.07 168.600006 3.93
2019-02-04 3.96 167.192474 3.96 169.529999 4.00 169.529999 3.93
... ... ... ... ... ... ... ...
2019-02-25 4.65 171.127441 4.65 173.520004 4.78 174.660004 4.50
2019-02-26 4.36 171.304947 4.36 173.699997 4.74 174.250000 4.36
2019-02-27 4.30 171.196487 4.30 173.589996 4.50 173.800003 4.30
2019-02-28 4.46 170.802002 4.46 173.190002 4.65 173.809998 4.40
2019-03-01 4.58 171.985443 4.58 174.389999 4.64 174.649994 4.45
Open Volume
QQQ GBTC QQQ GBTC QQQ
Date
2019-01-29 160.990005 3.970 163.199997 975200 30784200
2019-01-30 162.889999 4.035 163.399994 770700 41346500
2019-01-31 166.470001 4.040 166.699997 1108700 37258400
2019-02-01 166.990005 4.000 167.330002 889100 32143700
2019-02-04 167.330002 3.990 167.479996 871800 26718800
... ... ... ... ... ...
2019-02-25 173.399994 4.625 174.210007 2891200 32608800
2019-02-26 172.809998 4.625 173.100006 2000100 21939700
2019-02-27 171.759995 4.400 172.899994 1537000 25162000
2019-02-28 172.699997 4.420 173.050003 1192600 25085500
2019-03-01 173.179993 4.470 174.440002 948500 31431200
[23 rows x 12 columns]
And here's the contents of df2:
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
Date
2019-02-25 4.65 171.127441 4.65 173.520004 4.78 174.660004 4.50
2019-02-26 4.36 171.304947 4.36 173.699997 4.74 174.250000 4.36
2019-02-27 4.30 171.196487 4.30 173.589996 4.50 173.800003 4.30
2019-02-28 4.46 170.802002 4.46 173.190002 4.65 173.809998 4.40
2019-03-01 4.58 171.985443 4.58 174.389999 4.64 174.649994 4.45
... ... ... ... ... ... ... ...
2019-03-28 4.54 176.171432 4.54 178.309998 4.68 178.979996 4.51
2019-03-29 4.78 177.505249 4.78 179.660004 4.83 179.830002 4.55
2019-04-01 4.97 179.856705 4.97 182.039993 5.03 182.259995 4.85
2019-04-02 5.74 180.538437 5.74 182.729996 5.83 182.910004 5.52
2019-04-03 6.19 181.575836 6.19 183.779999 6.59 184.919998 5.93
Open Volume
QQQ GBTC QQQ GBTC QQQ
Date
2019-02-25 173.399994 4.625 174.210007 2891200 32608800
2019-02-26 172.809998 4.625 173.100006 2000100 21939700
2019-02-27 171.759995 4.400 172.899994 1537000 25162000
2019-02-28 172.699997 4.420 173.050003 1192600 25085500
2019-03-01 173.179993 4.470 174.440002 948500 31431200
... ... ... ... ... ...
2019-03-28 177.240005 4.650 178.360001 2104400 30368200
2019-03-29 178.589996 4.710 179.690002 2937400 35205500
2019-04-01 180.770004 4.850 181.509995 2733600 30969500
2019-04-02 181.779999 5.660 182.240005 6062000 22645200
2019-04-03 183.210007 5.930 183.759995 10002400 31633500
[28 rows x 12 columns]
As you can see from the above, df1 and df2 have overlapping Dates.
How can I create a merged DataFrame df that contains dates from 2019-01-29 to 2019-04-03 with no overlapping Date?
I've tried running df = df1.merge(df2, how='outer'). However, this command returns a DataFrame with Date removed, which is not something desirable.
> df
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
0 4.02 159.342209 4.02 161.570007 4.07 163.240005 3.93
1 4.06 163.395538 4.06 165.679993 4.09 166.279999 4.01
2 3.99 165.841370 3.99 168.160004 4.06 168.990005 3.93
3 4.02 165.141129 4.02 167.449997 4.07 168.600006 3.93
4 3.96 167.192474 3.96 169.529999 4.00 169.529999 3.93
.. ... ... ... ... ... ... ...
41 4.54 176.171432 4.54 178.309998 4.68 178.979996 4.51
42 4.78 177.505249 4.78 179.660004 4.83 179.830002 4.55
43 4.97 179.856705 4.97 182.039993 5.03 182.259995 4.85
44 5.74 180.538437 5.74 182.729996 5.83 182.910004 5.52
45 6.19 181.575836 6.19 183.779999 6.59 184.919998 5.93
Open Volume
QQQ GBTC QQQ GBTC QQQ
0 160.990005 3.970 163.199997 975200 30784200
1 162.889999 4.035 163.399994 770700 41346500
2 166.470001 4.040 166.699997 1108700 37258400
3 166.990005 4.000 167.330002 889100 32143700
4 167.330002 3.990 167.479996 871800 26718800
.. ... ... ... ... ...
41 177.240005 4.650 178.360001 2104400 30368200
42 178.589996 4.710 179.690002 2937400 35205500
43 180.770004 4.850 181.509995 2733600 30969500
44 181.779999 5.660 182.240005 6062000 22645200
45 183.210007 5.930 183.759995 10002400 31633500
[46 rows x 12 columns]
It seems that I should find a way to merge df1.index and df2.index. Then add the merged DatetimeIndex to df.
For the convenience of debugging, you can run the following code to get the same data as mine.
import yfinance as yf
symbols = ['QQQ', 'GBTC']
df1 = yf.download(symbols, start="2019-01-29", end="2019-03-01")
df2 = yf.download(symbols, start="2019-02-25", end="2019-04-03")
Taken from the docs:
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
So I believe that if you specify the index in the merge with on=Date, then you should be ok.
df1.merge(df2, how='outer', on='Date')
However, for the problem that you are trying to solve merge is note the correct tool. What you need to do is append the dataframes together and then remove the duplicated days:
df1.append(df2).drop_duplicates()

manipulating more than 2 dataframes

I have 6 different dataframes and I would like to append one after the other .
The only way I find to do so is append 2 each time, although I believe there must be a more efficient way to do this .
I am also looking forward after that to change the index and header names, that I also know how to do one by one, but I also believe there must also be an efficient way to do so.
The last problem I am facing is how to set an index with with the column that is NaN , how shall I refer to it in order to set_index?  
 
df1
 NaN     1      2      3
1   A   17.03   13.41  19.61
7   B   3.42    1.51    5.44
8   C   5.65    2.81    1.89
df2
NaN     1      2      3
1  J   1.60   2.65   1.44
5  H   26.78  27.04  21.06
df3
NaN    1      2      3
1   L   1.20   1.41   2.04
2   M   1.23   1.72   2.47
4   R  66.13  51.49  16.62
5   F     --  46.89  22.35
df4
 NaN    1      2      3
1   A   17.03   13.41  19.61
7   B   3.42    1.51    5.44
8   C   5.65    2.81    1.89
df5
NaN    1      2      3
1  J   1.60   2.65   1.44
5  H   26.78  27.04  21.06
df6
NaN    1      2      3
1   L   1.20   1.41   2.04
2   M   1.23   1.72   2.47
4   R  66.13  51.49  16.62
5   F     --  46.89  22.35
You can use concat, for select NaN column is possible use df.columns[0] with set_index and list comprehension:
dfs = [df1,df2, df3, ...]
df = pd.concat([df.set_index(df.columns[0], append=True) for df in dfs])
print (df)
1 2 3
NaN
1 A 17.03 13.41 19.61
7 B 3.42 1.51 5.44
8 C 5.65 2.81 1.89
1 J 1.6 2.65 1.44
5 H 26.78 27.04 21.06
1 L 1.20 1.41 2.04
2 M 1.23 1.72 2.47
4 R 66.13 51.49 16.62
5 F -- 46.89 22.35
EDIT:
It seems NaN values can be strings:
print (df3.columns)
Index(['NaN', '1', '2', '3'], dtype='object')
dfs = [df1,df2, df3]
df = pd.concat([df.set_index('NaN', append=True) for df in dfs])
print (df)
1 2 3
NaN
1 A 17.03 13.41 19.61
7 B 3.42 1.51 5.44
8 C 5.65 2.81 1.89
1 J 1.6 2.65 1.44
5 H 26.78 27.04 21.06
1 L 1.20 1.41 2.04
2 M 1.23 1.72 2.47
4 R 66.13 51.49 16.62
5 F -- 46.89 22.35
Or if there are np.nan for me works also:
#converting to `NaN` if necessary
#df1.columns = df1.columns.astype(float)
#df2.columns = df2.columns.astype(float)
#df3.columns = df3.columns.astype(float)
dfs = [df1,df2, df3]
df = pd.concat([df.set_index(np.nan, append=True) for df in dfs])
print (df)
1.0 2.0 3.0
nan
1 A 17.03 13.41 19.61
7 B 3.42 1.51 5.44
8 C 5.65 2.81 1.89
1 J 1.6 2.65 1.44
5 H 26.78 27.04 21.06
1 L 1.20 1.41 2.04
2 M 1.23 1.72 2.47
4 R 66.13 51.49 16.62
5 F -- 46.89 22.35

pandas: t-test and p-value of month over month mean difference in aggregated dataframe using groupby function

This is my first posted question, so please excuse if it doesn't look good.
I have a source data file which I transform to the following dataframe using pandas groupby aggregation
pd.read_csv('R:/Python ETL/AGG7.csv', sep=',')
Treatment Month stdev n avg
0 AAAA 1/1/2016 1.92 309 7.57
1 AAAA 2/1/2016 1.89 79 7.46
2 AAAA 3/1/2016 2.25 158 7.20
3 AAAA 4/1/2016 2.23 22 7.68
4 BBBB 1/1/2016 2.04 175 7.10
5 BBBB 2/1/2016 1.96 33 7.09
6 BBBB 3/1/2016 2.02 110 7.32
7 BBBB 4/1/2016 1.73 25 7.92
8 CCCC 1/1/2016 2.42 111 7.40
9 CCCC 2/1/2016 1.45 22 7.73
10 CCCC 3/1/2016 2.44 21 6.95
11 CCCC 4/1/2016 2.84 92 6.92
What I need is 2 additional columns with month over month difference (MoM diff) and p-value of T-tests of those differences.
MoM diff pValue
-0.11 0.35
-0.26 0.62
0.48 0.65
-0.01 0.02
0.23 0.44
0.6 0.83
0.33 0.46
-0.78 0.79
-0.03 0.04
The problem is that I cannot get them on the fly using pandas group by with scipy.stats ttest_ind function from original dataset and ttest_ind_from_stats function from the shown aggregated dataframe. I tried many different approaches, but with no success. Can anyone help, please?
You can use df.shift with groupby to have the shifted values:
df[["avg_2", "n_2", "stdev_2"]] = df.groupby("Treatment")["avg", "n", "stdev"].shift()
df
Out[7]:
Treatment Month stdev n avg avg_2 n_2 stdev_2
0 AAAA 2016-01-01 1.92 309 7.57 NaN NaN NaN
1 AAAA 2016-01-02 1.89 79 7.46 7.57 309.0 1.92
2 AAAA 2016-01-03 2.25 158 7.20 7.46 79.0 1.89
3 AAAA 2016-01-04 2.23 22 7.68 7.20 158.0 2.25
4 BBBB 2016-01-01 2.04 175 7.10 NaN NaN NaN
5 BBBB 2016-01-02 1.96 33 7.09 7.10 175.0 2.04
6 BBBB 2016-01-03 2.02 110 7.32 7.09 33.0 1.96
7 BBBB 2016-01-04 1.73 25 7.92 7.32 110.0 2.02
8 CCCC 2016-01-01 2.42 111 7.40 NaN NaN NaN
9 CCCC 2016-01-02 1.45 22 7.73 7.40 111.0 2.42
10 CCCC 2016-01-03 2.44 21 6.95 7.73 22.0 1.45
11 CCCC 2016-01-04 2.84 92 6.92 6.95 21.0 2.44
You can filter out NaN values with pd.notnull:
df2 = df[pd.notnull(df.avg_2)].copy()
And you can get the results of the t-tests with:
import scipy.stats as ss
res = ss.ttest_ind_from_stats(df2.avg, df2.stdev, df2.n, df2.avg_2, df2.stdev_2, df2.n_2, equal_var=False)
If you want the mean differences and p-values in this dataframe:
df2["dif_avg"] = df2.avg - df2.avg_2
df2["p_value"] = res.pvalue
Out[22]:
Month stdev n avg avg_2 n_2 stdev_2 dif_avg p_value
1 2016-01-02 1.89 79 7.46 7.57 309.0 1.92 -0.11 0.646226
2 2016-01-03 2.25 158 7.20 7.46 79.0 1.89 -0.26 0.350814
3 2016-01-04 2.23 22 7.68 7.20 158.0 2.25 0.48 0.353023
5 2016-01-02 1.96 33 7.09 7.10 175.0 2.04 -0.01 0.978808
6 2016-01-03 2.02 110 7.32 7.09 33.0 1.96 0.23 0.559625
7 2016-01-04 1.73 25 7.92 7.32 110.0 2.02 0.60 0.137527
9 2016-01-02 1.45 22 7.73 7.40 111.0 2.42 0.33 0.395806
10 2016-01-03 2.44 21 6.95 7.73 22.0 1.45 -0.78 0.214270
11 2016-01-04 2.84 92 6.92 6.95 21.0 2.44 -0.03 0.961019
Line-by-line:
import csv
import scipy.stats as ss
results = []
treatment1 = ""
with open('R:/Python ETL/AGG7.csv') as f:
reader = csv.reader(f)
next(reader, None)
for line in reader:
treatment2, stdev2, n2, avg2 = line[0], float(line[2]), int(line[3]), float(line[4])
if treatment2 == treatment1:
ttest_res = ss.ttest_ind_from_stats(avg1, stdev1, n1, avg2, stdev2, n2, equal_var=False)
results.append((avg2-avg1, ttest_res.pvalue))
treatment1, stdev1, n1, avg1 = treatment2, stdev2, n2, avg2
is that what you need?
In [154]: df
Out[154]:
Treatment Month stdev n avg
0 AAAA 1/1/2016 1.92 309 7.57
1 AAAA 2/1/2016 1.89 79 7.46
2 AAAA 3/1/2016 2.25 158 7.20
3 AAAA 4/1/2016 2.23 22 7.68
4 BBBB 1/1/2016 2.04 175 7.10
5 BBBB 2/1/2016 1.96 33 7.09
6 BBBB 3/1/2016 2.02 110 7.32
7 BBBB 4/1/2016 1.73 25 7.92
8 CCCC 1/1/2016 2.42 111 7.40
9 CCCC 2/1/2016 1.45 22 7.73
10 CCCC 3/1/2016 2.44 21 6.95
11 CCCC 4/1/2016 2.84 92 6.92
In [155]: df.stdev.diff()
Out[155]:
0 NaN
1 -0.03
2 0.36
3 -0.02
4 -0.19
5 -0.08
6 0.06
7 -0.29
8 0.69
9 -0.97
10 0.99
11 0.40
Name: stdev, dtype: float64
let's shift it one row up:
In [156]: df.stdev.diff().shift(-1)
Out[156]:
0 -0.03
1 0.36
2 -0.02
3 -0.19
4 -0.08
5 0.06
6 -0.29
7 0.69
8 -0.97
9 0.99
10 0.40
11 NaN
Name: stdev, dtype: float64

Python/Pandas - Sum dataframe items if indexes have the same month

I have this two DataFrames:
Seasonal_Component:
# DataFrame that has the seasonal component of a time series
Date
2014-12 -1.08
2015-01 -0.28
2015-02 0.15
2015-03 0.46
2015-04 0.48
2015-05 0.37
2015-06 0.20
2015-07 0.15
2015-08 0.12
2015-09 -0.02
2015-10 -0.17
2015-11 -0.39
Prediction_df:
# DataFrame with the prediction of the trend of that same time series
Prediction MAPE Score
2015-11-01 7.93 1.83 1
2015-12-01 7.93 1.67 1
2016-01-01 7.92 1.71 1
2016-02-01 7.95 1.84 1
2016-03-01 7.94 1.53 1
2016-04-01 7.87 1.45 1
2016-05-01 7.91 1.53 1
2016-06-01 7.87 1.40 1
2016-07-01 7.84 1.40 1
2016-08-01 7.89 1.77 1
2016-09-01 7.87 1.99 1
What I need to do:
Check which Prediction_df index have the same months as the Seasonal_Component index and sum the correspondent seasonal component with the prediction, so the Prediction_df looks like this:
Prediction MAPE Score
2015-11-01 7,54 1.83 1
2015-12-01 6.85 1.67 1
2016-01-01 7.64 1.71 1
2016-02-01 8.10 1.84 1
2016-03-01 8.40 1.53 1
2016-04-01 8.35 1.45 1
2016-05-01 8.28 1.53 1
2016-06-01 8.07 1.40 1
2016-07-01 7.99 1.40 1
2016-08-01 8.01 1.77 1
2016-09-01 7.85 1.99 1
Anyone available to enlight my journey?
I'm already on the "almost mad" stage trying to solve this.
EDIT
Important note to make it clearer: I need to disconsider the year and consider only the month to make the sum. Something like "everytime that an April appears (doesn't matter if it is 2006 or 2025) I need to sum with the April value of the Seasonal_Component frame.
Consider a data frame merge on the date fields (month values), then a simple addition of the two fields. The date fields may require conversion from string values:
import datetime as dt
...
# IF DATES ARE REGULAR COLUMNS
seasonal_component['Date'] = pd.to_datetime(seasonal_component['Date'])
seasonal_component['Month'] = seasonal_component['Date'].dt.month
predict_df['Date'] = pd.to_datetime(predict_df['Date'])
predict_df['Month'] = predict_df['Date'].dt.month
# IF DATES ARE INDICES
seasonal_component.index = pd.to_datetime(seasonal_component.index)
seasonal_component['Month'] = seasonal_component.index.month
predict_df.index = pd.to_datetime(predict_df.index)
predict_df['Month'] = predict_df.index.month
However, think about how you need to join the two data sets (akin to SQL's join clauses):
inner (default) - keeps only records matching both
left - keeps records of predict_df and only those matching seasonal_component where predict_df is first argument
right - keeps records of seasonal_component and only those matching predict_df where predict_df is first argument
outer - keeps all records, those that match and those that don't match
Below assumes an outer join where data on both sides remain with NaNs to fill for missing values.
# MERGING DATA FRAMES
merge_df = pd.merge(predict_df, seasonal_component[['Month', 'SeasonalComponent']],
on=['Month'], how='outer')
# ADDING COLUMNS
merge_df['Prediction'] = merge_df['Prediction'] + merge_df['SeasonalComponent']
Outcome (using posted data)
Date Prediction MAPE Score Month SeasonalComponent
0 2015-11-01 7.54 1.83 1 11 -0.39
1 2015-12-01 6.85 1.67 1 12 -1.08
2 2016-01-01 7.64 1.71 1 1 -0.28
3 2016-02-01 8.10 1.84 1 2 0.15
4 2016-03-01 8.40 1.53 1 3 0.46
5 2016-04-01 8.35 1.45 1 4 0.48
6 2016-05-01 8.28 1.53 1 5 0.37
7 2016-06-01 8.07 1.40 1 6 0.20
8 2016-07-01 7.99 1.40 1 7 0.15
9 2016-08-01 8.01 1.77 1 8 0.12
10 2016-09-01 7.85 1.99 1 9 -0.02
11 NaT NaN NaN NaN 10 -0.17
Firstly separate the month from both dataframes and then merge on basis of month. Further add the required columns and create new column with desired output. Here is the code below:
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
from numpy.random import randn
Seasonal_Component = DataFrame({
'Date': ['2014-12','2015-01','2015-02','2015-03','2015-04','2015-05','2015-06','2015-07','2015-08','2015-09','2015-10','2015-11'],
'Value': [-1.08,-0.28,0.15,0.46,0.48,0.37,0.20,0.15,0.12,-0.02,-0.17,-0.39]
})
Prediction_df = DataFrame({
'Date': ['2015-11-01','2015-12-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01'],
'Prediction': [7.93,7.93,7.92,7.95,7.94,7.87,7.91,7.87,7.84,7.89,7.87],
'MAPE':[1.83,1.67,1.71,1.84,1.53,1.45,1.53,1.40,1.40,1.77,1.99],
'Score':[1,1,1,1,1,1,1,1,1,1,1]
})
def mon_extract(date):
return date.split('-')[1]
Seasonal_Component['Month']=Seasonal_Component['Date'].apply(mon_extract)
def mon_extract(date):
return date.split('-')[1].split('-')[0]
Prediction_df['Month']=Prediction_df['Date'].apply(mon_extract)
FinalDF=pd.merge(Seasonal_Component,Prediction_df,on='Month',how='right')
FinalDF
FinalDF['PredictionF']=FinalDF['Value']+FinalDF['Prediction']
FinalDF.loc[:,['Date_y','PredictionF','MAPE','Score']]

Categories

Resources