Calculate mean on dataframe given a frequency using pandas - python

I have a dataframe where values are measured each 30 minutes, as shown below:
2015-01-01 00:00:00 94.50
2015-01-01 00:30:00 78.75
2015-01-01 01:00:00 85.87
2015-01-01 01:30:00 85.88
2015-01-01 02:00:00 84.75
2015-01-01 02:30:00 87.50
So, each day has 48 values. The fist column is the Time index created by using:
date= pd.date_range( '1/1/2015', periods=len(series),freq='30min' )
series=series.values.reshape(-1,1)
df=pd.DataFrame(series, index=date)
What I would like to do is to obtain the mean for each time of the day and weekday. Something like this:
My initial idea was to group by weekday and frequency (30 min.) as follow:
df= df.groupby([ df.index.weekday,df.index.freq])
print(df.describe())
count mean std min 25% 50% 75%
0 2015-01-05 00:30:00 1.0 93.75 NaN 93.75 93.75 93.75 93.75
2015-01-05 01:00:00 1.0 110.25 NaN 110.25 110.25 110.25 110.25
2015-01-05 01:30:00 1.0 110.88 NaN 110.88 110.88 110.88 110.88
2015-01-05 02:00:00 1.0 90.12 NaN 90.12 90.12 90.12 90.12
2015-01-05 02:30:00 1.0 91.50 NaN 91.50 91.50 91.50 91.50
2015-01-05 03:00:00 1.0 94.13 NaN 94.13 94.13 94.13 94.13
2015-01-05 03:30:00 1.0 90.62 NaN 90.62 90.62 90.62 90.62
2015-01-05 04:00:00 1.0 91.88 NaN 91.88 91.88 91.88 91.88
2015-01-05 04:30:00 1.0 92.50 NaN 92.50 92.50 92.50 92.50
2015-01-05 05:00:00 1.0 98.12 NaN 98.12 98.12 98.12 98.12
2015-01-05 05:30:00 1.0 105.75 NaN 105.75 105.75 105.75 105.75
2015-01-05 06:00:00 1.0 100.50 NaN 100.50 100.50 100.50 100.50
2015-01-05 06:30:00 1.0 82.25 NaN 82.25 82.25 82.25 82.25
2015-01-05 07:00:00 1.0 81.75 NaN 81.75 81.75 81.75 81.75
2015-01-05 07:30:00 1.0 90.50 NaN 90.50 90.50 90.50 90.50
2015-01-05 08:00:00 1.0 89.50 NaN 89.50 89.50 89.50 89.50
2015-01-05 08:30:00 1.0 89.63 NaN 89.63 89.63 89.63 89.63
2015-01-05 09:00:00 1.0 84.62 NaN 84.62 84.62 84.62 84.62
2015-01-05 09:30:00 1.0 86.63 NaN 86.63 86.63 86.63 86.63
2015-01-05 10:00:00 1.0 96.12 NaN 96.12 96.12 96.12 96.12
2015-01-05 10:30:00 1.0 104.13 NaN 104.13 104.13 104.13 104.13
2015-01-05 11:00:00 1.0 101.12 NaN 101.12 101.12 101.12 101.12
2015-01-05 11:30:00 1.0 85.88 NaN 85.88 85.88 85.88 85.88
2015-01-05 12:00:00 1.0 77.12 NaN 77.12 77.12 77.12 77.12
2015-01-05 12:30:00 1.0 78.88 NaN 78.88 78.88 78.88 78.88
2015-01-05 13:00:00 1.0 76.62 NaN 76.62 76.62 76.62 76.62
2015-01-05 13:30:00 1.0 78.63 NaN 78.63 78.63 78.63 78.63
2015-01-05 14:00:00 1.0 85.37 NaN 85.37 85.37 85.37 85.37
2015-01-05 14:30:00 1.0 103.63 NaN 103.63 103.63 103.63 103.63
2015-01-05 15:00:00 1.0 112.87 NaN 112.87 112.87 112.87 112.87
... ... ... .. ... ... ... ...
6 2016-10-02 09:30:00 1.0 84.75 NaN 84.75 84.75 84.75 84.75
2016-10-02 10:00:00 1.0 60.49 NaN 60.49 60.49 60.49 60.49
2016-10-02 10:30:00 1.0 76.25 NaN 76.25 76.25 76.25 76.25
2016-10-02 11:00:00 1.0 68.13 NaN 68.13 68.13 68.13 68.13
2016-10-02 11:30:00 1.0 54.15 NaN 54.15 54.15 54.15 54.15
2016-10-02 12:00:00 1.0 79.91 NaN 79.91 79.91 79.91 79.91
2016-10-02 12:30:00 1.0 72.79 NaN 72.79 72.79 72.79 72.79
2016-10-02 13:00:00 1.0 77.49 NaN 77.49 77.49 77.49 77.49
2016-10-02 13:30:00 1.0 77.65 NaN 77.65 77.65 77.65 77.65
2016-10-02 14:00:00 1.0 70.44 NaN 70.44 70.44 70.44 70.44
2016-10-02 14:30:00 1.0 82.47 NaN 82.47 82.47 82.47 82.47
2016-10-02 15:00:00 1.0 41.53 NaN 41.53 41.53 41.53 41.53
2016-10-02 15:30:00 1.0 66.65 NaN 66.65 66.65 66.65 66.65
2016-10-02 16:00:00 1.0 55.23 NaN 55.23 55.23 55.23 55.23
2016-10-02 16:30:00 1.0 59.45 NaN 59.45 59.45 59.45 59.45
2016-10-02 17:00:00 1.0 79.92 NaN 79.92 79.92 79.92 79.92
2016-10-02 17:30:00 1.0 58.48 NaN 58.48 58.48 58.48 58.48
2016-10-02 18:00:00 1.0 92.56 NaN 92.56 92.56 92.56 92.56
2016-10-02 18:30:00 1.0 86.92 NaN 86.92 86.92 86.92 86.92
2016-10-02 19:00:00 1.0 88.61 NaN 88.61 88.61 88.61 88.61
2016-10-02 19:30:00 1.0 99.21 NaN 99.21 99.21 99.21 99.21
2016-10-02 20:00:00 1.0 81.02 NaN 81.02 81.02 81.02 81.02
2016-10-02 20:30:00 1.0 84.83 NaN 84.83 84.83 84.83 84.83
2016-10-02 21:00:00 1.0 59.29 NaN 59.29 59.29 59.29 59.29
2016-10-02 21:30:00 1.0 95.99 NaN 95.99 95.99 95.99 95.99
2016-10-02 22:00:00 1.0 76.95 NaN 76.95 76.95 76.95 76.95
2016-10-02 22:30:00 1.0 112.49 NaN 112.49 112.49 112.49 112.49
2016-10-02 23:00:00 1.0 88.85 NaN 88.85 88.85 88.85 88.85
2016-10-02 23:30:00 1.0 122.40 NaN 122.40 122.40 122.40 122.40
2016-10-03 00:00:00 1.0 82.84 NaN 82.84 82.84 82.84 82.84
By looking at this, you can see it just group by weekday. So this is not the proper way to group in order to calculate the mean as I wanted to.

I'd use df.index.weekday and df.index.time
df.groupby([ df.index.weekday,df.index.time]).mean()

Related

Pandas fillna() method not filling all missing values

I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0

how can i replace time-series dataframe specific values in pandas?

I have the dataframes below (date/time is multi index) and I want to replace column values in (00:00:00~07:00:00) as a numpy array:
[[ 21.63920663 21.62012822 20.9900515 21.23217008 21.19482458
21.10839656 20.89631935 20.79977166 20.99176729 20.91567565
20.87258765 20.76210464 20.50357827 20.55897631 20.38005033
20.38227309 20.54460993 20.37707293 20.08279925 20.09955877
20.02559575 20.12390737 20.2917257 20.20056711 20.1589065
20.41302289 20.48000767 20.55604102 20.70255192]]
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
03:15:00 NaN
03:30:00 NaN
03:45:00 NaN
04:00:00 NaN
04:15:00 NaN
04:30:00 NaN
04:45:00 NaN
05:00:00 NaN
05:15:00 NaN
05:30:00 NaN
05:45:00 NaN
06:00:00 NaN
06:15:00 NaN
06:30:00 NaN
06:45:00 NaN
07:00:00 NaN
07:15:00 NaN
07:30:00 NaN
07:45:00 NaN
08:00:00 NaN
08:15:00 NaN
08:30:00 NaN
08:45:00 NaN
09:00:00 NaN
09:15:00 NaN
09:30:00 NaN
09:45:00 NaN
10:00:00 NaN
10:15:00 NaN
10:30:00 NaN
10:45:00 NaN
11:00:00 NaN
Name: temp, dtype: float64
<class 'datetime.time'>
How can I do this?
You can use slicers:
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
Or if second levels are times:
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = 1
Sample:
print (df1)
aaa
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 2.00
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
For assign array is necessary use numpy.tile for repeat by length of first level unique values:
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, 10),len(df1.index.levels[0]))
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
More general solution with generated array by length of slice:
idx = pd.IndexSlice
len0 = df1.loc[idx[df1.index.levels[0][0], '00:00:00':'02:00:00'],:].shape[0]
len1 = len(df1.index.levels[0])
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, len0 + 1), len1)
Tested with times:
import datetime
idx = pd.IndexSlice
arr =np.tile(np.arange(1, 10),len(df1.index.levels[0]))
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = arr
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
Last was problem found - my solution wokrs with one column DataFrame, but if working with Series need remove one ::
arr = np.array([[ 21.63920663, 21.62012822, 20.9900515, 21.23217008, 21.19482458, 21.10839656,
20.89631935, 20.79977166, 20.99176729, 20.91567565, 20.87258765, 20.76210464,
20.50357827, 20.55897631, 20.38005033, 20.38227309, 20.54460993, 20.37707293,
20.08279925, 20.09955877, 20.02559575, 20.12390737, 20.2917257, 20.20056711,
20.1589065, 20.41302289, 20.48000767, 20.55604102, 20.70255192]])
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0): datetime.time(7, 0, 0)]] = arr[0]
---^^^

Python Dataframe How to groupby weeks over years

I have a dataset like below :
date =
2012-01-01 NaN NaN NaN
2012-01-02 NaN NaN NaN
2012-01-03 NaN NaN NaN
2012-01-04 0.880 2.981 -0.0179
2012-01-05 0.857 2.958 -0.0261
2012-01-06 0.858 2.959 0.0012
2012-01-07 NaN NaN NaN
2012-01-08 NaN NaN NaN
2012-01-09 0.880 2.981 0.0256
2012-01-10 0.905 3.006 0.0284
2012-01-11 0.905 3.006 0.0000
2012-01-12 0.902 3.003 -0.0033
2012-01-13 0.880 2.981 -0.0244
2012-01-14 NaN NaN NaN
2012-01-15 NaN NaN NaN
2012-01-16 0.858 2.959 -0.0250
2012-01-17 0.891 2.992 0.0385
2012-01-18 0.878 2.979 -0.0146
2012-01-19 0.887 2.988 0.0103
2012-01-20 0.899 3.000 0.0135
2012-01-21 NaN NaN NaN
2012-01-22 NaN NaN NaN
2012-01-23 NaN NaN NaN
2012-01-24 NaN NaN NaN
2012-01-25 NaN NaN NaN
2012-01-26 NaN NaN NaN
2012-01-27 NaN NaN NaN
2012-01-28 NaN NaN NaN
2012-01-29 NaN NaN NaN
2012-01-30 0.892 2.993 -0.0078
... ... ... ...
2016-12-02 1.116 3.417 -0.0124
2016-12-03 NaN NaN NaN
2016-12-04 NaN NaN NaN
2016-12-05 1.111 3.412 -0.0045
2016-12-06 1.111 3.412 0.0000
2016-12-07 1.120 3.421 0.0081
2016-12-08 1.113 3.414 -0.0063
2016-12-09 1.109 3.410 -0.0036
2016-12-10 NaN NaN NaN
2016-12-11 NaN NaN NaN
2016-12-12 1.072 3.373 -0.0334
2016-12-13 1.075 3.376 0.0028
2016-12-14 1.069 3.370 -0.0056
2016-12-15 1.069 3.370 0.0000
2016-12-16 1.073 3.374 0.0037
2016-12-17 NaN NaN NaN
2016-12-18 NaN NaN NaN
2016-12-19 1.071 3.372 -0.0019
2016-12-20 1.067 3.368 -0.0037
2016-12-21 1.076 3.377 0.0084
2016-12-22 1.076 3.377 0.0000
2016-12-23 1.066 3.367 -0.0093
2016-12-24 NaN NaN NaN
2016-12-25 NaN NaN NaN
2016-12-26 1.041 3.372 0.0047
2016-12-27 1.042 3.373 0.0010
2016-12-28 1.038 3.369 -0.0038
2016-12-29 1.035 3.366 -0.0029
2016-12-30 1.038 3.369 0.0029
2016-12-31 1.038 3.369 0.0000
when I do :
in_range_df = Days_Count_Sum["2012-01-01":"2016-12-31"]
print("In range: ",in_range_df)
Week_Count = in_range_df.groupby(in_range_df.index.week)
print("in_range_df.index.week: ",in_range_df.index.week)
print("Group by Week: ",Week_Count.sum())
I found the result always get list of 1 to 53 (weeks)
when print out :in_range_df.index.week: [52 1 1 ..., 52 52 52]
I realized the index value is always "52" after the first year of this range 2012.
How to group by weeks from the range of more than one year?

Copy nan values from one dataframe to another

I have two dataframes df1 and df2 of the same size and dimensions. Is there a simple way to copy all the NaN values in 'df1' to 'df2' ? The example below demonstrates the output I want from .copynans()
In: df1
Out:
10053802 10053856 10053898 10058054
2012-07-01 00:00:00 100.0 0.353 0.300 0.326
2012-07-01 00:30:00 101.0 0.522 0.258 0.304
2012-07-01 01:00:00 102.0 0.311 0.369 0.228
2012-07-01 01:30:00 103.0 NaN 0.478 0.247
2012-07-01 02:00:00 101.0 NaN NaN 0.259
2012-07-01 02:30:00 102.0 0.281 NaN 0.239
2012-07-01 03:00:00 125.0 0.320 NaN 0.217
2012-07-01 03:30:00 136.0 0.288 NaN 0.283
In: df2
Out:
10053802 10053856 10053898 10058054
2012-07-01 00:00:00 1.0 2.0 3.0 4.0
2012-07-01 00:30:00 1.0 2.0 3.0 4.0
2012-07-01 01:00:00 1.0 2.0 3.0 4.0
2012-07-01 01:30:00 1.0 2.0 3.0 4.0
2012-07-01 02:00:00 1.0 2.0 3.0 4.0
2012-07-01 02:30:00 1.0 2.0 3.0 4.0
2012-07-01 03:00:00 1.0 2.0 3.0 4.0
2012-07-01 03:30:00 1.0 2.0 3.0 4.0
In: df2.copynans(df1)
Out:
10053802 10053856 10053898 10058054
2012-07-01 00:00:00 1.0 2.0 3.0 4.0
2012-07-01 00:30:00 1.0 2.0 3.0 4.0
2012-07-01 01:00:00 1.0 2.0 3.0 4.0
2012-07-01 01:30:00 1.0 NaN 3.0 4.0
2012-07-01 02:00:00 1.0 NaN NaN 4.0
2012-07-01 02:30:00 1.0 2.0 NaN 4.0
2012-07-01 03:00:00 1.0 2.0 NaN 4.0
2012-07-01 03:30:00 1.0 2.0 NaN 4.0
Either
df1.where(df2.notnull())
Or
df1.mask(df2.isnull())
#Use null cells from df1 as index to set the the corresponding cell to nan in df2
df2[df1.isnull()]=np.nan

Pandas: Find first occurrence - on daily basis in a timeseries

I'm struggling with this so any input appreciated. I want to iterate over the values in a dataframe column and return the first instance when a value is seen every day. Groupby looked to be a good option for this but when using df.groupby(grouper).first() with grouper set at daily the following output is seen.
In [95]:
df.groupby(grouper).first()
Out[95]:
test_1
2014-03-04 1.0
2014-03-05 1.0
This is only giving the day the value was seen in test _1 and not reseting the first() on a daily basis which is what I need (see desired output below).
I want to preserve the time this value was seen in the following format:
This is the input dataframe:
test_1
2014-03-04 09:00:00 NaN
2014-03-04 10:00:00 NaN
2014-03-04 11:00:00 NaN
2014-03-04 12:00:00 NaN
2014-03-04 13:00:00 NaN
2014-03-04 14:00:00 1.0
2014-03-04 15:00:00 NaN
2014-03-04 16:00:00 1.0
2014-03-05 09:00:00 1.0
This is the desired output:
test_1 test_output
2014-03-04 09:00:00 NaN NaN
2014-03-04 10:00:00 NaN NaN
2014-03-04 11:00:00 NaN NaN
2014-03-04 12:00:00 NaN NaN
2014-03-04 13:00:00 NaN NaN
2014-03-04 14:00:00 1.0 1.0
2014-03-04 15:00:00 NaN NaN
2014-03-04 16:00:00 1.0 NaN
2014-03-05 09:00:00 1.0 NaN
I just want to mark the time when an event first occurs in a new column named test_output.
Admins. Please note this question is different from the other marked as a duplicate as this requires a rolling one day first occurrence.
Try this, using this data:
rng = pd.DataFrame( {'test_1': [None, None,None, None, 1,1, 1 , None, None, None,1 , None, None, None,]}, index = pd.date_range('4/2/2014', periods=14, freq='BH'))
rng
test_1
2014-04-02 09:00:00 NaN
2014-04-02 10:00:00 NaN
2014-04-02 11:00:00 NaN
2014-04-02 12:00:00 NaN
2014-04-02 13:00:00 1.0
2014-04-02 14:00:00 1.0
2014-04-02 15:00:00 1.0
2014-04-02 16:00:00 NaN
2014-04-03 09:00:00 NaN
2014-04-03 10:00:00 NaN
2014-04-03 11:00:00 1.0
2014-04-03 12:00:00 NaN
2014-04-03 13:00:00 NaN
2014-04-03 14:00:00 NaN
The output is this:
rng['test_output'] = rng['test_1'].loc[rng.groupby(pd.TimeGrouper(freq='D'))['test_1'].idxmin()]
test_1 test_output
2014-04-02 09:00:00 NaN NaN
2014-04-02 10:00:00 NaN NaN
2014-04-02 11:00:00 NaN NaN
2014-04-02 12:00:00 NaN NaN
2014-04-02 13:00:00 1.0 1.0
2014-04-02 14:00:00 1.0 NaN
2014-04-02 15:00:00 1.0 NaN
2014-04-02 16:00:00 NaN NaN
2014-04-03 09:00:00 NaN NaN
2014-04-03 10:00:00 NaN NaN
2014-04-03 11:00:00 1.0 1.0
2014-04-03 12:00:00 NaN NaN
2014-04-03 13:00:00 NaN NaN
2014-04-03 14:00:00 NaN NaN

Categories

Resources