Finding maximum null values in stretch and generating flag - python

I have dataframe with datetime and two columns.I have to find maximum stretch of null values in a 'particular date' for column 'X' and replace it with zero in both column for that particular date. In addition to that I have to create third column with name 'flag' which will carry value of 1 for every zero imputation in other two column or else value of 0. In example below, January 1st the maximum stretch null value is 3 times, so I have to replace this with zero. Similarly, I have to replicate the process for 2nd January.
Below is my sample data:
Datetime X Y
01-01-2018 00:00 1 1
01-01-2018 00:05 nan 2
01-01-2018 00:10 2 nan
01-01-2018 00:15 3 4
01-01-2018 00:20 2 2
01-01-2018 00:25 nan 1
01-01-2018 00:30 nan nan
01-01-2018 00:35 nan nan
01-01-2018 00:40 4 4
02-01-2018 00:00 nan nan
02-01-2018 00:05 2 3
02-01-2018 00:10 2 2
02-01-2018 00:15 2 5
02-01-2018 00:20 2 2
02-01-2018 00:25 nan nan
02-01-2018 00:30 nan 1
02-01-2018 00:35 3 nan
02-01-2018 00:40 nan nan
"Below is the result that I am expecting"
Datetime X Y Flag
01-01-2018 00:00 1 1 0
01-01-2018 00:05 nan 2 0
01-01-2018 00:10 2 nan 0
01-01-2018 00:15 3 4 0
01-01-2018 00:20 2 2 0
01-01-2018 00:25 0 0 1
01-01-2018 00:30 0 0 1
01-01-2018 00:35 0 0 1
01-01-2018 00:40 4 4 0
02-01-2018 00:00 nan nan 0
02-01-2018 00:05 2 3 0
02-01-2018 00:10 2 2 0
02-01-2018 00:15 2 5 0
02-01-2018 00:20 2 2 0
02-01-2018 00:25 nan nan 0
02-01-2018 00:30 nan 1 0
02-01-2018 00:35 3 nan 0
02-01-2018 00:40 nan nan 0
This question is the extension of previous question. Here is the link Python - Find maximum null values in stretch and replacing with 0

First create consecutive groups for each column filled by unique values:
df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 6.0 108.0
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 8.0 144.0
2018-02-01 00:30:00 8.0 NaN
2018-02-01 00:35:00 NaN 180.0
2018-02-01 00:40:00 10.0 180.0
Then get groups with maximum count - here group 4:
a = df2.stack().value_counts().index[0]
print (a)
4.0
Get mask for match rows for set 0 and for Flag column cast mask to integer to Tru/False to 1/0 mapping:
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0
2018-02-01 00:40:00 NaN NaN 0
EDIT:
Added new condition for match dates from list:
dates = df.index.floor('d')
filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]
df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 NaN NaN
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 NaN NaN
2018-02-01 00:30:00 NaN NaN
2018-02-01 00:35:00 NaN NaN
2018-02-01 00:40:00 NaN NaN
a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0

Related

Pandas fillna() method not filling all missing values

I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Adding empty dataframe rows based on missing datetime values

I am trying to add rows to my pandas dataframe as such:
import pandas as pd
import datetime as dt
d={'datetime':[dt.datetime(2018,3,1,0,0),dt.datetime(2018,3,1,0,10),dt.datetime(2018,3,1,0,40)],
'value':[4.,5.,1.]}
df=pd.DataFrame(d)
Which outputs:
datetime value
0 2018-03-01 00:00:00 4.0
1 2018-03-01 00:10:00 5.0
2 2018-03-01 00:40:00 1.0
What I want to do is add rows from 00:00:00 to 00:40:00, to show every 5 minutes. My desired output looks like this:
datetime value
0 2018-03-01 00:00:00 4.0
1 2018-03-01 00:05:00 NaN
2 2018-03-01 00:10:00 5.0
3 2018-03-01 00:15:00 NaN
4 2018-03-01 00:20:00 NaN
5 2018-03-01 00:25:00 NaN
6 2018-03-01 00:30:00 NaN
7 2018-03-01 00:35:00 NaN
8 2018-03-01 00:40:00 1.0
How do I get there?
You can use pd.DataFrame.resample:
df = df.resample('5Min', on='datetime').first()\
.drop('datetime', 1).reset_index()
print(df)
datetime value
0 2018-03-01 00:00:00 4.0
1 2018-03-01 00:05:00 NaN
2 2018-03-01 00:10:00 5.0
3 2018-03-01 00:15:00 NaN
4 2018-03-01 00:20:00 NaN
5 2018-03-01 00:25:00 NaN
6 2018-03-01 00:30:00 NaN
7 2018-03-01 00:35:00 NaN
8 2018-03-01 00:40:00 1.0
First, you can create a dataframe including your final datetime index and then affect the second one :
df1 = pd.DataFrame({'value': np.nan} ,index=pd.date_range('2018-03-01 00:00:00',
periods=9, freq='5min'))
print(df)
#Output :
value
2018-03-01 00:00:00 NaN
2018-03-01 00:05:00 NaN
2018-03-01 00:10:00 NaN
2018-03-01 00:15:00 NaN
2018-03-01 00:20:00 NaN
2018-03-01 00:25:00 NaN
2018-03-01 00:30:00 NaN
2018-03-01 00:35:00 NaN
2018-03-01 00:40:00 NaN
Now, let's say your dataframe is the second one, you can add this to your above code :
d={'datetime':
[dt.datetime(2018,3,1,0,0),dt.datetime(2018,3,1,0,10),dt.datetime(2018,3,1,0,40)],
'value':[4.,5.,1.]}
df2=pd.DataFrame(d)
df2.datetime = pd.to_datetime(df2.datetime)
df2.set_index('datetime',inplace=True)
print(df2)
#Output
value
datetime
2018-03-01 00:00:00 4.0
2018-03-01 00:10:00 5.0
2018-03-01 00:40:00 1.0
Finally :
df1.value = df2.value
print(df1)
#output
value
2018-03-01 00:00:00 4.0
2018-03-01 00:05:00 NaN
2018-03-01 00:10:00 5.0
2018-03-01 00:15:00 NaN
2018-03-01 00:20:00 NaN
2018-03-01 00:25:00 NaN
2018-03-01 00:30:00 NaN
2018-03-01 00:35:00 NaN
2018-03-01 00:40:00 1.0

how can i replace time-series dataframe specific values in pandas?

I have the dataframes below (date/time is multi index) and I want to replace column values in (00:00:00~07:00:00) as a numpy array:
[[ 21.63920663 21.62012822 20.9900515 21.23217008 21.19482458
21.10839656 20.89631935 20.79977166 20.99176729 20.91567565
20.87258765 20.76210464 20.50357827 20.55897631 20.38005033
20.38227309 20.54460993 20.37707293 20.08279925 20.09955877
20.02559575 20.12390737 20.2917257 20.20056711 20.1589065
20.41302289 20.48000767 20.55604102 20.70255192]]
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
03:15:00 NaN
03:30:00 NaN
03:45:00 NaN
04:00:00 NaN
04:15:00 NaN
04:30:00 NaN
04:45:00 NaN
05:00:00 NaN
05:15:00 NaN
05:30:00 NaN
05:45:00 NaN
06:00:00 NaN
06:15:00 NaN
06:30:00 NaN
06:45:00 NaN
07:00:00 NaN
07:15:00 NaN
07:30:00 NaN
07:45:00 NaN
08:00:00 NaN
08:15:00 NaN
08:30:00 NaN
08:45:00 NaN
09:00:00 NaN
09:15:00 NaN
09:30:00 NaN
09:45:00 NaN
10:00:00 NaN
10:15:00 NaN
10:30:00 NaN
10:45:00 NaN
11:00:00 NaN
Name: temp, dtype: float64
<class 'datetime.time'>
How can I do this?
You can use slicers:
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
Or if second levels are times:
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = 1
Sample:
print (df1)
aaa
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 2.00
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
For assign array is necessary use numpy.tile for repeat by length of first level unique values:
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, 10),len(df1.index.levels[0]))
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
More general solution with generated array by length of slice:
idx = pd.IndexSlice
len0 = df1.loc[idx[df1.index.levels[0][0], '00:00:00':'02:00:00'],:].shape[0]
len1 = len(df1.index.levels[0])
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, len0 + 1), len1)
Tested with times:
import datetime
idx = pd.IndexSlice
arr =np.tile(np.arange(1, 10),len(df1.index.levels[0]))
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = arr
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
Last was problem found - my solution wokrs with one column DataFrame, but if working with Series need remove one ::
arr = np.array([[ 21.63920663, 21.62012822, 20.9900515, 21.23217008, 21.19482458, 21.10839656,
20.89631935, 20.79977166, 20.99176729, 20.91567565, 20.87258765, 20.76210464,
20.50357827, 20.55897631, 20.38005033, 20.38227309, 20.54460993, 20.37707293,
20.08279925, 20.09955877, 20.02559575, 20.12390737, 20.2917257, 20.20056711,
20.1589065, 20.41302289, 20.48000767, 20.55604102, 20.70255192]])
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0): datetime.time(7, 0, 0)]] = arr[0]
---^^^

Add missing times in dataframe column with pandas

I have a dataframe like so:
df = pd.DataFrame({'time':['23:59:45','23:49:50','23:59:55','00:00:00','00:00:05','00:00:10','00:00:15'],
'X':[-5,-4,-2,5,6,10,11],
'Y':[3,4,5,9,20,22,23]})
As you can see, the time is formed by hours (string format) and are across midnight. The time is given every 5 seconds!
My goal is however to add empty rows (filled with Nan for examples) so that the time is every second. Finally the column time should be converted as a time stamp and set as index.
Could you please suggest a smart and elegant way to achieve my goal?
Here is what the output should look like:
X Y
time
23:59:45 -5.0 3.0
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
... ... ...
00:00:10 10.0 22.0
00:00:11 NaN NaN
00:00:12 NaN NaN
00:00:13 NaN NaN
00:00:14 NaN NaN
00:00:15 11.0 23.0
Note: I do not need the dates.
Use to_timedelta with reindex by timedelta_range:
df['time'] = pd.to_timedelta(df['time'])
idx = pd.timedelta_range('0', '23:59:59', freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5.0 9.0
1 00:00:01 NaN NaN
2 00:00:02 NaN NaN
3 00:00:03 NaN NaN
4 00:00:04 NaN NaN
5 00:00:05 6.0 20.0
6 00:00:06 NaN NaN
7 00:00:07 NaN NaN
8 00:00:08 NaN NaN
9 00:00:09 NaN NaN
If need replace NaNs:
df = df.set_index('time').reindex(idx, fill_value=0).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5 9
1 00:00:01 0 0
2 00:00:02 0 0
3 00:00:03 0 0
4 00:00:04 0 0
5 00:00:05 6 20
6 00:00:06 0 0
7 00:00:07 0 0
8 00:00:08 0 0
9 00:00:09 0 0
Another solution with resample, but is possible some rows are missing in the end:
df = df.set_index('time').resample('S').first()
print (df.tail(10))
X Y
time
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
23:59:49 NaN NaN
23:59:50 NaN NaN
23:59:51 NaN NaN
23:59:52 NaN NaN
23:59:53 NaN NaN
23:59:54 NaN NaN
23:59:55 -2.0 5.0
EDIT1:
idx1 = pd.timedelta_range('23:59:45', '23:59:59', freq='S', name='time')
idx2 = pd.timedelta_range('0', '00:00:15', freq='S', name='time')
idx = np.concatenate([idx1, idx2])
df['time'] = pd.to_timedelta(df['time'])
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:59:45 -5.0 3.0
1 23:59:46 NaN NaN
2 23:59:47 NaN NaN
3 23:59:48 NaN NaN
4 23:59:49 NaN NaN
5 23:59:50 NaN NaN
6 23:59:51 NaN NaN
7 23:59:52 NaN NaN
8 23:59:53 NaN NaN
9 23:59:54 NaN NaN
print (df.tail(10))
time X Y
21 00:00:06 NaN NaN
22 00:00:07 NaN NaN
23 00:00:08 NaN NaN
24 00:00:09 NaN NaN
25 00:00:10 10.0 22.0
26 00:00:11 NaN NaN
27 00:00:12 NaN NaN
28 00:00:13 NaN NaN
29 00:00:14 NaN NaN
30 00:00:15 11.0 23.0
EDIT:
Another solution - change next day to 1 day timedeltas:
df['time'] = pd.to_timedelta(df['time'])
a = pd.to_timedelta(df['time'].diff().dt.days.abs().cumsum().fillna(1).sub(1), unit='d')
df['time'] = df['time'] + a
print (df)
X Y time
0 -5 3 0 days 23:59:45
1 -4 4 0 days 23:49:50
2 -2 5 0 days 23:59:55
3 5 9 1 days 00:00:00
4 6 20 1 days 00:00:05
5 10 22 1 days 00:00:10
6 11 23 1 days 00:00:15
idx = pd.timedelta_range(df['time'].min(), df['time'].max(), freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:49:50 -4.0 4.0
1 23:49:51 NaN NaN
2 23:49:52 NaN NaN
3 23:49:53 NaN NaN
4 23:49:54 NaN NaN
5 23:49:55 NaN NaN
6 23:49:56 NaN NaN
7 23:49:57 NaN NaN
8 23:49:58 NaN NaN
9 23:49:59 NaN NaN
print (df.tail(10))
time X Y
616 1 days 00:00:06 NaN NaN
617 1 days 00:00:07 NaN NaN
618 1 days 00:00:08 NaN NaN
619 1 days 00:00:09 NaN NaN
620 1 days 00:00:10 10.0 22.0
621 1 days 00:00:11 NaN NaN
622 1 days 00:00:12 NaN NaN
623 1 days 00:00:13 NaN NaN
624 1 days 00:00:14 NaN NaN
625 1 days 00:00:15 11.0 23.0

Categories

Resources