I have my pandas data frame in 1 min frequency, I want to do the re-sampling based on the threshold data (there are multiple thresholds in a numpy array)
Here is example of my dataset:
2018-01-01 00:01:00 0.867609
2018-01-01 00:02:00 0.544493
2018-01-01 00:03:00 0.958497
2018-01-01 00:04:00 0.371790
2018-01-01 00:05:00 0.470320
2018-01-01 00:06:00 0.757448
2018-01-01 00:07:00 0.198261
2018-01-01 00:08:00 0.666350
2018-01-01 00:09:00 0.392574
2018-01-01 00:10:00 0.627608
2018-01-01 00:11:00 0.414380
2018-01-01 00:12:00 0.120925
2018-01-01 00:13:00 0.559495
2018-01-01 00:14:00 0.260619
2018-01-01 00:15:00 0.982731
2018-01-01 00:16:00 0.996133
2018-01-01 00:17:00 0.410816
2018-01-01 00:18:00 0.366457
2018-01-01 00:19:00 0.927745
2018-01-01 00:20:00 0.626804
2018-01-01 00:21:00 0.223193
2018-01-01 00:22:00 0.007136
2018-01-01 00:23:00 0.245006
2018-01-01 00:24:00 0.491245
2018-01-01 00:25:00 0.215716
2018-01-01 00:26:00 0.932378
2018-01-01 00:27:00 0.366263
2018-01-01 00:28:00 0.522177
2018-01-01 00:29:00 0.614966
2018-01-01 00:30:00 0.670983
threshold=np.array([0.5,0.8,0.9])
What I want is to extract the data where it crosses the threshold values and if doesn't cross the threshold value just resample data at 30 min
Sample ans :
Threshold
2018-01-01 00:01:00 0.867609 0.8
2018-01-01 00:02:00 0.544493 0.5
2018-01-01 00:03:00 0.958497 0.9
2018-01-01 00:05:00 0.421055 NA
2018-01-01 00:06:00 0.757448 0.5
2018-01-01 00:07:00 0.198261 NA
2018-01-01 00:08:00 0.666350 0.5
2018-01-01 00:09:00 0.392574 NA
2018-01-01 00:10:00 0.627608 0.5
2018-01-01 00:12:00 0.414380 NA
2018-01-01 00:13:00 0.559495 0.5
2018-01-01 00:14:00 0.260619 NA
2018-01-01 00:15:00 0.982731 0.9
2018-01-01 00:16:00 0.996133 0.9
2018-01-01 00:18:00 0.388636 NA
2018-01-01 00:19:00 0.927745 0.9
2018-01-01 00:20:00 0.626804 0.5
2018-01-01 00:25:00 0.215716 NA
2018-01-01 00:26:00 0.932378 0.9
2018-01-01 00:27:00 0.366263 NA
2018-01-01 00:28:00 0.522177 0.5
2018-01-01 00:29:00 0.614966 0.5
2018-01-01 00:30:00 0.670983 0.5
I got the solution for resampling from #Scott Boston,
df = df.set_index(0)
g = df[1].lt(-22).mul(1).diff().bfill().ne(0).cumsum()
df.groupby(g).apply(lambda x: x.resample('1T', kind='period').mean().reset_index()
if (x.iloc[0] < -22).any() else
x.resample('30T', kind='period').mean().reset_index())\
.reset_index(drop=True)
Use pd.cut:
threshold=np.array([0.5,0.8,0.9]).tolist()
pd.cut(df[1],bins=threshold+[np.inf],labels=threshold)
Output:
0 0.8
1 0.5
2 0.9
3 NaN
4 NaN
5 0.5
6 NaN
7 0.5
8 NaN
9 0.5
10 NaN
11 NaN
12 0.5
13 NaN
14 0.9
15 0.9
16 NaN
17 NaN
18 0.9
19 0.5
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 0.9
26 NaN
27 0.5
28 0.5
29 0.5
Name: 1, dtype: category
Categories (3, float64): [0.5 < 0.8 < 0.9]
Now, let's add this to the datafame and filter out all consecutive NaNs.
df['Threshold'] = pd.cut(df[1],bins=threshold+[np.inf],labels=threshold)
mask = ~(df.Threshold.isnull() & (df.Threshold.isnull() == df.Threshold.isnull().shift(1)))
df[mask]
Output:
0 1 Threshold
0 2018-01-01 00:01:00 0.867609 0.8
1 2018-01-01 00:02:00 0.544493 0.5
2 2018-01-01 00:03:00 0.958497 0.9
3 2018-01-01 00:04:00 0.371790 NaN
5 2018-01-01 00:06:00 0.757448 0.5
6 2018-01-01 00:07:00 0.198261 NaN
7 2018-01-01 00:08:00 0.666350 0.5
8 2018-01-01 00:09:00 0.392574 NaN
9 2018-01-01 00:10:00 0.627608 0.5
10 2018-01-01 00:11:00 0.414380 NaN
12 2018-01-01 00:13:00 0.559495 0.5
13 2018-01-01 00:14:00 0.260619 NaN
14 2018-01-01 00:15:00 0.982731 0.9
15 2018-01-01 00:16:00 0.996133 0.9
16 2018-01-01 00:17:00 0.410816 NaN
18 2018-01-01 00:19:00 0.927745 0.9
19 2018-01-01 00:20:00 0.626804 0.5
20 2018-01-01 00:21:00 0.223193 NaN
25 2018-01-01 00:26:00 0.932378 0.9
26 2018-01-01 00:27:00 0.366263 NaN
27 2018-01-01 00:28:00 0.522177 0.5
28 2018-01-01 00:29:00 0.614966 0.5
29 2018-01-01 00:30:00 0.670983 0.5
Related
I have a dataframe with some data that I'm going to run simulations on. Each row is a datetime and a value. Because of the nature of the problem, I need to keep the original frequency of 1 hour when the value is above a certain threshold. When it's not, I could resample the data and run that part of the simulation on lower frequency data, in order to speed up the simulation.
My idea is to somehow group the dataframe by day (since I've noticed there are many whole days where the value stays below the threshold), check the max value over each group, and if the max is below the threshold then aggregate the data in that group into a single mean value.
Here's a minimal working example:
import pandas as pd
import numpy as np
threshold = 3
idx = pd.date_range("2018-01-01", periods=27, freq="H")
df = pd.Series(np.append(np.ones(26), 5), index=idx).to_frame("v")
print(df)
Output:
v
2018-01-01 00:00:00 1.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 1.0
2018-01-01 03:00:00 1.0
2018-01-01 04:00:00 1.0
2018-01-01 05:00:00 1.0
2018-01-01 06:00:00 1.0
2018-01-01 07:00:00 1.0
2018-01-01 08:00:00 1.0
2018-01-01 09:00:00 1.0
2018-01-01 10:00:00 1.0
2018-01-01 11:00:00 1.0
2018-01-01 12:00:00 1.0
2018-01-01 13:00:00 1.0
2018-01-01 14:00:00 1.0
2018-01-01 15:00:00 1.0
2018-01-01 16:00:00 1.0
2018-01-01 17:00:00 1.0
2018-01-01 18:00:00 1.0
2018-01-01 19:00:00 1.0
2018-01-01 20:00:00 1.0
2018-01-01 21:00:00 1.0
2018-01-01 22:00:00 1.0
2018-01-01 23:00:00 1.0
2018-01-02 00:00:00 1.0
2018-01-02 01:00:00 1.0
2018-01-02 02:00:00 5.0
The desired output of the operation would be this dataframe:
v
2018-01-01 00:00:00 1.0
2018-01-02 00:00:00 1.0
2018-01-02 01:00:00 1.0
2018-01-02 02:00:00 5.0
where the first value is the mean of the first day.
I think I'm getting close
grouped = df.resample("1D")
for name, group in grouped:
if group["v"].max() <= 3:
group['v'].agg("mean")
but I'm unsure how to actually apply the aggregation to the desired groups, and get a dataframe back.
Any help is greatly appreciated.
So I found a solution.
grouped = df.resample("1D")
def conditionalAggregation(x):
if x['v'].max() <= 3:
idx = [x.index[0].replace(hour=0, minute=0, second=0, microsecond=0)]
return pd.DataFrame(x['v'].max(), index=idx, columns=['v'])
else:
return x
conditionallyAggregated = grouped.apply(conditionalAggregation)
conditionallyAggregated = conditionallyAggregated.droplevel(level=0)
conditionallyAggregated
This gives the following df:
v
2018-01-01 00:00:00 1.0
2018-01-02 00:00:00 1.0
2018-01-02 01:00:00 1.0
2018-01-02 02:00:00 5.0
I have dataframe with datetime and two columns.I have to find maximum stretch of null values in a 'particular date' for column 'X' and replace it with zero in both column for that particular date. In addition to that I have to create third column with name 'flag' which will carry value of 1 for every zero imputation in other two column or else value of 0. In example below, January 1st the maximum stretch null value is 3 times, so I have to replace this with zero. Similarly, I have to replicate the process for 2nd January.
Below is my sample data:
Datetime X Y
01-01-2018 00:00 1 1
01-01-2018 00:05 nan 2
01-01-2018 00:10 2 nan
01-01-2018 00:15 3 4
01-01-2018 00:20 2 2
01-01-2018 00:25 nan 1
01-01-2018 00:30 nan nan
01-01-2018 00:35 nan nan
01-01-2018 00:40 4 4
02-01-2018 00:00 nan nan
02-01-2018 00:05 2 3
02-01-2018 00:10 2 2
02-01-2018 00:15 2 5
02-01-2018 00:20 2 2
02-01-2018 00:25 nan nan
02-01-2018 00:30 nan 1
02-01-2018 00:35 3 nan
02-01-2018 00:40 nan nan
"Below is the result that I am expecting"
Datetime X Y Flag
01-01-2018 00:00 1 1 0
01-01-2018 00:05 nan 2 0
01-01-2018 00:10 2 nan 0
01-01-2018 00:15 3 4 0
01-01-2018 00:20 2 2 0
01-01-2018 00:25 0 0 1
01-01-2018 00:30 0 0 1
01-01-2018 00:35 0 0 1
01-01-2018 00:40 4 4 0
02-01-2018 00:00 nan nan 0
02-01-2018 00:05 2 3 0
02-01-2018 00:10 2 2 0
02-01-2018 00:15 2 5 0
02-01-2018 00:20 2 2 0
02-01-2018 00:25 nan nan 0
02-01-2018 00:30 nan 1 0
02-01-2018 00:35 3 nan 0
02-01-2018 00:40 nan nan 0
This question is the extension of previous question. Here is the link Python - Find maximum null values in stretch and replacing with 0
First create consecutive groups for each column filled by unique values:
df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 6.0 108.0
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 8.0 144.0
2018-02-01 00:30:00 8.0 NaN
2018-02-01 00:35:00 NaN 180.0
2018-02-01 00:40:00 10.0 180.0
Then get groups with maximum count - here group 4:
a = df2.stack().value_counts().index[0]
print (a)
4.0
Get mask for match rows for set 0 and for Flag column cast mask to integer to Tru/False to 1/0 mapping:
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0
2018-02-01 00:40:00 NaN NaN 0
EDIT:
Added new condition for match dates from list:
dates = df.index.floor('d')
filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]
df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 NaN NaN
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 NaN NaN
2018-02-01 00:30:00 NaN NaN
2018-02-01 00:35:00 NaN NaN
2018-02-01 00:40:00 NaN NaN
a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0
I have a dataframe with missing values. I want to find the number of consecutive missing values along with their counts. Below is the sample data and sample result that I expect
Sample data
Timestamp X
2018-01-02 00:00:00 6
2018-01-02 00:05:00 6
2018-01-02 00:10:00 4
2018-01-02 00:15:00 nan
2018-01-02 00:20:00 nan
2018-01-02 00:25:00 3
2018-01-02 00:30:00 4
2018-01-02 00:35:00 nan
2018-01-02 00:40:00 nan
2018-01-02 00:45:00 nan
2018-01-02 00:50:00 nan
2018-01-02 00:55:00 nan
2018-01-02 01:00:00 nan
2018-01-02 01:05:00 2
2018-01-02 01:10:00 4
2018-01-02 01:15:00 6
2018-01-02 01:20:00 6
2018-01-02 01:25:00 nan
2018-01-02 01:30:00 nan
2018-01-02 01:35:00 6
2018-01-02 01:40:00 nan
2018-01-02 01:45:00 nan
2018-01-02 01:50:00 6
2018-01-02 01:55:00 6
2018-01-02 02:00:00 nan
2018-01-02 02:05:00 nan
2018-01-02 02:10:00 nan
2018-01-02 02:15:00 3
2018-01-02 02:20:00 4
Expected result
Consecutive missing
values range Cases
0-2 3
3-5 1
6 and above 1
First use solution from Identifying consecutive NaN's with pandas, then filter out 0 values and use cut for bins, last count values by GroupBy.size:
s = df.X.isna().groupby(df.X.notna().cumsum()).sum()
s = s[s!=0]
b = pd.cut(s, bins=[0, 2, 5, np.inf], labels=['0-2','3-5','6 and above'])
out = b.groupby(b).size().reset_index(name='Cases')
print (out)
X Cases
0 0-2 3
1 3-5 1
2 6 and above 1
I created a dataframe with only datetime column with 1 second interval for jan 1, 2018 as shown in the code below.
i = pd.date_range(start='2018-01-01 00:00:00', end='2018-01-01 23:59:00', freq="1S")
ts = pd.DataFrame(index=i)
ts = ts.reset_index()
ts = ts.rename(columns={'index': 'datetime'})`
df1:
datetime
0 2018-01-01 00:00:00
1 2018-01-01 00:00:01
2 2018-01-01 00:00:02
3 2018-01-01 00:00:03
4 2018-01-01 00:00:04
5 2018-01-01 00:00:05
6 2018-01-01 00:00:06
7 2018-01-01 00:00:07
8 2018-01-01 00:00:08
9 2018-01-01 00:00:09
10 2018-01-01 00:00:10
11 2018-01-01 00:00:11
12 2018-01-01 00:00:12
13 2018-01-01 00:00:13
14 2018-01-01 00:00:14
15 2018-01-01 00:00:15
16 2018-01-01 00:00:16
17 2018-01-01 00:00:17
18 2018-01-01 00:00:18
19 2018-01-01 00:00:19
20 2018-01-01 00:00:20
21 2018-01-01 00:00:21
22 2018-01-01 00:00:22
23 2018-01-01 00:00:23
24 2018-01-01 00:00:24
25 2018-01-01 00:00:25
26 2018-01-01 00:00:26
27 2018-01-01 00:00:27
28 2018-01-01 00:00:28
29 2018-01-01 00:00:29`
I have another dataframe with a datetime column another columns
df2:
datetime a b c d e
0 2018-01-01 00:00:04 0.9
1 2018-01-01 00:00:06 0.6 0.7
2 2018-01-01 00:00:09 0.5 0.7 0.8
3 2018-01-01 00:00:16 2.3 3.6 4.9 5.0
4 2018-01-01 00:00:17 0.9 3.5 5.5
5 2018-01-01 00:00:23 0.1 0.6 0.0 1.7
6 2018-01-01 00:00:29 2.7 5.5 4.3 `
Now I am trying to map the datetime columns of df1 and df2 using pandas outer join and I would like my expected result to look like
datetime a b c d e
0 2018-01-01 00:00:00
1 2018-01-01 00:00:01
2 2018-01-01 00:00:02
3 2018-01-01 00:00:03
4 2018-01-01 00:00:04 0.9
5 2018-01-01 00:00:05
6 2018-01-01 00:00:06 0.6 0.7
7 2018-01-01 00:00:07
8 2018-01-01 00:00:08
9 2018-01-01 00:00:09 0.5 0.7 0.8
10 2018-01-01 00:00:10
11 2018-01-01 00:00:11
12 2018-01-01 00:00:12
13 2018-01-01 00:00:13
14 2018-01-01 00:00:14
15 2018-01-01 00:00:15
16 2018-01-01 00:00:16 2.3 3.6 4.9 5.0
17 2018-01-01 00:00:17 0.9 3.5 5.5
18 2018-01-01 00:00:18
19 2018-01-01 00:00:19
20 2018-01-01 00:00:20
21 2018-01-01 00:00:21
22 2018-01-01 00:00:22
23 2018-01-01 00:00:23 0.1 0.6 0.0 1.7
24 2018-01-01 00:00:24
25 2018-01-01 00:00:25
26 2018-01-01 00:00:26
27 2018-01-01 00:00:27
28 2018-01-01 00:00:28
29 2018-01-01 00:00:29 2.7 5.5 4.3 `
but my output looks like this
datetime a b c d e
0 2018-01-01 00:00:00
1 2018-01-01 00:00:01
2 2018-01-01 00:00:02
3 2018-01-01 00:00:03
4 2018-01-01 00:00:04
5 2018-01-01 00:00:05
6 2018-01-01 00:00:06
7 2018-01-01 00:00:07
8 2018-01-01 00:00:08
9 2018-01-01 00:00:09
10 2018-01-01 00:00:10
11 2018-01-01 00:00:11
12 2018-01-01 00:00:12
13 2018-01-01 00:00:13
14 2018-01-01 00:00:14
15 2018-01-01 00:00:15
16 2018-01-01 00:00:16
17 2018-01-01 00:00:17
18 2018-01-01 00:00:18
19 2018-01-01 00:00:19
20 2018-01-01 00:00:20
21 2018-01-01 00:00:21
22 2018-01-01 00:00:22
23 2018-01-01 00:00:23
24 2018-01-01 00:00:24
25 2018-01-01 00:00:25
26 2018-01-01 00:00:26
27 2018-01-01 00:00:27
28 2018-01-01 00:00:28
29 2018-01-01 00:00:29
30 2018-01-01 00:00:04 0.9
31 2018-01-01 00:00:06 0.6 0.7
32 2018-01-01 00:00:09 0.5 0.7 0.8
33 2018-01-01 00:00:16 2.3 3.6 4.9 5.0
34 2018-01-01 00:00:17 0.9 3.5 5.5
35 2018-01-01 00:00:23 0.1 0.6 0.0 1.7
36 2018-01-01 00:00:29 2.7 5.5 4.3 `
The code I am using to do that operation is:
test = pandas.merge(df1, df2, on = ['datetime'], how= 'outer')
I am not quite sure how to approach this issue and I would appreciate if I can get some help.
Keep ts with datetime index and use Reindex as #Scott Boston mentioned in the comments,
i = pd.date_range(start='2018-01-01 00:00:00', end='2018-01-01 23:59:00', freq="1S")
ts = pd.DataFrame(index=i)
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime').reindex(ts.index)
a b c d e
2018-01-01 00:00:00 NaN NaN NaN NaN NaN
2018-01-01 00:00:01 NaN NaN NaN NaN NaN
2018-01-01 00:00:02 NaN NaN NaN NaN NaN
2018-01-01 00:00:03 NaN NaN NaN NaN NaN
2018-01-01 00:00:04 0.9
2018-01-01 00:00:05 NaN NaN NaN NaN NaN
2018-01-01 00:00:06 0.6 0.7
2018-01-01 00:00:07 NaN NaN NaN NaN NaN
2018-01-01 00:00:08 NaN NaN NaN NaN NaN
2018-01-01 00:00:09 0.5 0.7 0.8
2018-01-01 00:00:10 NaN NaN NaN NaN NaN
2018-01-01 00:00:11 NaN NaN NaN NaN NaN
2018-01-01 00:00:12 NaN NaN NaN NaN NaN
2018-01-01 00:00:13 NaN NaN NaN NaN NaN
2018-01-01 00:00:14 NaN NaN NaN NaN NaN
2018-01-01 00:00:15 NaN NaN NaN NaN NaN
2018-01-01 00:00:16 2.3 3.6 4.9 5.0
2018-01-01 00:00:17 0.9 3.5 5.5
Option 2: concat
pd.concat([ts, df.set_index('datetime')], axis = 1)
I have following dataframe in pandas
Date half_hourly_bucket Value
2018-01-01 00:00:01 - 00:30:00 123
2018-01-01 00:30:01 - 01:00:00 12
2018-01-01 01:00:01 - 01:30:00 122
2018-01-01 02:00:01 - 02:30:00 111
2018-01-01 03:00:01 - 03:30:00 122
2018-01-01 04:00:01 - 04:30:00 111
My desired dataframe would be
Date half_hourly_bucket Value
2018-01-01 00:00:01 - 00:30:00 123
2018-01-01 00:30:01 - 01:00:00 12
2018-01-01 01:00:01 - 01:30:00 122
2018-01-01 01:30:01 - 02:00:00 0
2018-01-01 02:00:01 - 02:30:00 122
2018-01-01 02:30:01 - 03:00:00 0
2018-01-01 03:00:01 - 03:30:00 111
2018-01-01 03:30:01 - 04:00:00 0
2018-01-01 04:00:01 - 04:30:00 111
2018-01-01 04:30:01 - 05:00:00 0
2018-01-01 05:00:01 - 05:30:00 0
2018-01-01 05:30:01 - 06:00:00 0
2018-01-01 06:00:01 - 06:30:00 0
2018-01-01 06:30:01 - 07:00:00 0
2018-01-01 07:00:01 - 07:30:00 0
2018-01-01 07:30:01 - 08:00:00 0
2018-01-01 08:00:01 - 08:30:00 0
2018-01-01 09:00:01 - 09:30:00 0
2018-01-01 10:00:01 - 10:30:00 0
2018-01-01 10:30:01 - 11:00:00 0
2018-01-01 11:00:01 - 11:30:00 0
2018-01-01 11:30:01 - 12:00:00 0
2018-01-01 12:00:01 - 12:30:00 0
2018-01-01 12:30:01 - 13:00:00 0
2018-01-01 13:00:01 - 13:30:00 0
2018-01-01 13:30:01 - 14:00:00 0
2018-01-01 14:00:01 - 14:30:00 0
2018-01-01 14:30:01 - 15:00:00 0
2018-01-01 15:00:01 - 15:30:00 0
2018-01-01 15:30:01 - 16:00:00 0
2018-01-01 16:00:01 - 16:30:00 0
2018-01-01 16:30:01 - 17:00:00 0
2018-01-01 17:00:01 - 17:30:00 0
2018-01-01 18:00:01 - 18:30:00 0
2018-01-01 18:30:01 - 19:00:00 0
2018-01-01 19:00:01 - 19:30:00 0
2018-01-01 19:30:01 - 20:00:00 0
2018-01-01 20:00:01 - 20:30:00 0
2018-01-01 20:30:01 - 21:00:00 0
2018-01-01 21:00:01 - 21:30:00 0
2018-01-01 21:30:01 - 22:00:00 0
2018-01-01 22:00:01 - 22:30:00 0
2018-01-01 22:30:01 - 23:00:00 0
2018-01-01 23:00:01 - 23:30:00 0
2018-01-01 23:30:01 - 00:00:00 0
What I want to check on Date column is if in any half hourly bucket (48 buckets in total per day) there is missing data and if it is missing then that bucket has to be added in order and will have value as 0.
How can I do it in pandas?
Solution break half_hourly_bucket to 2 new columns, process it and join back:
#create DatetimeIndex
df = df.set_index('Date')
#split to new columns
df[['one','two']] = df['half_hourly_bucket'].str.split(' - ', expand=True)
#add first column to DatetimeIndex
df.index += pd.to_timedelta(df['one'])
#add mising values to DatetimeIndex
one_sec = pd.Timedelta(1, unit='s')
one_day = pd.Timedelta(1, unit='d')
df = df.reindex(pd.date_range(df.index.min().floor('D') + one_sec,
df.index.max().floor('D') + one_day - one_sec, freq='30T'))
#recreate column two
df['two'] = df.index + pd.Timedelta(30*60 - 1, unit='s')
#join together
df['half_hourly_bucket'] = (df.index.strftime('%H:%M:%S') + ' - ' +
df['two'].dt.strftime('%H:%M:%S'))
#replace missing values
df['Value'] = df['Value'].fillna(0)
df = df.rename_axis('Date').reset_index()
#filter only necessary columns
df = df[['Date','half_hourly_bucket','Value']]
print (df)
Date half_hourly_bucket Value
0 2018-01-01 00:00:01 00:00:01 - 00:30:00 123.0
1 2018-01-01 00:30:01 00:30:01 - 01:00:00 12.0
2 2018-01-01 01:00:01 01:00:01 - 01:30:00 122.0
3 2018-01-01 01:30:01 01:30:01 - 02:00:00 0.0
4 2018-01-01 02:00:01 02:00:01 - 02:30:00 111.0
5 2018-01-01 02:30:01 02:30:01 - 03:00:00 0.0
6 2018-01-01 03:00:01 03:00:01 - 03:30:00 122.0
7 2018-01-01 03:30:01 03:30:01 - 04:00:00 0.0
8 2018-01-01 04:00:01 04:00:01 - 04:30:00 111.0
9 2018-01-01 04:30:01 04:30:01 - 05:00:00 0.0
10 2018-01-01 05:00:01 05:00:01 - 05:30:00 0.0
11 2018-01-01 05:30:01 05:30:01 - 06:00:00 0.0
12 2018-01-01 06:00:01 06:00:01 - 06:30:00 0.0
13 2018-01-01 06:30:01 06:30:01 - 07:00:00 0.0
14 2018-01-01 07:00:01 07:00:01 - 07:30:00 0.0
15 2018-01-01 07:30:01 07:30:01 - 08:00:00 0.0
16 2018-01-01 08:00:01 08:00:01 - 08:30:00 0.0
17 2018-01-01 08:30:01 08:30:01 - 09:00:00 0.0
18 2018-01-01 09:00:01 09:00:01 - 09:30:00 0.0
19 2018-01-01 09:30:01 09:30:01 - 10:00:00 0.0
20 2018-01-01 10:00:01 10:00:01 - 10:30:00 0.0
21 2018-01-01 10:30:01 10:30:01 - 11:00:00 0.0
22 2018-01-01 11:00:01 11:00:01 - 11:30:00 0.0
23 2018-01-01 11:30:01 11:30:01 - 12:00:00 0.0
24 2018-01-01 12:00:01 12:00:01 - 12:30:00 0.0
25 2018-01-01 12:30:01 12:30:01 - 13:00:00 0.0
26 2018-01-01 13:00:01 13:00:01 - 13:30:00 0.0
27 2018-01-01 13:30:01 13:30:01 - 14:00:00 0.0
28 2018-01-01 14:00:01 14:00:01 - 14:30:00 0.0
29 2018-01-01 14:30:01 14:30:01 - 15:00:00 0.0
30 2018-01-01 15:00:01 15:00:01 - 15:30:00 0.0
31 2018-01-01 15:30:01 15:30:01 - 16:00:00 0.0
32 2018-01-01 16:00:01 16:00:01 - 16:30:00 0.0
33 2018-01-01 16:30:01 16:30:01 - 17:00:00 0.0
34 2018-01-01 17:00:01 17:00:01 - 17:30:00 0.0
35 2018-01-01 17:30:01 17:30:01 - 18:00:00 0.0
36 2018-01-01 18:00:01 18:00:01 - 18:30:00 0.0
37 2018-01-01 18:30:01 18:30:01 - 19:00:00 0.0
38 2018-01-01 19:00:01 19:00:01 - 19:30:00 0.0
39 2018-01-01 19:30:01 19:30:01 - 20:00:00 0.0
40 2018-01-01 20:00:01 20:00:01 - 20:30:00 0.0
41 2018-01-01 20:30:01 20:30:01 - 21:00:00 0.0
42 2018-01-01 21:00:01 21:00:01 - 21:30:00 0.0
43 2018-01-01 21:30:01 21:30:01 - 22:00:00 0.0
44 2018-01-01 22:00:01 22:00:01 - 22:30:00 0.0
45 2018-01-01 22:30:01 22:30:01 - 23:00:00 0.0
46 2018-01-01 23:00:01 23:00:01 - 23:30:00 0.0
47 2018-01-01 23:30:01 23:30:01 - 00:00:00 0.0