I have a dataframe with a datetimeindex as shown with the format, the raw data is supposed to contains record every hourly for a year(each day having 24 record). Some hours/days are missing and not recorded in the data.
How can i get a list of all the missing datetimeindex hour.
Example: 01 hour is missing, how can i find and print out 2012-10-02 01:00:00
I'm currently able to get the missing days but unable to do so for the hour.
missing_day = pd.date_range(start = mdf.index[0], end = mdf.index[-1]).difference(mdf.index)
missing_day = missing_day.strftime('%Y%m%d')
missing = pd.Series(missing_day).array
for i in missing:
print(i)
for x in range(24):
x = str(x)
m = i + x
m = datetime.strptime(m,'%Y%m%d%H')
print(m)
Output(printing 24 hour for each missing days)
What would be the best way to list out all of the missing datetime.
Use set predicates to find missing index:
out = pd.date_range(df.index.min(), df.index.max(), freq='H').difference(df.index)
print(out)
# Output
DatetimeIndex(['2022-01-01 06:00:00', '2022-01-01 12:00:00',
'2022-01-01 14:00:00', '2022-01-01 16:00:00'],
dtype='datetime64[ns]', freq=None)
Setup:
df = pd.DataFrame({'A':[0]}, index=pd.date_range('2022-01-01', freq='H', periods=24))
df = df.sample(n=20).sort_index()
print(df)
# Output
A
2022-01-01 00:00:00 0
2022-01-01 01:00:00 0
2022-01-01 02:00:00 0
2022-01-01 03:00:00 0
2022-01-01 04:00:00 0
2022-01-01 05:00:00 0
2022-01-01 07:00:00 0
2022-01-01 08:00:00 0
2022-01-01 09:00:00 0
2022-01-01 10:00:00 0
2022-01-01 11:00:00 0
2022-01-01 13:00:00 0
2022-01-01 15:00:00 0
2022-01-01 17:00:00 0
2022-01-01 18:00:00 0
2022-01-01 19:00:00 0
2022-01-01 20:00:00 0
2022-01-01 21:00:00 0
2022-01-01 22:00:00 0
2022-01-01 23:00:00 0
Related
Let's suppose we have a pandas dataframe with work shifts:
df_aux = pd.DataFrame({'Worker' : ['Alice','Alice','Alice','Alice','Alice', 'Bob','Bob','Bob'],
'Shift_start' : ['2022-01-01 10:00:00', '2022-01-01 10:30:00', '2022-01-01 11:45:00', '2022-01-01 12:45:00', '2022-01-01 13:15:00', '2022-01-01 10:30:00', '2022-01-01 12:00:00', '2022-01-01 13:15:00'],
'Shift_end' : ['2022-01-01 10:15:00', '2022-01-01 11:45:00', '2022-01-01 12:30:00', '2022-01-01 13:15:00', '2022-01-01 14:00:00', '2022-01-01 11:30:00', '2022-01-01 13:10:00', '2022-01-01 14:30:00'],
'Position' : [1, 1, 2, 2, 2, 1, 2, 3],
'Role' : ['A', 'B', 'B', 'A', 'B', 'A', 'B', 'A']})
Worker
Shift_start
Shift_end
Position
Role
Alice
2022-01-01 10:00:00
2022-01-01 10:15:00
1
A
Alice
2022-01-01 10:30:00
2022-01-01 11:45:00
1
B
Alice
2022-01-01 11:45:00
2022-01-01 12:30:00
2
B
Alice
2022-01-01 12:45:00
2022-01-01 13:15:00
2
A
Alice
2022-01-01 13:15:00
2022-01-01 14:00:00
2
B
Bob
2022-01-01 10:30:00
2022-01-01 11:30:00
1
A
Bob
2022-01-01 12:00:00
2022-01-01 13:10:00
2
B
Bob
2022-01-01 13:15:00
2022-01-01 14:30:00
3
A
The Position column refers to the place where the workers are, while there are two roles, A and B (let's say there are main and auxiliar, for example). I would need to compute the time each worker is at the current position, regardless of their role, and the time they are in the same position AND role at the time of certain events. These events are given in a df_main, which records the time and position:
df_main = pd.DataFrame({'Event_time' : ['2022-01-01 11:05:00', '2022-01-01 12:35:00', '2022-01-01 13:25:00'] ,
'Position' : [1, 2, 2]})
Event_time
Position
2022-01-01 11:05:00
1
2022-01-01 12:35:00
2
2022-01-01 13:25:00
2
The idea would be to perform a merge between df_main and df_aux to have the following info:
Event_time
Worker
Shift_start
Shift_end
Position
Role
Time_in_position
Time_in_position_role
2022-01-01 11:05:00
Alice
2022-01-01 10:30:00
2022-01-01 11:45:00
1
B
1 hours 05 minutes
0 hours 35 minutes
2022-01-01 11:05:00
Bob
2022-01-01 10:30:00
2022-01-01 13:30:00
1
A
0 hours 35 minutes
0 hours 35 minutes
2022-01-01 12:35:00
Bob
2022-01-01 12:00:00
2022-01-01 15:10:00
2
B
0 hours 35 minutes
0 hours 35 minutes
2022-01-01 13:25:00
Alice
2022-01-01 13:15:00
2022-01-01 14:00:00
2
B
1 hours 40 minutes
0 hours 10 minutes
The first row is duplicated, because both Alice and Bob were in that position at the time of the event, but with different roles. I managed to compute the Time_in_position_role column:
df_full = df_main.merge(df_aux, on='Position')
df_full = df_full[(df_full['Event_time']>df_full['Shift_start']) & (df_full['Event_time']<df_full['Shift_end'])]
df_full['Time_in_position_role'] = df_full['Event_time'] - df_full['Shift_start']
But I am unable to do the same for the Time_in_position one. Any ideas?
The logic is:
For each "Worker", find the time period for which he was in particular position. If there are multiple rows, then merge them.
Join this with your result df and filter with same logic for "Time_in_position".
# For each "Worker", find the time period for which he was in particular position. If there are multiple rows, then merge them.
def sort_n_rank(g):
df_g = g.apply(pd.Series)
df_g = df_g.sort_values(0)
return (df_g[1] != df_g[1].shift(1)).cumsum()
df_aux["start_position"] = df_aux[["Shift_start", "Position"]].apply(tuple, axis=1)
df_aux["rank"] = df_aux.groupby("Worker")[["start_position"]].transform(sort_n_rank)
df_worker_position = df_aux.groupby(["Worker", "rank"]) \
.agg( \
Shift_start_min = ("Shift_start", "min"),
Shift_end_max = ("Shift_end", "max"),
Position = ("Position", "first")
) \
.reset_index()
df_full = df_full.merge(df_worker_position, on=["Worker", "Position"])
df_full = df_full[(df_full["Event_time"] > df_full["Shift_start_min"]) & (df_full["Event_time"] < df_full["Shift_end_max"])]
df_full["Time_in_position"] = df_full["Event_time"] - df_full["Shift_start_min"]
Output:
Event_time Worker Shift_start Shift_end Position Role Time_in_position Time_in_position_role
0 2022-01-01 11:05:00 Alice 2022-01-01 10:30:00 2022-01-01 11:45:00 1 B 0 days 01:05:00 0 days 00:35:00
1 2022-01-01 11:05:00 Bob 2022-01-01 10:30:00 2022-01-01 11:30:00 1 A 0 days 00:35:00 0 days 00:35:00
2 2022-01-01 12:35:00 Bob 2022-01-01 12:00:00 2022-01-01 13:10:00 2 B 0 days 00:35:00 0 days 00:35:00
3 2022-01-01 13:25:00 Alice 2022-01-01 13:15:00 2022-01-01 14:00:00 2 B 0 days 01:40:00 0 days 00:10:00
I have the below issue and I feel I'm just a few steps away from solving it, but I'm not experienced enough just yet. I've used business-duration for this.
I've looked through other similar answers to this and tried many methods, but this is the closest I have gotten (Using this answer). I'm using Anaconda and Spyder, which is the only method I have on my work laptop at the moment. I can't install some of the custom Business days functions into anaconda.
I have a large dataset (~200k rows) which I need to solve this for:
import pandas as pd
import business_duration as bd
import datetime as dt
import holidays as pyholidays
#Specify Business Working hours (8am - 5pm)
Bus_start_time = dt.time(8,00,0)
Bus_end_time = dt.time(17,0,0)
holidaylist = pyholidays.ZA()
unit='min'
list = [[10, '2022-01-01 07:00:00', '2022-01-08 15:00:00'], [11, '2022-01-02 18:00:00', '2022-01-10 15:30:00'],
[12, '2022-01-01 09:15:00', '2022-01-08 12:00:00'], [13, '2022-01-07 13:00:00', '2022-01-23 17:00:00']]
df = pd.DataFrame(list, columns =['ID', 'Start', 'End'])
print(df)
Which gives:
ID Start End
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00
The next step works in testing single dates:
startdate = pd.to_datetime('2022-01-01 00:00:00')
enddate = pd.to_datetime('2022-01-14 23:00:00')
df['TimeAdj'] = bd.businessDuration(startdate,enddate,Bus_start_time,Bus_end_time,holidaylist=holidaylist,unit=unit)
print(df)
Which results in:
ID Start End TimeAdj
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00 5400.0
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00 5400.0
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00 5400.0
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00 5400.0
For some reason I have float values showing up, but I can fix that later.
Next, I need to have this calculation run per row in the dataframe.
I tried replacing the df columns in start date and end date, but got an error:
startdate = df['Start']
enddate = df['End']
print(bd.businessDuration(startdate,enddate,Bus_start_time,Bus_end_time,holidaylist=holidaylist,unit=unit))`
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I then checked the documentation for business-duration, and adjusted to the below:
from itertools import repeat
df['TimeAdj'] = list(map(bd.businessDuration,startdate,enddate,repeat(Bus_start_time),repeat(Bus_end_time),repeat(holidaylist),repeat(unit)))
AttributeError: 'str' object has no attribute 'date'
I'm hoping to end with the correct values in each row of the TimeAdj column (example figures added).
ID Start End TimeAdj
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00 2300
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00 2830
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00 2115
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00 4800
What do I need to adjust on this?
Use:
from functools import partial
# Convert strings to datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# Get holidays list
years = range(df['Start'].min().year, df['End'].max().year+1)
holidaylist = pyholidays.ZA(years=years).keys()
# Create a partial function as a shortcut
bduration = partial(bd.businessDuration,
starttime=Bus_start_time, endtime=Bus_end_time,
holidaylist=holidaylist, unit=unit)
# Compute business duration
df['TimeAdj'] = df.apply(lambda x: bduration(x['Start'], x['End']), axis=1)
Output:
>>> df
ID Start End TimeAdj
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00 2700.0
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00 3150.0
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00 2700.0
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00 5640.0
I have a time series with breaks (times w/o recordings) in between. A simplified example would be:
df = pd.DataFrame(
np.random.rand(13), columns=["values"],
index=pd.date_range(start='1/1/2020 11:00:00',end='1/1/2020 23:00:00',freq='H'))
df.iloc[4:7] = np.nan
df.dropna(inplace=True)
df
values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166
Now I would like to split it in intervals which are divided by a certain time span (e.g. 2h). In the example above this would be:
( values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023,
values
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166)
I was a bit surprised that I didn't find anything on that since I thought this is a common problem. My current solution to get start and end index of each interval is :
def intervals(data: pd.DataFrame, delta_t: timedelta = timedelta(hours=2)):
data = data.sort_values(by=['event_timestamp'], ignore_index=True)
breaks = (data['event_timestamp'].diff() > delta_t).astype(bool).values
ranges = []
start = 0
end = start
for i, e in enumerate(breaks):
if not e:
end = i
if i == len(breaks) - 1:
ranges.append((start, end))
start = i
end = start
elif i != 0:
ranges.append((start, end))
start = i
end = start
return ranges
Any suggestions how I could do this in a smarter way? I suspect this should be somehow possible using groupby.
Yes, you can use the very convenient np.split:
dt = pd.Timedelta('2H')
parts = np.split(df, np.where(np.diff(df.index) > dt)[0] + 1)
Which gives, for your example:
>>> parts
[ values
2020-01-01 11:00:00 0.557374
2020-01-01 12:00:00 0.942296
2020-01-01 13:00:00 0.181189
2020-01-01 14:00:00 0.758822,
values
2020-01-01 18:00:00 0.682125
2020-01-01 19:00:00 0.818187
2020-01-01 20:00:00 0.053515
2020-01-01 21:00:00 0.572342
2020-01-01 22:00:00 0.423129
2020-01-01 23:00:00 0.882215]
#Pierre thanks for your input. I now got to a solution which is convenient for me:
df['diff'] = df.index.to_series().diff()
max_gap = timedelta(hours=2)
df['gapId'] = 0
df.loc[df['diff'] >= max_gap, ['gapId']] = 1
df['gapId'] = df['gapId'].cumsum()
list(df.groupby('gapId'))
gives:
[(0,
values date diff gapId
0 1.0 2020-01-01 11:00:00 NaT 0
1 1.0 2020-01-01 12:00:00 0 days 01:00:00 0
2 1.0 2020-01-01 13:00:00 0 days 01:00:00 0
3 1.0 2020-01-01 14:00:00 0 days 01:00:00 0),
(1,
values date diff gapId
7 1.0 2020-01-01 18:00:00 0 days 04:00:00 1
8 1.0 2020-01-01 19:00:00 0 days 01:00:00 1
9 1.0 2020-01-01 20:00:00 0 days 01:00:00 1
10 1.0 2020-01-01 21:00:00 0 days 01:00:00 1
11 1.0 2020-01-01 22:00:00 0 days 01:00:00 1
12 1.0 2020-01-01 23:00:00 0 days 01:00:00 1)]
I have a dataframe indexed by datetime. I want to filter out rows based on the difference between their index and the index of the previous row.
So, if my criteria is "remove all rows that are over one hour late than the previous row", the second row in the example below should be removed:
2005-07-15 17:00:00
2005-07-17 18:00:00
While in the following case, both rows stay:
2005-07-17 23:00:00
2005-07-18 00:00:00
It seems you need boolean indexing with diff for difference and compare with 1 hour Timedelta:
dates=['2005-07-15 17:00:00','2005-07-17 18:00:00', '2005-07-17 19:00:00',
'2005-07-17 23:00:00', '2005-07-18 00:00:00']
df = pd.DataFrame({'a':range(5)}, index=pd.to_datetime(dates))
print (df)
a
2005-07-15 17:00:00 0
2005-07-17 18:00:00 1
2005-07-17 19:00:00 2
2005-07-17 23:00:00 3
2005-07-18 00:00:00 4
diff = df.index.to_series().diff().fillna(0)
print (diff)
2005-07-15 17:00:00 0 days 00:00:00
2005-07-17 18:00:00 2 days 01:00:00
2005-07-17 19:00:00 0 days 01:00:00
2005-07-17 23:00:00 0 days 04:00:00
2005-07-18 00:00:00 0 days 01:00:00
dtype: timedelta64[ns]
mask = diff <= pd.Timedelta(1, unit='h')
print (mask)
2005-07-15 17:00:00 True
2005-07-17 18:00:00 False
2005-07-17 19:00:00 True
2005-07-17 23:00:00 False
2005-07-18 00:00:00 True
dtype: bool
df = df[mask]
print (df)
a
2005-07-15 17:00:00 0
2005-07-17 19:00:00 2
2005-07-18 00:00:00 4
I have a sample_data.txt with structure.
Precision= Waterdrops
2009-11-17 14:00:00,4.9,
2009-11-17 14:30:00,6.1,
2009-11-17 15:00:00,5.3,
2009-11-17 15:30:00,3.3,
2009-11-17 16:00:00,4.9,
I need to separate my data with the values bigger than zero and identify change (event) with timespam bigger than 2 h. So far i have wrote:
file_path = 'sample_data.txt'
df = pd.read_csv(file_path, skiprows = [num for (num,line) in enumerate(open(file_path),2) if 'Precision=' in line][0],
parse_dates = True,index_col = 0,header= None, sep =',',
names = ['meteo', 'empty'])
df['date'] = df.index
df = df.drop(['empty'], axis=1)
df = df[df.meteo>20]
df['diff'] = df.date-df.date.shift(1)
df['sections'] = (diff > np.timedelta64(2, "h")).astype(int).cumsum()
From the above code i get:
meteo date diff sections
2009-12-15 12:00:00 23.8 2009-12-15 12:00:00 NaT 0
2009-12-15 13:00:00 23.0 2009-12-15 13:00:00 01:00:00 0
If i use:
df.date.iloc[[0, -1]].reset_index(drop=True)
I get:
0 2009-12-15 12:00:00
1 2012-12-05 16:00:00
Name: date, dtype: datetime64[ns]
Which is the start date and finish date of my example_data.txt.
How i can get .iloc[[0, -1]].reset_index(drop=True) for each df['section'] category ?
I tried with .apply:
def f(s):
return s.iloc[[0, -1]].reset_index(drop=True)
df.groupby(df['sections']).apply(f)
and i get: IndexError: positional indexers are out-of-bounds
I don't know why you use the drop_index() shenanigans. My somewhat more straightforward process would be, starting with
df
sections meteo date diff
0 0 2009-12-15 12:00:00 NaT
1 0 2009-12-15 13:00:00 01:00:00
0 1 2009-12-15 12:00:00 NaT
1 1 2009-12-15 13:00:00 01:00:00
to do (after you ensure with sort('sections', 'date') that iloc[0,-1] actually is start and end, otherwise just use min() and max() )
def f(s):
return s.iloc[[0, -1]]['date']
df.groupby('sections').apply(f)
date 0 1
sections
0 12:00:00 13:00:00
1 12:00:00 13:00:00
Or, as a more streamlined approach
df.groupby('sections')['date'].agg([np.max, np.min])
amax amin
sections
0 13:00:00 12:00:00
1 13:00:00 12:00:00