Compare two dataframes and keep a specific datetime range of another - python

I have two dataframes with timestamps. I want to select the timestamps from df1 that equal the timestamps 'start_show' of df2 but also keep all the timestamps of df1 2 hours before and 2 hours after (of df1) where the timestamps are equal.
df1:
van_timestamp weekdag
2880 2016-11-19 00:00:00 6
2881 2016-11-19 00:15:00 6
2882 2016-11-19 00:30:00 6
... ... ...
822349 2019-11-06 22:45:00 3
822350 2019-11-06 23:00:00 3
822351 2019-11-06 23:15:00 3
df2:
einde_show start_show
255 2016-01-16 22:00:00 2016-01-16 20:00:00
256 2016-01-23 21:30:00 2016-01-23 19:45:00
257 2016-01-26 21:30:00 2016-01-26 19:45:00
... ... ...
1111 2019-12-29 18:30:00 2019-12-29 17:00:00
1112 2019-12-30 15:00:00 2019-12-30 13:30:00
1113 2019-12-30 18:30:00 2019-12-30 17:00:00
df1 contains a timestamp every 15 minutes of every day whereas df2['start_show'] contains just a single timestamp per day.
So ultimately what I want to achieve is that for every timestamp of df2 I have the corresponding timestamp of df1 +- 2 hours.
So far I've tried:
df1['van_timestamp'][df1['van_timestamp'].isin(df2['start_show'])]
This selects the right timestamps. Now I want to select everything from df1 in the range of
+ pd.Timedelta(2, unit='h')
- pd.Timedelta(2, unit='h')
But I'm not sure how to go about this. Help would be much appreciated!
Thanks!

I got it working (ugly fix). I created a datetime range
dates = [pd.date_range(start = df2['start_show'].iloc[i] - pd.Timedelta(2, unit='h'), end = df2['start_show'].iloc[i], freq = '15T') for i in range(len(evs_data))]
Which I then unlisted:
dates = [i for sublist in dates for i in sublist]
Afterwards I compared the dataframe with this list.
relevant_timestamps = df1[df1['van_timestamp'].isin(dates)]
If anyone else has a better solution, please let me know!

Related

Pandas: Calculate time in minutes between 2 columns, excluding weekends, public holidays and taking business hours into account

I have the below issue and I feel I'm just a few steps away from solving it, but I'm not experienced enough just yet. I've used business-duration for this.
I've looked through other similar answers to this and tried many methods, but this is the closest I have gotten (Using this answer). I'm using Anaconda and Spyder, which is the only method I have on my work laptop at the moment. I can't install some of the custom Business days functions into anaconda.
I have a large dataset (~200k rows) which I need to solve this for:
import pandas as pd
import business_duration as bd
import datetime as dt
import holidays as pyholidays
#Specify Business Working hours (8am - 5pm)
Bus_start_time = dt.time(8,00,0)
Bus_end_time = dt.time(17,0,0)
holidaylist = pyholidays.ZA()
unit='min'
list = [[10, '2022-01-01 07:00:00', '2022-01-08 15:00:00'], [11, '2022-01-02 18:00:00', '2022-01-10 15:30:00'],
[12, '2022-01-01 09:15:00', '2022-01-08 12:00:00'], [13, '2022-01-07 13:00:00', '2022-01-23 17:00:00']]
df = pd.DataFrame(list, columns =['ID', 'Start', 'End'])
print(df)
Which gives:
ID Start End
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00
The next step works in testing single dates:
startdate = pd.to_datetime('2022-01-01 00:00:00')
enddate = pd.to_datetime('2022-01-14 23:00:00')
df['TimeAdj'] = bd.businessDuration(startdate,enddate,Bus_start_time,Bus_end_time,holidaylist=holidaylist,unit=unit)
print(df)
Which results in:
ID Start End TimeAdj
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00 5400.0
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00 5400.0
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00 5400.0
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00 5400.0
For some reason I have float values showing up, but I can fix that later.
Next, I need to have this calculation run per row in the dataframe.
I tried replacing the df columns in start date and end date, but got an error:
startdate = df['Start']
enddate = df['End']
print(bd.businessDuration(startdate,enddate,Bus_start_time,Bus_end_time,holidaylist=holidaylist,unit=unit))`
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I then checked the documentation for business-duration, and adjusted to the below:
from itertools import repeat
df['TimeAdj'] = list(map(bd.businessDuration,startdate,enddate,repeat(Bus_start_time),repeat(Bus_end_time),repeat(holidaylist),repeat(unit)))
AttributeError: 'str' object has no attribute 'date'
I'm hoping to end with the correct values in each row of the TimeAdj column (example figures added).
ID Start End TimeAdj
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00 2300
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00 2830
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00 2115
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00 4800
What do I need to adjust on this?
Use:
from functools import partial
# Convert strings to datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# Get holidays list
years = range(df['Start'].min().year, df['End'].max().year+1)
holidaylist = pyholidays.ZA(years=years).keys()
# Create a partial function as a shortcut
bduration = partial(bd.businessDuration,
starttime=Bus_start_time, endtime=Bus_end_time,
holidaylist=holidaylist, unit=unit)
# Compute business duration
df['TimeAdj'] = df.apply(lambda x: bduration(x['Start'], x['End']), axis=1)
Output:
>>> df
ID Start End TimeAdj
0 10 2022-01-01 07:00:00 2022-01-08 15:00:00 2700.0
1 11 2022-01-02 18:00:00 2022-01-10 15:30:00 3150.0
2 12 2022-01-01 09:15:00 2022-01-08 12:00:00 2700.0
3 13 2022-01-07 13:00:00 2022-01-23 17:00:00 5640.0

How to use pd.interpolate fill the gap with only one missing data

I have a time series data for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
The gaps were composed of several consecutive or inconsecutive NAs, and there are some helpful statistics done by R like:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
Generally if I use the df.interpolate(limit=1),
gaps with more than one missing will be interpolated as well.
So I guess a better way to interpolate the gap with only one missing is to get the gap id.
To do so, I grouped the different size of gap and used the following function:
cum = df.notna().cumsum()
cum[cum.duplicated()]
and got the result:
PM2.5
2019-01-09 13:00:00 205
2019-01-10 15:00:00 230
2019-01-10 16:00:00 230
2019-01-16 11:00:00 368
2019-01-23 14:00:00 538
...
2019-12-02 10:00:00 7971
2019-12-10 09:00:00 8161
2019-12-16 15:00:00 8310
2019-12-24 12:00:00 8498
2019-12-31 10:00:00 8663
How to get the index of each first missing value in each gap like this?
PM2.5 gap size
2019-01-09 13:00:00 1
2019-01-10 15:00:00 2
2019-01-16 11:00:00 1
2019-01-23 14:00:00 1
...
2019-12-02 10:00:00 1
2019-12-10 09:00:00 1
2019-12-16 15:00:00 1
2019-12-24 12:00:00 1
2019-12-31 10:00:00 1
but when I used cum[cum.duplicated()].groupby(cum[cum.duplicated()]).count() the index would miss.
Are there better solutions to do these?
OR How to interpolate case by case?
Anyone can help me?

how to add values to specific date in pandas?

So I have a dataset with a specific date along with every data. I want to fill these values according to their specific date in Excel which contains the date range of the whole year. It's like the date starts from 01-01-2020 00:00:00 and end at 31-12-2020 23:45:00 with the frequency of 15 mins. So there will be a total of 35040 date-time values in Excel.
my data is like:
load date
12 01-02-2020 06:30:00
21 29-04-2020 03:45:00
23 02-07-2020 12:15:00
54 07-08-2020 16:00:00
23 22-09-2020 16:30:00
As you can see these values are not continuous but they have specific dates with them, so I these date values as the index and put it at that particular date in the Excel which has the date column, and also put zero in the missing values. Can someone please help?
Use DataFrame.reindex with date_range - so added 0 values for all not exist datetimes:
rng = pd.date_range('2020-01-01','2020-12-31 23:45:00', freq='15Min')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').reindex(rng, fill_value=0)
print (df)
load
2020-01-01 00:00:00 0
2020-01-01 00:15:00 0
2020-01-01 00:30:00 0
2020-01-01 00:45:00 0
2020-01-01 01:00:00 0
...
2020-12-31 22:45:00 0
2020-12-31 23:00:00 0
2020-12-31 23:15:00 0
2020-12-31 23:30:00 0
2020-12-31 23:45:00 0
[35136 rows x 1 columns]

How to make a time index in dataframe pandas with 15 minutes spacing

How to make a time index in dataframe pandas with 15 minutes spacing for 24 hours with out the date format (12\4\2020 00:15)or doing it manually?
example that I only want is 00:15 00:30 00:45.........23:45 00:00 as an index.
You can use pd.date_range to create dummy dates with your desired time frequency, then just extract them:
pd.Series(pd.date_range(
'1/1/2020', '1/2/2020', freq='15min', closed='left')).dt.time
0 00:00:00
1 00:15:00
2 00:30:00
3 00:45:00
4 01:00:00
...
91 22:45:00
92 23:00:00
93 23:15:00
94 23:30:00
95 23:45:00
Length: 96, dtype: object
You can use to_timedelta with an array of numbers, here I chose minutes.
pd.to_timedelta(np.arange(0, 24*60, 15), unit='min')
#TimedeltaIndex(['00:00:00', '00:15:00', '00:30:00', '00:45:00', '01:00:00',
# ....
# '23:45:00'],
# dtype='timedelta64[ns]', freq=None)

Pandas Add one day after midnight

I am trying to add one day after midnight.
For example, I have a column typed datetime64 in Dataframe, Pandas.
Originally, my csv file only has time like 12:13:00, 07:12:53, 02:33:27.
I wanted to add a date into the time cuz the file name has a date. The thing is that I have to add one day on time after midnight.
Here's an example.
This is original data with the file name mycsv_20180101.csv
time
22:00:00
23:00:00
03:00:00
This is what I want.
time
2018-01-01 22:00:00
2018-01-01 23:00:00
2018-01-02 03:00:00 # this is the point.
Is there any idea for it?
I've thought about it for a while and my idea is
firstly, add a date.
Secondly, df['time'].apply(lambda x: x + pd.to_timedelta('1d') if x.dt.hour < 6 else False) # before 6am, I assume that that's a next day
but it says 'The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I don't know why...
Thank you for your help in advance.
Suppose your dataframe and date from file are like these:
df = pd.DataFrame({'time': ["18:10:0","19:10:00","20:10:00","21:10:00","22:10:00","23:10:00","00:10:00","01:10:00","02:10:00","03:10:00"]})
file_date = '20180101'
You first need to add file_date to your data
df.time = df.time.apply(lambda x: ' '.join((file_date, x)))
which yields:
time
0 20180101 18:10:00
1 20180101 19:10:00
2 20180101 20:10:00
3 20180101 21:10:00
4 20180101 22:10:00
5 20180101 23:10:00
6 20180101 00:10:00
7 20180101 01:10:00
8 20180101 02:10:00
9 20180101 03:10:00
What you need to do is convert them into datetime type and add a day if hour is smaller than 4.
df.time = pd.to_datetime(df.time).apply(lambda x: x + pd.DateOffset(days=1) if x.hour <=3 else x)
which gives your desired output of:
time
0 2018-01-01 18:10:00
1 2018-01-01 19:10:00
2 2018-01-01 20:10:00
3 2018-01-01 21:10:00
4 2018-01-01 22:10:00
5 2018-01-01 23:10:00
6 2018-01-02 00:10:00
7 2018-01-02 01:10:00
8 2018-01-02 02:10:00
9 2018-01-02 03:10:00

Categories

Resources