Subtract smaller time intervals from a big time interval range - python

I want to obtain the time intervals subtracting the time windows from the timeline. Is there an efficient way using pandas Intervals and periods.
I tried to looking for a solution using pandas periods and Intervals class on SO but could not found such maybe because Intervals are immutable objects in pandas (ref https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.html#pandas-interval).
I found a relevant solution using 3rd party library Subtract Overlaps Between Two Ranges Without Sets But it does not particularly deals using datetime or Timestamp objects.
import pandas as pd
start = pd.Timestamp('00:00:00')
end = pd.Timestamp('23:59:00')
# input
big_time_interval = pd.Interval(start, end)
smaller_time_intervals_to_subtract = [
(pd.Timestamp('01:00:00'), pd.Timestamp('02:00:00')),
(pd.Timestamp('16:00:00'), pd.Timestamp('17:00:00'))]
# output
_output_time_intervals = [
(pd.Timestamp('00:00:00'), pd.Timestamp('01:00:00')),
(pd.Timestamp('02:00:00'), pd.Timestamp('16:00:00')),
(pd.Timestamp('17:00:00'), pd.Timestamp('23:59:00'))]
output_time_intervals = list(
map(lambda interval: pd.Interval(*interval), _output_time_intervals))
Any help would be appreciated.

Related

Filter Dask DataFrame rows by specific values of index

Is there effective solution to select specific rows in Dask DataFrame?
I would like to get only those rows which index is in a closed set (using the isin function is not enough efficient for me).
Are there any other effective solutions than ddf.loc[ddf.index.isin(list_of_index_values)]
ddf.loc[~ddf.index.isin(list_of_index_values)]
?
You can use the query method. You haven't provided a usable example but the format would be something like this
list_of_index_values = [6, 3]
dff.query('column in #list_of_index_values')
EDIT: Just for fun. I did this in pandas but I wouldn't expect much variance.
No clue whats stored in the index but assumed int.
from random import randint
import pandas as pd
from datetime import datetime as dt
# build huge random dataset
lst = []
for i in range(100000000):
lst.append(randint(0,100000))
# build huge random index
index = []
for i in range(1000000):
index.append(randint(0,100000))
df = pd.DataFrame(lst, columns=['values'])
isin = dt.now()
df[df['values'].isin(index)]
print(f'total execution time for isin {dt.now()-isin}')
query = dt.now()
df.query('values in #index')
print(f'total execution time for query {dt.now()-query}')
# total execution time for isin 0:01:22.914507
# total execution time for query 0:01:13.794499
If your index is sequential however
time = dt.now()
df[df['values']>100000]
print(dt.now()-time)
# 0:00:00.128209
It's not even close. You can even build out a range
time = dt.now()
df[(df['values']>100000) | (df['values'] < 500)]
print(dt.now()-time)
# 0:00:00.650321
Obviously the third method isn't always an option, but something to keep in mind if speed is a priority and you just need index between 2 values or some such.

Interpolating from a pandas DataFrame or Series to a new DatetimeIndex

Let's say I have an hourly series in pandas, fine to assume the source is regular but it is gappy. If I want to interpolate it to 15min, the pandas API provides resample(15min).interpolate('cubic'). It interpolates to the new times and provides some control over the limits of interpolation. The spline is helping to refine the series as well as fill small gaps. To be concrete:
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan # too wide a gap
signal[160:168:2] = np.nan # these can be interpolated
df = pd.DataFrame({"signal":signal},index=tndx)
df1= df.resample('15min').interpolate('cubic',limit=9)
Now let's say I have an irregular datetime index. In the example below, the first time is a regular time point, the second is in the big gap and the last is in the interspersed brief gaps.
tndx2 = pd.DatetimeIndex('2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00')
How do I interpolate to from the original series (hourly) to this irregular series of times?
Is the only option to build a series that includes the original data and the destination data? How would I do this? What is the most economical way to achieve the goals of interpolating to an independent irregular index and imposing a gap limit?
In case of irregular timestamps, first you set datetime as index and then you can use interpolate method to index df1= df.resample('15min').interpolate('index')
You can find more information here https://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html
This is an example solution within the pandas interpolate API, which doesn't seem to have a way of using abscissa and values from the source series to interpolate to new times provided by the destination index, as separate data structure. This method solves this by tacking the destination to the source. The method makes use of the limit argument of df.interpolate and it can use any interpolation algorithm from that API but it isn't perfect because the limit is in terms of the number of values and if there are a lot of destination points in a patch of NaNs those get counted as well.
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan
signal[160:168:2] = np.nan
df = pd.DataFrame({"signal":signal},index=tndx)
# Express the destination times as a dataframe and append to the source
tndx2 = pd.DatetimeIndex(['2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00'])
df2 = pd.DataFrame( {"signal": [np.nan,np.nan,np.nan]} , index = tndx2)
big_df = df.append(df2,sort=True)
# At this point there are duplicates with NaN values at the bottom of the DataFrame
# representing the destination points. If these are surrounded by lots of NaNs in the source frame
# and we want the limit argument to work in the call to interpolate, the frame has to be sorted and duplicates removed.
big_df = big_df.loc[~big_df.index.duplicated(keep='first')].sort_index(axis=0,level=0)
# Extract at destination locations
interpolated = big_df.interpolate(method='cubic',limit=3).loc[tndx2]

Finding the time overlaps in some time ranges in python

I have two time ranges in python and I would like to find whether there is any overlap between them or not. I am looking for an algorithm for that.
For instance I have the following time ranges:
r1 = start=(15:30:43), end=(16:30:56)
r2 = start=(15:40:35), end=(15:50:20)
How can I find the overlap among them in python?
You could use DatetimeIndex objects from the pandas package as follows:
import pandas as pd
# create DatetimeIndex objects with *seconds* resolution
dtidx1 = pd.date_range('15:30:43', '16:30:56', freq='S')
dtidx2 = pd.date_range('15:40:35', '15:50:20', freq='S')
# use the DatetimeIndex.intersection method to get another
# DatetimeIndex object whose values are what you requested
dtidx1.intersection(dtidx2)

Downsample timeseries based on another (irregular) timeseries, panda

I illustrate my question with the following example.
I have two panda dataframes.
The first with ten second timesteps, which is continuous. Example data for two days:
import pandas as pd
import random
t_10s = pd.date_range(start='1/1/2018', end='1/3/2018', freq='10s')
t_10s = pd.DataFrame(columns = ['b'],
data = [random.randint(0,10) for _ in range(len(t_10s))],
index = t_10s)
The next dataframe have five minute timesteps, but there is only data during daytime, and the logging starts at different times in the morning on each day. Example data for two days, starting at two different times in the morning to resemble the real data:
t_5m1 = pd.date_range(start='1/1/2018 08:08:30', end='1/1/2018 18:03:30', freq='5min')
t_5m2 = pd.date_range(start='1/2/2018 08:10:25', end='1/2/2018 18:00:25', freq='5min')
t_5m = t_5m1.append(t_5m2)
t_5m = pd.DataFrame(columns = ['a'],
data = [0 for _ in range(len(t_5m))],
index = t_5m)
Now what I want to do is for each datapoint, x, in t_5m, to find the equivalent average of the t_10s data, in a five minute window surrounding x.
Now, I have found a way to do this with a list-comprehension as follows:
tstep = pd.to_timedelta(2.5, 'm')
t_5m['avg'] = [t_10s.loc[((t_10s.index >= t_5m.index[i] - tstep) &
(t_10s.index < t_5m.index[i] + tstep))].b.mean() for i in range(0,len(t_5m))]
However, I want to do this for a timeseries spanning at least two years and for many columns (not just b as here. Current solution is to for loop over the relevant columns). The code then gets very slow. Can anyone think of a trick to do this more efficiently? I have thought about using resample or groupby. That would work if I had a regular 5-minute interval, but since it is irregular between days, I cannot make it work. Grateful for any input!
Have looked around some, e.g. here, but couldn't find what I need.

Python: Combine two data series with different length corresponding to date

I have two different Time-Series. They start and end both with the same date, but one of the time series is longer than the other. As I would like to do a regression I need to combine the Time-Series and find the dates which are missing.
How would I do it in an easy way? At the moment I try to use the concatenate function.
set intersection will solve your problem with combining two lists of time series into one list:
>>> from datetime import datetime as dt
>>> theSameDates1 = [dt.now().isoformat()]*3
>>> theSameDates1
['2015-09-09T11:33:59.989000', '2015-09-09T11:33:59.989000', '2015-09-09T11:33:59.989000']
>>> theSameDates2 = [x for x in theSameDates1]*2
>>> list(set(theSameDates1)|set(theSameDates2))
['2015-09-09T11:33:59.989000']

Categories

Resources