Finding the time overlaps in some time ranges in python - python

I have two time ranges in python and I would like to find whether there is any overlap between them or not. I am looking for an algorithm for that.
For instance I have the following time ranges:
r1 = start=(15:30:43), end=(16:30:56)
r2 = start=(15:40:35), end=(15:50:20)
How can I find the overlap among them in python?

You could use DatetimeIndex objects from the pandas package as follows:
import pandas as pd
# create DatetimeIndex objects with *seconds* resolution
dtidx1 = pd.date_range('15:30:43', '16:30:56', freq='S')
dtidx2 = pd.date_range('15:40:35', '15:50:20', freq='S')
# use the DatetimeIndex.intersection method to get another
# DatetimeIndex object whose values are what you requested
dtidx1.intersection(dtidx2)

Related

Filter Dask DataFrame rows by specific values of index

Is there effective solution to select specific rows in Dask DataFrame?
I would like to get only those rows which index is in a closed set (using the isin function is not enough efficient for me).
Are there any other effective solutions than ddf.loc[ddf.index.isin(list_of_index_values)]
ddf.loc[~ddf.index.isin(list_of_index_values)]
?
You can use the query method. You haven't provided a usable example but the format would be something like this
list_of_index_values = [6, 3]
dff.query('column in #list_of_index_values')
EDIT: Just for fun. I did this in pandas but I wouldn't expect much variance.
No clue whats stored in the index but assumed int.
from random import randint
import pandas as pd
from datetime import datetime as dt
# build huge random dataset
lst = []
for i in range(100000000):
lst.append(randint(0,100000))
# build huge random index
index = []
for i in range(1000000):
index.append(randint(0,100000))
df = pd.DataFrame(lst, columns=['values'])
isin = dt.now()
df[df['values'].isin(index)]
print(f'total execution time for isin {dt.now()-isin}')
query = dt.now()
df.query('values in #index')
print(f'total execution time for query {dt.now()-query}')
# total execution time for isin 0:01:22.914507
# total execution time for query 0:01:13.794499
If your index is sequential however
time = dt.now()
df[df['values']>100000]
print(dt.now()-time)
# 0:00:00.128209
It's not even close. You can even build out a range
time = dt.now()
df[(df['values']>100000) | (df['values'] < 500)]
print(dt.now()-time)
# 0:00:00.650321
Obviously the third method isn't always an option, but something to keep in mind if speed is a priority and you just need index between 2 values or some such.

Subtract smaller time intervals from a big time interval range

I want to obtain the time intervals subtracting the time windows from the timeline. Is there an efficient way using pandas Intervals and periods.
I tried to looking for a solution using pandas periods and Intervals class on SO but could not found such maybe because Intervals are immutable objects in pandas (ref https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.html#pandas-interval).
I found a relevant solution using 3rd party library Subtract Overlaps Between Two Ranges Without Sets But it does not particularly deals using datetime or Timestamp objects.
import pandas as pd
start = pd.Timestamp('00:00:00')
end = pd.Timestamp('23:59:00')
# input
big_time_interval = pd.Interval(start, end)
smaller_time_intervals_to_subtract = [
(pd.Timestamp('01:00:00'), pd.Timestamp('02:00:00')),
(pd.Timestamp('16:00:00'), pd.Timestamp('17:00:00'))]
# output
_output_time_intervals = [
(pd.Timestamp('00:00:00'), pd.Timestamp('01:00:00')),
(pd.Timestamp('02:00:00'), pd.Timestamp('16:00:00')),
(pd.Timestamp('17:00:00'), pd.Timestamp('23:59:00'))]
output_time_intervals = list(
map(lambda interval: pd.Interval(*interval), _output_time_intervals))
Any help would be appreciated.

DASK: Is there an equivalent of numpy.select for dask?

I'm using Dask to load an 11m row csv into a dataframe and perform calculations. I've reached a position where I need conditional logic - If this, then that, else other.
If I were to use pandas, for example, I could do the following, where a numpy select statement is used along with an array of conditions and results. This statement takes about 35 seconds to run - not bad, but not great:
df["AndHeathSolRadFact"] = np.select(
[
(df['Month'].between(8,12)),
(df['Month'].between(1,2) & df['CloudCover']>30) #Array of CONDITIONS
], #list of conditions
[1, 1], #Array of RESULTS (must match conditions)
default=0) #DEFAULT if no match
What I am hoping to do is use dask to do this, natively, in a dask dataframe, without having to first convert my dask dataframe to a pandas dataframe, and then back again.
This allows me to:
- Use multithreading
- Use a dataframe that is larger than available ram
- Potentially speed up the result.
Sample CSV
Location,Date,Temperature,RH,WindDir,WindSpeed,DroughtFactor,Curing,CloudCover
1075,2019-20-09 04:00,6.8,99.3,143.9,5.6,10.0,93.0,1.0
1075,2019-20-09 05:00,6.4,100.0,93.6,7.2,10.0,93.0,1.0
1075,2019-20-09 06:00,6.7,99.3,130.3,6.9,10.0,93.0,1.0
1075,2019-20-09 07:00,8.6,95.4,68.5,6.3,10.0,93.0,1.0
1075,2019-20-09 08:00,12.2,76.0,86.4,6.1,10.0,93.0,1.0
Full Code for minimum viable sample
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
import numpy as np
# Dataframes implement the Pandas API
import dask.dataframe as dd
from timeit import default_timer as timer
start = timer()
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
#Convert back to a Dask dataframe because we want that juicy parallelism
ddf2 = dd.from_pandas(df,npartitions=4)
del [df]
print(ddf2.head())
#print(ddf.tail())
end = timer()
print(end - start)
#Clean up remaining dataframes
del [[ddf2]
So, the answer I was able to come up with that was the most performant was:
#Create a helper column where we store the value we want to set the column to later.
ddf['Helper'] = 1
#Create the column where we will be setting values, and give it a default value
ddf['AndHeathSolRadFact'] = 0
#Break the logic out into separate where clauses. Rather than looping we will be selecting those rows
#where the conditions are met and then set the value we went. We are required to use the helper
#column value because we cannot set values directly, but we can match from another column.
#First, a very simple clause. If Temperature is greater than or equal to 8, make
#AndHeathSolRadFact equal to the value in Helper
#Note that at the end, after the comma, we preserve the existing cell value if the condition is not met
ddf['AndHeathSolRadFact'] = (ddf.Helper).where(ddf.Temperature >= 8, ddf.AndHeathSolRadFact)
#A more complex example
#this is the same as the above, but demonstrates how to use a compound select statement where
#we evaluate multiple conditions and then set the value.
ddf['AndHeathSolRadFact'] = (ddf.Helper).where(((ddf.Temperature == 6.8) & (ddf.RH == 99.3)), ddf.AndHeathSolRadFact)
I'm a newbie at this, but I'm assuming this approach counts as being vectorised. It makes full use of the array and evaluates very quickly.
Adding the new column, filling it with 0, evaluating both select statements and replacing the values in the target rows only added 0.2s to the processing time on an 11m row dataset with npartitions = 4.
Former, and similar approaches in pandas took 45 seconds or so.
The only thing left to do is to remove the helper column once we're done. Currently, I'm not sure how to do this.
It sounds like you're looking to dd.Series.where

Interpolating from a pandas DataFrame or Series to a new DatetimeIndex

Let's say I have an hourly series in pandas, fine to assume the source is regular but it is gappy. If I want to interpolate it to 15min, the pandas API provides resample(15min).interpolate('cubic'). It interpolates to the new times and provides some control over the limits of interpolation. The spline is helping to refine the series as well as fill small gaps. To be concrete:
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan # too wide a gap
signal[160:168:2] = np.nan # these can be interpolated
df = pd.DataFrame({"signal":signal},index=tndx)
df1= df.resample('15min').interpolate('cubic',limit=9)
Now let's say I have an irregular datetime index. In the example below, the first time is a regular time point, the second is in the big gap and the last is in the interspersed brief gaps.
tndx2 = pd.DatetimeIndex('2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00')
How do I interpolate to from the original series (hourly) to this irregular series of times?
Is the only option to build a series that includes the original data and the destination data? How would I do this? What is the most economical way to achieve the goals of interpolating to an independent irregular index and imposing a gap limit?
In case of irregular timestamps, first you set datetime as index and then you can use interpolate method to index df1= df.resample('15min').interpolate('index')
You can find more information here https://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html
This is an example solution within the pandas interpolate API, which doesn't seem to have a way of using abscissa and values from the source series to interpolate to new times provided by the destination index, as separate data structure. This method solves this by tacking the destination to the source. The method makes use of the limit argument of df.interpolate and it can use any interpolation algorithm from that API but it isn't perfect because the limit is in terms of the number of values and if there are a lot of destination points in a patch of NaNs those get counted as well.
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan
signal[160:168:2] = np.nan
df = pd.DataFrame({"signal":signal},index=tndx)
# Express the destination times as a dataframe and append to the source
tndx2 = pd.DatetimeIndex(['2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00'])
df2 = pd.DataFrame( {"signal": [np.nan,np.nan,np.nan]} , index = tndx2)
big_df = df.append(df2,sort=True)
# At this point there are duplicates with NaN values at the bottom of the DataFrame
# representing the destination points. If these are surrounded by lots of NaNs in the source frame
# and we want the limit argument to work in the call to interpolate, the frame has to be sorted and duplicates removed.
big_df = big_df.loc[~big_df.index.duplicated(keep='first')].sort_index(axis=0,level=0)
# Extract at destination locations
interpolated = big_df.interpolate(method='cubic',limit=3).loc[tndx2]

Python: Combine two data series with different length corresponding to date

I have two different Time-Series. They start and end both with the same date, but one of the time series is longer than the other. As I would like to do a regression I need to combine the Time-Series and find the dates which are missing.
How would I do it in an easy way? At the moment I try to use the concatenate function.
set intersection will solve your problem with combining two lists of time series into one list:
>>> from datetime import datetime as dt
>>> theSameDates1 = [dt.now().isoformat()]*3
>>> theSameDates1
['2015-09-09T11:33:59.989000', '2015-09-09T11:33:59.989000', '2015-09-09T11:33:59.989000']
>>> theSameDates2 = [x for x in theSameDates1]*2
>>> list(set(theSameDates1)|set(theSameDates2))
['2015-09-09T11:33:59.989000']

Categories

Resources