Pandas: select DF rows based on another DF

Pandas: select DF rows based on another DF - python

I've got two dataframes (very long, with hundreds or thousands of rows each). One of them, called df1, contains a timeseries, in intervals of 10 minutes. For example:
date value
2016-11-24 00:00:00 1759.199951
2016-11-24 00:10:00 992.400024
2016-11-24 00:20:00 1404.800049
2016-11-24 00:30:00 45.799999
2016-11-24 00:40:00 24.299999
2016-11-24 00:50:00 159.899994
2016-11-24 01:00:00 82.499999
2016-11-24 01:10:00 37.400003
2016-11-24 01:20:00 159.899994
....
And the other one, df2, contains datetime intervals:
start_date end_date
0 2016-11-23 23:55:32 2016-11-24 00:14:03
1 2016-11-24 01:03:18 2016-11-24 01:07:12
2 2016-11-24 01:11:32 2016-11-24 02:00:00
...
I need to select all the rows in df1 that "falls" into an interval in df2.
With these examples, the result dataframe should be:
date value
2016-11-24 00:00:00 1759.199951 # Fits in row 0 of df2
2016-11-24 00:10:00 992.400024 # Fits in row 0 of df2
2016-11-24 01:00:00 82.499999 # Fits in row 1 of df2
2016-11-24 01:10:00 37.400003 # Fits on row 2 of df2
2016-11-24 01:20:00 159.899994 # Fits in row 2 of df2
....

Using np.searchsorted:
Here's a variation based on np.searchsorted that seems to be an order of magnitude faster than using intervaltree or merge, assuming my larger sample data is correct.
# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)
This may need to be modified if the intervals in df2 are nested or overlapping; I haven't fully thought it through in that scenario, but it may still work.
Using an Interval Tree
Not quite a pure Pandas solution, but you may want to consider building an Interval Tree from df2, and querying it against your intervals in df1 to find the ones that overlap.
The intervaltree package on PyPI seems to have good performance and easy to use syntax.
from intervaltree import IntervalTree
# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
I converted the dates to their integer equivalents for performance reasons. I doubt the intervaltree package was built with pd.Timestamp in mind, so there probably some intermediate conversion steps that slow things down a bit.
Also, note that intervals in the intervaltree package do not include the end point, although the start point is included. That's why I have the + [0, 1] when creating tree; I'm padding the end point by a nanosecond to make sure the real end point is actually included. It's also the reason why it's fine for me to add pd.offsets.Minute(10) to get the interval end when querying the tree, instead of adding only 9m 59s.
The resulting output for either method:
date value
0 2016-11-24 00:00:00 1759.199951
1 2016-11-24 00:10:00 992.400024
6 2016-11-24 01:00:00 82.499999
7 2016-11-24 01:10:00 37.400003
8 2016-11-24 01:20:00 159.899994
Timings
Using the following setup to produce larger sample data:
# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2
Which yields the following for df1 and df2:
df1
date value
0 2016-11-24 00:00:00 0.444939
1 2016-11-24 00:10:00 0.407554
2 2016-11-24 00:20:00 0.460148
3 2016-11-24 00:30:00 0.465239
4 2016-11-24 00:40:00 0.462691
...
54995 2017-12-10 21:50:00 0.754123
54996 2017-12-10 22:00:00 0.401820
54997 2017-12-10 22:10:00 0.146284
54998 2017-12-10 22:20:00 0.394759
54999 2017-12-10 22:30:00 0.907233
df2
start_date end_date
0 2016-11-24 00:00:19 2016-11-24 00:41:24
1 2016-11-24 18:22:44 2016-11-24 18:36:44
2 2016-11-25 12:44:44 2016-11-25 13:03:13
3 2016-11-26 07:07:05 2016-11-26 07:49:29
4 2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53
And using the following functions for timing purposes:
def root_searchsorted(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
return df1[np.any(cond, axis=0)].drop('date_end', axis=1)
def root_intervaltree(df1, df2):
# Build the Interval Tree.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter the DataFrame.
return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
def ptrj(df1, df2):
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
def parfait(df1, df2):
df1['key'] = 1
df2['key'] = 1
df2['row'] = df2.index.values
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= #s1.end_date.values) |\
(date_end <= #s1.end_date.values) |\
(#s1.index.values != #s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)
I get the following timings:
%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop
%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop
%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop
%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop
%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop

This solution (I believe it works) uses pandas.Series.asof. Under the hood, it's some version of searchsorted - but for some reason it's four times faster than it's comparable in speed with #root's function.
I assume that all date columns are in the pandas datetime format, sorted, and that df2 intervals are non-overlapping.
The code is pretty short but somewhat intricate (explanation below).
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# The main function (see explanation below):
def get_it(df1):
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
The advantage of this approach is twofold: sdate and edate are evaluated only once and the main function can take chunks of df1 if df1 is very large.
Explanation
pandas.Series.asof returns the last valid row for a given index. It can take an array as an input and is quite fast.
For the sake of this explanation, let s[j] = sdate.index[j] be the jth date in sdate and x be some arbitrary date (timestamp).
There is always s[sdate.asof(x)] <= x (this is exactly how asof works) and it's not difficult to show that:
j <= sdate.asof(x) if and only if s[j] <= x
sdate.asof(x) < j if and only if x < s[j]
Similarly for edate. Unfortunately, we can't have the same inequalities (either week or strict) in both 1. and 2.
Two intervals [a, b) and [x, y] intersect iff x < b and a <= y.
(We may think of a, b as coming from sdate.index and edate.index - the interval [a, b) is chosen to be closed-open because of properties 1. and 2.)
In our case x is a date from df1, y = x + 10min - epsilon,
a = s[j], b = e[j] (note that epsilon has been added to edate), where j is some number.
So, finally, the condition equivalent to "[a, b) and [x, y] intersect" is
"sdate.asof(x) < j and j <= edate.asof(y) for some number j". And it roughly boils down to l < r inside the function get_it (modulo some technicalities).

This is not exactly straightforward but you can do the following:
First get the relevant date columns from the two dataframes and concatenate them together so that one column is all the dates and the other two are columns representing the indexes from df2. (Note that df2 gets a multiindex after stacking)
dfm = pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)
print(dfm)
0 level_0 level_1
0 2016-11-23 23:55:32 0.0 start_date
0 2016-11-24 00:00:00 NaN NaN
1 2016-11-24 00:10:00 NaN NaN
1 2016-11-24 00:14:03 0.0 end_date
2 2016-11-24 00:20:00 NaN NaN
3 2016-11-24 00:30:00 NaN NaN
4 2016-11-24 00:40:00 NaN NaN
5 2016-11-24 00:50:00 NaN NaN
6 2016-11-24 01:00:00 NaN NaN
2 2016-11-24 01:03:18 1.0 start_date
3 2016-11-24 01:07:12 1.0 end_date
7 2016-11-24 01:10:00 NaN NaN
4 2016-11-24 01:11:32 2.0 start_date
8 2016-11-24 01:20:00 NaN NaN
5 2016-11-24 02:00:00 2.0 end_date
You can see that the values from df1 have NaN in the right two columns and since we have sorted the dates, these rows fall in between the start_date and end_date rows (from df2).
In order to indicate that the rows from df1 fall between the rows from df2 we can interpolate the level_0 column which gives us:
dfm['level_0'] = dfm['level_0'].interpolate()
0 level_0 level_1
0 2016-11-23 23:55:32 0.000000 start_date
0 2016-11-24 00:00:00 0.000000 NaN
1 2016-11-24 00:10:00 0.000000 NaN
1 2016-11-24 00:14:03 0.000000 end_date
2 2016-11-24 00:20:00 0.166667 NaN
3 2016-11-24 00:30:00 0.333333 NaN
4 2016-11-24 00:40:00 0.500000 NaN
5 2016-11-24 00:50:00 0.666667 NaN
6 2016-11-24 01:00:00 0.833333 NaN
2 2016-11-24 01:03:18 1.000000 start_date
3 2016-11-24 01:07:12 1.000000 end_date
7 2016-11-24 01:10:00 1.500000 NaN
4 2016-11-24 01:11:32 2.000000 start_date
8 2016-11-24 01:20:00 2.000000 NaN
5 2016-11-24 02:00:00 2.000000 end_date
Notice that the level_0 column now contains integers (mathematically, not the data type) for the rows that fall between a start date and an end date (this assumes that an end date will not overlap the following start date).
Now we can just filter out the rows originally in df1:
df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']
And merge back with the original dataframe
df_final = pd.merge(df1, right=df_falls, on='date', how='outer')
which gives:
print(df_final)
date value falls_index
0 2016-11-24 00:00:00 1759.199951 0.0
1 2016-11-24 00:10:00 992.400024 0.0
2 2016-11-24 00:20:00 1404.800049 NaN
3 2016-11-24 00:30:00 45.799999 NaN
4 2016-11-24 00:40:00 24.299999 NaN
5 2016-11-24 00:50:00 159.899994 NaN
6 2016-11-24 01:00:00 82.499999 NaN
7 2016-11-24 01:10:00 37.400003 NaN
8 2016-11-24 01:20:00 159.899994 2.0
Which is the same as the original dataframe with the extra column falls_index which indicates the index of the row in df2 that that row falls into.

Consider a cross join merge that returns the cartesian product between both sets (all possible row pairings M x N). You can cross join using an all 1's key column in merge's on argument. Then, run a filter on large returned set using pd.series.between(). Specifically, the series between() keeps rows where start date falls within the 9:59 range of date or date falls within start and end times.
However, prior to the merge, create a df1['date'] column equal to the date index so it can be a retained column after merge and used for date filtering. Additionally, create a df2['row'] column to be used as row indicator at the end. For demo, below recreates posted df1 and df2 dataframes:
from io import StringIO
import pandas as pd
import datetime as dt
data1 = '''
date value
"2016-11-24 00:00:00" 1759.199951
"2016-11-24 00:10:00" 992.400024
"2016-11-24 00:20:00" 1404.800049
"2016-11-24 00:30:00" 45.799999
"2016-11-24 00:40:00" 24.299999
"2016-11-24 00:50:00" 159.899994
"2016-11-24 01:00:00" 82.499999
"2016-11-24 01:10:00" 37.400003
"2016-11-24 01:20:00" 159.899994
'''
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values
data2 = '''
start_date end_date
"2016-11-23 23:55:32" "2016-11-24 00:14:03"
"2016-11-24 01:03:18" "2016-11-24 01:07:12"
"2016-11-24 01:11:32" "2016-11-24 02:00:00"
'''
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
(df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
print(df3)
# value row
# date
# 2016-11-24 00:00:00 1759.199951 0
# 2016-11-24 00:10:00 992.400024 0
# 2016-11-24 01:00:00 82.499999 1
# 2016-11-24 01:10:00 37.400003 2
# 2016-11-24 01:20:00 159.899994 2

I tried to modify the #root's code with the experimental query pandas method see.
It should be faster than the original implementation for very large dataFrames. For small dataFrames it will be definitely slower.
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= #s1.end_date.values) |\
(date_end <= #s1.end_date.values) |\
(#s1.index.values != #s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)

Related

group datetime column by 5 minutes increment only for time of day (ignoring date) and count

I have a dataframe with one column timestamp (of type datetime) and some other columns but their content don't matter. I'm trying to group by 5 minutes interval and count but ignoring the date and only caring about the time of day.
One can generate an example dataframe using this code:
def get_random_dates_df(
n=10000,
start=pd.to_datetime('2015-01-01'),
period_duration_days=5,
seed=None
):
if not seed: # from piR's answer
np.random.seed(0)
end = start + pd.Timedelta(period_duration_days, 'd'),
n_seconds = int(period_duration_days * 3600 * 24)
random_dates = pd.to_timedelta(n_seconds * np.random.rand(n), unit='s') + start
return pd.DataFrame(data={"timestamp": random_dates}).reset_index()
df = get_random_dates_df()
it would look like this:
index
timestamp
0
0
2015-01-03 17:51:27.433696604
1
1
2015-01-04 13:49:21.806272885
2
2
2015-01-04 00:19:53.778462950
3
3
2015-01-03 17:23:09.535054659
4
4
2015-01-03 02:50:18.873314407
I think I have a working solution but it seems overly complicated:
gpd_df = df.groupby(pd.Grouper(key="timestamp", freq="5min")).agg(
count=("index", "count")
).reset_index()
gpd_df["time_of_day"] = gpd_df["timestamp"].dt.time
res_df= gpd_df.groupby("time_of_day").sum()
Output:
count
time_of_day
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
... ...
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
[288 rows x 1 columns]
Is there a better way to solve this?

You could groupby the floored 5Min datetime's time portion:
df2 = df.groupby(df['timestamp'].dt.floor('5Min').dt.time)['index'].count()

I'd suggest something like this, to avoid trying to merge the results of two groupbys together:
gpd_df = df.copy()
gpd_df["time_of_day"] = gpd_df["timestamp"].apply(lambda x: x.replace(year=2000, month=1, day=1))
gpd_df = gpd_df.set_index("time_of_day")
res_df = gpd_df.resample("5min").size()
It works by setting the year/month/day to fixed values and applying the built-in resampling function.

What about flooring the datetimes to 5min, extracting the time only and using value_counts:
out = (df['timestamp']
.dt.floor('5min')
.dt.time.value_counts(sort=False)
.sort_index()
)
Output:
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
..
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
Name: timestamp, Length: 288, dtype: int64

Filling in missing hourly data in Pandas

I have a dataframe containing time series with hourly measurements with the following structure: name, time, output. For each name the measurements come from more or less the same time period. I am trying to fill in the missing values, such that for each day all 24h appear in the time column.
So I'm expecting a table like this:
name time output
x 2018-02-22 00:00:00 100
...
x 2018-02-22 23:00:00 200
x 2018-02-24 00:00:00 300
...
x 2018-02-24 23:00:00 300
y 2018-02-22 00:00:00 100
...
y 2018-02-22 23:00:00 200
y 2018-02-25 00:00:00 300
...
y 2018-02-25 23:00:00 300
For this I groupby name and then try to apply a custom function that adds the missing timestamps in the corresponding dataframe.
def add_missing_hours(df):
start_date = df.time.iloc[0].date()
end_date = df.time.iloc[-1].date()
dates_range = pd.date_range(start_date, end_date, freq = '1H')
new_dates = set(dates_range) - set(df.time)
name = df["name"].iloc[0]
df = df.append(pd.DataFrame({'GSRN':[name]*len(new_dates), 'time': new_dates}))
return df
For some reason the name column is dropped when I create the DataFrame, but I can't understand why. Does anyone know why or have a better idea how to fill in the missing timestamps?
Edit 1:
This is different than the [question here][1] because they didn't need all 24 values/day -- resampling between 2pm and 10pm will only give the values in between.
Edit 2:
I found a (not great) solution by creating a multi index with all name-timestamps pairs and combining with the table. Code below for anyone interested, but still interested in a better solution:
start_date = datetime.datetime.combine(df.time.min().date(),datetime.time(0, 0))
end_date = datetime.datetime.combine(df.time.max().date(),datetime.time(23, 0))
new_idx = pd.date_range(start_date, end_date, freq = '1H')
mux = pd.MultiIndex.from_product([df['name'].unique(),new_idx], names=('name','time'))
df_complete = pd.DataFrame(index=mux).reset_index().combine_first(df)
df_complete = df_complete.groupby(["name",df_complete.time.dt.date]).filter(lambda g: (g["output"].count() == 0))
The last line removes any days that were completely missing for the specific name in the initial dataframe.

try:
1st create dataframe starting from min date to max date with hour as an interval. Then concatenate them together.
df.time = pd.to_datetime(df.time)
min_date = df.time.min()
max_date = df.time.max()
dates_range = pd.date_range(min_date, max_date, freq = '1H')
df.set_index('time', inplace=True)
df3=pd.DataFrame(dates_range).set_index(0)
df4 = df3.join(df)
df4:
name output
2018-02-22 00:00:00 x 100.0
2018-02-22 00:00:00 y 100.0
2018-02-22 01:00:00 NaN NaN
2018-02-22 02:00:00 NaN NaN
2018-02-22 03:00:00 NaN NaN
... ... ...
2018-02-25 19:00:00 NaN NaN
2018-02-25 20:00:00 NaN NaN
2018-02-25 21:00:00 NaN NaN
2018-02-25 22:00:00 NaN NaN
2018-02-25 23:00:00 y 300.0
98 rows × 2 columns

Best approach to group the differences between each row by month, year, etc?

I have a dataframe like the following:
Index Diff
2019-03-14 11:32:21.583000+00:00 0
2019-03-14 11:32:21.583000+00:00 2
2019-04-14 11:32:21.600000+00:00 13
2019-04-14 11:32:21.600000+00:00 14
2019-05-14 11:32:21.600000+00:00 19
2019-05-14 11:32:21.600000+00:00 27
What would be the best approach to group by the month and take the difference inside of those months?
Using the .diff() option I am able to find the difference between each row, but I am trying to use the df.groupby(pd.Grouper(freq='M')) with no success.
Expected Output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0
Any help would be much appreciated!!

Depending on whether or not your date is on the index, you can comment out df1 = df.reset_index(). Also, check that your index is in DateTimeIndex format if it is on the index. If not in the correct format, then you can change the data type with df.index = pd.to_datetime(df.index). Then, you should be set to change the Diff column with df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff() and then later groupby with the full dataframe:
input:
import pandas as pd
df = pd.DataFrame({'Diff': {'2019-03-14 11:32:21.583000+00:00': 2,
'2019-04-14 11:32:21.600000+00:00': 14,
'2019-05-14 11:32:21.600000+00:00': 27}})
df.index.name = 'Index'
df.index = pd.to_datetime(df.index)
code:
df1 = df.reset_index()
df1['Diff'] = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff()
df1 = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].max().reset_index()
df1
output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0

Pandas normalize column indexed by datetimeindex by sum of groupby date

If given a dataframe that's indexed with a datetimeindex, is there an efficient way to normalize the values within a given day? For example I'd like to sum all values for each day, and then divide each columns values by the resulting sum for the day.
I can easily group by date and calculate the divisor (sum of values of each column for each date) but I'm not entirely sure the best way to divide the original dataframe by the resulting sum df.
Example dataframe with datetimeindex and resulting df from sum
I attempted to do something like
df / df.groupby(df.index.to_period('D')).sum()
however it isn't behaving as I would have hoped for.
Instead I'm getting a df with NaN everywhere and Date appended as new indexes.
i.e
Results from above division
Toy recreation:
df = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]],columns=['a','b'],
index=pd.to_datetime(['2017-01-01 14:30:00','2017-01-01 14:31:00',
'2017-01-02 14:30:00', '2017-01-02 14:31:00']))
df / df.groupby(df.index.to_period('D')).sum()
results in
a b
2017-01-01 14:30:00 NaN NaN
2017-01-01 14:31:00 NaN NaN
2017-01-02 14:30:00 NaN NaN
2017-01-02 14:31:00 NaN NaN
2017-01-01 NaN NaN
2017-01-02 NaN NaN

You will need to copy and paste your dataframe as text and not an image so I can help further but here is an example:
sample df
df1 = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'),
index=pd.date_range('2017-01-03', '2017-01-07'))
df2 = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'),
index=pd.date_range('2017-01-03', '2017-01-07'))
df = pd.concat([df1,df2])
A B C D E
2017-01-03 1.393874 1.933301 0.215026 -0.412957 -0.293925
2017-01-04 0.825777 0.315449 2.317292 -0.347617 -2.427019
2017-01-05 -0.372916 -0.931185 0.049707 0.635828 -0.774566
2017-01-06 1.564714 -1.582461 1.455403 0.521305 -2.175344
2017-01-07 1.255747 1.967338 -0.766391 -0.021921 0.672704
2017-01-03 0.620301 -1.521681 -0.352800 -1.394239 -1.206983
2017-01-04 -0.041829 -0.870871 -0.402440 0.268725 1.499321
2017-01-05 -1.098647 1.690136 1.004087 0.304037 1.235310
2017-01-06 0.305645 -0.327096 0.280591 -0.476904 1.652096
2017-01-07 1.251927 0.469697 0.047694 1.838995 -0.258889
then what you are currently doing:
df / df.groupby(df.index).sum()
A B C D E
2017-01-03 0.692032 4.696817 -1.560723 0.228507 0.195831
2017-01-03 0.307968 -3.696817 2.560723 0.771493 0.804169
2017-01-04 1.053357 -0.567944 1.210167 4.406211 2.616174
2017-01-04 -0.053357 1.567944 -0.210167 -3.406211 -1.616174
2017-01-05 0.253415 -1.226937 0.047170 0.676510 -1.681122
2017-01-05 0.746585 2.226937 0.952830 0.323490 2.681122
2017-01-06 0.836585 0.828706 0.838369 11.740853 4.157386
2017-01-06 0.163415 0.171294 0.161631 -10.740853 -3.157386
2017-01-07 0.500762 0.807267 1.066362 -0.012064 1.625615
2017-01-07 0.499238 0.192733 -0.066362 1.012064 -0.625615
Take a look at the first row col A
1.393874 / (1.393874 + 0.620301) = 0.6920322216292031 so your example of df / df.groupby(df.index).sum() is working as expected.
Also be careful if your data contains NaNs because np.nan / a number = nan
update per comment:
df = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]],columns=['a','b'],
index=pd.to_datetime(['2017-01-01 14:30:00','2017-01-01 14:31:00',
'2017-01-02 14:30:00', '2017-01-02 14:31:00']))
# create multiindex with level 1 being just dates
df.set_index(df.index.floor('D'), inplace=True, append=True)
# divide df by the group sum matching the index values of level 1
df.div(df.groupby(level=1).sum(), level=1).reset_index(level=1, drop=True)
a b
2017-01-01 14:30:00 0.250000 0.333333
2017-01-01 14:31:00 0.750000 0.666667
2017-01-02 14:30:00 0.416667 0.428571
2017-01-02 14:31:00 0.583333 0.571429

find closest rows between dataframes with positive timedelta

I have two dataframes each with a datetime column:
df_long=
mytime_long
0 00:00:01 1/10/2013
1 00:00:05 1/10/2013
2 00:00:55 1/10/2013
df_short=
mytime_short
0 00:00:02 1/10/2013
1 00:00:03 1/10/2013
2 00:00:06 1/10/2013
The timestamps are unique and can be assumed sorted in each of the two dataframes.
I would like to create a new dataframe that contains the nearest (index,mytime_long) after or at the same time value in mytime_short (hence with a non-negative timedelta)
ex.
0 (0, 00:00:02 1/10/2013)
1 (2, 00:00:06 1/10/2013)
2 (np.nan,np.nat)

write a function to get the closest index & timestamp in df_short given a timestamp
def get_closest(n):
mask = df_short.mytime_short >= n
ids = np.where(mask)[0]
if ids.size > 0:
return ids[0], df_short.mytime_short[ids[0]]
else:
return np.nan, np.nan
apply this function over df_long.mytime_long, to get a new data frame with the index & timestamp values in a tuple
df = df_long.mytime_long.apply(get_closest)
df
# output:
0 (0, 2013-01-10 00:00:02)
1 (2, 2013-01-10 00:00:06)
2 (nan, nan)
ilia timofeev's answer reminded me of this pandas.merge_asof function which is perfect for this type of join
df = pd.merge_asof(df_long,
df_short.reset_index(),
left_on='mytime_long',
right_on='mytime_short',
direction='forward')[['index', 'mytime_short']]
df
# output:
index mytime_short
0 0.0 2013-01-10 00:00:02
1 2.0 2013-01-10 00:00:06
2 NaN NaT

Little bit ugly, but effective way to solve task. Idea is to join them on timestamp and select first "short" after "long" if any.
#recreate data
df_long = pd.DataFrame(
pd.to_datetime( ['00:00:01 1/10/2013','00:00:05 1/10/2013','00:00:55 1/10/2013']),
index = [0,1,2],columns = ['mytime_long'])
df_short = pd.DataFrame(
pd.to_datetime( ['00:00:02 1/10/2013','00:00:03 1/10/2013','00:00:06 1/10/2013']),
index = [0,1,2],columns = ['mytime_short'])
#join by time, preserving ids
df_all = df_short.assign(inx_s=df_short.index).set_index('mytime_short').join(
df_long.assign(inx_l=df_long.index).set_index('mytime_long'),how='outer')
#mark all "short" rows with nearest "long" id
df_all['inx_l'] = df_all.inx_l.ffill().fillna(-1)
#select "short" rows
df_short_candidate = df_all[~df_all.inx_s.isnull()].astype(int)
df_short_candidate['mytime_short'] = df_short_candidate.index
#select get minimal "short" time in "long" group,
#join back with long to recover empty intersection
df_res = df_long.join(df_short_candidate.groupby('inx_l').first())
print (df_res)
Out:
mytime_long inx_s mytime_short
0 2013-01-10 00:00:01 0.0 2013-01-10 00:00:02
1 2013-01-10 00:00:05 2.0 2013-01-10 00:00:06
2 2013-01-10 00:00:55 NaN NaT
Performance comparison on sample of 100000 elements:
186 ms to execute this implementation.
1min 3s to execute df_long.mytime_long.apply(get_closest)
UPD: but the winner is #Haleemur Ali's pd.merge_asof with 10ms

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: select DF rows based on another DF - python

Related

group datetime column by 5 minutes increment only for time of day (ignoring date) and count

Filling in missing hourly data in Pandas

Best approach to group the differences between each row by month, year, etc?

Pandas normalize column indexed by datetimeindex by sum of groupby date

find closest rows between dataframes with positive timedelta

Categories

Resources