Drop rows with column value X and time difference less than Y

Drop rows with column value X and time difference less than Y - python

I checked this post but couldnt get to my solution.
I have a dataframe which I filtered to get rows where df[df.columntype == 'B'] like below. Also, df.timeframe is of type datetime64[ns]
timeframe columntype
292 2021-05-19 10:17:00 B
293 2021-05-19 10:18:00 B
294 2021-05-19 10:18:00 B
295 2021-05-19 10:18:00 B
296 2021-05-19 10:18:00 B
418 2021-05-25 09:49:00 B
419 2021-05-25 09:49:00 B
420 2021-05-25 09:50:00 B
659 2021-07-08 10:33:00 B
660 2021-07-08 10:33:00 B
661 2021-07-08 10:33:00 B
I want to drop rows where time difference is less than 5 minutes. So I would get:
timeframe columntype
292 2021-05-19 10:17:00 B
418 2021-05-25 09:49:00 B
659 2021-07-08 10:33:00 B
How can I do this?

I would try to do it with the method diff:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html
Then filter on a condition where df.timeframe.diff().abs() < x
Hope it helped

Related

Pandas count observations in one dataframe conditionally on values of other dataframe

I'm trying to solve this issue. I have two dataframe. The first one looks like:
ID
start.date
end.date
272
2007-03-27 10:37:00
2007-03-27 15:09:00
290
2007-04-10 14:12:00
2007-04-10 15:51:00
268
2007-03-23 18:18:00
2007-03-23 18:24:00
264
2007-04-05 06:54:00
2007-04-09 06:45:00
105
2007-04-18 10:51:00
2007-04-18 13:37:00
280
2007-03-30 11:09:00
2007-04-02 06:27:00
99
2007-03-28 12:12:00
2007-03-28 15:22:00
268
2007-03-27 10:41:00
2007-03-27 10:54:00
263
2007-03-28 11:08:00
2007-03-28 12:45:00
264
2007-03-28 07:12:00
2007-03-28 11:08:00
While the second one looks like:
ID
date
266
2007-03-30 17:17:10
272
2007-03-30 14:23:39
268
2007-03-30 09:12:48
264
2007-03-30 18:57:57
276
2007-04-02 14:30:02
106
2007-03-28 11:35:49
276
2007-03-30 13:40:24
82
2007-03-27 17:29:28
104
2007-03-28 17:50:12
264
2007-03-29 14:41:16
I would like to add a column to the first dataframe with the count of the rows in the second dataframe with that ID and with a date value between the start.date and end.date of the first dataframe. How can I do it?

You can try apply on rows:
df1['start.date'] = pd.to_datetime(df1['start.date'])
df1['end.date'] = pd.to_datetime(df1['end.date'])
df2['date'] = pd.to_datetime(df2['date'])
df1['count'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & (row['start.date'] < df2['date']) & (df2['date'] < row['end.date'])).sum(), axis=1)
# or
df1['count2'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & df2['date'].between(row['start.date'], row['end.date'], inclusive='neither')).sum(), axis=1)
print(df1)
ID start.date end.date count count2
0 272 2007-03-27 10:37:00 2007-03-27 15:09:00 0 0
1 290 2007-04-10 14:12:00 2007-04-10 15:51:00 0 0
2 268 2007-03-23 18:18:00 2007-03-23 18:24:00 0 0
3 264 2007-04-05 06:54:00 2007-04-09 06:45:00 0 0
4 105 2007-04-18 10:51:00 2007-04-18 13:37:00 0 0
5 280 2007-03-30 11:09:00 2007-04-02 06:27:00 0 0
6 99 2007-03-28 12:12:00 2007-03-28 15:22:00 0 0
7 268 2007-03-27 10:41:00 2007-03-27 10:54:00 0 0
8 263 2007-03-28 11:08:00 2007-03-28 12:45:00 0 0
9 264 2007-03-28 07:12:00 2007-03-28 11:08:00 0 0

Perfect job for numpy boardcasting:
id1, start_date, end_date = [df1[[col]].to_numpy() for col in ["ID", "start.date", "end.date"]]
id2, date = [df2[col].to_numpy() for col in ["ID", "date"]]
# Check every row in df1 against every row in df2 for our criteria:
# matching id, and date between start.date and end.date
match = (id1 == id2) & (start_date < date) & (date < end_date)
df1["count"] = match.sum(axis=1)

How to get values for the next month for a selected column from a pandas data frame with date time index

I have the below data frame (date time index, with all working days in us calender)
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import random
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt_rng = pd.date_range(start='1/1/2018', end='12/31/2018', freq=us_bd)
n1 = [round(random.uniform(20, 35),2) for _ in range(len(dt_rng))]
n2 = [random.randint(100, 200) for _ in range(len(dt_rng))]
df = pd.DataFrame(list(zip(n1,n2)), index=dt_rng, columns=['n1','n2'])
print(df)
n1 n2
2018-01-02 24.78 197
2018-01-03 23.33 176
2018-01-04 33.19 128
2018-01-05 32.49 110
... ... ...
2018-12-26 31.34 173
2018-12-27 29.72 166
2018-12-28 31.07 104
2018-12-31 33.52 184
[251 rows x 2 columns]
For each row in column n1 , how to get values from the same column for the same day of next month? (if value for that exact day is not available (due to weekends or holidays), then should get the value at the next available date. ). I tried using df.n1.shift(21), but its not working as the exact working days at each month differ.
Expected output as below
n1 n2 next_mnth_val
2018-01-02 25.97 184 28.14
2018-01-03 24.94 133 27.65 # three values below are same, because on Feb 2018, the next working day after 2nd is 5th
2018-01-04 23.99 143 27.65
2018-01-05 24.69 182 27.65
2018-01-08 28.43 186 28.45
2018-01-09 31.47 104 23.14
... ... ... ...
2018-12-26 29.06 194 20.45
2018-12-27 29.63 158 20.45
2018-12-28 30.60 148 20.45
2018-12-31 20.45 121 20.45
for December , the next month value should be last value of the data frame ie, value at index 2018-12-31 (20.45).
please help.

This is an interesting problem. I would shift the date by 1 month, then shift it again to the next business day:
df1 = df.copy().reset_index()
df1['new_date'] = df1['index'] + pd.DateOffset(months=1) + pd.offsets.BDay()
df.merge(df1, left_index=True, right_on='new_date')
Output (first 31st days):
n1_x n2_x index n1_y n2_y new_date
0 34.82 180 2018-01-02 29.83 129 2018-02-05
1 34.82 180 2018-01-03 24.28 166 2018-02-05
2 34.82 180 2018-01-04 27.88 110 2018-02-05
3 24.89 186 2018-01-05 25.34 111 2018-02-06
4 31.66 137 2018-01-08 26.28 138 2018-02-09
5 25.30 162 2018-01-09 32.71 139 2018-02-12
6 25.30 162 2018-01-10 34.39 159 2018-02-12
7 25.30 162 2018-01-11 20.89 132 2018-02-12
8 23.44 196 2018-01-12 29.27 167 2018-02-13
12 25.40 153 2018-01-19 28.52 185 2018-02-20
13 31.38 126 2018-01-22 23.49 141 2018-02-23
14 30.90 133 2018-01-23 25.56 145 2018-02-26
15 30.90 133 2018-01-24 23.06 155 2018-02-26
16 30.90 133 2018-01-25 24.95 174 2018-02-26
17 29.39 138 2018-01-26 21.28 157 2018-02-27
18 32.94 173 2018-01-29 20.26 189 2018-03-01
19 32.94 173 2018-01-30 22.41 196 2018-03-01
20 32.94 173 2018-01-31 27.32 149 2018-03-01
21 28.09 119 2018-02-01 31.39 192 2018-03-02
22 32.21 199 2018-02-02 28.22 151 2018-03-05
23 21.78 120 2018-02-05 34.82 180 2018-03-06
24 28.25 127 2018-02-06 24.89 186 2018-03-07
25 22.06 189 2018-02-07 32.85 125 2018-03-08
26 33.78 121 2018-02-08 30.12 102 2018-03-09
27 30.79 137 2018-02-09 31.66 137 2018-03-12
28 29.88 131 2018-02-12 25.30 162 2018-03-13
29 20.02 143 2018-02-13 23.44 196 2018-03-14
30 20.28 188 2018-02-14 20.04 102 2018-03-15

How to join DataFrame with multiple conditions on different columns?

I have two data-frames as follows:
mydata1:
ID X1 X2 Date1
002 324 634 2016-01-01
002 334 534 2016-01-14
002 354 834 2016-01-30
004 543 843 2017-02-01
004 923 043 2017-04-15
005 032 212 2015-09-01
005 523 843 2017-09-15
005 212 222 2015-10-1
mydata2:
ID Y1 Y2 Date2
002 1224 234 2016-01-04
002 1254 249 2016-01-28
004 321 212 2016-12-01
005 1121 222 2017-09-13
I want to merge these two data-frames based on ID and the Date where the difference between Date1 --dataframe1-- and Date2 --indataframe2--is less than 15. So, my desired data-frame as an output should be like this:
ID X1 X2 Date1. Y1. Y2. Date2
002 324 634 2016-01-01. nan. nan. nan
002 334 534 2016-01-14 1224 234 2016-01-04
002 354 834 2016-01-30. 1254 249 2016-01-28
004 543 843 2017-02-01 321 212 2015-12-01
004 923 043 2017-04-15. nan nan. nan
005 032 212 2015-09-01 nan nan. nan
005 523 843 2015-09-15. 1121 222 2017-09-13
005 212 222 2015-10-1. nan nan. nan

So your desired output is slightly wrong since one of the values is 2 years older than the joined value.
First we perform a join:
f = df.merge(df1, how='left', on='ID')
ID X1 X2 Date1 Y1 Y2 Date2
0 2 324 634 2016-01-01 1224 234 2016-01-04
1 2 334 534 2016-01-14 1224 234 2016-01-04
2 2 354 834 2016-01-30 1224 234 2016-01-04
3 4 543 843 2017-02-01 321 212 2016-12-01
4 4 923 43 2017-04-15 321 212 2016-12-01
5 5 32 212 2015-09-01 1121 222 2015-09-13
6 5 523 843 2015-09-15 1121 222 2015-09-13
7 5 212 222 2015-10-1 1121 222 2015-09-13
Then we create a boolean mask:
mask = (pd.to_datetime(f['Date1'], format='%Y-%m-%d') - pd.to_datetime(f['Date2'], format='%Y-%m-%d')).apply(lambda i: i.days <= 15 and i.days > 0)
0 False
1 True
2 False
3 False
4 False
5 False
6 True
7 False
Then we set it to nan where the condition does not match:
f.loc[~mask, ['Y1', 'Y2', 'Date2']] = np.nan
ID X1 X2 Date1 Y1 Y2 Date2
0 2 324 634 2016-01-01 NaN NaN NaN
1 2 334 534 2016-01-14 1224.0 234.0 2016-01-04
2 2 354 834 2016-01-30 NaN NaN NaN
3 4 543 843 2017-02-01 NaN NaN NaN
4 4 923 43 2017-04-15 NaN NaN NaN
5 5 32 212 2015-09-01 NaN NaN NaN
6 5 523 843 2015-09-15 1121.0 222.0 2015-09-13
7 5 212 222 2015-10-1 NaN NaN NaN

How do I get minimal value of multiple column timestamp

I want to get minimal value of multiple column timestamp. Here's my data
Id timestamp 1 timestamp 2 timestamp 3
136 2014-08-27 17:29:23 2014-11-05 13:02:18 2014-09-29 22:26:34
245 2015-09-06 15:46:00 NaN NaN
257 2014-09-29 22:26:34 2016-02-02 17:59:54 NaN
258 NaN NaN NaN
480 2016-02-02 17:59:54 2014-11-05 13:02:18 NaN
I want to get minimal timestamp of minimal
Id minimal
136 2014-08-27 17:29:23
245 2015-09-06 15:46:00
257 2014-09-29 22:26:34
258 NaN
480 2014-11-05 13:02:18

Select all columns without first by iloc, convert to datetimes and get min per rows and it is added to first column by join:
df = df[['Id']].join(df.iloc[:, 1:].apply(pd.to_datetime).min(axis=1).rename('min'))
print (df)
Id min
0 136 2014-08-27 17:29:23
1 245 2015-09-06 15:46:00
2 257 2014-09-29 22:26:34
3 258 NaT
4 480 2014-11-05 13:02:18

Python - Statsmodels.tsa.seasonal_decompose - missing values in head and tail of dataframe

I have the following dataframe, that I'm calling "sales_df":
Value
Date
2004-01-01 0
2004-02-01 173
2004-03-01 225
2004-04-01 230
2004-05-01 349
2004-06-01 258
2004-07-01 270
2004-08-01 223
... ...
2015-06-01 218
2015-07-01 215
2015-08-01 233
2015-09-01 258
2015-10-01 252
2015-11-01 256
2015-12-01 188
2016-01-01 70
I want to separate its trend from its seasonal component and for that I use statsmodels.tsa.seasonal_decompose through the following code:
decomp=sm.tsa.seasonal_decompose(sales_df.Value)
df=pd.concat([sales_df,decomp.trend],axis=1)
df.columns=['sales','trend']
This is getting me this:
sales trend
Date
2004-01-01 0 NaN
2004-02-01 173 NaN
2004-03-01 225 NaN
2004-04-01 230 NaN
2004-05-01 349 NaN
2004-06-01 258 NaN
2004-07-01 270 236.708333
2004-08-01 223 248.208333
2004-09-01 243 251.250000
... ... ...
2015-05-01 270 214.416667
2015-06-01 218 215.583333
2015-07-01 215 212.791667
2015-08-01 233 NaN
2015-09-01 258 NaN
2015-10-01 252 NaN
2015-11-01 256 NaN
2015-12-01 188 NaN
2016-01-01 70 NaN
Note that there are 6 NaN's in the start and in the end of the Trend's series.
So I ask, is that right? Why is that happening?

This is expected as seasonal_decompose uses a symmetric moving average by default if the filt argument is not specified (as you did). The frequency is inferred from the time series.
https://searchcode.com/codesearch/view/86129185/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop rows with column value X and time difference less than Y - python

I would try to do it with the method diff: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html Then filter on a condition where df.timeframe.diff().abs() < x Hope it helped

Related

Pandas count observations in one dataframe conditionally on values of other dataframe

How to get values for the next month for a selected column from a pandas data frame with date time index

How to join DataFrame with multiple conditions on different columns?

How do I get minimal value of multiple column timestamp

Python - Statsmodels.tsa.seasonal_decompose - missing values in head and tail of dataframe

Categories

Resources