Conditional selection before certain time of day - Pandas dataframe

Conditional selection before certain time of day - Pandas dataframe - python

I have the above dataframe (snippet) and want create a new dataframe which is a conditional selection where I keep only the rows that are timestamped with a time before 15:00:00.
I'm still somewhat new to Pandas / python and have been stuck on this for a while :(

You can use DataFrame.between_time:
start = pd.to_datetime('2015-02-24 11:00')
rng = pd.date_range(start, periods=10, freq='14h')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
print (df)
Date a
0 2015-02-24 11:00:00 0
1 2015-02-25 01:00:00 1
2 2015-02-25 15:00:00 2
3 2015-02-26 05:00:00 3
4 2015-02-26 19:00:00 4
5 2015-02-27 09:00:00 5
6 2015-02-27 23:00:00 6
7 2015-02-28 13:00:00 7
8 2015-03-01 03:00:00 8
9 2015-03-01 17:00:00 9
df = df.set_index('Date').between_time('00:00:00', '15:00:00')
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-25 15:00:00 2
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
If need exclude 15:00:00 add parameter include_end=False:
df = df.set_index('Date').between_time('00:00:00', '15:00:00', include_end=False)
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8

You can check the hours of the date column and use it for subsetting:
df['date'] = pd.to_datetime(df['date']) # optional if the date column is of datetime type
df[df.date.dt.hour < 15]

Related

Create a Pandas Dataframe Date Column to Day of Year

I know this should be easy but for some reason, I cannot get the result that I need. I have data that looks like this where 'raw_time' is read into a df in the date format yyyy-mm-dd hh:mm:ss.
It looks like this:
dfdates =
1429029 1992-01-03 02:00:00
1429030 1992-01-03 01:00:00
1429031 1992-01-03 00:00:00
1429032 1992-01-02 23:00:00
1429033 1992-01-02 22:00:00
1429034 1992-01-02 21:00:00
1429035 1992-01-02 20:00:00
1429036 1992-01-02 19:00:00
1429037 1992-01-02 18:00:00
1429038 1992-01-02 17:00:00
1429039 1992-01-02 16:00:00
1429040 1992-01-02 15:00:00
1429041 1992-01-02 14:00:00
1429042 1992-01-02 13:00:00
1429043 1992-01-02 12:00:00
1429044 1992-01-02 11:00:00
I just need to convert each row to day of year. So the result in a new df would look like:
df_doy:
index day_of_year
1429029 3
1429030 3
1429031 3
1429032 2
1429033 2
1429034 2
1429035 2
1429036 2
1429037 2
1429038 2
1429039 2
1429040 2
1429041 2
1429042 2
1429043 2
1429044 2
thank you,

We have
df['day_of_year'] = pd.to_datetime(df[col]).dt.dayofyear
Or just output the day
df['day_of_year'] = pd.to_datetime(df[1]).dt.day

Assuming dfdates columns are ["index", "date"], you can use dt.dayofyear this way :
df_doy = dfdates.assign(day_of_year = pd.to_datetime(dfdates.pop("date")).dt.dayofyear)
Output :
print(df_doy)
index day_of_year
0 1429029 3
1 1429030 3
2 1429031 3
3 1429032 2
4 1429033 2
.. ... ...
11 1429040 2
12 1429041 2
13 1429042 2
14 1429043 2
15 1429044 2
[16 rows x 2 columns]

Looks like there is a day_of_year variable in Period.
https://pandas.pydata.org/docs/reference/api/pandas.Period.dayofyear.html

Filtering out another dataframe based on selected hours

I'm trying to filter out my dataframe based only on 3 hourly frequency, meaning starting from 0000hr, 0300hr, 0900hr, 1200hr, 1500hr, 1800hr, 2100hr, so on and so forth.
A sample of my dataframe would look like this
Time A
2019-05-25 03:54:00 1
2019-05-25 03:57:00 2
2019-05-25 04:00:00 3
...
2020-05-25 03:54:00 4
2020-05-25 03:57:00 5
2020-05-25 04:00:00 6
Desired output:
Time A
2019-05-25 06:00:00 1
2019-05-25 09:00:00 2
2019-05-25 12:00:00 3
...
2020-05-25 00:00:00 4
2020-05-25 03:00:00 5
2020-05-25 06:00:00 6
2020-05-25 09:00:00 6
2020-05-25 12:00:00 6
2020-05-25 15:00:00 6
2020-05-25 18:00:00 6
2020-05-25 21:00:00 6
2020-05-26 00:00:00 6
...

You can define a date range with 3 hours interval with pd.date_range() and then filter your dataframe with .loc and isin(), as follows:
date_rng_3H = pd.date_range(start=df['Time'].dt.date.min(), end=df['Time'].dt.date.max() + pd.DateOffset(days=1), freq='3H')
df_out = df.loc[df['Time'].isin(date_rng_3H)]
Input data:
date_rng = pd.date_range(start='2019-05-25 03:54:00', end='2020-05-25 04:00:00', freq='3T')
np.random.seed(123)
df = pd.DataFrame({'Time': date_rng, 'A': np.random.randint(1, 6, len(date_rng))})
Time A
0 2019-05-25 03:54:00 3
1 2019-05-25 03:57:00 5
2 2019-05-25 04:00:00 3
3 2019-05-25 04:03:00 2
4 2019-05-25 04:06:00 4
... ... ...
175678 2020-05-25 03:48:00 2
175679 2020-05-25 03:51:00 1
175680 2020-05-25 03:54:00 2
175681 2020-05-25 03:57:00 2
175682 2020-05-25 04:00:00 1
175683 rows × 2 columns
Output:
print(df_out)
Time A
42 2019-05-25 06:00:00 4
102 2019-05-25 09:00:00 2
162 2019-05-25 12:00:00 1
222 2019-05-25 15:00:00 3
282 2019-05-25 18:00:00 5
... ... ...
175422 2020-05-24 15:00:00 1
175482 2020-05-24 18:00:00 5
175542 2020-05-24 21:00:00 2
175602 2020-05-25 00:00:00 3
175662 2020-05-25 03:00:00 3

how to sort by english date format not american pandas .sort()

symb dates
4 BLK 01/03/2014 09:00:00
0 BBR 02/06/2014 09:00:00
21 HZ 02/06/2014 09:00:00
24 OMNI 02/07/2014 09:00:00
31 NOTE 03/04/2014 09:00:00
65 AMP 03/04/2016 09:00:00
40 RBY 04/07/2014 09:00:00
Here's a sample of the output from (df.sort('date')).
As you can see it uses the days for the months and vice versa. Any idea how to fix this ?

You can use pandas.to_datetime and use the format argument then sort it.
>> df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S')
>> df.sort('date')
date symb
0 2014-01-03 09:00:00 BLK
1 2014-02-06 09:00:00 BBR
2 2014-02-06 09:00:00 HZ
3 2014-02-07 09:00:00 OMNI
4 2014-03-04 09:00:00 NOTE
6 2014-04-07 09:00:00 RBY
5 2016-03-04 09:00:00 AMP

You can use to_datetime, for sorting sort_values:
#format mm/dd/YYYY
df['dates'] = pd.to_datetime(df['dates'])
print (df.sort_values('dates'))
symb dates
4 BLK 2014-01-03 09:00:00
0 BBR 2014-02-06 09:00:00
21 HZ 2014-02-06 09:00:00
24 OMNI 2014-02-07 09:00:00
31 NOTE 2014-03-04 09:00:00
40 RBY 2014-04-07 09:00:00
65 AMP 2016-03-04 09:00:00
#format dd/mm/YYYY
df['dates'] = pd.to_datetime(df['dates'], dayfirst=True)
print (df.sort_values('dates'))
symb dates
4 BLK 2014-03-01 09:00:00
31 NOTE 2014-04-03 09:00:00
0 BBR 2014-06-02 09:00:00
21 HZ 2014-06-02 09:00:00
24 OMNI 2014-07-02 09:00:00
40 RBY 2014-07-04 09:00:00
65 AMP 2016-04-03 09:00:00
Another solution is use parameter parse_dates in read_csv, if format dd/mm/YYYY add dayfirst=True:
import pandas as pd
import numpy as np
from pandas.compat import StringIO
temp=u"""symb,dates
BLK,01/03/2014 09:00:00
BBR,02/06/2014 09:00:00
HZ,02/06/2014 09:00:00
OMNI,02/07/2014 09:00:00
NOTE,03/04/2014 09:00:00
AMP,03/04/2016 09:00:00
RBY,04/07/2014 09:00:00"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), parse_dates=['dates'])
print (df)
symb dates
0 BLK 2014-01-03 09:00:00
1 BBR 2014-02-06 09:00:00
2 HZ 2014-02-06 09:00:00
3 OMNI 2014-02-07 09:00:00
4 NOTE 2014-03-04 09:00:00
5 AMP 2016-03-04 09:00:00
6 RBY 2014-04-07 09:00:00
print (df.dtypes)
symb object
dates datetime64[ns]
dtype: object
print (df.sort_values('dates'))
symb dates
0 BLK 2014-01-03 09:00:00
1 BBR 2014-02-06 09:00:00
2 HZ 2014-02-06 09:00:00
3 OMNI 2014-02-07 09:00:00
4 NOTE 2014-03-04 09:00:00
6 RBY 2014-04-07 09:00:00
5 AMP 2016-03-04 09:00:00
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), parse_dates=['dates'], dayfirst=True)
print (df)
symb dates
0 BLK 2014-03-01 09:00:00
1 BBR 2014-06-02 09:00:00
2 HZ 2014-06-02 09:00:00
3 OMNI 2014-07-02 09:00:00
4 NOTE 2014-04-03 09:00:00
5 AMP 2016-04-03 09:00:00
6 RBY 2014-07-04 09:00:00
print (df.dtypes)
symb object
dates datetime64[ns]
dtype: object
print (df.sort_values('dates'))
symb dates
0 BLK 2014-03-01 09:00:00
4 NOTE 2014-04-03 09:00:00
1 BBR 2014-06-02 09:00:00
2 HZ 2014-06-02 09:00:00
3 OMNI 2014-07-02 09:00:00
6 RBY 2014-07-04 09:00:00
5 AMP 2016-04-03 09:00:00

I am not sure how you are getting the data, but if you are importing it from some source such as a CSV you could use pandas.read_csv and set parse_dates=True. The question is what is the type of the dates column? You an easily change them to datelike objects using `dateutil.parse.parse. For example,
import pandas
import dateutil
data = {'symb': ['BLK', 'BBR', 'HZ', 'OMNI', 'NOTE', 'AMP', 'RBY'],
'dates': ['01/03/2014 09:00:00', '02/06/2014 09:00:00', '02/06/2014 09:00:00',
'02/07/2014 09:00:00', '03/04/2014 09:00:00', '03/04/2016 09:00:00',
'04/07/2014 09:00:00']}
df = pandas.DataFrame.from_dict(data)
df.dates = df.dates.apply(dateutil.parser.parse)
print df.to_string()
# OUTPUT
# 0 2014-01-03 09:00:00 BLK
# 1 2014-02-06 09:00:00 BBR
# 2 2014-02-06 09:00:00 HZ
# 3 2014-02-07 09:00:00 OMNI
# 4 2014-03-04 09:00:00 NOTE
# 5 2016-03-04 09:00:00 AMP
# 6 2014-04-07 09:00:00 RBY
This gets you the [ISO8601 format] which may be preferable to the dd/mm/yyyy format, but if you must have that format you can use the code recommended by #umutto

Filtering Pandas DataFrame on last n dates

I have a Pandas DF that looks like this:
df
I want to filter the DF using a locally defined int parameter, 'days'. Such as when days = 10, my filtered DF only has the data for the last available 10 dates.
Until now, I have tried the following:
days=10
cutoff_date = df["SeriesDate"][-1:] - datetime.timedelta(days=days)
However, then trying to output the filtered DF using:
df[df['SeriesDate'] > cutoff_date]
I get the follwing error:
ValueError: Can only compare identically-labeled Series objects
I am still learning Python so will appreciate any help that I can get with this.

I think you need select last value of column SeriesDate by iloc:
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=15, freq='20H')
df = pd.DataFrame({'SeriesDate': rng, 'Value_1': np.random.random(15)})
print (df)
SeriesDate Value_1
0 2015-02-24 00:00:00 0.849160
1 2015-02-24 20:00:00 0.332487
2 2015-02-25 16:00:00 0.687638
3 2015-02-26 12:00:00 0.310326
4 2015-02-27 08:00:00 0.660795
5 2015-02-28 04:00:00 0.354475
6 2015-03-01 00:00:00 0.061312
7 2015-03-01 20:00:00 0.443908
8 2015-03-02 16:00:00 0.708326
9 2015-03-03 12:00:00 0.257419
10 2015-03-04 08:00:00 0.618363
11 2015-03-05 04:00:00 0.121625
12 2015-03-06 00:00:00 0.637324
13 2015-03-06 20:00:00 0.058292
14 2015-03-07 16:00:00 0.047624
days=10
cutoff_date = df["SeriesDate"].iloc[-1] - pd.Timedelta(days=days)
print (cutoff_date)
2015-02-25 16:00:00
df1 = df[df['SeriesDate'] > cutoff_date]
print (df1)
SeriesDate Value_1
3 2015-02-26 12:00:00 0.310326
4 2015-02-27 08:00:00 0.660795
5 2015-02-28 04:00:00 0.354475
6 2015-03-01 00:00:00 0.061312
7 2015-03-01 20:00:00 0.443908
8 2015-03-02 16:00:00 0.708326
9 2015-03-03 12:00:00 0.257419
10 2015-03-04 08:00:00 0.618363
11 2015-03-05 04:00:00 0.121625
12 2015-03-06 00:00:00 0.637324
13 2015-03-06 20:00:00 0.058292
14 2015-03-07 16:00:00 0.047624
Another alternative is use max, thanks Pocin:
cutoff_date = df["SeriesDate"].max() - pd.Timedelta(days=days)
print (cutoff_date)
2015-02-25 16:00:00
And if you want filter by dates only:
days=10
cutoff_date = df["SeriesDate"].dt.date.iloc[-1] - pd.Timedelta(days=days)
print (cutoff_date)
2015-02-25
EDIT:
You can filter out dates where is weekend with dayofweek and then use isin
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=15)
df = pd.DataFrame({'SeriesDate': rng, 'Value_1': np.random.random(15)})
print (df)
SeriesDate Value_1
0 2015-02-24 0.498387
1 2015-02-25 0.435767
2 2015-02-26 0.299233
3 2015-02-27 0.489286
4 2015-02-28 0.892167
5 2015-03-01 0.507436
6 2015-03-02 0.360427
7 2015-03-03 0.903886
8 2015-03-04 0.718148
9 2015-03-05 0.645489
10 2015-03-06 0.251285
11 2015-03-07 0.139275
12 2015-03-08 0.756845
13 2015-03-09 0.565863
14 2015-03-10 0.148077
days=10
last_day = df["SeriesDate"].dt.date.iloc[-1]
cutoff_date = last_day - pd.Timedelta(days=days)
rng = pd.date_range(cutoff_date, last_day)
rng = rng[(rng.dayofweek != 0) & (rng.dayofweek != 6)]
print (rng)
DatetimeIndex(['2015-02-28', '2015-03-03', '2015-03-04', '2015-03-05',
'2015-03-06', '2015-03-07', '2015-03-10'],
dtype='datetime64[ns]', freq=None)
df1 = df[df['SeriesDate'].isin(rng)]
print (df1)
SeriesDate Value_1
4 2015-02-28 0.892167
7 2015-03-03 0.903886
8 2015-03-04 0.718148
9 2015-03-05 0.645489
10 2015-03-06 0.251285
11 2015-03-07 0.139275
14 2015-03-10 0.148077

How to merge two dataframes based on the closest (or most recent) timestamp

Suppose I have a dataframe df1, with columns 'A' and 'B'. A is a column of timestamps (e.g. unixtime) and 'B' is a column of some value.
Suppose I also have a dataframe df2 with columns 'C' and 'D'. C is also a unixtime column and D is a column containing some other values.
I would like to fuzzy merge the dataframes with a join on the timestamp. However, if the timestamps don't match (which they most likely don't), I would like it to merge on the closest entry before the timestamp in 'A' that it can find in 'C'.
pd.merge does not support this, and I find myself converting away from dataframes using to_dict(), and using some iteration to solve this. Is there a way in pandas to solve this?

numpy.searchsorted() finds the appropriate index positions to merge on (see docs) - hope the below get you closer to what you're looking for:
start = datetime(2015, 12, 1)
df1 = pd.DataFrame({'A': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'B': [1] * 10}).sort_values('A').reset_index(drop=True)
df2 = pd.DataFrame({'C': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'D': [2] * 10}).sort_values('C').reset_index(drop=True)
df2.index = np.searchsorted(df1.A.values, df2.C.values)
print(pd.merge(left=df1, right=df2, left_index=True, right_index=True, how='left'))
A B C D
0 2015-12-01 00:01:00 1 NaT NaN
1 2015-12-01 00:02:00 1 2015-12-01 00:02:00 2
2 2015-12-01 00:02:00 1 NaT NaN
3 2015-12-01 00:12:00 1 2015-12-01 00:05:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
5 2015-12-01 00:28:00 1 2015-12-01 00:22:00 2
6 2015-12-01 00:30:00 1 NaT NaN
7 2015-12-01 00:39:00 1 2015-12-01 00:31:00 2
7 2015-12-01 00:39:00 1 2015-12-01 00:39:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:40:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:46:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:54:00 2
9 2015-12-01 00:57:00 1 NaT NaN

Building on #Stephan's answer and #JohnE's comment, something similar can be done with pandas.merge_asof for pandas>=0.19.0:
>>> import numpy as np
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> a_timestamps = pd.date_range(start, start + timedelta(hours=4.5), freq='30Min')
>>> c_timestamps = pd.date_range(start, start + timedelta(hours=9), freq='H')
>>> df1 = pd.DataFrame({'A': a_timestamps, 'B': range(10)})
A B
0 2015-12-01 00:00:00 0
1 2015-12-01 00:30:00 1
2 2015-12-01 01:00:00 2
3 2015-12-01 01:30:00 3
4 2015-12-01 02:00:00 4
5 2015-12-01 02:30:00 5
6 2015-12-01 03:00:00 6
7 2015-12-01 03:30:00 7
8 2015-12-01 04:00:00 8
9 2015-12-01 04:30:00 9
>>> df2 = pd.DataFrame({'C': c_timestamps, 'D': range(10, 20)})
C D
0 2015-12-01 00:00:00 10
1 2015-12-01 01:00:00 11
2 2015-12-01 02:00:00 12
3 2015-12-01 03:00:00 13
4 2015-12-01 04:00:00 14
5 2015-12-01 05:00:00 15
6 2015-12-01 06:00:00 16
7 2015-12-01 07:00:00 17
8 2015-12-01 08:00:00 18
9 2015-12-01 09:00:00 19
>>> pd.merge_asof(left=df1, right=df2, left_on='A', right_on='C')
A B C D
0 2015-12-01 00:00:00 0 2015-12-01 00:00:00 10
1 2015-12-01 00:30:00 1 2015-12-01 00:00:00 10
2 2015-12-01 01:00:00 2 2015-12-01 01:00:00 11
3 2015-12-01 01:30:00 3 2015-12-01 01:00:00 11
4 2015-12-01 02:00:00 4 2015-12-01 02:00:00 12
5 2015-12-01 02:30:00 5 2015-12-01 02:00:00 12
6 2015-12-01 03:00:00 6 2015-12-01 03:00:00 13
7 2015-12-01 03:30:00 7 2015-12-01 03:00:00 13
8 2015-12-01 04:00:00 8 2015-12-01 04:00:00 14
9 2015-12-01 04:30:00 9 2015-12-01 04:00:00 14

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditional selection before certain time of day - Pandas dataframe - python

I have the above dataframe (snippet) and want create a new dataframe which is a conditional selection where I keep only the rows that are timestamped with a time before 15:00:00. I'm still somewhat new to Pandas / python and have been stuck on this for a while :(

You can check the hours of the date column and use it for subsetting: df['date'] = pd.to_datetime(df['date']) # optional if the date column is of datetime type df[df.date.dt.hour < 15]

Related

Create a Pandas Dataframe Date Column to Day of Year

Filtering out another dataframe based on selected hours

how to sort by english date format not american pandas .sort()

Filtering Pandas DataFrame on last n dates

How to merge two dataframes based on the closest (or most recent) timestamp

Categories

Resources