pandas resample to specific weekday in month - python

I have a Pandas dataframe where I'd like to resample to every third Friday of the month.
np.random.seed(0)
#requested output:
dates = pd.date_range("2018-01-01", "2018-08-31")
dates_df = pd.DataFrame(data=np.random.random(len(dates)), index=dates)
mask = (dates.weekday == 4) & (14 < dates.day) & (dates.day < 22)
dates_df.loc[mask]
But when a third Friday is missing (e.g. dropping Feb third
Friday), I want to have the latest value (so as of 2018-02-15). Using the mask gives me the next value (Feb 17 instead of Feb 15):
# remove February third Friday:
dates_df = dates_df.drop([pd.to_datetime("2018-02-16")])
mask = (dates.weekday == 4) & (14 < dates.day) & (dates.day < 22)
dates_df.loc[mask]
Using monthly resample in combination with loffset gives the end of month values with offsetting the index, which is also not what I want:
from pandas.tseries.offsets import WeekOfMonth
dates_df.resample("M", loffset=WeekOfMonth(week=2, weekday=4)).last()
Is there an alternative (preferably using resample) without having to resample to daily values first and then adding a mask (this takes a long time to complete on my dataframe)

Your second attempt is in the right direction IIUC, you just need to resample using WeekOfMonth as the rule, rather than using it as an offset:
dates_df.resample(WeekOfMonth(week=2, weekday=4)).asfreq().dropna()
This approach will not offset the index, it should just return the data for the third Friday for every month.
Dealing with Missing 3rd Friday:
With the above code, if you have a missing 3rd Friday the whole month will be excluded. But depending on how you want to deal with missing data, you can bfill, ffill, pad.. you can amend the above to the following:
dates_df.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()
The above will bfill the missing 3rd Friday with the next value.
Update: Lets work with a fixed data set instead of np.random:
# create a smaller daterange
dates = pd.date_range("2018-05-01", "2018-08-31")
# create a data with only 1,2,3 values
data = [1,2,3] * int(len(dates)/3)
dates_df = pd.DataFrame(data=data, index=dates)
dates_df.head()
# Output:
2018-05-01 1
2018-05-02 2
2018-05-03 3
2018-05-04 1
2018-05-05 2
Now let's check what the data looks like for the 3rd Friday of each month by selecting it manually:
dates_df.loc[[
pd.Timestamp('2018-05-18'),
pd.Timestamp('2018-06-15'),
pd.Timestamp('2018-07-20'),
pd.Timestamp('2018-08-17')
]]
Output:
2018-05-18 3
2018-06-15 1
2018-07-20 3
2018-08-17 1
If you dont have any missing 3rd Fridays and running the code provided earlier:
dates_df.resample(rule=WeekOfMonth(week=2,weekday=4)).asfreq().dropna()
Will produce the following output:
2018-05-18 3
2018-06-15 1
2018-07-20 3
2018-08-17 1
As you can see the index has not been shifted here and it returned the exact values for the 3rd Friday of each month.
Now say you do have some 3rd Fridays missing, depending how you want to do it (use previous value: ffill, or next value bfill):
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use NEXT valid observation to fill gap
dates_df.drop(index=pd.Timestamp('2018-08-17')).resample(rule=WeekOfMonth(week=2, weekday=4)).ffill().asfreq(freq='D').dropna()
2018-05-18 3
2018-06-15 1
2018-07-20 3
2018-08-17 3
dates_df.drop(index=pd.Timestamp('2018-08-17')).resample(rule=WeekOfMonth(week=2, weekday=4)).bfill().asfreq(freq='D').dropna()
2018-04-20 1
2018-05-18 3
2018-06-15 1
2018-07-20 3
2018-08-17 2
If say the whole index was shifted like your example:
dates_df.resample(rule='M', loffset=WeekOfMonth(week=2, weekday=4)).asfreq().dropna()
# Output:
2018-06-15 1
2018-07-20 1
2018-08-17 2
2018-09-21 3
Whats happening there is you're resampling by rule 'M' (month end) and then you're offsetting (shifting forward) the index by the 3rd Friday of each Month.
As you can see before the offset, this how it looks like:
dates_df.resample(rule='M').asfreq().dropna()
# Output
2018-05-31 1
2018-06-30 1
2018-07-31 2
2018-08-31 3

Related

Add column to dataframe based on date range

I want to add a column to my data frame prod_data based on a range of dates. This is an example of the data in the column ['Mount Time'] I want to modify the new column from:
0 2022-08-17 06:07:00
1 2022-08-17 06:12:00
2 2022-08-17 06:40:00
3 2022-08-17 06:45:00
4 2022-08-17 06:47:00
The new column is named ['Week'] and I want it to run from M-S, with week 1 starting on 9/5/22, running through 9/11/22 and then week 2 the next M-S, and so on until the last week which would be 53. I would also like weeks previous to 9/5 to have negative week numbers, so 8/29/22 would be the start of week -1 and so on.
The only thing I could think of was to create 2 massive lists and use np.select to define the parameters of the column, but there has to be a cleaner way of doing this, right?
You can use pandas datetime objects to figure out how many days away a date is from your start date, 9/5/2022, and then use floor division to convert that to week numbers. I made the "mount_time" column just to emphasize that the original column should be a datetime object.
prod_data["mount_time"] = pd.to_datetime( prod_data[ "Mount Time" ] )
start_date = pd.to_datetime( "9/5/2022" )
days_away = prod_data.mount_time - start_date
prod_data["Week"] = ( days_away.dt.days // 7 ) + 1
As intended, 9/5/2022 through 9/11/2022 will have a value of 1. 8/29/2022 would start week 0 (not -1 as you wrote) unless you want 9/5/2022 to start as week 0 (in which case just delete the + 1 from the code). Some more examples:
>>> test[ ["date", "Week" ] ]
date Week
0 2022-08-05 -4
1 2022-08-14 -3
2 2022-08-28 -1
3 2022-08-29 0
4 2022-08-30 0
5 2022-09-05 1
6 2022-09-11 1
7 2022-09-12 2

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

select a specifc day from a data set, else the next working day if not available

I have a large dataset spanning many years and I want to subset this data frame by selecting data based on a specific day of the month using python.
This is simple enough and I have achieved with the following line of code:
df[df.index.day == 12]
This selects data from the 12th of each month for all years in the data set. Great.
The problem I have however is the original data set is based on working day data. Therefore the 12th might actually be a weekend or national holiday and thus doesnt appear in the data set. Nothing is returned for that month as such.
What I would like to happen is to select the 12th where available, else select the next working day in the data set.
All help appreciated!
Here's a solution that looks at three days from every month (12, 13, and 14), and then picks the minimum. If the 12th is a weekend it won't exist in the original dataframe, and you'll get the 13th. The same goes for the 14th.
Here's the code:
# Create dummy data - initial range
df = pd.DataFrame(pd.date_range("2018-01-01", "2020-06-01"), columns = ["date"])
# Create dummy data - Drop weekends
df = df[df.date.dt.weekday.isin(range(5))]
# get only the 12, 13, and 14 of every month
# group by year and month.
# get the minimum
df[df.date.dt.day.isin([12, 13, 14])].groupby(by=[df.date.dt.year, df.date.dt.month], as_index=False).min()
Result:
date
0 2018-01-12
1 2018-02-12
2 2018-03-12
3 2018-04-12
4 2018-05-14
5 2018-06-12
6 2018-07-12
7 2018-08-13
8 2018-09-12
9 2018-10-12
...
Edit
Per a question in the comments about national holidays: the same solution applies. Instead of picking 3 days (12, 13, 14), pick a larger number (e.g. 12-18). Then, get the minimum of these that actually exists in the dataframe - and that's the first working day starting with the 12th.
You can backfill the dataframe first to fill the missing values then select the date you want
df = df.asfreq('d', method='bfill')
Then you can do df[df.index.day == 12]
This is my approach, I will explain each line below the code. Please feel free to add a comment if there's something unclear:
!pip install workalendar #Install the module
import pandas as pd #Import pandas
from workalendar.usa import NewYork #Import the required country and city
df = pd.DataFrame(pd.date_range(start='1/1/2018', end='12/31/2018')).rename(columns={0:'Dates'}) #Create a dataframe with dates for the year 2018
cal = NewYork() #Instance the calendar
df['Is_Working_Day'] = df['Dates'].map(lambda x: cal.is_working_day(x)) #Create an extra column, True for working days, False otherwise
df[(df['Dates'].dt.day >= 12) & (df['Is_Working_Day'] == True)].groupby(df['Dates'].dt.month)['Dates'].first()
Essentially this last line returns all days with values equal or higher than 12 that are actual working days, we then group them by month and return the first day for each where this condition is met (day >= 12 and Working_day = True).
Output:
Dates
1 2018-01-12
2 2018-02-13
3 2018-03-12
4 2018-04-12
5 2018-05-14
6 2018-06-12
7 2018-07-12
8 2018-08-13
9 2018-09-12
10 2018-10-12
11 2018-11-13
12 2018-12-12

Calculating moving median within group

I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)

How to group every date in an uneven pandas datetime series with the closest date 1 year ago in the same series?

I processing time-series data within a pandas dataframe. The datetime index is incomplete (i.e. some dates are missing).
I want to create a new column with a datetime series of 1 year offset, but only containg dates present in the original datetimeindex . The challenge is that the exact 1y match is not present in the index in many cases.
Index (Input) 1 year offset (Output)
1/2/2014 None
1/3/2014 None
1/6/2014 None
1/7/2014 None
1/9/2014 None
1/10/2014 None
1/2/2015 1/2/2014
1/5/2015 1/3/2014
1/6/2015 1/6/2014
1/7/2015 1/7/2014
1/8/2015 1/9/2014
1/9/2015 1/10/2014
The requirements are as follows:
Every date as of 1/2/2015 must have a corresponding offset date (no blanks)
Every date within the "offset date" group must also be present in the Index column (i.e. introduction of new dates, like 1/8/2014, is not desired
All offset dates must be ordered in an ascending way (the sequence of dates must be preserved)
What I have tried so far:
The Dateoffset doesn't help, since it is insensitive to dates not present in the index.
The .shift method data["1 year offset (Output)"] = data.Index.shift(365) doesn't help because the number of dates within the index is different across the years.
What I am trying to do now has several steps:
Apply Dateoffset method at first to create "temp 1 year offset"
Remove single dates from "temp 1 year offset" that are not present in datetimeindex using set(list) method and replace cells by NaN
Select dates in datetimeindex whose "temp 1 year offset" is NaN and substract one year
Map the Dates from (3) to its closest date in the datetimeindex using argmin
The challenge here is that I am getting double entries as well as a descending order of days in some cases. Those mess up with the results in the following way (see the timedeltas between day n and day n+1):
Index (Input) 1 year offset (Output) Timedelta
4/17/2014 4/16/2014 1
4/22/2014 4/17/2014 1
4/23/2014 4/25/2014 8
4/24/2014 None
4/25/2014 4/22/2014 -3
4/28/2014 4/23/2014 1
4/29/2014 4/24/2014 1
4/30/2014 4/25/2014 1
In any case, this last approach seems to be an overkill concerning the simplicity of the underlying goal. Is there a faster and more simple way to do it?
How to group every date in an uneven pandas datetime series with the closest date one year ago in the same series?
This would be a way:
However look at this thread to properly handle 1 year when the year has 366 days:
Add one year in current date PYTHON
This code therefore needs some small modifications.
import pandas as pd
import datetime
df = pd.DataFrame(dict(dates=[
'1/3/2014',
'1/6/2014',
'1/7/2014',
'1/9/2014',
'1/10/2014',
'1/2/2015',
'1/5/2015',
'1/6/2015',
'1/7/2015',
'1/8/2015',
'1/9/2015']))
# Convert column to datetime
df.dates = pd.to_datetime(df.dates)
# Store min(year) as a variable
minyear = min(df.dates).year
# Calculate the day with timedelta -365 days (might fail on 2012?)
df['offset'] = [(i + datetime.timedelta(days=-365)).date()
if i.year != minyear else None for i in df.dates]
df
Returns:
dates offset
0 2014-01-03 None
1 2014-01-06 None
2 2014-01-07 None
3 2014-01-09 None
4 2014-01-10 None
5 2015-01-02 2014-01-02
6 2015-01-05 2014-01-05
7 2015-01-06 2014-01-06
8 2015-01-07 2014-01-07
9 2015-01-08 2014-01-08
10 2015-01-09 2014-01-09

Categories

Resources