I would like to create a running dataframe of trading data for the next four hours from the current time while skipping non-trading hours (5-6pm weekdays, Saturday-6pm Sunday). For example, at 4pm on Friday, I'd like a dataframe that runs from 4pm to 5pm on Friday and then 6pm-9pm on Sunday.
Currently, I am using the following:
time_parameter = pd.Timedelta(hours=4) #Set time difference to four hours
df = df.set_index(['Time'])
for current_time, row in df.iterrows(): #df is the entire trading data df
future_time = current_time + time_parameter
temp_df = df.loc[current_time : future_time]
This obviously doesn't skip non-trading hours so I am trying to find an efficient way to do that.
One method I can use is creating a set of non-trading hours, checking if the current time bounds (current_time:future_time) include any, and adding an additional hour for each.
However, since the dataset has about 3.5million rows and would need this check for each row, I want to ask if anyone may know of a faster approach?
In short, looking for a method to add 4 business hours (Sun-Fri 6pm-5pm) to current time. Thanks!
Input Data: This shows the first 19 rows of the trading data
Expected Output Data: This shows the first and last 3 rows from a four hour period starting at 18:00:30 on January 8th, 2017
Solution
Based on the answer by Code Different below, I used the following:
def last_trading_hour(start_time, time_parameter, periods_parameter):
start_series = pd.date_range(start_time, freq='H', periods = periods_parameter)
mask = (((start_series.dayofweek == 6) & (time_2(18) <= start_series.time)) #Sunday: After 6pm
| ((start_series.dayofweek == 4) & (start_series.time < time_2(17))) #Friday before 5pm
| ((start_series.dayofweek < 4) & (start_series.time < time_2(17))) #Mon-Thur before 5pm
| ((start_series.dayofweek < 4) & (time_2(18) <= start_series.time)) #Mon-Thur after 6pm
)
return start_series[mask][time_parameter]
start_time = pd.Timestamp('2019-08-16 13:00:10')
time_parameter = 4 #Adding 4 hours to time
periods_parameter = 49 + time_parameter #Max 49 straight hours of no-trades (Fri 5pm-Sun 6pm)
last_trading_hour(start_time, time_parameter, periods_parameter)
Results:
Timestamp('2019-08-18 18:00:10')
If you need the entire series, follow Code Different's method for indexing.
Generate a sufficiently long series of hours then filter for the first 4 that are trading hours:
from datetime import time
start_time = pd.Timestamp('2019-08-16 16:00')
s = pd.date_range(start_time, freq='H', periods=72)
is_trading_hour = (
((s.weekday == 6) & (time(18) <= s.time))
| ((s.weekday == 4) & (s.time < time(17)))
| (s.weekday < 4)
)
s[is_trading_hour][:4]
Result:
DatetimeIndex(['2019-08-16 16:00:00', '2019-08-18 18:00:00',
'2019-08-18 19:00:00', '2019-08-18 20:00:00'],
dtype='datetime64[ns]', freq=None)
It's hard to tell from so little information. However, it seems that you're working on hour boundaries. If so, it should be straightforward to set up a look-up table (dict) keyed by each day and hour, perhaps: (0,0) for midnight Sun/Mon, (2, 13) for 1pm Wed, and so on. Then provide simple entries for the end of the 4-hour period
(0, 0): Timedelta(hours= 4), # 0:00 Mon, normal span; regular trading hours
(0,16): Timedelta(hours= 5), # 16:00 Sun; 1 hour of down-time
(4,16): Timedelta(hours=53), # 16:00 Fri; 1 hour trade, 49 hrs down, 3 hrs trade
(5,16): Timedelta(hours=26), # 16:00 Sat; 26 hours down, 4 hours trade
Add the indicated Timedelta to your start time; that gives you the end time of the period. You can write a few loops and if statements to compute these times for you, or just hard-code all 168; they're rather repetitive.
Checking your data base lines remains up to you, since you didn't specify their format or semantics in your posting.
Related
I am a complete beginner in Python and it is my first question on Stackoverflow. I have tried numerous tutorials on youtube + some additional google searching, but havent been really able to completely solve my task. Briefly putting it below asf:
We have a dataset of futures prices (values) for next 12-36 months. Each value corresponds to one month in future. The idea for the code is to have an input of following:
starting date in days (like 2nd of Feb 2021 or any other)
duration of given days (say 95 or 150 days or 425 days)
The code has to calculate the number of days from each given month between starting and ending date (which is starting + duration) and then to use appropriate values from corresponding month to calculate an average price for this particular duration in time.
Example:
Starting date is 2nd of Feb 2021 and duration is 95 days (end date 8th of May). Values are Feb - 7750, Mar - 9200, April - 9500, May is 10100.
I have managed to do same in Excel (which was very clumsy and too complicated to use on the daily basis) and average stands for around 8949 taking in mind all above. But I cant figure out how to code same "interval" with days per month in Python. All of the articles just simply point out to "monthrange" function, but how is that possible to apply same for this task?
Appreciate your understanding of a newbie question and sorry for the lack of knowledge to express/explain my thoughts more clear.
Looking forward to any help relative to above.
You can use dataframe.todatetime() to constuct your code. If you need further help, just click ctrl + tab within your code to see the inputs and their usage.
You can try the following code.
The input_start_date() function will input the start date, and return it when called.
After we have the start date we input the duration of days.
Then we simply add them using timedelta
For the Distribution of days in the month : SO - #wwii
import datetime
from datetime import timedelta
def input_start_date():
YEAR = int(input('Enter the year : '))
MONTH = int(input('Enter the month : '))
DAY = int(input('Enter the day : '))
DATE = datetime.date(YEAR, MONTH, DAY)
return DATE
# get the start date:
Start_date = input_start_date()
# get the Duration
Duration = int(input('Enter the duration : '))
print('Start Date : ', Start_date)
print('Duration :', Duration)
# final date.
Final_date = Start_date + timedelta(days=Duration)
print(Final_date)
# credit goes to #wwii -----------------------
one_day = datetime.timedelta(1)
start_dates = [Start_date]
end_dates = []
today = Start_date
while today <= Final_date:
tomorrow = today + one_day
if tomorrow.month != today.month:
start_dates.append(tomorrow)
end_dates.append(today)
today = tomorrow
end_dates.append(Final_date)
# -----------------------------------------------
print("Distribution : ")
for i in range(len(start_dates)):
days = int(str(end_dates[i]-start_dates[i]).split()[0]) + 1
print(start_dates[i], ' to ', end_dates[i], ' = ', days)
print(str(end_dates[0]-start_dates[0]))
'''
Distribution :
2021-02-02 to 2021-02-28 = 27
2021-03-01 to 2021-03-31 = 31
2021-04-01 to 2021-04-30 = 30
2021-05-01 to 2021-05-08 = 8
'''
I want to find the first day of a given month an average 90 days previous to a random date. For instance:
December 15 -- returns August 30
December 30 -- returns August 30
December 1st -- returns August 30
I know this can be done with Pandas pd.DateOffset:
print(pd.Timestamp("2019-12-15") - pd.DateOffset(days=90))
but then I'll get something like September 15th.
I know I can count minus 90 days, select the month, subtract 1 and then select last day of the obtained month, but I was wondering if this can be easily done in one line of code, efficiently.
Assume that the date in question is:
dat = pd.Timestamp('2019-12-15')
To compute the date 90 days before, run:
dat2 = dat - pd.DateOffset(days=90)
getting 2019-09-16.
And finally, to get the start of this month, run:
dat2 - pd.offsets.MonthBegin(0)
getting 2019-09-01.
To put tho whole thing short, run just:
dat - pd.DateOffset(days=90) - pd.offsets.MonthBegin(0)
A subtle difference becomes visible if you start from a date, which
turned 90 days back gives just the first day of a month. E.g.
dat = pd.Timestamp('2019-11-30')
dat2 = dat - pd.DateOffset(days=90)
gives 2019-09-01.
Then dat2 - pd.offsets.MonthBegin(0) gives just the same date.
If you want in this case the start date of the previous month, run:
dat2 - pd.offsets.MonthBegin(1)
(note the argument changed to 1), getting 2019-08-01.
So choose the variant which suits your needs.
I created an hourly dates dataframe, and now I would like to create a column that flags whether each row (hour) is in Daylight Saving Time or not. For example, in summer hours, the flag should == 1, and in winter hours, the flag should == 0.
# Localized dates dataframe
dates = pd.DataFrame(data=pd.date_range('2018-1-1', '2019-1-1', freq='h', tz='America/Denver'), columns=['date_time'])
# My failed attempt to create the flag column
dates['dst_flag'] = np.where(dates['date_time'].dt.daylight_saving_time == True, 1, 0)
There's a nice link in the comments that at least let you do this manually. AFAIK, there isn't a vectorized way to do this.
import pandas as pd
import numpy as np
from pytz import timezone
# Generate data (as opposed to index)
date_range = pd.to_datetime(pd.date_range('1/1/2018', '1/1/2019', freq='h', tz='America/Denver'))
date_range = [date for date in date_range]
# Localized dates dataframe
df = pd.DataFrame(data=date_range, columns=['date_time'])
# Map transition times to year for some efficiency gain
tz = timezone('America/Denver')
transition_times = tz._utc_transition_times[1:]
transition_times = [t.astimezone(tz) for t in transition_times]
transition_times_by_year = {}
for start_time, stop_time in zip(transition_times[::2], transition_times[1::2]):
year = start_time.year
transition_times_by_year[year] = [start_time, stop_time]
# If the date is in DST, mark true, else false
def mark_dst(dates):
for date in dates:
start_dst, stop_dst = transition_times_by_year[date.year]
yield start_dst <= date <= stop_dst
df['dst_flag'] = [dst_flag for dst_flag in mark_dst(df['date_time'])]
# Do a quick sanity check to make sure we did this correctly for year 2018
dst_start = df[df['dst_flag'] == True]['date_time'][0] # First dst time 2018
dst_end = df[df['dst_flag'] == True]['date_time'][-1] # Last dst time 2018
print(dst_start)
print(dst_end)
this outputs:
2018-03-11 07:00:00-06:00
2018-11-04 06:00:00-07:00
which is likely correct. I didn't do the UTC conversions by hand or anything to check that the hours are exactly right for the given timezone. You can at least verify the dates are correct with a quick google search.
Some gotchas:
pd.date_range generates an index, not data. I changed your original code slightly to make it be data as opposed to the index. I assume you have the data already.
There's something goofy about how tz._utc_transition_times is structured. It's start/stop utc DST transition times, but there is some goofy stuff in the early dates. It should be good from 1965 onward though. If you are doing dates earlier than that change tz._utc_transition_times[1:] to tz._utc_transition_times. Note not all years before 1965 are present.
tz._utc_transition_times is "Python private". It is liable to change without warning or notice, and may or may not work for future or past versions of pytz. I'm using pytz verion 2017.3. I recommend you run this code to make sure the output matches, and if not, make sure to use version 2017.3.
HTH, good luck with your research/regression problem!
If you are looking for a vectorized way of doing this (which you probably should be), you can use something like the code below.
The fundamental idea behind this is to find the difference between the current time in your timezone and the UTC time. In the winter months, the difference will be one extra hour behind UTC. Whatever the difference is, add what is needed to get to the 1 or 0 for the flag.
In Denver, summer months are UTC-6 and winter months are UTC-7. So, if you take the difference between the tz-aware time in Denver and UTC time, then add 7, you'll get a value of 1 for summer months and a value of 0 for winter months.
import pandas as pd
start = pd.to_datetime('2020-10-30')
end = pd.to_datetime('2020-11-02')
dates = pd.date_range(start=start, end=end, freq='h', tz='America/Denver')
df1 = pd.DataFrame({'dst_flag': 1, 'date1': dates.tz_localize(None)}, index=dates)
# add extra day on each end so that there are no nan's after the join
dates = pd.to_datetime(pd.date_range(start=start - pd.to_timedelta(1, 'd'), end=end + pd.to_timedelta(1, 'd'), freq='h'), utc=True)
df2 = pd.DataFrame({'date2': dates.tz_localize(None)}, index=dates)
out = df1.join(df2)
out['dst_flag'] = (out['date1'] - out['date2']) / pd.to_timedelta(1, unit='h') + 7
out.drop(columns=['date1', 'date2'], inplace=True)
Here is what I ended up doing, and it works for my purposes:
import pandas as pd
import pytz
# Create dates table and flag Daylight Saving Time dates
dates = pd.DataFrame(data=pd.date_range('2018-1-1', '2018-12-31-23', freq='h'), columns=['date_time'])
# Create a list of start and end dates for DST in each year, in UTC time
dst_changes_utc = pytz.timezone('America/Denver')._utc_transition_times[1:]
# Convert to local times from UTC times and then remove timezone information
dst_changes = [pd.Timestamp(i).tz_localize('UTC').tz_convert('America/Denver').tz_localize(None) for i in dst_changes_utc]
flag_list = []
for index, row in dates['date_time'].iteritems():
# Isolate the start and end dates for DST in each year
dst_dates_in_year = [date for date in dst_changes if date.year == row.year]
spring = dst_dates_in_year[0]
fall = dst_dates_in_year[1]
if (row >= spring) & (row < fall):
flag = 1
else:
flag = 0
flag_list.append(flag)
print(flag_list)
dates['dst_flag'] = flag_list
del(flag_list)
the following vectorized way seem to work fine.
The idea behind is the same as Nick Klavoht's idea : find the difference between the current time in your timezone and the utc time.
# Localized dates dataframe
df = pd.DataFrame(data=pd.date_range('2018-1-1', '2019-1-1', freq='h', tz='America/Denver'), columns=['date_time'])
df['utc_offset'] = df['date_time'].dt.strftime('%z').str[0:3].astype(float)
df['utc_offset_shifted'] = df['utc_offset'].shift(-1)
df['dst'] = df['utc_offset'] - df['utc_offset_shifted']
df_dst = df[(df['dst'] != 0) & (df['dst'])]
df_dst = df_dst.drop(['utc_offset', 'utc_offset_shifted'], axis=1).reset_index(drop=True)
print(df_dst)
This outputs :
date_time dst
0 2018-03-11 01:00:00-07:00 -1.0
1 2018-11-04 01:00:00-06:00 1.0
If you know what time zone you are dealing with you could use:
dates['dst_flag'] = dates['date_time'].apply(lambda x: x.tzname() == 'CEST')
This would flag the all hours in CET as False and in CEST as True. I'm not sure if I'd want to do that on a huge column.
I'm working on a script that calculates my salary, for each of my work days, since we don't get the plan send out electronic.
We do earn extra withing some time periods.
every hour i earn 61.86 DKK, but at within some time periods i earn extra money, as seen below.
(For simplicity i have calculated the time in 24h cycle, since that what i am used to)
Weekdays (18:00 - 06:00) 12.20 DKK
Saturday (15:00 - 24:00) 21.65 DKK
Sunday (whole day) 24.50 DKK
So fare i have worked out, how to calculate the extra money and the hourly rate fine. Although my problem is, if i have a work guard that starts 20:00 and ends next day 4:00 then it will give me and error. I have an IF statement that activates if the hour is above 18(which is when i get extra in the weekdays) then i subtract the hour count with 18 to get, how many hours that's i need to earn extra.
if d2.isoweekday() <= 5: #If its a weekday
if d2.hour >= 18:
extra += ((d2.hour - 18) * weekdaySalary) + ((d2.minute / 60) * weekdaySalary)
How do i detect, exact how many hours that's between a specific period?
like if i have to dates
26-12-2014 17:00
27-12-2014 08:00
i need a way to see how many of those work hours is within the time period(18:00-06:00).
how can this be done?
it's like having 2 diffrent date ranges.
1st - for when i get extra.
2rd - for when i actually work.
26-12-2014 17:00
18:00 - extra money period start here
|
|how do i get the time between these to points?
|
06:00 - extra money period ends here
27-12-2014 08:00
it could also be like this
26-12-2014 17:00
18:00 - extra money period start here
|
|how do i get the time between these to points?
|
27-12-2014 04:00
06:00 - extra money period ends here
Every answer is highly appreciated, spent so much time trying to figure out with no really result.
Based on the two ranges you provided, presuming they are when your shift starts and ends, the following will calculate pay from start of shift to end, increasing by basic rate or basic rate plus extra pay based on the time of day:
def calculate_ot(t1,t2):
t1 = datetime.strptime(t1, "%d-%m-%Y %H:%M")
t2 = datetime.strptime(t2, "%d-%m-%Y %H:%M")
days = ((t2 - t1).days * 86400)
hours, rem = divmod((t2 - t1).seconds + days, 3600)
start_ot = datetime.strptime("18:00 {}".format(t1.date()), "%H:%M %Y-%m-%d")
end_ot = datetime.strptime("06:00 {}".format(t2.date()), "%H:%M %Y-%m-%d")
total_pay = 0
for x in range(hours): # loop in total hour time difference
# if we are within the range of extras pay increase by normal rate plus ot
if start_ot <= t1 < end_ot and t1 < t2:
total_pay += 62 # or 62.20 you have conflicting examples
else:
# else just add basic pay rate
total_pay == 50
t1 += timedelta(hours=1) # keep adding another hour
return total_pay
Given a date range how to calculate the number of weekends partially or wholly within that range?
(A few definitions as requested:
take 'weekend' to mean Saturday and Sunday.
The date range is inclusive i.e. the end date is part of the range
'wholly or partially' means that any part of the weekend falling within the date range means the whole weekend is counted.)
To simplify I imagine you only actually need to know the duration and what day of the week the initial day is...
I darn well now it's going to involve doing integer division by 7 and some logic to add 1 depending on the remainder but I can't quite work out what...
extra points for answers in Python ;-)
Edit
Here's my final code.
Weekends are Friday and Saturday (as we are counting nights stayed) and days are 0-indexed starting from Monday. I used onebyone's algorithm and Tom's code layout. Thanks a lot folks.
def calc_weekends(start_day, duration):
days_until_weekend = [5, 4, 3, 2, 1, 1, 6]
adjusted_duration = duration - days_until_weekend[start_day]
if adjusted_duration < 0:
weekends = 0
else:
weekends = (adjusted_duration/7)+1
if start_day == 5 and duration % 7 == 0: #Saturday to Saturday is an exception
weekends += 1
return weekends
if __name__ == "__main__":
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for start_day in range(0,7):
for duration in range(1,16):
print "%s to %s (%s days): %s weekends" % (days[start_day], days[(start_day+duration) % 7], duration, calc_weekends(start_day, duration))
print
General approach for this kind of thing:
For each day of the week, figure out how many days are required before a period starting on that day "contains a weekend". For instance, if "contains a weekend" means "contains both the Saturday and the Sunday", then we have the following table:
Sunday: 8
Monday: 7
Tuesday: 6
Wednesday: 5
Thursday: 4
Friday: 3
Saturday: 2
For "partially or wholly", we have:
Sunday: 1
Monday: 6
Tuesday: 5
Wednesday: 4
Thursday: 3
Friday: 2
Saturday: 1
Obviously this doesn't have to be coded as a table, now that it's obvious what it looks like.
Then, given the day-of-week of the start of your period, subtract[*] the magic value from the length of the period in days (probably start-end+1, to include both fenceposts). If the result is less than 0, it contains 0 weekends. If it is equal to or greater than 0, then it contains (at least) 1 weekend.
Then you have to deal with the remaining days. In the first case this is easy, one extra weekend per full 7 days. This is also true in the second case for every starting day except Sunday, which only requires 6 more days to include another weekend. So in the second case for periods starting on Sunday you could count 1 weekend at the start of the period, then subtract 1 from the length and recalculate from Monday.
More generally, what's happening here for "whole or part" weekends is that we're checking to see whether we start midway through the interesting bit (the "weekend"). If so, we can either:
1) Count one, move the start date to the end of the interesting bit, and recalculate.
2) Move the start date back to the beginning of the interesting bit, and recalculate.
In the case of weekends, there's only one special case which starts midway, so (1) looks good. But if you were getting the date as a date+time in seconds rather than day, or if you were interested in 5-day working weeks rather than 2-day weekends, then (2) might be simpler to understand.
[*] Unless you're using unsigned types, of course.
My general approach for this sort of thing: don't start messing around trying to reimplement your own date logic - it's hard, ie. you'll screw it up for the edge cases and look bad. Hint: if you have mod 7 arithmetic anywhere in your program, or are treating dates as integers anywhere in your program: you fail. If I saw the "accepted solution" anywhere in (or even near) my codebase, someone would need to start over. It beggars the imagination that anyone who considers themselves a programmer would vote that answer up.
Instead, use the built in date/time logic that comes with Python:
First, get a list of all of the days that you're interested in:
from datetime import date, timedelta
FRI = 5; SAT = 6
# a couple of random test dates
now = date.today()
start_date = now - timedelta(57)
end_date = now - timedelta(13)
print start_date, '...', end_date # debug
days = [date.fromordinal(d) for d in
range( start_date.toordinal(),
end_date.toordinal()+1 )]
Next, filter down to just the days which are weekends. In your case you're interested in Friday and Saturday nights, which are 5 and 6. (Notice how I'm not trying to roll this part into the previous list comprehension, since that'd be hard to verify as correct).
weekend_days = [d for d in days if d.weekday() in (FRI,SAT)]
for day in weekend_days: # debug
print day, day.weekday() # debug
Finally, you want to figure out how many weekends are in your list. This is the tricky part, but there are really only four cases to consider, one for each end for either Friday or Saturday. Concrete examples help make it clearer, plus this is really the sort of thing you want documented in your code:
num_weekends = len(weekend_days) // 2
# if we start on Friday and end on Saturday we're ok,
# otherwise add one weekend
#
# F,S|F,S|F,S ==3 and 3we, +0
# F,S|F,S|F ==2 but 3we, +1
# S|F,S|F,S ==2 but 3we, +1
# S|F,S|F ==2 but 3we, +1
ends = (weekend_days[0].weekday(), weekend_days[-1].weekday())
if ends != (FRI, SAT):
num_weekends += 1
print num_weekends # your answer
Shorter, clearer and easier to understand means that you can have more confidence in your code, and can get on with more interesting problems.
To count whole weekends, just adjust the number of days so that you start on a Monday, then divide by seven. (Note that if the start day is a weekday, add days to move to the previous Monday, and if it is on a weekend, subtract days to move to the next Monday since you already missed this weekend.)
days = {"Saturday":-2, "Sunday":-1, "Monday":0, "Tuesday":1, "Wednesday":2, "Thursday":3, "Friday":4}
def n_full_weekends(n_days, start_day):
n_days += days[start_day]
if n_days <= 0:
n_weekends = 0
else:
n_weekends = n_days//7
return n_weekends
if __name__ == "__main__":
tests = [("Tuesday", 10, 1), ("Monday", 7, 1), ("Wednesday", 21, 3), ("Saturday", 1, 0), ("Friday", 1, 0),
("Friday", 3, 1), ("Wednesday", 3, 0), ("Sunday", 8, 1), ("Sunday", 21, 2)]
for start_day, n_days, expected in tests:
print start_day, n_days, expected, n_full_weekends(n_days, start_day)
If you want to know partial weekends (or weeks), just look at the fractional part of the division by seven.
You would need external logic beside raw math. You need to have a calendar library (or if you have a decent amount of time implement it yourself) to define what a weekend, what day of the week you start on, end on, etc.
Take a look at Python's calendar class.
Without a logical definition of days in your code, a pure mathematical methods would fail on corner case, like a interval of 1 day or, I believe, anything lower then a full week (or lower then 6 days if you allowed partials).