This is my current code
class TimeSeries():
def year(year):
today = datetime.now()
start_date = today+relativedelta(years=-1)
mint, maxt = datetime.min.time(), datetime.max.time()
for st in rrule(MONTHLY, count=24, bymonthday=(1,-1,), dtstart=start_date):
yield st.combine(st, mint)
And this is output from this:
for y in TimeSeries().year():
print(y)
2013-01-31 00:00:00
2013-02-01 00:00:00
2013-02-28 00:00:00
2013-03-01 00:00:00
2013-03-31 00:00:00
2013-04-01 00:00:00
2013-04-30 00:00:00
2013-05-01 00:00:00
2013-05-31 00:00:00
2013-06-01 00:00:00
2013-06-30 00:00:00
2013-07-01 00:00:00
2013-07-31 00:00:00
2013-08-01 00:00:00
2013-08-31 00:00:00
2013-09-01 00:00:00
2013-09-30 00:00:00
2013-10-01 00:00:00
2013-10-31 00:00:00
2013-11-01 00:00:00
2013-11-30 00:00:00
2013-12-01 00:00:00
2013-12-31 00:00:00
2014-01-01 00:00:00
The question is how I can force that counting are started from 2013-01-01 00:00:00 and month end like 2013-01-31 23:59:59 and so on.
And the end of loop ends on 2014-01-31 23:59:59 instead 2014-01-01 00:00:00
Also I like make start date and end date on one line:
2013-03-01 00:00:00 2013-03-31 23:59:59
2013-04-01 00:00:00 2013-03-30 23:59:59
...
...
2014-01-01 00:00:00 2014-01-31 23:59:59
Any suggestion?
First, are you really sure that you want 2013-03-31 23:59:59. Date intervals are traditionally specified as half-open intervals—just like ranges in Python. And the reason for this is that 23:59:59 is not actually the end of a day.
Most obviously, 23:59:59.001 is later than that but on the same day. Python datetime objects include microseconds, so this isn't just a "meh, whatever" problem—if you, e.g., call now(), you can get a time that's incorrectly later than your "end of the day" on the same day.
Less obviously, on a day with a leap second, 23:59:60 is also later but on the same day.
But if you really want this, there are two obvious ways to get it:
You're already iterating dates instead of datetimes and combining the times in manually. And it's obvious when you're dealing with a day 1 vs. day -1, because the date's day member will be 1 or it won't be. So:
class TimeSeries():
def year(year):
today = datetime.now()
start_date = today+relativedelta(years=-1)
mint, maxt = datetime.min.time(), datetime.max.time()
for st in rrule(MONTHLY, count=24, bymonthday=(1, -1,), dtstart=start_date):
yield st.combine(st, mint if st.day=1 else maxt)
Alternatively, instead of iterating both first and last days, just iterate first days, and subtract a second to get the last second of the previous month:
class TimeSeries():
def year(year):
today = datetime.now()
start_date = today+relativedelta(years=-1)
mint, maxt = datetime.min.time(), datetime.max.time()
for st in rrule(MONTHLY, count=24, bymonthday=(1,), dtstart=start_date):
dt = st.combine(st, mint)
yield dt - timedelta(seconds=1)
yield dt
As far as printing these in pairs… well, as written, that's an underspecified problem. The first value in your list is the second value in a pair—except when you run this on the 1st of a month. And likewise, the last date is the first value in a pair, except when you run this on the 31st. So, what do you want to do with them?
If this isn't obvious, look at your example. Your first value is 2013-01-31 00:00:00, but your first pair doesn't start with 2013-01-31.
There are many things you could want here:
Start with the first of the month a year ago, rather than the first first-or-last of the month that happened within the last year. And likewise for the end. So you would have 2013-01-01 in your list, and there would always be pairs.
Start with the first month that started within the last year, and likewise for the end. So you wouldn't get 2013-01-31 in your list, and there would always be pairs.
Use your current rule, and there's not a pair, use None for the missing value.
etc.
Whatever rule you actually want can be coded up pretty easily. And then you'll probably want to yield in (start, end) tuples, so the print loop can just do this:
for start, end in TimeSeries().year():
print(start, end)
Related
This is rather confusing to ask because I understand there are date ranges that create dates between two dates (which isn't what I want).
Instead, what I'm looking for is literally a start to an end. I plan on making a data frame with the index being in the format of YYYY-MM-DD - YYYY-MM-DD.
So, the first row would have an index of 2000-01-01 - 2001-01-01, the second row would have an index of 2000-01-02 - 2001-01-02, the third row would have an index of 2000-01-03 - 2001-01-03, etc. Is there any way to do do this (or something similar)? I was considering making a class that paired two datetime.date objects together, but I'm hesitant as I'm sure there's a more fleshed out manner of doing this.
Apologies if I didn't understand your question properly,
from datetime import datetime, timedelta
def datetime_range(dt_1=None, dt_2=None, span=None):
for i in range(span + 1):
yield dt_1 + timedelta(days=i), dt_2 + timedelta(days=i)
dt_1, dt_2 = datetime(2014, 1, 1), datetime(2015, 1, 1)
for date_1,date_2 in datetime_range(dt_1, dt_2, 10):
print(date_1, date_2) #.date() if needed..
# o/p's
2014-01-01 00:00:00 2015-01-01 00:00:00
2014-01-02 00:00:00 2015-01-02 00:00:00
2014-01-03 00:00:00 2015-01-03 00:00:00
2014-01-04 00:00:00 2015-01-04 00:00:00
2014-01-05 00:00:00 2015-01-05 00:00:00
2014-01-06 00:00:00 2015-01-06 00:00:00
...
I have the following column
Time
2:00
00:13
1:00
00:24
in object format (strings). This time refers to hours and minutes ago from a time that I need to use as a start: 8:00 (it might change; in this example is 8:00).
Since the times in the column Time are referring to hours/minutes ago, what I would like to expect should be
Time
6:00
07:47
7:00
07:36
calculated as time difference (e.g. 8:00 - 2:00).
However, I am having difficulties in doing this calculation and transform the result in a datetime (keeping only hours and minutes).
I hope you can help me.
Since the Time columns contains only Hour:Minute I suggest using timedelta instead of datetime:
df['Time'] = pd.to_timedelta(df.Time+':00')
df['Start_Time'] = pd.to_timedelta('8:00:00') - df['Time']
Output:
Time Start_Time
0 02:00:00 06:00:00
1 00:13:00 07:47:00
2 01:00:00 07:00:00
3 00:24:00 07:36:00
you can do it using pd.to_datetime.
ref = pd.to_datetime('08:00') #here define the hour of reference
s = ref-pd.to_datetime(df['Time'])
print (s)
0 06:00:00
1 07:47:00
2 07:00:00
3 07:36:00
Name: Time, dtype: timedelta64[ns]
This return a series, that can be change to a dataframe with s.to_frame() for example
My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.
I have a dataframe which contains prices for a security each minute over a long period of time.
I would like to extract a subset of the prices, 1 per day between certain hours.
Here is an example of brute-forcing it (using hourly for brevity):
dates = pandas.date_range('20180101', '20180103', freq='H')
prices = pandas.DataFrame(index=dates,
data=numpy.random.rand(len(dates)),
columns=['price'])
I now have a DateTimeIndex for the hours within each day I want to extract:
start = datetime.datetime(2018,1,1,8)
end = datetime.datetime(2018,1,1,17)
day1 = pandas.date_range(start, end, freq='H')
start = datetime.datetime(2018,1,2,9)
end = datetime.datetime(2018,1,2,13)
day2 = pandas.date_range(start, end, freq='H')
days = [ day1, day2 ]
I can then use prices.index.isin with each of my DateTimeIndexes to extract the relevant day's prices:
daily_prices = [ prices[prices.index.isin(d)] for d in days]
This works as expected:
daily_prices[0]
daily_prices[1]
The problem is that as the length of each selection DateTimeIndex increases, and the number of days I want to extract increases, my list-comprehension slows down to a crawl.
Since I know each selection DateTimeIndex is fully inclusive of the hours it encompasses, I tried using loc and the first and last element of each index in my list comprehension:
daily_prices = [ prices.loc[d[0]:d[-1]] for d in days]
Whilst a bit faster, it is still exceptionally slow when the number of days is very large
Is there a more efficient way to divide up a dataframe into begin and end time ranges like above?
If the hours are consistent from day to day as it seems like they might be, you can just filter the index, which should be pretty fast:
In [5]: prices.loc[prices.index.hour.isin(range(8,18))]
Out[5]:
price
2018-01-01 08:00:00 0.638051
2018-01-01 09:00:00 0.059258
2018-01-01 10:00:00 0.869144
2018-01-01 11:00:00 0.443970
2018-01-01 12:00:00 0.725146
2018-01-01 13:00:00 0.309600
2018-01-01 14:00:00 0.520718
2018-01-01 15:00:00 0.976284
2018-01-01 16:00:00 0.973313
2018-01-01 17:00:00 0.158488
2018-01-02 08:00:00 0.053680
2018-01-02 09:00:00 0.280477
2018-01-02 10:00:00 0.802826
2018-01-02 11:00:00 0.379837
2018-01-02 12:00:00 0.247583
....
EDIT: To your comment, working directly on the index and then doing a single lookup at the end will still probably be fastest even if it's not always consistent from day to day. Single day frames at the end will be easy with a groupby.
For example:
df = prices.loc[[i for i in prices.index if (i.hour in range(8, 18) and i.day in range(1,10)) or (i.hour in range(2,4) and i.day in range(11,32))]]
framelist = [frame for _, frame in df.groupby(df.index.date)]
will give you a list of dataframes with 1 day per list element, and will include 8:00-17:00 for the first 10 days each month and 2:00-3:00 for days 11-31.
I have a dataframe that has entries like this, where the times are in UTC:
start_date_time timezone
1 2017-01-01 14:00:00 America/Los_Angeles
2 2017-01-01 14:00:00 America/Denver
3 2017-01-01 14:00:00 America/Phoenix
4 2017-01-01 14:30:00 America/Los_Angeles
5 2017-01-01 14:30:00 America/Los_Angeles
I need to be able to group by date (local date, not UTC date) and I need to be able to create indicators for whether the event happened between certain times (local times, not UTC times).
I have successfully done the above in R by:
Creating a time variable in each of the timezones
Converting those to strings
Pulling each of the string date/time variables into one column, which one I pull depends on the appropriate timezone
Then, splitting that column to get a string date column and a string time column
I can then convert everything back to datetime objects for comparisons. e.g. now I can say if something happened between 2 and 3pm and it will correctly identify everything that happened between 2 and 3pm locally.
I have tried a bunch in python and have the dates as
2017-01-02 04:30:00-08:00
but I can't figure out how to go from there to
2017-01-01 20:30:00
Thanks!
Your example is incorrect. Your timezone is eight hours behind UTC, which means you need to add eight hours to 4:30AM which is 12:30PM UTC time.
The datetime object function astimezone(...) will do the conversion for you. For ease of use, I recommend pytz.
However in pure python:
import datetime as dt
local_tz = dt.timezone(dt.timedelta(hours=-8))
utc = dt.timezone.utc
d = dt.datetime(2017, 1, 2, 4, 30, 0, 0, local_tz)
print(d, d.astimezone(utc))
Will print:
2017-01-02 04:30:00-08:00 2017-01-02 12:30:00+00:00
Here's an example using pytz to lookup time zones:
import datetime as dt
import pytz
dates = [("2017-01-01 14:00:00", "America/Los_Angeles"),
("2017-01-01 14:00:00", "America/Denver"),
("2017-01-01 14:00:00", "America/Phoenix"),
("2017-01-01 14:30:00", "America/Los_Angeles"),
]
for d, tz_str in dates:
start = dt.datetime.strptime(d, "%Y-%m-%d %H:%M:%S")
start = start.replace(tzinfo=pytz.utc)
local_tz = pytz.timezone(tz_str) # convert to desired timezone
print(start, local_tz.zone, "\t", start.astimezone(local_tz))
This produces:
2017-01-01 14:00:00+00:00 America/Los_Angeles 2017-01-01 06:00:00-08:00
2017-01-01 14:00:00+00:00 America/Denver 2017-01-01 07:00:00-07:00
2017-01-01 14:00:00+00:00 America/Phoenix 2017-01-01 07:00:00-07:00
2017-01-01 14:30:00+00:00 America/Los_Angeles 2017-01-01 06:30:00-08:00