Currently I'm generating a DateTimeIndex using a certain function, zipline.utils.tradingcalendar.get_trading_days. The time series is roughly daily but with some gaps.
My goal is to get the last date in the DateTimeIndex for each month.
.to_period('M') & .to_timestamp('M') don't work since they give the last day of the month rather than the last value of the variable in each month.
As an example, if this is my time series I would want to select '2015-05-29' while the last day of the month is '2015-05-31'.
['2015-05-18', '2015-05-19', '2015-05-20', '2015-05-21',
'2015-05-22', '2015-05-26', '2015-05-27', '2015-05-28',
'2015-05-29', '2015-06-01']
Condla's answer came closest to what I needed except that since my time index stretched for more than a year I needed to groupby by both month and year and then select the maximum date. Below is the code I ended up with.
# tempTradeDays is the initial DatetimeIndex
dateRange = []
tempYear = None
dictYears = tempTradeDays.groupby(tempTradeDays.year)
for yr in dictYears.keys():
tempYear = pd.DatetimeIndex(dictYears[yr]).groupby(pd.DatetimeIndex(dictYears[yr]).month)
for m in tempYear.keys():
dateRange.append(max(tempYear[m]))
dateRange = pd.DatetimeIndex(dateRange).order()
Suppose your data frame looks like this
original dataframe
Then the following Code will give you the last day of each month.
df_monthly = df.reset_index().groupby([df.index.year,df.index.month],as_index=False).last().set_index('index')
transformed_dataframe
This one line code does its job :)
My strategy would be to group by month and then select the "maximum" of each group:
If "dt" is your DatetimeIndex object:
last_dates_of_the_month = []
dt_month_group_dict = dt.groupby(dt.month)
for month in dt_month_group_dict:
last_date = max(dt_month_group_dict[month])
last_dates_of_the_month.append(last_date)
The list "last_date_of_the_month" contains all occuring last dates of each month in your dataset. You can use this list to create a DatetimeIndex in pandas again (or whatever you want to do with it).
This is an old question, but all existing answers here aren't perfect. This is the solution I came up with (assuming that date is a sorted index), which can be even written in one line, but I split it for readability:
month1 = pd.Series(apple.index.month)
month2 = pd.Series(apple.index.month).shift(-1)
mask = (month1 != month2)
apple[mask.values].head(10)
Few notes here:
Shifting a datetime series requires another pd.Series instance (see here)
Boolean mask indexing requires .values (see here)
By the way, when the dates are the business days, it'd be easier to use resampling: apple.resample('BM')
Maybe the answer is not needed anymore, but while searching for an answer to the same question I found maybe a simpler solution:
import pandas as pd
sample_dates = pd.date_range(start='2010-01-01', periods=100, freq='B')
month_end_dates = sample_dates[sample_dates.is_month_end]
Try this, to create a new diff column where the value 1 points to the change from one month to the next.
df['diff'] = np.where(df['Date'].dt.month.diff() != 0,1,0)
Related
I have a pandas dataframe and the index column is time with hourly precision. I want to create a new column that compares the value of the column "Sales number" at each hour with the same exact time one week ago.
I know that it can be written in using shift function:
df['compare'] = df['Sales'] - df['Sales'].shift(7*24)
But I wonder how can I take advantage of the date_time format of the index. I mean, is there any alternatives to using shift(7*24) when the index is in date_time format?
Try something with
df['Sales'].shift(7,freq='D')
I have the following daily dataframe:
daily_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='D')
random_values = np.random.randint(1, 3,size=(len(daily_index), 1))
daily_df = pd.DataFrame(random_values, index=daily_index, columns=['A']).replace(1, np.nan)
I want to map each value to a dataframe where each day is expanded to multiple 1 minute intervals. The final DF looks like so:
intraday_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='1min')
intraday_df_full = daily_df.reindex(intraday_index)
# Choose random indices.
drop_indices = np.random.choice(intraday_df_full.index, 5000, replace=False)
intraday_df = intraday_df_full.drop(drop_indices)
In the final dataframe, each day is broken into 1 min intervals, but some are missing (so the minute count on each day is not the same). Some days have a value in the beginning of the day, but nan for the rest.
My question is, only for the days which start with some value in the first minute, how do I front fill for the rest of the day?
I initially tried to simply do the following daily_df.reindex(intraday_index, method='ffill', limit=1440), but since some rows are missing, this cannot work. Maybe there a way to limit by time?
Following #Datanovice's comments, this line achieves the desired result:
intraday_df.groupby(intraday_df.index.date).transform('ffill')
where my groupby defines the desired groups on which we want to apply the operation and transform does this without modifying the DataFrame's index.
I can specify the date range of month ends using
import pandas as pd
monthend_range = pd.date_range(datetime.date(2017,12,10), datetime.date(2018,2,2), freq='BM')
Is there a straightforward way to include the middle of the month into the range above to form a middle-and-end-month index? Let's say that the logic we want is to use the successive month ends in the code above and find the business day that is right in the middle between the monthends. If that is not a business day, then try the following day and the following until we get a business day.
The expected output is
['2017-12-29', '2018-01-16', '2018-01-31']
This might seem a bit inconsistent as 2017-12-15 is a middle of the month that is within the date range. But the procedure is get the end of months, then interpolate between the ends. Unless of course there is a better approach to dealing with this question.
Idea is create business day range for each value with omit first and select value in the middle, last use Index.union for join togetehr:
a = []
for x in monthend_range[1:]:
r = pd.date_range(x.to_period('m').to_timestamp(), x, freq='B')
a.append(r[len(r)//2])
print (a)
[Timestamp('2018-01-16 00:00:00', freq='B')]
out = monthend_range.union(a)
print (out)
DatetimeIndex(['2017-12-29', '2018-01-16', '2018-01-31'], dtype='datetime64[ns]', freq=None)
I have a long time series, eg.
import pandas as pd
index=pd.date_range(start='2012-11-05', end='2012-11-10', freq='1S').tz_localize('Europe/Berlin')
df=pd.DataFrame(range(len(index)), index=index, columns=['Number'])
Now I want to extract all sub-DataFrames for each day, to get the following output:
df_2012-11-05: data frame with all data referring to day 2012-11-05
df_2012-11-06: etc.
df_2012-11-07
df_2012-11-08
df_2012-11-09
df_2012-11-10
What is the most effective way to do this avoiding to check if the index.date==give_date which is very slow. Also, the user does not know a priory the range of days in the frame.
Any hint do do this with an iterator?
My current solution is this, but it is not so elegant and has two issues defined below:
time_zone='Europe/Berlin'
# find all days
a=np.unique(df.index.date) # this can take a lot of time
a.sort()
results=[]
for i in range(len(a)-1):
day_now=pd.Timestamp(a[i]).tz_localize(time_zone)
day_next=pd.Timestamp(a[i+1]).tz_localize(time_zone)
results.append(df[day_now:day_next]) # how to select if I do not want day_next included?
# last day
results.append(df[day_next:])
This approach has the following problems:
a=np.unique(df.index.date) can take a lot of time
df[day_now:day_next] includes the day_next, but I need to exclude it in the range
If you want to group by date (AKA: year+month+day), then use df.index.date:
result = [group[1] for group in df.groupby(df.index.date)]
As df.index.day will use the day of the month (i.e.: from 1 to 31) for grouping, which could result in undesirable behavior if the input dataframe dates extend to multiple months.
Perhaps groupby?
DFList = []
for group in df.groupby(df.index.day):
DFList.append(group[1])
Should give you a list of data frames where each data frame is one day of data.
Or in one line:
DFList = [group[1] for group in df.groupby(df.index.day)]
Gotta love python!
I'm opening a CSV file with two columns and about 10,000 rows. The first column has a unique date and time stamp (ascending in 30-minute intervals, called 'date_time') and the second column has an integer, 'intnum'. I use the date_time column as my index and then use conditions to sum only the integers that fall into specific date ranges. All of the conditions work perfectly, EXCEPT the last condition is based on matching those dates with the USFederalHolidayCalendar.
Here's the rub, the indexed date is more complex (eg. '2015-02-16 12:30:00.00000') than the holiday list date (eg. '2015-02-16', President's Day). So when I run an 'isin' function against the holiday list, it doesn't find all of the integers associated with the whole day because '2015-02-16 12:30:00.00000' is not equal to '2015-02-16', despite the fact that it is the same day.
Code snippet:
import numpy as np
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar, get_calendar
newcal = get_calendar('USFederalHolidayCalendar')
holidays = newcal.holidays(start='2010-01-01', end='2016-12-31')
filename = "/Users/Me/Desktop/test.csv"
int_array = pd.read_csv(filename, header=0, parse_dates=['date_time'], index_col='date_time')
intnum_total = int(int_array['intnum'][(int_array.index.month >= 2) &
(int_array.index.month <= 3) & (int_array.index.hour >= 12) &
(int_array.index.isin(holidays) == TRUE)].sum()
print intnum_total
Now, I get no errors, so the syntax and functions work "properly", but I know for a fact the holiday match is not working.
Any thoughts?
Thanks ahead of time - this is my first post, so hopefully the formatting and question is clear.
Here are some some thoughts...
Say you have a list of holidays for 2016:
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2016-01-01', end='2016-12-31')
print holidays.size
Which yields:
10
So there are 10 holidays in 2016 based on USFederalHolidayCalendar.
You also have your DateTimeIndex, which, let's say is covering 2015 and 2016:
idx = pd.DatetimeIndex(pd.date_range(start='2015-1-1',
end='2016-12-31', freq='30min'))
print idx.size
Which shows:
35041
Now if I would want to see how many holidays are in my 30 min based idx I would take the date part of the DateTimeIndex and compare it to date part of the holidays:
idx[pd.DatetimeIndex(idx.date).isin(holidays.date)].size
Which would give me:
480
Which is 10 holidays * 24 hours * 2 halfhours in an hour.
Does that sound correct?
Note that when you do index.isin(other_index) you get back a boolean array which is sufficient for indexing, and you don't need to do an extra comparison index.isin(other_index) == True
Can't you just access the date from your timestamp and see if it is in your list of federal holidays? I don't know why you need your second integer index column; I would think a boolean value should suffice (e.g. fed_holiday).
df = pd.DataFrame(pd.date_range(start='2016-1-1', end='2016-12-31', freq='30min', name='ts'))
df['fed_holiday'] = [ts.date() in holidays for ts in df.ts]
>>> df.fed_holiday.sum() / (24 * 2.)
10.0