I have a df with a datetime column. I need to test some function by increasin the data one day at a time. My datetime range goes from the last days of September to the first few days of October. My question is: how do I split the days and "add" more data at each iteration? I'm after something like:
for d in range(0927,1010):
fn(df[df["20100927":d])
So at the beginning, it's only one day of data, after the second iteration it will be two days, etc.
Another way to think this: I have
t = pd.date_range(start='20100923', end='20101006')
how do I slice df like
for d in t:
df["20100923": d]
EDIT: I've been able to make it work, but it's not pretty... isn't it possible to just iterate over the date_range object? My solution:
D = ['2010-09-27', '2019-09-28'] #and so on
for d in D:
df[df[D[0]:d]
What about:
start_index = df[df['datetime']=='20100927'].index[0]
days_to_test = 30
for offset in days_to_test:
fn(df.iloc[start_index:start_index+offset])
Related
I have the following daily dataframe:
daily_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='D')
random_values = np.random.randint(1, 3,size=(len(daily_index), 1))
daily_df = pd.DataFrame(random_values, index=daily_index, columns=['A']).replace(1, np.nan)
I want to map each value to a dataframe where each day is expanded to multiple 1 minute intervals. The final DF looks like so:
intraday_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='1min')
intraday_df_full = daily_df.reindex(intraday_index)
# Choose random indices.
drop_indices = np.random.choice(intraday_df_full.index, 5000, replace=False)
intraday_df = intraday_df_full.drop(drop_indices)
In the final dataframe, each day is broken into 1 min intervals, but some are missing (so the minute count on each day is not the same). Some days have a value in the beginning of the day, but nan for the rest.
My question is, only for the days which start with some value in the first minute, how do I front fill for the rest of the day?
I initially tried to simply do the following daily_df.reindex(intraday_index, method='ffill', limit=1440), but since some rows are missing, this cannot work. Maybe there a way to limit by time?
Following #Datanovice's comments, this line achieves the desired result:
intraday_df.groupby(intraday_df.index.date).transform('ffill')
where my groupby defines the desired groups on which we want to apply the operation and transform does this without modifying the DataFrame's index.
I would like to generate a random list of length n based on the dates of, say, September.
So, you have your list like this:
september = ["01/09/2019","02/09/2019",...,"30/09/2019"]
And I would like to generate a list that contains, say, 1000 elements taken randomly from september like this:
dates = ["02/09/2019","02/09/2019","07/09/2019",...,"23/09/2019"]
I could use something like:
dates = np.random.choice(september,1000)
But the catch is that I want dates to be selected based on the probabilities of the days of the week. So for example, I have a dictionary like this:
days = {"Monday":0.1,"Tuesday":0.4,"Wednesday":0.1,"Thursday":0.05,"Friday":0.05,"Saturday":0.2,"Sunday":0.1}
So as "01/01/2019" was Sunday, I would like to choose this date from september with probability 0.1.
My attempt was to create a list whose first element is the probability of the first date in september and after 7 days this probability repeats and so on, like this:
p1 = [0.1,0.1,0.4,0.1,0.05,0.05,0.2,0.1,0.1,0.4,0.1,0.05,0.05,...]
Obviously this doesn't add to 1, so I would do the following:
p2 = [x/sum(p1) for x in p1]
And then:
dates = np.random.choice(september,1000,p=p2)
However, I am not sure this really works... Can you help me?
Actually I think your approach is fine. But instead of using the dates, first get a list of dates grouped by weekdays:
import numpy as np
import datetime
from collections import defaultdict
days = {"Monday":0.1,"Tuesday":0.4,"Wednesday":0.1,"Thursday":0.05,"Friday":0.05,"Saturday":0.2,"Sunday":0.1}
date_list = [(datetime.datetime(2019, 9, 1) + datetime.timedelta(days=x)) for x in range(30)]
d = defaultdict(list)
for i in date_list:
d[i.strftime("%A")].append(i)
Now pass this to np.random.choice:
np.random.seed(500)
result = np.random.choice(list(d.values()),
p=[days.get(i) for i in list(d.keys())],
size=1000)
You now have a list of lists of weighted datetime objects. Just do another random.choice for the items inside:
final = [np.random.choice(i) for i in result]
If I understand correctly, you want to select dates from the days of September, where the probability to select each date is proportional to the number of times that the weekday of that date appears in September - and what you need is how to assign the proper probabilities.
I'll show how to assign the probabilities using pandas (just because it's convenient to me).
First, create the array of relevant dates using a pd.DatetimeIndex, so the elements of the array (Index in this case) are pd.Timestamp objects:
import pandas as pd
days_of_september = pd.DatetimeIndex(freq='1D', start='2019/09/01', end='2019/09/30')
to each date, we assign its weekday (from 0 to 6), using the .weekday method (this is why a Timestamp or a datetime are convenient here):
days_and_weekdays_of_september = pd.DataFrame(
[(day, day.weekday()) for day in days_of_september], columns=('date', 'weekday'))
Count how many times each weekday appears in the month:
weekday_counts = days_and_weekdays_of_september['weekday'].value_counts()
(No big suprise here - all the values are either 4 or 5).
Assign a probability relative to that count:
probability = days_and_weekdays_of_september.apply(lambda date: weekday_counts[date['weekday']], axis=1)
probability = probability/probability.sum()
And then, with pandas, you can select based on those probabilities (called "weights" here):
days_and_weekdays_of_september['date'].sample(n=1000, weights=probability, replace=True)
I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates.
I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?
The dataset looks like this:
Sample data
The code I am using is -
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
return df
df = mark_holiday(df)
This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.
The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).
from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000 # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)
# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())
# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))
>>> df.head(6)
start_date end_date holiday_in_range
0 2015-07-14 2015-07-31 False
1 2010-12-18 2010-12-30 True # 2010-12-24
2 2013-04-06 2013-04-16 False
3 2013-09-12 2013-09-24 False
4 2017-10-28 2017-10-31 False
5 2013-12-14 2013-12-29 True # 2013-12-25
So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.
Have you already considered using pandas.merge_asof for this?
I could imagine that map and apply with lambda functions cannot be executed that efficiently.
UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.
If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):
Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')
The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).
Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.
In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.
I want to do something like the following:
df['Day'] = df['Day'].apply(lambda x: x + myDict[df['Month']]),
where
myDict={2:3,4:1,6:1,9:1,11:1,1:0,3:0,5:0,7:0,8:0,10:0,12:0}.
What I'm doing is adding a number of days onto the day of the month if it's a certain month. Ex: If it's February and the day of the month is 28, I add 3 to get 31.
But this does not work because I really want to apply myDict to the indexes of df['Month'], not the Month column directly.
Can I do iterrows inline for my command? I think this would perform faster through pandas than a big for loop iterating through the whole dataframe.
Try:
df.Day += df.Month.map(myDict)
Or:
because I don't really get what you are doing
df.Day += df.index.to_series().map(myDict)
Currently I'm generating a DateTimeIndex using a certain function, zipline.utils.tradingcalendar.get_trading_days. The time series is roughly daily but with some gaps.
My goal is to get the last date in the DateTimeIndex for each month.
.to_period('M') & .to_timestamp('M') don't work since they give the last day of the month rather than the last value of the variable in each month.
As an example, if this is my time series I would want to select '2015-05-29' while the last day of the month is '2015-05-31'.
['2015-05-18', '2015-05-19', '2015-05-20', '2015-05-21',
'2015-05-22', '2015-05-26', '2015-05-27', '2015-05-28',
'2015-05-29', '2015-06-01']
Condla's answer came closest to what I needed except that since my time index stretched for more than a year I needed to groupby by both month and year and then select the maximum date. Below is the code I ended up with.
# tempTradeDays is the initial DatetimeIndex
dateRange = []
tempYear = None
dictYears = tempTradeDays.groupby(tempTradeDays.year)
for yr in dictYears.keys():
tempYear = pd.DatetimeIndex(dictYears[yr]).groupby(pd.DatetimeIndex(dictYears[yr]).month)
for m in tempYear.keys():
dateRange.append(max(tempYear[m]))
dateRange = pd.DatetimeIndex(dateRange).order()
Suppose your data frame looks like this
original dataframe
Then the following Code will give you the last day of each month.
df_monthly = df.reset_index().groupby([df.index.year,df.index.month],as_index=False).last().set_index('index')
transformed_dataframe
This one line code does its job :)
My strategy would be to group by month and then select the "maximum" of each group:
If "dt" is your DatetimeIndex object:
last_dates_of_the_month = []
dt_month_group_dict = dt.groupby(dt.month)
for month in dt_month_group_dict:
last_date = max(dt_month_group_dict[month])
last_dates_of_the_month.append(last_date)
The list "last_date_of_the_month" contains all occuring last dates of each month in your dataset. You can use this list to create a DatetimeIndex in pandas again (or whatever you want to do with it).
This is an old question, but all existing answers here aren't perfect. This is the solution I came up with (assuming that date is a sorted index), which can be even written in one line, but I split it for readability:
month1 = pd.Series(apple.index.month)
month2 = pd.Series(apple.index.month).shift(-1)
mask = (month1 != month2)
apple[mask.values].head(10)
Few notes here:
Shifting a datetime series requires another pd.Series instance (see here)
Boolean mask indexing requires .values (see here)
By the way, when the dates are the business days, it'd be easier to use resampling: apple.resample('BM')
Maybe the answer is not needed anymore, but while searching for an answer to the same question I found maybe a simpler solution:
import pandas as pd
sample_dates = pd.date_range(start='2010-01-01', periods=100, freq='B')
month_end_dates = sample_dates[sample_dates.is_month_end]
Try this, to create a new diff column where the value 1 points to the change from one month to the next.
df['diff'] = np.where(df['Date'].dt.month.diff() != 0,1,0)