Moving x rows up in a dataframe indexed on dates - python

I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks

Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.

Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]

Related

pandas create a column to compare a value with one week ago

I have a pandas dataframe and the index column is time with hourly precision. I want to create a new column that compares the value of the column "Sales number" at each hour with the same exact time one week ago.
I know that it can be written in using shift function:
df['compare'] = df['Sales'] - df['Sales'].shift(7*24)
But I wonder how can I take advantage of the date_time format of the index. I mean, is there any alternatives to using shift(7*24) when the index is in date_time format?
Try something with
df['Sales'].shift(7,freq='D')

Find if there is any holidays between two dates in a large dataset?

I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates.
I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?
The dataset looks like this:
Sample data
The code I am using is -
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
return df
df = mark_holiday(df)
This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.
The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).
from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000 # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)
# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())
# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))
>>> df.head(6)
start_date end_date holiday_in_range
0 2015-07-14 2015-07-31 False
1 2010-12-18 2010-12-30 True # 2010-12-24
2 2013-04-06 2013-04-16 False
3 2013-09-12 2013-09-24 False
4 2017-10-28 2017-10-31 False
5 2013-12-14 2013-12-29 True # 2013-12-25
So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.
Have you already considered using pandas.merge_asof for this?
I could imagine that map and apply with lambda functions cannot be executed that efficiently.
UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.
If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):
Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')
The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).
Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.
In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.

How to get the gap between years using python pandas

I am having a data frame like this
Series_id F start end data
3 A 2012 2018 [[2012,0],[2014,0],[2015,1],[2017,3],[2019,0],[2020.1]]
I need Output like this
{series_id:3,start:2013,end:2013,count:1},{series_id:3,start:2016,end:2016,count:1}{series_id:3,start:2018,end:2018,count:1},
I have to iterate over the data frame of each row and find the gap between year in the data column The data column is a list of a list which has the year and some data in each list
I have to check with consecutive and missing years and count.
count example: take series id:3 start=2014 end=2018 on iterating the 2013,2016 and 2018 are missed but in between, 2017 is there so for that I have to get like this
{series_id:3,start:2013,end:2013,count:1},
{series_id:3,start:2016,end:2016,count:1},{series_id:3,start:2018,end:2018,count:1}
How can I achieve this even I not able to iterate it?
Please Need a help
If your question is how to convert a query operation performed on a pandas dataframe into a list. There is a method called as to_json() on pandas.
Find the link below to see examples and implementation.
enter link description here
Use orient = records
Then you can use json.loads() to convert to a list.
Now you can iterate over it.

Select hourly data based on days

I have a time series hourly_df, containing some hourly data:
import pandas as pd
import numpy as np
hourly_index = pd.date_range(start='2018-01-01', end='2018-01-07', freq='H')
hourly_data = np.random.rand(hourly_index.shape[0])
hourly_df = pd.DataFrame(hourly_data, index=hourly_index)
and I have a DatetimeIndex, containing some dates (as days as I wish), e.g.
daily_index = pd.to_datetime(['2018-01-01', '2018-01-05', '2018-01-06'])
I want to select each row of hourly_df, which date of its index is in daily_index, so in my case all hourly data from 1st, 5th and 6th January. What is the best way to do this?
If I naively use hourly_df.loc[daily_index], I only get the rows at 0:00:00 for each of the three days. What I want is the hourly data for the whole day for each of the days in daily_index.
One possibility to solve this, is to create a filter that takes the date of each element in the index of hourly_df and compares whether of not this date is in daily_index.
day_filter = [hour.date() in daily_index.date for hour in hourly_df.index]
hourly_df[day_filter]
This produces the desired output, but it seems the usage of the filter is avoidable and can be done in an expression similar to hourly_df.loc[daily_index.date].
save the daily_index as a dataframe
merge on index using hourly_df.merge(daily_index, how = 'inner', ...)

Get last date in each month of a time series pandas

Currently I'm generating a DateTimeIndex using a certain function, zipline.utils.tradingcalendar.get_trading_days. The time series is roughly daily but with some gaps.
My goal is to get the last date in the DateTimeIndex for each month.
.to_period('M') & .to_timestamp('M') don't work since they give the last day of the month rather than the last value of the variable in each month.
As an example, if this is my time series I would want to select '2015-05-29' while the last day of the month is '2015-05-31'.
['2015-05-18', '2015-05-19', '2015-05-20', '2015-05-21',
'2015-05-22', '2015-05-26', '2015-05-27', '2015-05-28',
'2015-05-29', '2015-06-01']
Condla's answer came closest to what I needed except that since my time index stretched for more than a year I needed to groupby by both month and year and then select the maximum date. Below is the code I ended up with.
# tempTradeDays is the initial DatetimeIndex
dateRange = []
tempYear = None
dictYears = tempTradeDays.groupby(tempTradeDays.year)
for yr in dictYears.keys():
tempYear = pd.DatetimeIndex(dictYears[yr]).groupby(pd.DatetimeIndex(dictYears[yr]).month)
for m in tempYear.keys():
dateRange.append(max(tempYear[m]))
dateRange = pd.DatetimeIndex(dateRange).order()
Suppose your data frame looks like this
original dataframe
Then the following Code will give you the last day of each month.
df_monthly = df.reset_index().groupby([df.index.year,df.index.month],as_index=False).last().set_index('index')
transformed_dataframe
This one line code does its job :)
My strategy would be to group by month and then select the "maximum" of each group:
If "dt" is your DatetimeIndex object:
last_dates_of_the_month = []
dt_month_group_dict = dt.groupby(dt.month)
for month in dt_month_group_dict:
last_date = max(dt_month_group_dict[month])
last_dates_of_the_month.append(last_date)
The list "last_date_of_the_month" contains all occuring last dates of each month in your dataset. You can use this list to create a DatetimeIndex in pandas again (or whatever you want to do with it).
This is an old question, but all existing answers here aren't perfect. This is the solution I came up with (assuming that date is a sorted index), which can be even written in one line, but I split it for readability:
month1 = pd.Series(apple.index.month)
month2 = pd.Series(apple.index.month).shift(-1)
mask = (month1 != month2)
apple[mask.values].head(10)
Few notes here:
Shifting a datetime series requires another pd.Series instance (see here)
Boolean mask indexing requires .values (see here)
By the way, when the dates are the business days, it'd be easier to use resampling: apple.resample('BM')
Maybe the answer is not needed anymore, but while searching for an answer to the same question I found maybe a simpler solution:
import pandas as pd
sample_dates = pd.date_range(start='2010-01-01', periods=100, freq='B')
month_end_dates = sample_dates[sample_dates.is_month_end]
Try this, to create a new diff column where the value 1 points to the change from one month to the next.
df['diff'] = np.where(df['Date'].dt.month.diff() != 0,1,0)

Categories

Resources