I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.
Related
I am trying to drop specific rows in a dataframe where the index is a date with 1hr intervals during specific times of the day. (It is hourly intervals of stock market data).
For instance, 2021-10-26 09:30:00-4:00,2021-10-26 10:30:00-4:00,2021-10-26 11:30:00-4:00, 2021-10-26 12:30:00-4:00 etc.
I want to be able to specify the row to keep by hh:mm (e.g. keep just the 6:30, 10:30 data each day), and drop all the rest.
I'm pretty new to programming so have absolutely no idea how to do this.
If your columns are datetime objects and not strings, you can do something like this
df = pd.Dataframe()
...input data, etc...
columns = df.columns
kept = []
for col in columns
if (col.dt.hour == 6 or col.dt.hour == 10) and col.dt.minute == 30
kept.append(col)
else:
continue
df = df[kept]
see about half way down about working with time in pandas on this source here
https://www.dataquest.io/blog/python-datetime-tutorial/
I have a pandas dataframe with a datetime index and some column, 'value'. I would like to compare the 'value' value at a given time of day to the value at a different time of the same day. E.g. compare the 10am value to the 10pm value.
Right now I can get the value at either side using:
mask = df[(df.index.hour == hour)]
the problem is that this returns a dataframe indexed at hour. So doing mask1.value - mask2.value returns Nan's since the indexes are different.
I can get around this in a convoluted way:
out = mask.value.loc["2020-07-15"].reset_index() - mask2.value.loc["2020-07-15"].reset_index() #assuming mask2 is the same as the mask call but at a different hour
but this is tiresome to loop over for a dataset that spans years. (Obviously I could timedelta +=1 in the loop to avoid the hard calls).
I don't actually care if some nan's get into the end result if some, e.g. 10am, values are missing.
Edit:
Initial dataframe:
index values
2020-05-10T10:00:00 23
2020-05-10T11:00:00 20
2020-05-10T12:00:00 5
.....
2020-05-30T22:00:00 8
2020-05-30T23:00:00 8
2020-05-30T24:00:00 9
Expected dataframe:
index date newval
0 2020-05-10 18
.....
x 2020-05-30 1
where newval is some subtraction of the two different times I described above (eg. the 10am measurement - the 12pm measurement so 23-5 = 18), second entry is made up
it doesn't matter to me if date is a separate column or the index.
A workaround:
mask1 = df[(df.index.hour == hour1)]
mask2 = df[(df.index.hour == hour2)]
out = mask1.values - mask2.values # df.values returns an np array without indices
result_df = pd.DataFrame(index=pd.daterange(start,end), data=out)
It should save you the effort of looping over the dates
I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates.
I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?
The dataset looks like this:
Sample data
The code I am using is -
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
return df
df = mark_holiday(df)
This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.
The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).
from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000 # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)
# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())
# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))
>>> df.head(6)
start_date end_date holiday_in_range
0 2015-07-14 2015-07-31 False
1 2010-12-18 2010-12-30 True # 2010-12-24
2 2013-04-06 2013-04-16 False
3 2013-09-12 2013-09-24 False
4 2017-10-28 2017-10-31 False
5 2013-12-14 2013-12-29 True # 2013-12-25
So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.
Have you already considered using pandas.merge_asof for this?
I could imagine that map and apply with lambda functions cannot be executed that efficiently.
UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.
If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):
Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')
The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).
Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.
In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.
In pandas, I have two data frames. One containing the Holidays of a particular country from http://www.timeanddate.com/holidays/austria and another one containing a date column. I want to calculate the #days after a holiday.
def compute_date_diff(x, y):
difference = y - x
differenceAsNumber = (difference/ np.timedelta64(1, 'D'))
return differenceAsNumber.astype(int)
for index, row in holidays.iterrows():
secondDF[row['name']+ '_daysAfter'] = secondDF.dateColumn.apply(compute_date_diff, args=(row.day,))
However, this
calculates the wrong difference e.g. > than a year in case holidays contains data for more than a year.
is pretty slow.
How could I fix the flaw and increase performance? Is there a parallel apply? Or what about http://pandas.pydata.org/pandas-docs/stable/timeseries.html#holidays-holiday-calendars
As I am new to pandas I am unsure how to obtain the current date/index of the date object whilst iterating through in apply. As far as I know I cannot loop the other way round e.g. over all my rows in secondDF as it was impossible for me to generate feature columns whilst iterating via apply
To do this, join both data frames using a common column and then try this code
import pandas
import numpy as np
df = pandas.DataFrame(columns=['to','fr','ans'])
df.to = [pandas.Timestamp('2014-01-24'), pandas.Timestamp('2014-01-27'), pandas.Timestamp('2014-01-23')]
df.fr = [pandas.Timestamp('2014-01-26'), pandas.Timestamp('2014-01-27'), pandas.Timestamp('2014-01-24')]
df['ans']=(df.fr-df.to) /np.timedelta64(1, 'D')
print df
output
to fr ans
0 2014-01-24 2014-01-26 2.0
1 2014-01-27 2014-01-27 0.0
2 2014-01-23 2014-01-24 1.0
I settled for something entirely different:
Now, only the number of days since before the most current holiday will be calculated.
my function:
def get_nearest_holiday(holidays, pivot):
return min(holidays, key=lanbda x: abs(x- pivot)
# this needs to be converted to an int, but at least the nearest holiday is found efficiently
is called as a lambda expression on a per-row basis
I've created a pandas dataframe from a 205MB csv (approx 1.1 million rows by 15 columns). It holds a column called starttime that is dtype object (it's more precisely a string). The format is as follows: 7/1/2015 00:00:03.
I would like to create two new dataframes from this pandas dataframe. One should contain all rows corresponding with weekend dates, the other should contain all rows corresponding with weekday dates.
Weekend dates are:
weekends = ['7/4/2015', '7/5/2015', '7/11/2015', '7/12/2015',
'7/18/2015', '7/19/2015', '7/25/2015', '7,26/2015']
I attempted to convert the string to datetime (pd.to_datetime) hoping that would make the values easier to parse, but when I do it hangs for so long that I ended up restarting the kernel several times.
Then I decided to use df["date"], df["time"] = zip(*df['starttime'].str.split(' ').tolist()) to create two new columns in the original dataframe (one for date, one for time). Next I figured I'd use a boolean test to 'flag' weekend records (according to the new date field) as True and all others False and create another column holding those values, then I'd be able to group by True and False.
For example,
test1 = bikes['date'] == '7/1/2015' returns True for all 7/1/2015 values, but I can't figure out how to iterate over all items in weekends so that I get True for all weekend dates. I tried this and broke Python (hung again):
for i in weekends:
for k in df['date']:
test2 = df['date'] == i
I'd appreciate any help (with both my logic and my code).
First, create a DataFrame of string timestamps with 1.1m rows:
df = pd.DataFrame({'date': ['7/1/2015 00:00:03', '7/1/2015 00:00:04'] * 550000})
Next, you can simply convert them to Pandas timestamps as follows:
df['ts'] = pd.to_datetime(df.date)
This operation took just under two minutes. However, it took under seven seconds if you specify the format:
df['ts'] = pd.to_datetime(df.date, format='%m/%d/%Y %H:%M:%S')
Now, it is easy to set up a weekend flag as follows (which took about 3 seconds):
df['weekend'] = [d.weekday() >= 5 for d in df.ts]
Finally, it is easy to subset your DataFrame, which takes virtually no time:
df_weekdays = df.loc[~df.weekend, :]
df_weekends = df.loc[df.weekend, :]
The weekend flag is to help explain what is happening. You can simplify as follows:
df_weekdays = df.loc[df.ts.apply(lambda ts: ts.weekday() < 5), :]
df_weekends = df.loc[df.ts.apply(lambda ts: ts.weekday() >= 5), :]