Pandas data frame: resample with linear interpolation - python

I am trying to get a fairly basic resampling method to work with a pandas data frame. My data frame df is indexed by datetime entries and contains prices
price
datetime
2000-08-16 09:29:55.755000 7.302786
2000-08-16 09:30:10.642000 7.304059
2000-08-16 09:30:26.598000 7.304435
2000-08-16 09:30:41.372000 7.304314
2000-08-16 09:30:56.718000 7.304334
I would like to downsample this to 5min. Using
df.resample(rule='5Min',how='last',closed='left')
takes the closest point to the left in my data of a multiple of 5min; similarly
df.resample(rule='5Min',how='first',closed='left')
takes the closes point to the right.
However, I would like to take the linear interpolation between the point to the left and right instead, e.g. if my df contains the two consecutive entries
time t1, price p1
time t2, price p2
and
t1<t<t2 where t is a multiple of 5min
then the resampled dataframe should have the entry
time t, price p1+(t-t1)/(t2-t1)*(p2-p1)

try creating two separate dataframes, reset_index them (so they have the same numerical index), fillna on them, and then just do the math on df1 and df2. e.g:
df1 = df.resample(rule='5Min',how='last',closed='left').reset_index().fillna(method='ffill')
df2 = df.resample(rule='5Min',how='first',closed='left').reset_index().fillna(method='ffill')
dt = df1.datetime - df2.datetime
px_fld = df1.price + ...
something like that should do the trick.

Related

Trouble resampling pandas timeseries from 1min to 5min data

I have a 1 minute interval intraday stock data which looks like this:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
I am trying to resample it to '5m' data like this:
n = n.resample('5T').agg(dict(zip(n.columns, ['first', 'max', 'min', 'last', 'last', 'sum'])))
But it tries to resample the datetime information which is not in my data. The market data is only available till 03:30 PM, but when I look at the resampled dataframe I find its tried to resample for entire 24 hrs.
How do I stop the resampling till 03:30PM and move on to the succeeding date?
Right now the dataframe has mostly NaN values due to this. Any suggestions will be welcome.
I am not sure what you are trying to achieve with that agg() function. Assuming 'first' refers to the first quantile and 'last' to the last quantile and you want to calculate some statistics per column, I suggest you do the following:
Get your data:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
Resample your data:
Note: your result is the same as when you resample with n.resample('5T').first() but this means every value in the dataframe
equals the first value from the 5 minute interval consisting of 5
values. A more logical resampling method is to use the mean() or
sum() function as shown below.
If this is data on stock prices it makes more sense to use mean():
resampled_df = n.resample('5T').mean()
To remove resampled hours that are outside of the working stock hours you have 2 options.
Option 1: drop na values:
filtered_df = resampled_df.dropna()
Note: this will not work if you use sum() since the result won't contain missing values but zeros.
Option 2 filter based on start and end hour
Get minimum and maximum time of day where data is available as datetime.time object:
start = n.index.min().time() # 09:15 as datetime.time object
end = n.index.max().time() # 15:29 as datetime.time object
Filter dataframe based on start and end times:
filtered_df = resampled_df.between_time(start, end)
Get the statistics:
statistics = filtered_df.describe()
statistics
Note that describe() will not contain the sum, so in order to add it you could do:
statistics = pd.concat([statistics, filtered_df.agg(['sum'])])
statistics
Output:
The agg() is to apply individual method of operation for each column, I used this so that I can get to see the 'candlestick' formation as it is called in stock technical analysis.
I was able to fix the issue, by dropping the NaN values.

How do I drop rows in a pandas dataframe based on the time of day

I am trying to drop specific rows in a dataframe where the index is a date with 1hr intervals during specific times of the day. (It is hourly intervals of stock market data).
For instance, 2021-10-26 09:30:00-4:00,2021-10-26 10:30:00-4:00,2021-10-26 11:30:00-4:00, 2021-10-26 12:30:00-4:00 etc.
I want to be able to specify the row to keep by hh:mm (e.g. keep just the 6:30, 10:30 data each day), and drop all the rest.
I'm pretty new to programming so have absolutely no idea how to do this.
If your columns are datetime objects and not strings, you can do something like this
df = pd.Dataframe()
...input data, etc...
columns = df.columns
kept = []
for col in columns
if (col.dt.hour == 6 or col.dt.hour == 10) and col.dt.minute == 30
kept.append(col)
else:
continue
df = df[kept]
see about half way down about working with time in pandas on this source here
https://www.dataquest.io/blog/python-datetime-tutorial/

Pandas - Merging rows with time difference (When datetime is index)

I have found some tasks to do, to develop myself more with Pandas, but I found some unexpected errors in the data files I used. And actually wanted to fix it by myself, but I have no idea how.
Basically I have an excel file, with columns - PayType, Money, Date. In the column of PayType, I have 4 different types of payment. Car rent payment, car service fee payment, and 2 more which are not important. Basically, on every entry of car rent payment, there is an automatic service fee deduction, which happens at the exactly same time. I used the Pivot table and divided PayTypes as columns, as I wanted to count the percentage of these fees.
Before Pivot Table:
enter image description here
Time difference example:
enter image description here
After Pivot Table:
enter image description here
import numpy as np
import pandas as pd
import xlrd
from pandas import Series, DataFrame
df = pd.read_excel ('C:/Data.xlsx', sheet_name = 'Sheet1',
usecols = ['PayType', 'Money', 'Date'])
df['Date'] = pd.to_datetime(df['Date'], format = '%Y-%m-%d %H:$M:%S.%f')
df = df.pivot_table(index = ['Date'],
columns = ['PayType']).fillna(0)
df = pd.merge_asof(df['Money', 'serviceFee'], df['Money', 'carRenting'], on = 'Date', tolerance =
pd.Timedelta('2s'))
df['Percentage'] = df['Money','serviceFee'] / df['Money','carRenting'] * 100
df['Percentage'] = df['Percentage'].abs()
df['Charges'] = np.where(df['Percentage'].notna(), np.where(df['Percentage'] > 26, 'Overcharge -
30%', 'Fixed - 25%'), 'Null')
df.to_excel("Finale123.xlsx")
So in the Pivot table, entries for renting the car and fee payments almost all of them happened at the same moment, so their time is equal and they are in one row. But there are few mistakes, where time is different for carrenting and feepayment just for 1 or 2 seconds. Because of this time difference, they are divided into 2 different rows.
I tried to use merge_asof, but it didn't work.
How can I merge 2 rows, which have different times (by 2 seconds max) and also this time column (date) is the actual index for the pivot table.
I had a similar problem. I needed to merge time series data of multiple sensors. The time interval of the sensor measurements are 5 seconds. The time format is yyyy:MM:dd HH:mm:ss. To do the merge, I also needed to sort the column used for the merge.
sensors_livingroom = load(filename_livingroom)
sensors_bedroom = load(filename_bedroom)
sensors_livingroom = sensors_livingroom.set_index("time")
sensors_bedroom = sensors_bedroom.set_index("time")
sensors_livingroom.index = pd.to_datetime(sensors_livingroom.index, dayfirst=True)
sensors_bedroom.index = pd.to_datetime(sensors_bedroom.index, dayfirst=True)
sensors_livingroom.sort_index(inplace=True)
sensors_bedroom.sort_index(inplace=True)
sensors = pd.merge_asof(sensors_bedroom, sensors_livingroom, on='time', direction="nearest")
In my case I wanted to merge to the nearest time value so I set the parameter direction to nearest. In your case, it seems that the time of one dataframe will always be smaller that the time of the other, so it may be better to set direction parameter to forward or backward. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html

Select hourly data based on days

I have a time series hourly_df, containing some hourly data:
import pandas as pd
import numpy as np
hourly_index = pd.date_range(start='2018-01-01', end='2018-01-07', freq='H')
hourly_data = np.random.rand(hourly_index.shape[0])
hourly_df = pd.DataFrame(hourly_data, index=hourly_index)
and I have a DatetimeIndex, containing some dates (as days as I wish), e.g.
daily_index = pd.to_datetime(['2018-01-01', '2018-01-05', '2018-01-06'])
I want to select each row of hourly_df, which date of its index is in daily_index, so in my case all hourly data from 1st, 5th and 6th January. What is the best way to do this?
If I naively use hourly_df.loc[daily_index], I only get the rows at 0:00:00 for each of the three days. What I want is the hourly data for the whole day for each of the days in daily_index.
One possibility to solve this, is to create a filter that takes the date of each element in the index of hourly_df and compares whether of not this date is in daily_index.
day_filter = [hour.date() in daily_index.date for hour in hourly_df.index]
hourly_df[day_filter]
This produces the desired output, but it seems the usage of the filter is avoidable and can be done in an expression similar to hourly_df.loc[daily_index.date].
save the daily_index as a dataframe
merge on index using hourly_df.merge(daily_index, how = 'inner', ...)

Python - select certain time range pandas

Python newbie here but I have some data that is intra-day financial data, going back to 2012, so it's got the same hours each day(same trading session each day) but just different dates. I want to be able to select certain times out of the data and check the corresponding OHLC data for that period and then do some analysis on it.
So at the moment it's a CSV file, and I'm doing:
import pandas as pd
data = pd.DataFrame.read_csv('data.csv')
date = data['date']
op = data['open']
high = data['high']
low = data['low']
close = data['close']
volume = data['volume']
The thing is that the date column is in the format of "dd/mm/yyyy 00:00:00 "as one string or whatever, so is it possible to still select between a certain time, like between "09:00:00" and "10:00:00"? or do I have to separate that time bit from the date and make it it's own column? If so, how?
So I believe pandas has a between_time() function, but that seems to need a DataFrame, so how can I convert it to a DataFrame, then I should be able to use the between_time function to select between the times I want. Also because there's obviously thousands of days, all with their own "xx:xx:xx" to "xx:xx:xx" I want to pull that same time period I want to look at from each day, not just the first lot of "xx:xx:xx" to "xx:xx:xx" as it makes its way down the data, if that makes sense. Thanks!!
Consider the dataframe df
from pandas_datareader import data
df = data.get_data_yahoo('AAPL', start='2016-08-01', end='2016-08-03')
df = df.asfreq('H').ffill()
option 1
convert index to series then dt.hour.isin
slc = df.index.to_series().dt.hour.isin([9, 10])
df.loc[slc]
option 2
numpy broadcasting
slc = (df.index.hour[:, None] == [9, 10]).any(1)
df.loc[slc]
response to comment
To then get a range within that time slot per day, use resample + agg + np.ptp (peak to peak)
df.loc[slc].resample('D').agg(np.ptp)

Categories

Resources