Merge 1min, 5min and Daily OHLC dataframes WITHOUT upsampling the 1min? - python

To make this simple, let's say I have 3 datasets, a 1min OHLCV dataset, a 5min, and a Daily. Given these 3 datasets:
1min
5min
Daily
...how can I merge them and turn them into something like this using Pandas/Python 3?
As you can see, every time a new 5 mins is hit going down the 1min dataframe, the matching time from the 5min chart <= the 1min time gets added to their respective columns. I've left the blanks in there to help visualize what's happening, but for simplicity's sake, I can just forward fill the values. No backfilling so as to not introduce a lookahead bias. The daily value would be the previous day's OHLCV data, except for 4:00 PM, that would be the current day's data. This is because of how Alpha Vantage structures their dataframes. I also have a 15min and 60min dataset to go along with this, but I think once the 1 to 5min merge is done, the same logic could apply.
I have included a reproducible code below to get the exact dataframes I'm working with, however you have to pip install alpha_vantage and get a free API key from here.
SIMPLE DESIRED SOLUTION - What I've outlined above.
ADVANCED DESIRED SOLUTION - Rather than forward filling the 5min data, ideally those empty spaces would consist of the respective running prices. For example, the next empty open price in the 5min_open column would be the same 1min_open at the beginning of that 5 minutes. Think of something like:
conversion = {'Open' : 'first', 'High' : 'max', 'Low' : 'min', 'Close' : 'last', 'Volume' : 'sum'}
That way the empty spaces in the 5min columns are updating as they should, but the simple solution would suffice.
I don't want to just upsample the 1min dataframe because I want to include all of the previous data from a past daily dataframe, so to upsample the 1min to a daily dataset, I'd lose a bunch of past daily data, so I want the solution to NOT include upsampling, but merging of different df's by datetime. Thanks!
Windows 10, Python 3:
from alpha_vantage.timeseries import TimeSeries
import pandas as pd
import sys, os
import os.path
from time import sleep
# Make a historical folder if it's not created already
if not os.path.exists("data/historical/"):
os.makedirs("data/historical/")
api_call_limit_daily = 500
total_api_count = 0
timeframes = ['1min','5min','15min','60min','daily']
ts = TimeSeries(key='YOUR_API_KEY_HERE', output_format='csv')
tickers = ['SPY']
for ticker in tickers:
# Ensure uppercase
ticker = str(ticker.upper())
for timeframe in timeframes:
print("Downloading dataset for", ticker, "-", timeframe, "...")
if total_api_count > api_call_limit_daily:
print("Daily API call limit reached. Closing...")
sys.exit(0)
if timeframe != 'daily':
data, meta_data = ts.get_intraday_extended(symbol=ticker, interval=timeframe)
elif timeframe == 'daily':
data, meta_data = ts.get_daily(symbol=ticker)
# Convert the data object to a dataframe, and reverse the order so oldest date's on top
csv_list = list(data)
data = pd.DataFrame.from_records(csv_list[1:], columns=csv_list[0])
data = data.iloc[::-1]
if timeframe == 'daily':
data = data.rename(columns={"timestamp": "time"})
print(data)
df = data.set_index('time')
total_api_count += 1
df.to_csv("data/historical/" + ticker + "_" + timeframe + ".csv", index=True)
print("Success...")
# Sleep if we're not through tickers/timeframes yet re: api limits
sleep(15)
print("Done!")
UPDATE
I have frankenstein'd an iterative process for the simple solution that works for each dataframe, except for the daily. Still trying to work that out. Basically for any time on the 1min rows, I want to display YESTERDAY's daily data, unless the time is >= 16:00:00, then it can be the current day's. I want that so not to introduce forward peaking. Anyway, here's the code that accomplishes what I'm looking for iteratively, but hoping there's a faster/cleaner way to do this:
import numpy as np
# Merge all timeframes together into one dataframe
for ticker in tickers:
# Ensure uppercase
ticker = str(ticker.upper())
for timeframe in timeframes:
# Define the 1min timeframe as the main df we'll add other columns to
if timeframe == "1min":
main_df = pd.read_csv("./data/historical/" + ticker + "_" + timeframe + ".csv")
main_df['time'] = pd.to_datetime(main_df['time'])
continue
# Now add in some nan's for the next timeframe's columns
main_df[timeframe + "_open"] = np.nan
main_df[timeframe + "_high"] = np.nan
main_df[timeframe + "_low"] = np.nan
main_df[timeframe + "_close"] = np.nan
main_df[timeframe + "_volume"] = np.nan
# read in the next timeframe's dataset
df = pd.read_csv("./data/historical/" + ticker + "_" + timeframe + ".csv")
df['time'] = pd.to_datetime(df['time'])
# Rather than doing a double for loop to iterate through both datasets, just keep a counter
# of what row we're at in the second dataframe. Used as a row locater
curr_df_row = 0
# Do this for all datasets except the daily one
if timeframe != 'daily':
# Iterate through the main_df
for i in range(len(main_df)):
# If the time in the main df is >= the current timeframe's row's df, add the values to their columns
if main_df['time'].iloc[i] >= df['time'].iloc[curr_df_row]:
main_df[timeframe + "_open"].iloc[i] = df['open'].iloc[curr_df_row]
main_df[timeframe + "_high"].iloc[i] = df['high'].iloc[curr_df_row]
main_df[timeframe + "_low"].iloc[i] = df['low'].iloc[curr_df_row]
main_df[timeframe + "_close"].iloc[i] = df['close'].iloc[curr_df_row]
main_df[timeframe + "_volume"].iloc[i] = df['volume'].iloc[curr_df_row]
curr_df_row += 1
# Daily dataset logic would go here
print(main_df)
main_df.to_csv("./TEST.csv", index=False)

Related

Timeseries dataframe returns an error when using Pandas Align - valueError: cannot join with no overlapping index names

My goal:
I have two time-series data frames, one with a time interval of 1m and the other with a time interval of 5m. The 5m data frame is a resampled version of the 1m data. What I'm doing is computing a set of RSI values that correspond to the 5m df using the vectorbt library, then aligning and broadcasting these values to the 1m df using df.align
The Problem:
When trying to do this line by line, it works perfectly. Here's what the final result looks like:
However, when applying it under the function, it returns the following error while having overlapping index names:
ValueError: cannot join with no overlapping index names
Here's the complete code:
import vectorbt as vbt
import numpy as np
import pandas as pd
import datetime
end_date = datetime.datetime.now()
start_date = end_date - datetime.timedelta(days=3)
btc_price = vbt.YFData.download('BTC-USD',
interval='1m',
start=start_date,
end=end_date,
missing_index='drop').get('Close')
def custom_indicator(close, rsi_window=14, ma_window=50):
close_5m = close.resample('5T').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill')
print(rsi) #to check
print(close) #to check
return
#setting up indicator factory
ind = vbt.IndicatorFactory(
class_name='Combination',
short_name='comb',
input_names=['close'],
param_names=['rsi_window', 'ma_window'],
output_names=['value']).from_apply_func(custom_indicator,
rsi_window=14,
ma_window=50,
keep_pd=True)
res = ind.run(btc_price, rsi_window=21, ma_window=50)
print(res)
Thank you for taking the time to read this. Any help would be appreciated!
if you checked the columns of both , rsi and close
print('close is', close.columns)
print('rsi is', rsi.columns)
you will find
rsi is MultiIndex([(14, 'Close')],
names=['rsi_window', None])
close is Index(['Close'], dtype='object')
as it has two indexes, one should be dropped, so it can be done by the below code
rsi.columns = rsi.columns.droplevel()
to drop one level of the indexes, so it could be align,
The problem is that the data must be a time series and not a pandas data frame for table joins using align
You need to fix the data type
# Time Series
close = close['Close']
close_5m = close.resample('15min').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right')
When you are aligning the data make sure to include join='right'
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right'

Python pandas rolling computations with custom step size

I have a pandas dataframe with daily data. At the last day of each month, I would like to compute a quantity that depends on the daily data of the previous n months (e.g., n=3).
My current solution is to use the pandas rolling function to compute this quantity for every day, and then, only keep the quantities of the last days of each month (and discard all the other quantities). This however implies that I perform a lot of unnecessary computations.
Does somebody of you know how I can improve that?
Thanks a lot in advance!
EDIT:
In the following, I add two examples. In both cases, I compute rolling regressions of stock returns. The first (short) example shows the problem described above and is a sub-problem of my actual problem. The second (long) example shows my actual problem. Therefore, I would either need a solution of the first example that can be embedded in my algorithm for solving the second example or a completely different solution of the second example. Note: The dataframe that I'm using is very large, which means that multiple copies of the entire dataframe are not feasible.
Example 1:
import pandas as pd
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
df = pd.DataFrame(index=dates,columns=['Y','X']).sort_index()
# Generate Data
df['X'] = np.array(range(0,365))
df['Y'] = 3.1*X-2.5
df = df.iloc[random.sample(range(365),280)] # some days are missing
df.iloc[random.sample(range(280),20),0] = np.nan # some observations are missing
df = df.sort_index()
# Compute Beta
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'], sm.add_constant(df.loc[ser.index,'X']), missing = 'drop').fit().params[-1]
df['beta'] = df['Y'].rolling('60D', min_periods=10).apply(estimate_beta) # use last 60 days and require at least 10 observations
# Get last entries per month
df_monthly = df[['beta']].groupby([pd.Grouper(freq='M', level='date')]).agg('last')
df_monthly
Example 2:
import pandas as pd
from pandas import IndexSlice as idx
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
arrays = [dates.tolist()+dates.tolist(),["10000"]*365+["10001"]*365]
index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=["Date", "Stock"])
df = pd.DataFrame(index=index,columns=['Y','X']).sort_index()
# Generate Data
df.loc[idx[:,"10000"],'X'] = X = np.array(range(0,365)).astype(float)
df.loc[idx[:,"10000"],'Y'] = 3*X-2
df.loc[idx[:,"10001"],'X'] = X
df.loc[idx[:,"10001"],'Y'] = -X+1
df = df.iloc[random.sample(range(365*2),360*2)] # some days are missing
df.iloc[random.sample(range(280*2),20*2),0] = np.nan # some observations are missing
# Estimate beta
def estimate_beta_grouped(df_in):
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'].astype(float),sm.add_constant(df.loc[ser.index,'X'].astype(float)), missing = 'drop').fit().params[-1]
df = df_in.droplevel('Stock').reset_index().set_index(['Date']).sort_index()
df['beta'] = df['Y'].rolling('60D',min_periods=10).apply(estimate_beta)
return df[['beta']]
df_beta = df.groupby(level='Stock').apply(estimate_beta_grouped)
# Extract beta at last day per month
df_monthly = df.groupby([pd.Grouper(freq='M', level='Date'), df.index.get_level_values(1)]).agg('last') # get last observations
df_monthly = df_monthly.merge(df_beta, left_index=True, right_index=True, how='left') # merge beta on df_monthly
df_monthly

Adding a column to pandas dataframe conditionally

I am working on a personal project collecting the data on Covid-19 cases. The data set only shows the total number of Covid-19 cases per state cumulatively. I would like to add a column that contains the new cases added that day. This is what I have so far:
import pandas as pd
from datetime import date
from datetime import timedelta
import numpy as np
#read the CSV from github
hist_US_State = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
#some code to get yesterday's date and the day before which is needed later.
today = date.today()
yesterday = today - timedelta(days = 1)
yesterday = str(yesterday)
day_before_yesterday = today - timedelta(days = 2)
day_before_yesterday = str(day_before_yesterday)
#Extracting yesterday's and the day before cases and combine them in one dataframe
yesterday_cases = hist_US_State[hist_US_State["date"] == yesterday]
day_before_yesterday_cases = hist_US_State[hist_US_State["date"] == day_before_yesterday]
total_cases = pd.DataFrame()
total_cases = day_before_yesterday_cases.append(yesterday_cases)
#Adding a new column called "new_cases" and this is where I get into trouble.
total_cases["new_cases"] = yesterday_cases["cases"] - day_before_yesterday_cases["cases"]
Can you please point out what I am doing wrong?
Because you defined total_cases as a concatenation (via append) of yesterday_cases and day_before_yesterday_cases, its number of rows is equal to the sum of the other two dataframes. It looks like yesterday_cases and day_before_yesterday_cases both have 55 rows, and so total_cases has 110 rows. Thus your last line is trying to assign 55 values to a series of 110 values.
You may either want to reshape your data so that each date is its own column, or work in arrays of dataframes.

Iterate over dates, calculate averages for every 24-hour period

I have a csv file with data every ~minute over 2 years, and am wanting to run code to calculate 24-hour averages. Ideally I'd like the code to iterate over the data, calculate averages and standard deviations, and R^2 between dataA and dataB, for every 24hr period and then output this new data into a new csv file (with datestamp and calculated data for each 24hr period).
The data has an unusual timestamp which I think might be tripping me up slightly. I've been trying different For Loops to iterate over the data, but I'm not sure how to specify that I want the averages,etc for each 24hr period.
This is the code I have so far, but I'm not sure how to complete the For Loop to achieve what I'm wanting. If anyone can help that would be great!
import math
import pandas as pd
import os
import numpy as np
from datetime import timedelta, date
# read the file in csv
data = pd.read_csv("Jacaranda_data_HST.csv")
# Extract the data columns from the csv
data_date = data.iloc[:,1]
dataA = data.iloc[:,2]
dataB = data.iloc[:,3]
# set the start and end dates of the data
start_date = data_date.iloc[0]
end_date = data_date.iloc[-1:]
# for loop to run over every 24 hours of data
day_count = (end_date - start_date).days + 1
for single_date in [d for d in (start_date + timedelta(n) for n in
range(day_count)) if d <= end_date]:
print np.mean(dataA), np.mean(dataB), np.std(dataA), np.std(dataB)
# output new csv file - **unsure how to call the data**
csvfile = "Jacaranda_new.csv"
outdf = pd.DataFrame()
#outdf['dataA_mean'] = ??
#outdf['dataB_mean'] = ??
#outdf['dataA_stdev'] = ??
#outdf['dataB_stdev'] = ??
outdf.to_csv(csvfile, index=False)
A simplified aproach could be to group by calendar day in a dict. I don't have much experience with pandas time management in DataFrames, so this could be an alternative.
You could create a dict where the keys are the dates of the data (without the time part), so you can later calculate the mean of all the data points that are under each key.
data_date = data.iloc[:,1]
data_a = data.iloc[:,2]
data_b = data.iloc[:,3]
import collections
dd_a = collections.defaultdict(list)
dd_b = collections.defaultdict(list)
for date_str, data_point_a, data_point_b in zip(data_date, data_a, data_b):
# we split the string by the first space, so we get only the date part
date_part, _ = date_str.split(' ', maxsplit=1)
dd_a[date_part].append(data_point_a)
dd_b[date_part].append(data_point_b)
Now you can calculate the averages:
for date, v_list in dd_a.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))
for date, v_list in dd_b.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))

Speeding up pandas array calculation

I have working code that achieves the desired calculation result, but I am currently using an algorithm that iterates over the pandas array. this is obviously slower than pure pandas DataFrame calculations. Would like some advice on how i can use pandas functions to speed up this calculation
Code to generate dummy data
df = pd.DataFrame(index=pd.date_range(start='2014-01-01', periods=365))
df['Month'] = df.index.month
df['MTD'] = (df.index.day+0.001)/10000
This is basically a pandas DataFrame with MTD figures for some value. This is purely given so that we have some data to play with.
Needed calculation
what I need is a new DataFrame that has starting (investment) dates as columns - populating them with a few beginning of month values. the index is all possible dates and the values should be the YTD figure. I am using this Dataframe as a lookup/cache for investement dates
pseudocode
YTD = (1+last MTD figure) * ((1+last MTD figure)... for all months to the required date
Working function
def calculate_YTD(df): # slow takes 3.5s on my machine!!!!!!
YTD_df = pd.DataFrame(index=df.index)
for investment_date in [datetime.datetime(2014,x+1,1) for x in range(12)]:
YTD_df[investment_date] =1.0 # pre-populate with dummy floats
for date in df.index: # iterate over all dates in period
h = (df[investment_date:date].groupby('Month')['MTD'].max().fillna(0) + 1).product() -1
YTD_df[investment_date][date] = h
return YTD_df
I have hardcoded the investment dates list to simplify the problem statement. On my machines this code takes 2.5 to 3.5 seconds. Any suggestions on how i can speed it up?
Here's an approach that should be reasonably quick. Quite possible there is something faster/cleaner, but this should be an improvement.
#assuming a fixed number of investments dates, build a list
investment_dates = pd.date_range('2014-1-1', periods=12, freq='MS')
#build a table, by month, which contains the cumulative MTD
#return for each invesment date. Still have to loop over the investment dates,
#but don't need to loop over each daily value
running_mtd = []
for date in investment_dates:
curr_mo = (df[df.index >= date].groupby('Month')['MTD'].last() + 1.).cumprod()
curr_mo.name = date
running_mtd.append(curr_mo)
running_mtd_df = pd.concat(running_mtd, axis=1)
running_mtd_df = running_mtd_df.shift(1).fillna(1.)
#merge running mtd returns with base dataframe
df = df.merge(running_mtd_df, left_on='Month', right_index=True)
#calculate ytd return for each column / day, by multipling the running
#monthly return with the current MTD value
for date in investment_dates:
df[date] = np.where(df.index < date, np.nan, df[date] * (1. + df['MTD']) - 1.)

Categories

Resources