Speeding up pandas array calculation - python

I have working code that achieves the desired calculation result, but I am currently using an algorithm that iterates over the pandas array. this is obviously slower than pure pandas DataFrame calculations. Would like some advice on how i can use pandas functions to speed up this calculation
Code to generate dummy data
df = pd.DataFrame(index=pd.date_range(start='2014-01-01', periods=365))
df['Month'] = df.index.month
df['MTD'] = (df.index.day+0.001)/10000
This is basically a pandas DataFrame with MTD figures for some value. This is purely given so that we have some data to play with.
Needed calculation
what I need is a new DataFrame that has starting (investment) dates as columns - populating them with a few beginning of month values. the index is all possible dates and the values should be the YTD figure. I am using this Dataframe as a lookup/cache for investement dates
pseudocode
YTD = (1+last MTD figure) * ((1+last MTD figure)... for all months to the required date
Working function
def calculate_YTD(df): # slow takes 3.5s on my machine!!!!!!
YTD_df = pd.DataFrame(index=df.index)
for investment_date in [datetime.datetime(2014,x+1,1) for x in range(12)]:
YTD_df[investment_date] =1.0 # pre-populate with dummy floats
for date in df.index: # iterate over all dates in period
h = (df[investment_date:date].groupby('Month')['MTD'].max().fillna(0) + 1).product() -1
YTD_df[investment_date][date] = h
return YTD_df
I have hardcoded the investment dates list to simplify the problem statement. On my machines this code takes 2.5 to 3.5 seconds. Any suggestions on how i can speed it up?

Here's an approach that should be reasonably quick. Quite possible there is something faster/cleaner, but this should be an improvement.
#assuming a fixed number of investments dates, build a list
investment_dates = pd.date_range('2014-1-1', periods=12, freq='MS')
#build a table, by month, which contains the cumulative MTD
#return for each invesment date. Still have to loop over the investment dates,
#but don't need to loop over each daily value
running_mtd = []
for date in investment_dates:
curr_mo = (df[df.index >= date].groupby('Month')['MTD'].last() + 1.).cumprod()
curr_mo.name = date
running_mtd.append(curr_mo)
running_mtd_df = pd.concat(running_mtd, axis=1)
running_mtd_df = running_mtd_df.shift(1).fillna(1.)
#merge running mtd returns with base dataframe
df = df.merge(running_mtd_df, left_on='Month', right_index=True)
#calculate ytd return for each column / day, by multipling the running
#monthly return with the current MTD value
for date in investment_dates:
df[date] = np.where(df.index < date, np.nan, df[date] * (1. + df['MTD']) - 1.)

Related

Python pandas rolling computations with custom step size

I have a pandas dataframe with daily data. At the last day of each month, I would like to compute a quantity that depends on the daily data of the previous n months (e.g., n=3).
My current solution is to use the pandas rolling function to compute this quantity for every day, and then, only keep the quantities of the last days of each month (and discard all the other quantities). This however implies that I perform a lot of unnecessary computations.
Does somebody of you know how I can improve that?
Thanks a lot in advance!
EDIT:
In the following, I add two examples. In both cases, I compute rolling regressions of stock returns. The first (short) example shows the problem described above and is a sub-problem of my actual problem. The second (long) example shows my actual problem. Therefore, I would either need a solution of the first example that can be embedded in my algorithm for solving the second example or a completely different solution of the second example. Note: The dataframe that I'm using is very large, which means that multiple copies of the entire dataframe are not feasible.
Example 1:
import pandas as pd
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
df = pd.DataFrame(index=dates,columns=['Y','X']).sort_index()
# Generate Data
df['X'] = np.array(range(0,365))
df['Y'] = 3.1*X-2.5
df = df.iloc[random.sample(range(365),280)] # some days are missing
df.iloc[random.sample(range(280),20),0] = np.nan # some observations are missing
df = df.sort_index()
# Compute Beta
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'], sm.add_constant(df.loc[ser.index,'X']), missing = 'drop').fit().params[-1]
df['beta'] = df['Y'].rolling('60D', min_periods=10).apply(estimate_beta) # use last 60 days and require at least 10 observations
# Get last entries per month
df_monthly = df[['beta']].groupby([pd.Grouper(freq='M', level='date')]).agg('last')
df_monthly
Example 2:
import pandas as pd
from pandas import IndexSlice as idx
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
arrays = [dates.tolist()+dates.tolist(),["10000"]*365+["10001"]*365]
index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=["Date", "Stock"])
df = pd.DataFrame(index=index,columns=['Y','X']).sort_index()
# Generate Data
df.loc[idx[:,"10000"],'X'] = X = np.array(range(0,365)).astype(float)
df.loc[idx[:,"10000"],'Y'] = 3*X-2
df.loc[idx[:,"10001"],'X'] = X
df.loc[idx[:,"10001"],'Y'] = -X+1
df = df.iloc[random.sample(range(365*2),360*2)] # some days are missing
df.iloc[random.sample(range(280*2),20*2),0] = np.nan # some observations are missing
# Estimate beta
def estimate_beta_grouped(df_in):
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'].astype(float),sm.add_constant(df.loc[ser.index,'X'].astype(float)), missing = 'drop').fit().params[-1]
df = df_in.droplevel('Stock').reset_index().set_index(['Date']).sort_index()
df['beta'] = df['Y'].rolling('60D',min_periods=10).apply(estimate_beta)
return df[['beta']]
df_beta = df.groupby(level='Stock').apply(estimate_beta_grouped)
# Extract beta at last day per month
df_monthly = df.groupby([pd.Grouper(freq='M', level='Date'), df.index.get_level_values(1)]).agg('last') # get last observations
df_monthly = df_monthly.merge(df_beta, left_index=True, right_index=True, how='left') # merge beta on df_monthly
df_monthly

Merge 1min, 5min and Daily OHLC dataframes WITHOUT upsampling the 1min?

To make this simple, let's say I have 3 datasets, a 1min OHLCV dataset, a 5min, and a Daily. Given these 3 datasets:
1min
5min
Daily
...how can I merge them and turn them into something like this using Pandas/Python 3?
As you can see, every time a new 5 mins is hit going down the 1min dataframe, the matching time from the 5min chart <= the 1min time gets added to their respective columns. I've left the blanks in there to help visualize what's happening, but for simplicity's sake, I can just forward fill the values. No backfilling so as to not introduce a lookahead bias. The daily value would be the previous day's OHLCV data, except for 4:00 PM, that would be the current day's data. This is because of how Alpha Vantage structures their dataframes. I also have a 15min and 60min dataset to go along with this, but I think once the 1 to 5min merge is done, the same logic could apply.
I have included a reproducible code below to get the exact dataframes I'm working with, however you have to pip install alpha_vantage and get a free API key from here.
SIMPLE DESIRED SOLUTION - What I've outlined above.
ADVANCED DESIRED SOLUTION - Rather than forward filling the 5min data, ideally those empty spaces would consist of the respective running prices. For example, the next empty open price in the 5min_open column would be the same 1min_open at the beginning of that 5 minutes. Think of something like:
conversion = {'Open' : 'first', 'High' : 'max', 'Low' : 'min', 'Close' : 'last', 'Volume' : 'sum'}
That way the empty spaces in the 5min columns are updating as they should, but the simple solution would suffice.
I don't want to just upsample the 1min dataframe because I want to include all of the previous data from a past daily dataframe, so to upsample the 1min to a daily dataset, I'd lose a bunch of past daily data, so I want the solution to NOT include upsampling, but merging of different df's by datetime. Thanks!
Windows 10, Python 3:
from alpha_vantage.timeseries import TimeSeries
import pandas as pd
import sys, os
import os.path
from time import sleep
# Make a historical folder if it's not created already
if not os.path.exists("data/historical/"):
os.makedirs("data/historical/")
api_call_limit_daily = 500
total_api_count = 0
timeframes = ['1min','5min','15min','60min','daily']
ts = TimeSeries(key='YOUR_API_KEY_HERE', output_format='csv')
tickers = ['SPY']
for ticker in tickers:
# Ensure uppercase
ticker = str(ticker.upper())
for timeframe in timeframes:
print("Downloading dataset for", ticker, "-", timeframe, "...")
if total_api_count > api_call_limit_daily:
print("Daily API call limit reached. Closing...")
sys.exit(0)
if timeframe != 'daily':
data, meta_data = ts.get_intraday_extended(symbol=ticker, interval=timeframe)
elif timeframe == 'daily':
data, meta_data = ts.get_daily(symbol=ticker)
# Convert the data object to a dataframe, and reverse the order so oldest date's on top
csv_list = list(data)
data = pd.DataFrame.from_records(csv_list[1:], columns=csv_list[0])
data = data.iloc[::-1]
if timeframe == 'daily':
data = data.rename(columns={"timestamp": "time"})
print(data)
df = data.set_index('time')
total_api_count += 1
df.to_csv("data/historical/" + ticker + "_" + timeframe + ".csv", index=True)
print("Success...")
# Sleep if we're not through tickers/timeframes yet re: api limits
sleep(15)
print("Done!")
UPDATE
I have frankenstein'd an iterative process for the simple solution that works for each dataframe, except for the daily. Still trying to work that out. Basically for any time on the 1min rows, I want to display YESTERDAY's daily data, unless the time is >= 16:00:00, then it can be the current day's. I want that so not to introduce forward peaking. Anyway, here's the code that accomplishes what I'm looking for iteratively, but hoping there's a faster/cleaner way to do this:
import numpy as np
# Merge all timeframes together into one dataframe
for ticker in tickers:
# Ensure uppercase
ticker = str(ticker.upper())
for timeframe in timeframes:
# Define the 1min timeframe as the main df we'll add other columns to
if timeframe == "1min":
main_df = pd.read_csv("./data/historical/" + ticker + "_" + timeframe + ".csv")
main_df['time'] = pd.to_datetime(main_df['time'])
continue
# Now add in some nan's for the next timeframe's columns
main_df[timeframe + "_open"] = np.nan
main_df[timeframe + "_high"] = np.nan
main_df[timeframe + "_low"] = np.nan
main_df[timeframe + "_close"] = np.nan
main_df[timeframe + "_volume"] = np.nan
# read in the next timeframe's dataset
df = pd.read_csv("./data/historical/" + ticker + "_" + timeframe + ".csv")
df['time'] = pd.to_datetime(df['time'])
# Rather than doing a double for loop to iterate through both datasets, just keep a counter
# of what row we're at in the second dataframe. Used as a row locater
curr_df_row = 0
# Do this for all datasets except the daily one
if timeframe != 'daily':
# Iterate through the main_df
for i in range(len(main_df)):
# If the time in the main df is >= the current timeframe's row's df, add the values to their columns
if main_df['time'].iloc[i] >= df['time'].iloc[curr_df_row]:
main_df[timeframe + "_open"].iloc[i] = df['open'].iloc[curr_df_row]
main_df[timeframe + "_high"].iloc[i] = df['high'].iloc[curr_df_row]
main_df[timeframe + "_low"].iloc[i] = df['low'].iloc[curr_df_row]
main_df[timeframe + "_close"].iloc[i] = df['close'].iloc[curr_df_row]
main_df[timeframe + "_volume"].iloc[i] = df['volume'].iloc[curr_df_row]
curr_df_row += 1
# Daily dataset logic would go here
print(main_df)
main_df.to_csv("./TEST.csv", index=False)

Python: How to use date as an independent variable when running a regression? [duplicate]

It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:
date, city, players, sales
2014-04-28,London,111,1091.28
2014-04-29,London,100,1100.44
2014-04-28,Paris,87,1001.33
...
I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.
My current (non-working, but compiling code):
import pandas as pd
from pandas import DataFrame, Series
import statsmodels.formula.api as sm
df = pd.read_csv('gameAct.csv')
df.columns = ['date', 'city', 'players', 'sales']
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date', data = city_data).fit()
As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True' in defining the dataframe df, but without success.
I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!
Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:
'\xef\xbb\xbf2014-04-28'
How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).
For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.
This does the trick nicely:
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
df['date_delta'] = (df['date'] - df['date'].min()) / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()
The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.
There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well:
http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html
Also, a quick note:
You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.
get date as floating point year
I prefer a date-format, which can be understood without context. Hence, the floating point year representation.
The nice thing here is, that the solution works on a numpy level - hence should be fast.
import numpy as np
import pandas as pd
def dt64_to_float(dt64):
"""Converts numpy.datetime64 to year as float.
Rounded to days
Parameters
----------
dt64 : np.datetime64 or np.ndarray(dtype='datetime64[X]')
date data
Returns
-------
float or np.ndarray(dtype=float)
Year in floating point representation
"""
year = dt64.astype('M8[Y]')
# print('year:', year)
days = (dt64 - year).astype('timedelta64[D]')
# print('days:', days)
year_next = year + np.timedelta64(1, 'Y')
# print('year_next:', year_next)
days_of_year = (year_next.astype('M8[D]') - year.astype('M8[D]')
).astype('timedelta64[D]')
# print('days_of_year:', days_of_year)
dt_float = 1970 + year.astype(float) + days / (days_of_year)
# print('dt_float:', dt_float)
return dt_float
if __name__ == "__main__":
dates = np.array([
'1970-01-01', '2014-01-01', '2020-12-31', '2019-12-31', '2010-04-28'],
dtype='datetime64[D]')
df = pd.DataFrame({
'date': dates,
'number': np.arange(5)
})
df['date_float'] = dt64_to_float(df['date'].to_numpy())
print('df:', df, sep='\n')
print()
dt64 = np.datetime64( "2011-11-11" )
print('dt64:', dt64_to_float(dt64))
output
df:
date number date_float
0 1970-01-01 0 1970.000000
1 2014-01-01 1 2014.000000
2 2020-12-31 2 2020.997268
3 2019-12-31 3 2019.997260
4 2010-04-28 4 2010.320548
dt64: 2011.8602739726027
I'm not sure about the specifics of the statsmodels, but this post lists all the date/time conversions for python. They aren't always one-to-one, so it's a reference I used often ;-)
df.date.dt.total_seconds()
If the data type of your date is datetime64[ns] than dt.total_seconds() should work; this will return a number of seconds (float).

How to find sales in previous n months using groupby

I have a dataframe of daily sales:
import pandas as pd
date = ['28-01-2017','29-01-2017','30-01-2017','31-01-2017','01-02-2017','02-02-2017']
sales = [1,2,3,4,1,2]
ym = [201701,201701,201701,201701,201702,201702]
prev_1_ym = [201612,201612,201612,201612,201701,201701]
prev_2_ym = [201611,201611,201611,201611,201612,201612]
df_test = pd.DataFrame({'date': date,'ym':ym,'prev_1_ym':prev_1_ym,'prev_2_ym':prev_2_ym,'sales':sales})
df_test['date'] = pd.to_datetime(df_test['date'],format = '%d-%m-%Y')
I am trying to find total sales in the previous 1m, previous 2m etc..
My current approach is to use a list comprehension:
df_test[prev_1m_sales] = [ sum(df_test.loc[df_test['ym'] == x].sales) for x in df_test[prev_1_ym] ]
However, this proves to be very slow.
Is there a way to speed it up by using .groupby()?
you can use the date column to group your data, first change its data-type to pandas TimeStamps,
df['dates']=pd.to_datetime(df['dates'])
then you can use it directly in grouping for example
df.groupby(df.data.month).sales.sum().cumsum()

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

Categories

Resources