turning a for loop into a dataframe.apply problem - python

This is my first ever question on here, so please forgive me if I don't explain it clearly, or overexplain. The task is to turn a for loop that contained 2 if statements in to dataframe.apply instead of the loop. I thought the way of doing it was turning the if statements inside the for loop into a defined function, then calling the function in the .apply line, but can only get so far. Not even sure I am trying to tackle this the right way. can provide original For loop code if necessary. Thanks in advance.
The goal is to import a csv of stock prices, compare the prices in one column to a moving average, which needed to be created, and if > MA, buy, if < MA, sell. Keep track of all buy/sells and determine overall wealth/return at the end. It worked as a for loop: for each x in prices, use the 2 if's, append prices to a list to determine ending wealth. I think I get to the point where I am to call the defined function into the .apply line, and errors out. In my code below there may still be some unnecessary lingering code from the for loop usage, but shouldn't interfere with the .apply attempt, just makes for messy coding until I figure it out.
df2 = pd.read_csv("MSFT.csv", index_col=0, parse_dates=True).sort_index(axis=0 ,ascending=True) #could get yahoo to work but not quandl, so imported the csv file from class
buyPrice = 0
sellPrice = 0
maWealth = 1.0
cash = 1
stock = 0
sma = 200
ma = np.round(df2['AdjClose'].rolling(window=sma, center=False).mean(), 2) #to create the moving average to compare to
n_days = len(df2['AdjClose'])
closePrices = df2['AdjClose'] #to only work with one column from original csv import
buy_data = []
sell_data = []
trade_price = []
wealth = []
def myiffunc(adjclose):
if closePrices > ma and cash == 1: # Buy if stock price > MA & if not bought yet
buyPrice = closePrices[0+ 1]
buy_data.append(buyPrice)
trade_price.append(buyPrice)
cash = 0
stock = 1
if closePrices < ma and stock == 1: # Sell if stock price < MA and if you have a stock to sell
sellPrice = closePrices[0+ 1]
sell_data.append(sellPrice)
trade_price.append(sellPrice)
cash = 1
stock = 0
wealth.append(1*(sellPrice / buyPrice))
closePrices.apply(myiffunc)

Checking the docs for apply, it seems like you need to use the index=1 version to process each row at a time, and pass two columns: the moving average and the closing price.
Something like this:
df2 = ...
df2['MovingAverage'] = ...
have_shares = False
def my_func(row):
global have_shares
if not have_shares and row['AdjClose'] > row['MovingAverage']:
# buy shares
have_shares = True
elif have_shares and row['AdjClose'] < row['MovingAverage']:
# sell shares
have_shares = False
However, it's worth pointing out that you can do the comparisons using numpy/pandas as well, just storing the results in another column:
df2['BuySignal'] = (df2.AdjClose > df2.MovingAverage)
df2['SellSignal'] = (df2.AdjClose < df2.MovingAverage)
Then you could .apply() a function that made use of the Buy/Sell signal columns.

Related

How do you perform conditional operations on different elements in a Pandas DataFrame?

Let's say I have a Pandas Dataframe of the price and stock history of a product at 10 different points in time:
df = pd.DataFrame(index=[np.arange(10)])
df['price'] = 10,10,11,15,20,10,10,11,15,20
df['stock'] = 30,20,13,8,4,30,20,13,8,4
df
price stock
0 10 30
1 10 20
2 11 13
3 15 8
4 20 4
5 10 30
6 10 20
7 11 13
8 15 8
9 20 4
How do I perform operations between specific rows that meet certain criteria?
In my example row 0 and row 5 meet the criteria "stock over 25" and row 4 and row 9 meet the criteria "stock under 5".
I would like to calculate:
df['price'][4] - df['price'][0] and
df['price'][9] - df['price'][5]
but not
df['price'][9] - df['price'][0] or
df['price'][4] - df['price'][5].
In other words, I would like to calculate the price change between the most recent event where stock was under 5 vs the most recent event where stock was over 25; over the whole series.
Of course, I would like to do this over larger datasets where picking them manually is not good.
First, set up data frame and add some calculations:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[np.arange(10)])
df['price'] = 10,10,11,15,20,10,10,11,15,20
df['stock'] = 30,20,13,8,4,30,20,13,8,4
df['stock_under_5'] = df['stock'] < 5
df['stock_over_25'] = df['stock'] > 25
df['cum_stock_under_5'] = df['stock_under_5'].cumsum()
df['change_stock_under_5'] = df['cum_stock_under_5'].diff()
df['change_stock_under_5'].iloc[0] = df['stock_under_5'].iloc[0]*1
df['next_row_change_stock_under_5'] = df['change_stock_under_5'].shift(-1)
df['cum_stock_over_25'] = df['stock_over_25'].cumsum()
df['change_stock_over_25'] = df['cum_stock_over_25'].diff()
df['change_stock_over_25'].iloc[0] = df['stock_over_25'].iloc[0]*1
df['next_row_change_stock_over_25'] = df['change_stock_over_25'].shift(-1)
df['row'] = np.arange(df.shape[0])
df['next_row'] = df['row'].shift(-1)
df['next_row_price'] = df['price'].shift(-1)
Next we find all windows where either the stock went over 25 or below 5 by grouping over the cumulative marker of those events.
changes = (
df.groupby(['cum_stock_under_5', 'cum_stock_over_25'])
.agg({'row':'first', 'next_row':'last', 'change_stock_under_5':'max', 'change_stock_over_25':'max',
'next_row_change_stock_under_5':'max', 'next_row_change_stock_over_25':'max',
'price':'first', 'next_row_price':'last'})
.assign(price_change = lambda x: x['next_row_price'] - x['price'])
.reset_index(drop=True)
)
For each window we find what happened at the beginning of the window: if change_stock_under_5 = 1 it means the window started with the stock going under 5, if change_stock_over_25 = 1 it started with the stock going over 25.
Same for the end of the window using the columns next_row_change_stock_under_5 and next_row_change_stock_over_25
Now, we can readily extract the stock price change in rows where the stock went from being over 25 to being under 5:
from_over_to_below = changes[(changes['change_stock_over_25']==1) & (changes['next_row_change_stock_under_5']==1)]
and the other way around:
from_below_to_over = changes[(changes['change_stock_under_5']==1) & (changes['next_row_change_stock_over_25']==1)]
You can for example calculate the average price change when the stock went from over 25 to below 5:
from_over_to_below.price_change.mean()
In order to give a better explanation, will separate the approach by creating two different functions:
The first one will be the event detection, let's call it detect_event.
The second one will calculate the the price between the current event and the previous one, in the list generated by the first function. We will call it calculate_price_change.
Starting with the first function, here it is key to understand very well the goals we want to reach or the constraints/conditions we want to satisfy.
Will leave two, of more, potential options, given the various interpretations of the question:
A. The initial is what I could get from my initial understanding
B. The second part will be one of the interpretations one could get from #Iyar Lyn comment (I can see more interpretations, but won't consider in this answer as the approach will be similar).
Within option A, we will create a function to detect where a stock is under 5 or 25
def detect_event(df):
# Create a list of the indexes of the events where stock was under 5 or over 25
events = []
# Loop through the dataframe
for i in range(len(df)):
# If stock is under 5, add the index to the list
if df['stock'][i] < 5:
events.append(i)
# If stock is over 25, add the index to the list
elif df['stock'][i] > 25:
events.append(i)
# Return the list of indexes of the events where stock was under 5 or over 25
return events
The comments make it self-explanatory, but, basically, this will return a list of indexes of the rows where stock is under 5 or over 25.
With OP's df this will return
events = detect_event(df)
[Out]:
[0, 4, 5, 9]
Within the option B, assuming one wants to know the events where the stock went from under 5 to over 25, and vice-versa, consecutively (there are more ways to interpret this), then one can use the following function
def detect_event(df):
# Create a list of the indexes of the events where we will store the elements in the conditions
events = []
for i, stock in enumerate(df['stock']):
# If the index is 0, add the index of the first event to the list of events
if i == 0:
events.append(i)
# If the index is not 0, check if the stock went from over 25 to under 5 or from under 5 to over 25
else:
# If the stock went from over 25 to under 5, add the index of the event to the list of events
if stock < 5 and df['stock'][i-1] > 25:
events.append(i)
# If the stock went from under 5 to over 25, add the index of the event to the list of events
elif stock > 25 and df['stock'][i-1] < 5:
events.append(i)
# Return the list of events
return events
With OP's df this will return
events = detect_event(df)
[Out]:
[0, 5]
Note that 0 is the element in the first position, that we are appending by default.
As for the second function, once the conditions are well defined, meaning we know clearly what we want, and adapted the first function, detect_event, accordingly, we can now detect the changes in the prices.
In order to detect the price change between the events that satisfy the conditions we defined previously, one will use a different function: calculate_price_change.
This function will take both the dataframe df and the list events generated by the previous function, and return a list with the prices diferences.
def calculate_price_change(df, events):
# Create a list to store the price change between the most recent event where stock was under 5 vs the most recent event where stock was over 25
price_change = []
# Loop through the list of indexes of the events
for i, event in enumerate(events):
# If the index is 0, the price change is 0
if i == 0:
price_change.append(0)
# If the index is not 0, calculate the price change between the current and past events
else:
price_change.append(df['price'][event] - df['price'][events[i-1]])
return price_change
Now we if one calls this last function using the df and the list created with the first function detect_event, one gets the following
price_change = calculate_price_change(df, events)
[Out]:
[0, 10, -10, 10]
Notes:
As it is, the question gives room for multiple interpretations. That's why my initial flag for "Needs details or clarity". For the future one might want to review: How do I ask a good question?
and its hyperlinks.
I understand that sometimes we won't be able to specify everything that we want (as we might not even know - due to various reasons), so communication is key. Therefore, appreciate Iyar Lin's time and contributions as they helped improve this answer.

Assign Variables Based on Datagrame Values

I have a dataframe in which I am trying to define variables according to the values of particular cells in the dataframe in order to populate a (currently) empty final column based on the relationship between the price targets and current prices of the companies. Currently, the dataframe I’m working with looks like this, with the index being the companies:
Company
Current Price
High
Median
Low
Suggest
Company 1
$296.12
$410.00
$398.00
$365.43
Company 2
$143.18
$212.05
$200.34
$155.12
Company 3
$184.23
$214.09
$192.88
$123.63
How would I assign a variables (for example: target_high(company) = value in the “ticker, target_high” cell)? I don't think I can hard code it because the list of companies will be constantly changing. So far I’ve tried the following but it doesn’t seem to work:
for ticker in Company_List:
target_high(company) = str(Target_Frame.loc[ticker, "High"])
target_mid(company) = str(Target_Frame.loc[ticker, "Median"])
target_low(company) = str(Target_Frame.loc[ticker, "Low"])
current_price = str(Target_Frame.loc[ticker, "Price"])
if current_price(ticker) > target_high(ticker):
Target_Frame.loc[[ticker], ['Suggest']] = "Sell"
elif current_price(ticker) < target_low(ticker):
Target_Frame.loc[[ticker], ['Suggest']] = "Buy"
elif target_mid(ticker) < current_price(ticker) < target_high(ticker):
Target_Frame.loc[[ticker], ['Suggest']] = "Hold"
elif target_low(ticker) < current_price(ticker) < target_mid(ticker):
Target_Frame.loc[[ticker], ['Suggest']] = "Consider"
Thank you!
Otherwise you could use np.where or .map (see this question) with the conditions inside, rather than creating separate variables. So maybe something like this:
Target_Frame["Suggest"] = np.where(
Target_Frame["Price"] > Target_Frame["High"], "Sell", # if above high then sell
np.where(Target_Frame["Price"] < Target_Frame["Low"], "Buy", # if below low then buy
np.where(Target_Frame["Price"].between(
Target_Frame["Median"], Target_Frame["High"]), "Hold", # if between median and high then hold
"Consider"))) # else consider

Pandas- locate a value based on logical statements

I am using the this dataset for a project.
I am trying to find the total yield for each inverter for the 34 day duration of the dataset (basically use the final and initial value available for each inverter). I have been able to get the list of inverters using pd.unique()(there are 22 inverters for each solar power plant.
I am having trouble querying the total_yield data for each inverter.
Here is what I have tried:
def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
delta = np.zeros(len(arr))
index =0
for i in arr:
initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
initial = initial.loc[initial["INVERTER_ID"]==i]
initial.reset_index(inplace=True,drop=True)
initial = initial.at[0,"TOTAL_YIELD"]
final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
final = final.loc[final["INVERTER_ID"]==i]
final.reset_index(inplace=True, drop=True)
final = final.at[0,"TOTAL_YIELD"]
delta[index] = final - initial
index = index + 1
return delta
Reference: arr is the array of inverters, listed below. df is the generation dataframe for each plant.
The problem is that not every inverter has a data point for each interval. This makes this function only work for the inverters at the first plant, not the second one.
My second approach was to filter by the inverter first, then take the first and last data points. But I get an error- 'Series' objects are mutable, thus they cannot be hashed
Here is the code for that so far:
def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
delta = np.zeros(len(arr))
index = 0
for i in arr:
initial = df.loc(df["INVERTER_ID"] == i)
index += 1
break
return delta
List of inverters at plant 1 for reference(labeled as SOURCE_KEY):
['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']
List of inverters at plant 2:
['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']
Thank you very much.
I can't download the dataset to test this. Getting "To May Requests" Error.
However, you should be able to do this with a groupby.
import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])
So if I'm understanding this right, what you want is the TOTAL_YIELD for each inverter for the beginning of the time period starting 5-05-2020 02:00 and ending 17-06-2020 23:45. Try this:
# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr):
# to filter the info to between the two dates, but not necessarily assuming that
# each inverter's data starts and ends at each date
inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020
23:45:00')]
inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]
# sort by date
inverter_df.sort_values(by='DATE_TIME', inplace= True)
# grab TOTAL_YIELD at the first available date
initial = inverter_df['TOTAL_YIELD'].iloc[0]
# grab TOTAL_YIELD at the last available date
final = inverter_df['TOTAL_YIELD'].iloc[-1]
delta[index] = final - initial

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

How do I avoid a loop with Python/Pandas to build an equity curve?

I am trying to build an equity curve in Python using Pandas. For those not in the know, an equity curve is a cumulative tally of investing profits/losses day by day. The code below works but it is incredibly slow. I've tried to build an alternate using Pandas .iloc and such but nothing is working. I'm not sure if it is possible to do this outside of a loop given how I have to reference the prior row(s).
for today in range(len(f1)): #initiate a loop that runs the length of the "f1" dataframe
if today == 0: #if the index value is zero (aka first row in the dataframe) then...
f1.loc[today,'StartAUM'] = StartAUM #Set intial assets
f1.loc[today,'Shares'] = 0 #dummy placeholder for shares; no trading on day 1
f1.loc[today,'PnL'] = 0 #dummy placeholder for P&L; no trading day 1
f1.loc[today,'EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = today - 1 #used to reference the rows (see below)
f1.loc[today,'StartAUM'] = f1.loc[yesterday,'EndAUM'] #todays starting aseets are yesterday's ending assets
f1.loc[today,'Shares'] = f1.loc[yesterday,'EndAUM']//f1.loc[yesterday,'Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
f1.loc[today,'PnL'] = f1.loc[today,'Shares']*f1.loc[today,'Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
f1.loc[today,'EndAUM'] = f1.loc[today,'StartAUM']+f1.loc[today,'PnL'] #ending assets are starting assets + today's P&L
There is a good example here: http://www.pythonforfinance.net/category/basic-data-analysis/ and I know that there is an example in Wes McKinney's book Python for Data Analysis. You might be able to find it here: http://wesmckinney.com/blog/python-for-financial-data-analysis-with-pandas/
Have you tried using iterrows() to construct the for loop?
for index, row in f1.iterrows():
if today == 0:
row['StartAUM'] = StartAUM #Set intial assets
row['Shares'] = 0 #dummy placeholder for shares; no trading on day 1
row['PnL'] = 0 #dummy placeholder for P&L; no trading day 1
row['EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = row[today] - 1 #used to reference the rows (see below)
row['StartAUM'] = row['EndAUM'] #todays starting aseets are yesterday's ending assets
row['Shares'] = row['EndAUM']//['Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
row['PnL'] = row['Shares']*row['Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
row['EndAUM'] = row['StartAUM']+row['PnL'] #ending assets are starting assets + today's P&L
Probably the code is so slow as loc goes through f1 from beginning every time. iterrows() uses the same dataframe as it loops through it row by row.
See more details about iterrows() here.
You need to vectorize the operations (don't iterate with for but rather compute whole column at once)
# fill the initial values
f1['StartAUM'] = StartAUM # Set intial assets
f1['Shares'] = 0 # dummy placeholder for shares; no trading on day 1
f1['PnL'] = 0 # dummy placeholder for P&L; no trading day 1
f1['EndAUM'] = StartAUM # s
#do the computations (vectorized)
f1['StartAUM'].iloc[1:] = f1['EndAUM'].iloc[:-1]
f1['Shares'].iloc[1:] = f1['EndAUM'].iloc[:-1] // f1['Shareprice'].iloc[:-1]
f1['PnL'] = f1['Shares'] * f1['Outcome1']
f1['EndAUM'] = f1['StartAUM'] + f1 ['PnL']
EDIT: this will not work correctly since StartAUM, EndAUM, Shares depend on each other and cannot be computed one without another. I didn't notice that before.
Can you try the following:
#import relevant modules
import pandas as pd
import numpy as np
from pandas_datareader import data
import matplotlib.pyplot as plt
#download data into DataFrame and create moving averages columns
f1 = data.DataReader('AAPL', 'yahoo',start='1/1/2017')
StartAUM = 1000000
#populate DataFrame with starting values
f1['Shares'] = 0
f1['PnL'] = 0
f1['EndAUM'] = StartAUM
#Set shares held to be the previous day's EndAUM divided by the previous day's closing price
f1['Shares'] = f1['EndAUM'].shift(1) / f1['Adj Close'].shift(1)
#Set the day's PnL to be the number of shares held multiplied by the change in closing price from yesterday to today's close
f1['PnL'] = f1['Shares'] * (f1['Adj Close'] - f1['Adj Close'].shift(1))
#Set day's ending AUM to be previous days ending AUM plus daily PnL
f1['EndAUM'] = f1['EndAUM'].shift(1) + f1['PnL']
#Plot the equity curve
f1['EndAUM'].plot()
Does the above solve your issue?
The solution was to use the Numba package. It performs the loop task in a fraction of the time.
https://numba.pydata.org/
The arguments/dataframe can be passed to the numba module/function. I will try to write up a more detailed explanation with code when time permits.
Thanks to all
In case others come across this, you can definitely make an equity curve without loops.
Dummy up some data
import pandas as pd
import numpy as np
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (13, 10)
# Some data to work with
np.random.seed(1)
stock = pd.DataFrame(
np.random.randn(100).cumsum() + 10,
index=pd.date_range('1/1/2020', periods=100, freq='D'),
columns=['Close']
)
stock['ma_5'] = stock['Close'].rolling(5).mean()
stock['ma_15'] = stock['Close'].rolling(15).mean()
Holdings: simple long/short based on moving average crossover signals
longs = stock['Close'].where(stock['ma_5'] > stock['ma_15'], np.nan)
shorts = stock['Close'].where(stock['ma_5'] < stock['ma_15'], np.nan)
# Quick plot
stock.plot()
longs.plot(lw=5, c='green')
shorts.plot(lw=5, c='red')
EQUITY CURVE:
Identify which side (l/s) has first holding (ie: first trade, in this case, short), then keep the initial trade price and subsequently cumulatively sum the daily changes (there would normally be more nan's in the series if you have exit rules as well for when you are out of the market), and finally forward fill over the nan values and fill any last remaining nans with zeros. Its basically the same for the second opposite holdings (in this case, long) except don't keep the starting price. The other important thing is to invert the short daily changes (ie: negative changes should be positive to the PnL).
lidx = np.where(longs > 0)[0][0]
sidx = np.where(shorts > 0)[0][0]
startdx = min(lidx, sidx)
# For first holding side, keep first trade price, then calc daily change fwd and ffill nan's
# For second holdng side, get cumsum of daily changes, ffill and fillna(0) (make sure short changes are inverted)
if lidx == startdx:
lcurve = longs.diff() # get daily changes
lcurve[lidx] = longs[lidx] # put back initial starting price
lcurve = lcurve.cumsum().ffill() # add dialy changes/ffill to build curve
scurve = -shorts.diff().cumsum().ffill().fillna(0) # get daily changes (make declines positive changes)
else:
scurve = -shorts.diff() # get daily changes (make declines positive changes)
scurve[sidx] = shorts[sidx] # put back initial starting price
scurve = scurve.cumsum().ffill() # add dialy changes/ffill to build curve
lcurve = longs.diff().cumsum().ffill().fillna(0) # get daily changes
Add the 2 long/short curves together to get the final equity curve
eq_curve = lcurve + scurve
# quick plot
stock.iloc[:, :3].plot()
longs.plot(lw=5, c='green', label='Long')
shorts.plot(lw=5, c='red', label='Short')
eq_curve.plot(lw=2, ls='dotted', c='orange', label='Equity Curve')
plt.legend()

Categories

Resources