Problem Statement
I have a dataframe of position_signal and close (closing prices) for a given asset. You can think of each row being a specific time step (minute, hour, etc.). I want to run a vectorized backtest to calculate my aggregated portfolio return at each point in time whether i'm long, short or not holding. In which case the position_signal column is either 1, -1, 0, respectively.
What I've Tried
I've tried a lot... seems like an easy problem. There are a lot of tutorials online that claim to do it. However the problem with all of these is that they either: assume cumulative returns during trading (so each day's returns are multiplied by the next, which isn't right if you're just holding a position) or they don't calculate short returns correctly.
The code for most of them boils down to this:
df['log_returns'] = np.log(df['close']) - np.log(df['close'].shift(1))
df['strategy_returns'] = df['position_signal'] * df['log_returns']
df['cumulative_ret'] = df['strategy_returns'].cumsum().apply(np.exp)
Now, the math is fairly annoying so I don't want to do it here... but essentially this works fine with longs, but breaks down when trying to calculate the returns for shorts. The reason is because it thinks that our returns on our shorts is just the negative of our loss on an equivalent long position. Which is wrong.
Simple Example
Let's say I have a dataframe that represents two trades taken: one that is long from price = $1 --> $2 and the other is short from $2 -> $1. Now, if we were to calculate the returns for this would simply be: 100% on the first trade, and 50% on the second.
If we held a position of $100 at the start of the session, our first trade (+100%) would turn our $100 to $200, then our second trade (+50%) would turn our $200 to $300. So our final position is $300 (or 300%).
The problem is that the code given in these tutorials does the long right but then thinks our returns for the short is 100% (because our returns are the negative of a -100% returns on a long).
Code for Proof
Here is a code snippet that demonstrates this. A minor thing to note, we enter at the price above the first row where position_signal = 1/-1, this basically means our position_signal is shifted by 1 to account for look-ahead bias. It's a semantics issue and not really an issue, you can do it however you'd like as long as returns are the same
df = pd.DataFrame([[0.0, 1.0],
[0.0, 1.2],
[0.0, 1.50],
[0.0, 1],
[1.0, 1.2], # We entered a LONG at the open of this timestep, which is the same as the close of the previous ($1.0)
[1.0, 1.3],
[1.0, 2.0], # We exit at close of this timestep, so $2
[0.0, 1.7],
[0.0, 2],
[-1.0, 1.798], # We entered a SHORT at the open of this timestep, which is the same as the close of the previous ($2.0)
[-1.0, 0.50],
[-1.0, 1.3],
[-1.0, 1], # We exit at close of this timestep, so $1.0
[0.0, 1.5]],
columns=['position_signal','close'])
df['log_returns'] = np.log(df.close) - np.log(df.close.shift(1))
df['strategy_returns'] = df.log_returns * df.position_signal
df['cumulative_returns'] = df.strategy_returns.cumsum().apply(np.exp)
print(df)
Code Output:
position_signal close log_returns strategy_returns cumulative_returns
0 0.0 1.000 NaN NaN NaN
1 0.0 1.200 0.182322 0.000000 1.000000
2 0.0 1.500 0.223144 0.000000 1.000000
3 0.0 1.000 -0.405465 -0.000000 1.000000
4 1.0 1.200 0.182322 0.182322 1.200000
5 1.0 1.300 0.080043 0.080043 1.300000
6 1.0 2.000 0.430783 0.430783 2.000000
7 0.0 1.700 -0.162519 -0.000000 2.000000
8 0.0 2.000 0.162519 0.000000 2.000000
9 -1.0 1.798 -0.106472 0.106472 2.224694
10 -1.0 0.500 -1.279822 1.279822 8.000000
11 -1.0 1.300 0.955511 -0.955511 3.076923
12 -1.0 1.000 -0.262364 0.262364 4.000000
13 0.0 1.500 0.405465 0.000000 4.000000
As you can see, by the end it assumes our returns are 4.0 (400%), which is not what we want. It should be 3.0 (300%)
Other Variations:
1 - There's a way to do this if we ignore the intermediate price ticks. If we only reduce this df to four rows: enter trade 1, exit trade 1, enter trade 2, exit trade 2, then we can just mask on position signal, and calculate returns on each trade as a single row and then run cumprod on that. I don't want that, I need to know my portfolio value in the intermediate. Which is why I added those rows for the example above.
2 - There's also a way to do this by dividing the pnl due to shorts and pnl due to longs. But this isn't good because if I'm down 50% due to a bad long, it affects my purchasing power for the next trade (i.e. being up 50% for the next trade doesn't make me break-even)
3 - There's also the non-vectorized route, but that is too slow for me.
I'm new to the python and analytics world, and I know there are lots of packages out there to solve specific problems. My problem is like follows:
I have some independent variables with a value for each row.
I can calculate some ratios (like payed amount, payed count, bill not payed amount and bill not payed count, in relation to the totals of the dataset).
I have to find the value for each independent variable maximizing the ratios of payed amount and counts and minimizing the ratios for not payed ones.
For example:
amount payed var1 var2 var3
10 1 25 5 10
25 0 21 8 15
30 1 35 3 8
As you can see in this data set the payed and not payed ratios would be:
For the totals.
Payed count: 2/3 ~ 0.67
Payed amount: 40/65 ~ 0.62
Not payed count: 1/3 ~ 0.33
Not payed amount: 25/65 ~ 0.38
I was thinking to compare each split of the independent variables, against the ones in the total and see If there is a gain on the calculations, if so, I can take it as a possible solution, for var1 >= 25
Payed count: 2/2 ~ 1
Payed amount: 40/40 ~ 1
Not payed count: 0/2 ~ 0
Not payed amount: 0/40 ~ 0
This numbers are better that the original ones if we calculate the gain:
Payed count: 1 - 0.67 = 0.33
payed amount: 1 - 0.62 = 0.38
Not payed count: 0.33 - 0 = 0.33
Not payed amount: 0.38 - 0 = 0.38
Total Gain: = 1.42
An so on for each independent variable. Until we get the best splits with the maximum gain.
So seems like the best solution would be:
Var1: >= 25
Var2: <= 5
Var3: <= 10
Is there any tool to solve this? I was thinking of an optimization package but any other approach is welcome.
As per my knowledge Python loops are slow, hence it is preferred to use pandas inbuilt functions.
In my problem, one column will have different currencies, I need to convert them to dollar. How can I detect and convert them to dollar using pandas inbuilt functions ?
My column as following:
100Dollar
200Dollar
100Euro
300Euro
184pounds
150pounds
10rupee
30rupee
Note: amount and currency name is in same column.
Note: conversion exchange rate w.r.t dollar {Euro: 1.2, pounds: 1.3, rupee: 0.05}
Note: currency enum is ['Euro', 'Dollar', 'Pounds', 'Rupee']
I would suggest something similar to the below using the CurrencyConverter package (tested with google for accuracy):
from currency_converter import CurrencyConverter
c = CurrencyConverter()
d={'Dollar':'USD','Euro':'EUR','pounds':'GBP','rupee':'INR'} #mapping dict
m=pd.DataFrame(df['column'].replace(d,regex=True).str.findall(r'(\d+|\D+)').tolist())
new_df=df.assign(USD_VALUE=[c.convert(a,b,'USD') for a,b in zip(m[0],m[1])])
column USD_VALUE
0 100Dollar 100.000000
1 200Dollar 200.000000
2 100Euro 110.770000
3 300Euro 332.310000
4 184pounds 242.428366
5 150pounds 197.631820
6 10rupee 0.140999
7 30rupee 0.422996
Use Series.str.extract with regular expressions to extra the correct values into a new column. Then map the exchange_rate to the Currency column to calculate the Amount dollars:
df[['Amount', 'Currency']] = df['column'].str.extract(r'(\d+)(\D+)')
exchange_rate = {'Euro': 1.2, 'pounds': 1.3, 'rupee': 0.05}
df['Amount_dollar'] = pd.to_numeric(df['Amount']) * df['Currency'].map(exchange_rate).fillna(1)
column Amount Currency Amount_dollar
0 100Dollar 100 Dollar 100.00
1 200Dollar 200 Dollar 200.00
2 100Euro 100 Euro 120.00
3 300Euro 300 Euro 360.00
4 184pounds 184 pounds 239.20
5 150pounds 150 pounds 195.00
6 10rupee 10 rupee 0.50
7 30rupee 30 rupee 1.50
income tax calculation python asks how to calculate taxes given a marginal tax rate schedule, and its answer provides a function that works (below).
However, it works only for a single value of income. How would I adapt it to work for a list/numpy array/pandas Series of income values? That is, how do I vectorize this code?
from bisect import bisect
rates = [0, 10, 20, 30] # 10% 20% 30%
brackets = [10000, # first 10,000
30000, # next 20,000
70000] # next 40,000
base_tax = [0, # 10,000 * 0%
2000, # 20,000 * 10%
10000] # 40,000 * 20% + 2,000
def tax(income):
i = bisect(brackets, income)
if not i:
return 0
rate = rates[i]
bracket = brackets[i-1]
income_in_bracket = income - bracket
tax_in_bracket = income_in_bracket * rate / 100
total_tax = base_tax[i-1] + tax_in_bracket
return total_tax
This method implements the vectorized marginal tax calculations just using NumPy if it's needed.
def tax(incomes, bands, rates):
# Broadcast incomes so that we can compute an amount per income, per band
incomes_ = np.broadcast_to(incomes, (bands.shape[0] - 1, incomes.shape[0]))
# Find amounts in bands for each income
amounts_in_bands = np.clip(incomes_.transpose(),
bands[:-1], bands[1:]) - bands[:-1]
# Calculate tax per band
taxes = rates * amounts_in_bands
# Sum tax bands per income
return taxes.sum(axis=1)
For usage, bands should include the upper limit - in my view this makes it more explicit.
incomes = np.array([0, 7000, 14000, 28000, 56000, 77000, 210000])
bands = np.array([0, 12500, 50000, 150000, np.inf])
rates = np.array([0, 0.2, 0.4, 0.45])
df = pd.DataFrame()
df['pre_tax'] = incomes
df['post_tax'] = incomes - tax(incomes, bands, rates)
print(df)
Output:
pre_tax post_tax
0 0 0.0
1 7000 7000.0
2 14000 13700.0
3 28000 24900.0
4 56000 46100.0
5 77000 58700.0
6 210000 135500.0
Two data frames are created, one for the tax parameters and one for the incomes.
For each income, we get the corresponding row indexes from the tax table, using the "searchsorted" method.
With that index we create a new table (df_tax.loc[rows]) and concatenate it with the income table,
then calculate the taxes, and drop the unnecessary columns.
import numpy as np, pandas as pd
# Test data:
df=pd.DataFrame({"name":["Bob","Julie","Mary","John","Bill","George","Andie"], \
"income":[0, 9_000, 10_000, 11_000, 30_000, 69_999, 200_000]})
OUT:
name income
0 Bob 0
1 Julie 9000
2 Mary 10000
3 John 11000
4 Bill 30000
5 George 69999
6 Andie 200000
df_tax=pd.DataFrame({"brackets": [0, 10_000, 30_000, 70_000 ], # lower limits
"rates": [0, .10, .20, .30 ],
"base_tax": [0, 0, 2_000, 10_000 ]} )
rows= df_tax["brackets"].searchsorted(df["income"], side="right") - 1 # aka bisect()
OUT:
[0 0 1 1 2 2 3]
df= pd.concat([df,df_tax.loc[rows].reset_index(drop=True)], axis=1)
df["total_tax"]= df["income"].sub(df["brackets"]).mul(df["rates"]).add(df["base_tax"])
OUT:
name income brackets rates base_tax total_tax
0 Bob 0 0 0.0 0 0.0
1 Julie 9000 0 0.0 0 0.0
2 Mary 10000 10000 0.1 0 0.0
3 John 11000 10000 0.1 0 100.0
4 Bill 30000 30000 0.2 2000 2000.0
5 George 69999 30000 0.2 2000 9999.8
6 Andie 200000 70000 0.3 10000 49000.0
df=df.reindex(columns=["name","income","total_tax"])
OUT:
name income total_tax
0 Bob 0 0.0
1 Julie 9000 0.0
2 Mary 10000 0.0
3 John 11000 100.0
4 Bill 30000 2000.0
5 George 69999 9999.8
6 Andie 200000 49000.0
Edit:
At the beginning, you can calculate the base_tax, too:
df_tax["base_tax"]= df_tax.brackets #edit2
.sub(df_tax.brackets.shift(fill_value=0))
.mul(df_tax.rates.shift(fill_value=0))
.cumsum()
One (probably inefficient) way is to use list comprehension:
def tax_multiple(incomes):
return [tax(income) for income in incomes]
Adapting kantal's answer to run as a function:
def income_tax(income, brackets, rates):
df_tax = pd.DataFrame({'brackets': brackets, 'rates': rates})
df_tax['base_tax'] = df_tax.brackets.\
sub(df_tax.brackets.shift(fill_value=0)).\
mul(df_tax.rates.shift(fill_value=0)).cumsum()
rows = df_tax.brackets.searchsorted(income, side='right') - 1
income_bracket_df = df_tax.loc[rows].reset_index(drop=True)
return pd.Series(income).sub(income_bracket_df.brackets).\
mul(income_bracket_df.rates).add(income_bracket_df.base_tax)
e.g.:
income = [0, 9_000, 10_000, 11_000, 30_000, 69_999, 200_000]
brackets = [0, 10_000, 30_000, 70_000] # Lower limits.
rates = [0, .10, .20, .30]
income_tax(income, brackets, rates).tolist()
# [0.0, 0.0, 0.0, 100.0, 2000.0, 9999.8, 49000.0]
i Have a set of 100 random distances ranging from 0.5 to 25, most of then only differ between each other by about 0.01 points. I want to create a script that:
Reads in a vector with cut off values
For example the vector for cut off values would be: [ 10.5, 15.2, 17.8 20.1, 24.3]
The script would ideally create the bins itself by taking the value before and making it the minimum for the bin and taking the next value in the vector and making it the max
For example:
bin 1 : min =10.50 and the bin max would be 15.2-0.01
bin 2: min=15.2 and the bin max would be 17.8-0.01
bin 3: min- 17.8 and the bin max would be 20.1-0.01
bin 4: min- 20.1 and the bin max would be 24.3-0.1
The Script would ideally take the 100 values and sort then into one of the bins so values of 10.8 and 10.99 would be sorted in the bin 1 which is between 10.5 and 15.2. Once the script sorts all 10 values it would return a list with the bin min and max values ( in other words the limits of the bin in this case 10.5 and 15.2) and also return how many numbers of the 100 total fit in the bin.
For example:
10.51
10.52
10.53
21.1
21.1
Therefore the histogram loop would run through the numbers on the list and return that bin 1 has 3 values in it and group 2 has 2 numbers in it.