Python: how to calculate average length of times when a variable=yes - python

I have a set of EURUSD data and looking at arbitrage opportunities. The data is formatted as shown in photo.
mispricing_1=yes when buy_b_sell_A>0 and mispricing_2=yes when buy_A_sell_B>0
In the photo there is no datapoint where exploitable=yes however when the buy_b_sell_A>6 or when buy_A_sell_B>6, then we get exploitable=yes
I am looking to calculate the average length of time an exploitable arbitrage opportunity is present, shown by exploitable=yes
How can I calculate the length of time that there are consecutive exploitable=yes so that I can plot a distribution and then also calculate the average?

df=pd.DataFrame(data={'ts':list(range(1,14)),
'mp':[0,0,1,1,1,0,0,1,1,0,0,1,0]}) # your data
df.loc[df.mp.diff(1)==1, 'ts1'] = df.ts # TS1
df.loc[df.mp.diff(1)==-1, 'ts2'] = df.ts # TS2
df=df[~(df.ts1.isna())|~(df.ts2.isna())] # keep only rows with changes
df.loc[~df.ts2.isna(), 'delta'] = df.ts2 - df.ts1.shift(1) # TS2-TS1
print (df)

If you import this as a panda frame, which lets call it df, you can do df.groupby[‘exploitable’].mean
You could do .histogram or something for distribution.

Related

How to find AVERAGE drawdown of 7 assets?

I'm currently tasked with finding the average drawdown of 7 assets. This is what I have so far:
end = dt.datetime.today()
start = end - dt.timedelta(365)
tickers = ["SBUX", "MCD", "CMG", "WEN", "DPZ", "YUM", "DENN"]
bench = ['SPY', 'IWM', 'DIA']
table_1 = pd.DataFrame(index=tickers)
data = yf.download(tickers+bench, start, end)['Adj Close']
log_returns = np.log(data/data.shift())
table_1["drawdown"] = (log_returns.min() - log_returns.max() ) / log_returns.max()
However, this only gives me the maximum drawdown, when I actually want the average.
You will need scipy to find local max/min:
from scipy.signal import argrelextrema
I've defined a function that calculates the local min and max of the time series. Then simply calculate the relative difference between each local maximum and next local minimum and compute the mean:
def av_dd(series):
series = series.values # convert to numpy array
drawdowns = []
loc_max = argrelextrema(series, np.greater)[0] # getting indexes of local maximums
loc_min = argrelextrema(series, np.less)[0] # getting indexes of local minimums
# adding first value of series if first local minimum comes before first local maximum (you want the first drawdown to be taken into account)
if series[0]>series[1]:
loc_max = np.insert(loc_max,0,0)
# adding last value of series if last local maximum comes after last local minimum (you want the last drawdown to be taken into account)
if len(loc_max)>len(loc_min):
loc_min = np.append(loc_min, len(series)-1)
for i in range(len(loc_max)):
drawdowns.append(series[loc_min[i]]/series[loc_max[i]]-1)
return sum(drawdowns)/len(drawdowns)
Both if statements in the function are here to make sure that you also take into account the first and last drawdown depending what are the local extremas at the beginning and end of the time series.
You simply need to apply this function to your data time
table_1['drawdown'] = df.apply(lambda x: av_dd(x))

Calculate average and normalize GPS data in Python

I have a dataset in json with gps coordinates:
"utc_date_and_time":"2021-06-05 13:54:34", # timestamp
"hdg":"018.0", # heading
"sog":"000.0", # speed
"lat":"5905.3262N", # latitude
"lon":"00554.2433E" # longitude
This data will be imported into a database, with one entry every second for every "vessel".
As you can imagine this is a huge amount of data that provides a level of accuracy I do not need.
My goal:
Create a new entry in the database for every X seconds
If I set X to 60 (a minute) and there are missing 10 entries within this period, 50 entries should be used. Data can be missing for certain periods, and I do not want this to create bogus positions.
Use timestamp from last entry in period.
Use the heading (hdg) that is appearing the most times within this period.
Calculate average speed within this period.
Latitude and longitude could use the last entry, but I have seen "spikes" that needs to be filtered out, or use average, and remove values that differ too much.
My script is now pushing all the data to the database via a for loop with different data-checks inside it, and this is working.
I am new to python and still learning every day through reading and youtube videos, but it would be great if anyone could point me in the right direction for how to achieve the above goal.
As of now the data is imported into a dictionary. And I am wondering if creating a dictionary where the timestamp is the key is the way to go, but I am a little lost.
Code:
import os
import json
from pathlib import Path
from datetime import datetime, timedelta, date
def generator(data):
for entry in data:
yield entry
data = json.load(open("5_gps_2021-06-05T141524.1397180000.json"))["gps_data"]
gps_count = len(data)
start_time = None
new_gps = list()
tempdata = list()
seconds = 60
i = 0
for entry in generator(data):
i = i+1
if start_time == None:
start_time = datetime.fromisoformat(entry['utc_date_and_time'])
# TODO: Filter out values with too much deviation
tempdata.append(entry)
elapsed = (datetime.fromisoformat(entry['utc_date_and_time']) - start_time).total_seconds()
if (elapsed >= seconds) or (i == gps_count):
# TODO: Calculate average values etc. instead of using last
new_gps.append(tempdata)
tempdata = []
start_time = None
print("GPS count before:" + str(gps_count))
print("GPS count after:" + str(len(new_gps)))
Output:
GPS count before:1186
GPS count after:20

Python: How to find max value in list of preceding 10 values?

I have a csv file containing wave data (time, tidal elevation, wave period, wave height and wave direction)
I want to know, at a given time, when the previous high tide was and the corresponding wave period, height and direction.
I have this code now which selects the line of the time that I'm looking for:
import csv
with open ('Waves_2019.csv') as f:
reader = csv.reader (f)
for line_num, content in enumerate(reader):
if content [0] == '01/03/2019T08:00':
a = line_num
print (a)
The next step would then take the previous 12 hours of data to select the highest tidal elevation (0.77 at 01/03/2019T01:00 in example) and then return the other data (period, height, direction).
How could I amend the code that it looks for the max tidal elevation in column 2 based on the previous 12 data points of the selected time? And then return the other data during that high tide?
First we select the index of the date we require to examine
selected_index = df.loc[df["time"].eq("01/03/2019T08:00")].index[0]
Then obtain the id value having the maximum tidal_elevation from the previous 12 hours (since there are no missing rows in the dataset, we can safely assume that the previous 12 indices would indicate the 12 hours)
filt = df.loc[selected_index-12: selected_index, "tidal_elevation"].idxmax()
Now, we select other parameters for the index having maximum tidal_elevation
res = df.loc[filt, ["time", "tidal_elevation", "wave_period", "wave_height", "wave_direction"]]
print(res)
P.S. res = df.loc[filt, "time":"wave_direction"]
would also work if the chronology of the columns is the same i.e. it is in the same order ["time", "tidal_elevation", "wave_period", "wave_height", "wave_direction"]
Edit:
Taking average of values one hour before and after the maximum tidal_elevation
res_avg = df.loc[filt-1:filt+1, "time":"wave_direction"].mean()
print(res_avg)

Python Data manipulation: Duplicate and Average row and column values using dates

Hi I have a dataset in the following format:
Code for replicating the data:
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
I input the numbers as a string to show blank cells
Where the first three columns denotes date (Year, Month and Day) and the following columns represent individuals (My actual data file consists of about 300 such rows and about 1000 subjects. I presented a subset of the data here).
Where the column value refers to expenditure on FMCG products.
What I would like to do is the following:
Part 1 (Beginning and end points)
a) For each individual locate the first observation and duplicate the value of the first observation for atleast the previous six months. For example: Subject C's 1st observation is on the 10th of August 2008. In that case I would want all the rows from June 10, 2008 to be equal to 65 for Subject C (Roughly 2/12/2008
is the cutoff date. SO we leave the 3rd cell from the top for Subject_C's column blank).
b) Locate last observation and repeat the last observation for the following 3 months. For example for Subject_A, we repeat 35 twice (till 6th November 2008).
Please refer to the following diagram for the highlighted cell with the solutions.
Part II - (Rows in between)
Next I would like to do two things (I would need to do the following three steps separately, not all at one time):
For individuals like Subject_A, locate two observations that come one after the other (30 and 35).
i) Use the average of the two observations. In this case we would have 32.5 in the four rows without caring about time.
for eg:
ii) Find the total time between two observations and take the mean of the time. For the 1st half of the time period assign the first value and for the 2nd half assign the second value. For example - for subject 1, the total days between 01/22/208 and 08/10/2008 is 201 days. For the first 201/2 = 100.5 days assign the value of 30 to Subject_A and for the remaining value assign 35. In this case the columns for Subject_A and Subject_C will look like:
The final dataset will use (a), (b) & (i) or (a), (b) & (ii)
Final data I [using a,b and i]
Final data II [using a,b and ii]
I would appreciate any help with this. Thanks in advance. Please let me know if the steps are unclear.
Follow up question and Issues
Thanks #Juan for the initial answer. Here's my follow up question. Suppose that Subject_A has more than 2 observations (code for the example data below). Would we be able to extend this code to incorporate more than 2 observations?
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
Issues
For the current code, I found an issue for part II (ii). This is the output that I get:
This is actually on the right track. The two cells above 35 does not seem to get updated. Is there something wrong on my end? Also the same question as before, would we be able to extend it to the case of >2 observations?
Here a code solution for subject A. Should work with the other subjects:
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
d1 = pd.DataFrame(d1)
## Create a variable named date
d1['date']= pd.to_datetime(d1['Year']+'/'+d1['Month']+'/'+d1['Day'])
# convert to float, to calculate mean
d1['Subject_A'] = d1['Subject_A'].replace('',np.nan).astype(float)
# index of the not null rows
subja = d1['Subject_A'].notnull()
### max and min index row with notnull value
max_id_subja = d1.loc[subja,'date'].idxmax()
min_id_subja = d1.loc[subja,'date'].idxmin()
### max and min date for Sub A with notnull value
max_date_subja = d1.loc[subja,'date'].max()
min_date_subja = d1.loc[subja,'date'].min()
### value for max and min date
max_val_subja = d1.loc[max_id_subja,'Subject_A']
min_val_subja = d1.loc[min_id_subja,'Subject_A']
#### Cutoffs
min_cutoff = min_date_subja-pd.Timedelta(6, unit='M')
max_cutoff = max_date_subja+pd.Timedelta(3, unit='M')
## PART I.a
d1.loc[(d1['date']<min_date_subja) & (d1['date']>min_cutoff),'Subject_A'] = min_val_subja
## PART I.b
d1.loc[(d1['date']>max_date_subja) & (d1['date']<max_cutoff),'Subject_A'] = max_val_subja
## PART II
d1_2i = d1.copy()
d1_2ii = d1.copy()
lower_date = min_date_subja
lower_val = min_val_subja.copy()
next_dates_index = d1_2i.loc[(d1['date']>min_date_subja) & subja].index
for N in next_dates_index:
next_date = d1_2i.loc[N,'date']
next_val = d1_2i.loc[N,'Subject_A']
#PART II.i
d1_2i.loc[(d1['date']>lower_date) & (d1['date']<next_date),'Subject_A'] = np.mean([lower_val,next_val])
#PART II.ii
mean_time_a = pd.Timedelta((next_date-lower_date).days/2, unit='d')
d1_2ii.loc[(d1['date']>lower_date) & (d1['date']<=lower_date+mean_time_a),'Subject_A'] = lower_val
d1_2ii.loc[(d1['date']>lower_date+mean_time_a) & (d1['date']<=next_date),'Subject_A'] = next_val
lower_date = next_date
lower_val = next_val
print(d1_2i)
print(d1_2ii)

How do I avoid a loop with Python/Pandas to build an equity curve?

I am trying to build an equity curve in Python using Pandas. For those not in the know, an equity curve is a cumulative tally of investing profits/losses day by day. The code below works but it is incredibly slow. I've tried to build an alternate using Pandas .iloc and such but nothing is working. I'm not sure if it is possible to do this outside of a loop given how I have to reference the prior row(s).
for today in range(len(f1)): #initiate a loop that runs the length of the "f1" dataframe
if today == 0: #if the index value is zero (aka first row in the dataframe) then...
f1.loc[today,'StartAUM'] = StartAUM #Set intial assets
f1.loc[today,'Shares'] = 0 #dummy placeholder for shares; no trading on day 1
f1.loc[today,'PnL'] = 0 #dummy placeholder for P&L; no trading day 1
f1.loc[today,'EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = today - 1 #used to reference the rows (see below)
f1.loc[today,'StartAUM'] = f1.loc[yesterday,'EndAUM'] #todays starting aseets are yesterday's ending assets
f1.loc[today,'Shares'] = f1.loc[yesterday,'EndAUM']//f1.loc[yesterday,'Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
f1.loc[today,'PnL'] = f1.loc[today,'Shares']*f1.loc[today,'Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
f1.loc[today,'EndAUM'] = f1.loc[today,'StartAUM']+f1.loc[today,'PnL'] #ending assets are starting assets + today's P&L
There is a good example here: http://www.pythonforfinance.net/category/basic-data-analysis/ and I know that there is an example in Wes McKinney's book Python for Data Analysis. You might be able to find it here: http://wesmckinney.com/blog/python-for-financial-data-analysis-with-pandas/
Have you tried using iterrows() to construct the for loop?
for index, row in f1.iterrows():
if today == 0:
row['StartAUM'] = StartAUM #Set intial assets
row['Shares'] = 0 #dummy placeholder for shares; no trading on day 1
row['PnL'] = 0 #dummy placeholder for P&L; no trading day 1
row['EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = row[today] - 1 #used to reference the rows (see below)
row['StartAUM'] = row['EndAUM'] #todays starting aseets are yesterday's ending assets
row['Shares'] = row['EndAUM']//['Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
row['PnL'] = row['Shares']*row['Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
row['EndAUM'] = row['StartAUM']+row['PnL'] #ending assets are starting assets + today's P&L
Probably the code is so slow as loc goes through f1 from beginning every time. iterrows() uses the same dataframe as it loops through it row by row.
See more details about iterrows() here.
You need to vectorize the operations (don't iterate with for but rather compute whole column at once)
# fill the initial values
f1['StartAUM'] = StartAUM # Set intial assets
f1['Shares'] = 0 # dummy placeholder for shares; no trading on day 1
f1['PnL'] = 0 # dummy placeholder for P&L; no trading day 1
f1['EndAUM'] = StartAUM # s
#do the computations (vectorized)
f1['StartAUM'].iloc[1:] = f1['EndAUM'].iloc[:-1]
f1['Shares'].iloc[1:] = f1['EndAUM'].iloc[:-1] // f1['Shareprice'].iloc[:-1]
f1['PnL'] = f1['Shares'] * f1['Outcome1']
f1['EndAUM'] = f1['StartAUM'] + f1 ['PnL']
EDIT: this will not work correctly since StartAUM, EndAUM, Shares depend on each other and cannot be computed one without another. I didn't notice that before.
Can you try the following:
#import relevant modules
import pandas as pd
import numpy as np
from pandas_datareader import data
import matplotlib.pyplot as plt
#download data into DataFrame and create moving averages columns
f1 = data.DataReader('AAPL', 'yahoo',start='1/1/2017')
StartAUM = 1000000
#populate DataFrame with starting values
f1['Shares'] = 0
f1['PnL'] = 0
f1['EndAUM'] = StartAUM
#Set shares held to be the previous day's EndAUM divided by the previous day's closing price
f1['Shares'] = f1['EndAUM'].shift(1) / f1['Adj Close'].shift(1)
#Set the day's PnL to be the number of shares held multiplied by the change in closing price from yesterday to today's close
f1['PnL'] = f1['Shares'] * (f1['Adj Close'] - f1['Adj Close'].shift(1))
#Set day's ending AUM to be previous days ending AUM plus daily PnL
f1['EndAUM'] = f1['EndAUM'].shift(1) + f1['PnL']
#Plot the equity curve
f1['EndAUM'].plot()
Does the above solve your issue?
The solution was to use the Numba package. It performs the loop task in a fraction of the time.
https://numba.pydata.org/
The arguments/dataframe can be passed to the numba module/function. I will try to write up a more detailed explanation with code when time permits.
Thanks to all
In case others come across this, you can definitely make an equity curve without loops.
Dummy up some data
import pandas as pd
import numpy as np
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (13, 10)
# Some data to work with
np.random.seed(1)
stock = pd.DataFrame(
np.random.randn(100).cumsum() + 10,
index=pd.date_range('1/1/2020', periods=100, freq='D'),
columns=['Close']
)
stock['ma_5'] = stock['Close'].rolling(5).mean()
stock['ma_15'] = stock['Close'].rolling(15).mean()
Holdings: simple long/short based on moving average crossover signals
longs = stock['Close'].where(stock['ma_5'] > stock['ma_15'], np.nan)
shorts = stock['Close'].where(stock['ma_5'] < stock['ma_15'], np.nan)
# Quick plot
stock.plot()
longs.plot(lw=5, c='green')
shorts.plot(lw=5, c='red')
EQUITY CURVE:
Identify which side (l/s) has first holding (ie: first trade, in this case, short), then keep the initial trade price and subsequently cumulatively sum the daily changes (there would normally be more nan's in the series if you have exit rules as well for when you are out of the market), and finally forward fill over the nan values and fill any last remaining nans with zeros. Its basically the same for the second opposite holdings (in this case, long) except don't keep the starting price. The other important thing is to invert the short daily changes (ie: negative changes should be positive to the PnL).
lidx = np.where(longs > 0)[0][0]
sidx = np.where(shorts > 0)[0][0]
startdx = min(lidx, sidx)
# For first holding side, keep first trade price, then calc daily change fwd and ffill nan's
# For second holdng side, get cumsum of daily changes, ffill and fillna(0) (make sure short changes are inverted)
if lidx == startdx:
lcurve = longs.diff() # get daily changes
lcurve[lidx] = longs[lidx] # put back initial starting price
lcurve = lcurve.cumsum().ffill() # add dialy changes/ffill to build curve
scurve = -shorts.diff().cumsum().ffill().fillna(0) # get daily changes (make declines positive changes)
else:
scurve = -shorts.diff() # get daily changes (make declines positive changes)
scurve[sidx] = shorts[sidx] # put back initial starting price
scurve = scurve.cumsum().ffill() # add dialy changes/ffill to build curve
lcurve = longs.diff().cumsum().ffill().fillna(0) # get daily changes
Add the 2 long/short curves together to get the final equity curve
eq_curve = lcurve + scurve
# quick plot
stock.iloc[:, :3].plot()
longs.plot(lw=5, c='green', label='Long')
shorts.plot(lw=5, c='red', label='Short')
eq_curve.plot(lw=2, ls='dotted', c='orange', label='Equity Curve')
plt.legend()

Categories

Resources