Improving performance of Python for loop? - python

I am trying to write a code to construct dataFrame which consists of cointegrating pairs of portfolios (stock price is cointegrating). In this case, stocks in a portfolio are selected from S&P500 and they have the equal weights.
Also, for some economical issue, the portfolios must include the same sectors.
For example:
if stocks in one portfolio are from [IT] and [Financial] sectors, the second portoflio must select stocks from [IT] and [Financial] sectors.
There are no correct number of stocks in a portfolio, so I'm considering about 10 to 20 stocks for each of them. However, when it comes to think about the combination, this is (500 choose 10), so I have an issue of computation time.
The followings are my code:
def adf(x, y, xName, yName, pvalue=0.01, beta_lower=0.5, beta_upper=1):
res=pd.DataFrame()
regress1, regress2 = pd.ols(x=x, y=y), pd.ols(x=y, y=x)
error1, error2 = regress1.resid, regress2.resid
test1, test2 = ts.adfuller(error1, 1), ts.adfuller(error2, 1)
if test1[1] < pvalue and test1[1] < test2[1] and\
regress1.beta["x"] > beta_lower and regress1.beta["x"] < beta_upper:
res[(tuple(xName), tuple(yName))] = pd.Series([regress1.beta["x"], test1[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
elif test2[1] < pvalue and regress2.beta["x"] > beta_lower and\
regress2.beta["x"] < beta_upper:
res[(tuple(yName), tuple(xName))] = pd.Series([regress2.beta["x"], test2[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
else:
pass
def coint(dataFrame, nstocks = 2, pvalue=0.01, beta_lower=0.5, beta_upper=1):
# dataFrame = pandas_dataFrame, in this case, data['Adj Close'], row=time, col = tickers
# pvalue = level of significance of adf test
# nstocks = number of stocks considered for adf test (equal weight)
# if nstocks > 2, coint return cointegration between portfolios
# beta_lower = lower bound for slope of linear regression
# beta_upper = upper bound for slope of linear regression
a=time.time()
tickers = dataFrame.columns
tcomb = itertools.combinations(dataFrame.columns, nstocks)
res = pd.DataFrame()
sec = pd.DataFrame()
for pair in tcomb:
xName, yName = list(pair[:int(nstocks/2)]), list(pair[int(nstocks/2):])
xind, yind = tickers.searchsorted(xName), tickers.searchsorted(yName)
xSector = list(SNP.ix[xind]["Sector"])
ySector = list(SNP.ix[yind]["Sector"])
if set(xSector) == set(ySector):
sector = [[(xSector, ySector)]]
x, y = dataFrame[list(xName)].sum(axis=1), dataFrame[list(yName)].sum(axis=1)
res1 = adf(x,y,xName,yName)
if res1 is None:
continue
elif res.size==0:
res=res1
sec = pd.DataFrame(sector, index = res.index, columns = ["sector"])
print("added : ", pair)
else:
res=res.append(res1)
sec = sec.append(pd.DataFrame(sector, index = [res.index[-1]], columns = ["sector"]))
print("added : ", pair)
res = pd.concat([res,sec],axis=1)
res=res.sort_values(by=["pvalue"],ascending=True)
b=time.time()
print("time taken : ", b-a, "sec")
return res
when nstocks=2, this takes about 263 seconds, but as nstocks increases, the loop takes alot of time (more than a day)
I collected 'Adj Close' data from yahoo finance using pandas_datareader.data
and the index is time and columns are different tickers
Any suggestions or help will be appreciated

I dont know what computer you have, but i would advise you to use some kind of multiprocessing for the loop. I haven't looked really hard into your code, but as far as i see res and sec can be moved into shared memory objects, and the individual loops paralleled with multiprocessing.
If you have a decent CPU it can improve the performance 4-6 times. In case you have access to some kind of HPC it can do wonders.

I'd recommend using a profiler to narrow down the most time consuming calls, and the number of loops (does your loop make the expected number of passes?). Python 3 has a profiler in the standard library:
https://docs.python.org/3.6/library/profile.html
You can either invoke it in your code:
import cProfile
cProfile.run('your_function(inputs)')
Or if a script is an easier entrypoint:
python -m cProfile [-o output_file] [-s sort_order] your-script.py

Related

Python Monte Carlo Simulation end value not looping

I am trying to build a simulation of an investment portfolio where there is flexibility to adjust a couple of manual inputs. I have the returns simulation running fine, but I can't figure out how to loop the ending value for previous period to equal the beginning value for the next period.
def mcs(n_years = 10, n_scenarios=10, mu=0.07, sigma=0.15, steps_per_year=12, balance=100,
expenses=10, personal_income=5):
dt = 1/steps_per_year
n_steps = int(n_years*steps_per_year) + 1
beginning_value=balance
ending_value=balance
for i in range(n_steps):
rets_mcs = np.random.normal(loc=(1+mu)**dt-1, scale=(sigma*np.sqrt(dt)), size=(n_steps, n_scenarios))
ending_value = balance + balance*rets_mcs - expenses + personal_income
df = pd.DataFrame(data=ending_value,index=range(n_steps))
df.iloc[0]=balance
return df
I keep ending up results based off the original value. Any code help or resources would be appreciated.
Managed to solve this problem using the below code. This previous post was helpful (Dataframe with Monte Carlo Simulation calculation next row Problem)
n_years = 10
n_scenarios=2
mu=0.07
sigma=0.15
steps_per_year=12
s_0=100
expenses=10
personal_income=5
inflation=0
wage_growth=0
output = []
dt = 1/steps_per_year
n_steps = int(n_years*steps_per_year) + 1
for i in range(n_steps):
rets_ngbm = np.random.normal(loc=(1+mu)**dt-1, scale=(sigma*np.sqrt(dt)), size=n_scenarios)
ending_value = s_0 + s_0*rets_ngbm - expenses + personal_income
output.append(ending_value)
s_0 = ending_value
df = pd.DataFrame(data=output,index=range(n_steps),columns=range(n_scenarios))

Trying to calculate correct returns and set a constraints on the max and min invested in each asset using 'quad_form'

I am trying to hack together some code that looks like it should print our risk and returns of a portfolio, but the first return is 0.00, and that can't be right. Here's the code that I'm testing.
import pandas as pd
# initialize list of lists
data = [[130000, 150000, 190000, 200000], [100000, 200000, 300000, 900000], [350000, 450000, 890000, 20000], [400000, 10000, 500000, 600000]]
# Create the pandas DataFrame
data = pd.DataFrame(data, columns = ['HOSPITAL', 'HOTEL', 'STADIUM', 'SUBWAY'])
# print dataframe.
data
That gives me this data frame.
symbols = data.columns
# convert daily stock prices into daily returns
returns = data.pct_change()
r = np.asarray(np.mean(returns, axis=1))
r = np.nan_to_num(r)
C = np.asmatrix(np.cov(returns))
C = np.nan_to_num(C)
# print expected returns and risk
for j in range(len(symbols)):
print ('%s: Exp ret = %f, Risk = %f' %(symbols[j],r[j], C[j,j]**0.5))
The result is this.
The hospital risk and return can't be zero. That doesn't make sense. Something is off here, but I'm not sure what.
Finally, I am trying to optimize the portfolio. So, I hacked together this code.
# Number of variables
n = len(data)
# The variables vector
x = Variable(n)
# The minimum return
req_return = 0.02
# The return
ret = r.T*x
# The risk in xT.Q.x format
risk = quad_form(x, C)
# The core problem definition with the Problem class from CVXPY
prob = Problem(Minimize(risk), [sum(x)==1, ret >= req_return, x >= 0])
try:
prob.solve()
print ("Optimal portfolio")
print ("----------------------")
for s in range(len(symbols)):
print (" Investment in {} : {}% of the portfolio".format(symbols[s],round(100*x.value[s],2)))
print ("----------------------")
print ("Exp ret = {}%".format(round(100*ret.value,2)))
print ("Expected risk = {}%".format(round(100*risk.value**0.5,2)))
except:
print ("Error")
It seems to run but I don't know how to add a constraint. I want to invest at least 5% in every asset and don't invest more than 40% in any one asset. How can I add a constraint to do that?
The idea comes from this link.
https://tirthajyoti.github.io/Notebooks/Portfolio_optimization.html
Based on the idea from the link, they skip the NaN row from the monthly return dataframe, and after converting the return to a matrix, the following step is transposing the matrix, that is the step you are missing hence you are getting the returns and risk as 0 for Hospital. You might want to add this line C = np.asmatrix(np.cov(returns.dropna().transpose()))to skip the first NaN line. This should give you the correct Returns and Risk values.
As for your second question, i had a quick glance into the class definition of cxpy Problem class and there doesnt seem to be a provision for add constraints. The function was programmed to solve equations based on the Minimizing or Maximizing constraint given to it.
For a work around you might want to try taking the outputs and then capping the investment to 40% and and the remaining you distribute it among other ventures proportionally. Example lets say it tells you to invest 5%, 80% and 15% of your assets in A, B and C. You could cap investment in B to 40% and the part remainder of the asset (5/(5+15))*40 = 10% more into A and 30% of your total investing asset more ,into B.
DISCLAIMER: I am not an expert in finance, i am just stating my opinion.

Pandas- locate a value based on logical statements

I am using the this dataset for a project.
I am trying to find the total yield for each inverter for the 34 day duration of the dataset (basically use the final and initial value available for each inverter). I have been able to get the list of inverters using pd.unique()(there are 22 inverters for each solar power plant.
I am having trouble querying the total_yield data for each inverter.
Here is what I have tried:
def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
delta = np.zeros(len(arr))
index =0
for i in arr:
initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
initial = initial.loc[initial["INVERTER_ID"]==i]
initial.reset_index(inplace=True,drop=True)
initial = initial.at[0,"TOTAL_YIELD"]
final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
final = final.loc[final["INVERTER_ID"]==i]
final.reset_index(inplace=True, drop=True)
final = final.at[0,"TOTAL_YIELD"]
delta[index] = final - initial
index = index + 1
return delta
Reference: arr is the array of inverters, listed below. df is the generation dataframe for each plant.
The problem is that not every inverter has a data point for each interval. This makes this function only work for the inverters at the first plant, not the second one.
My second approach was to filter by the inverter first, then take the first and last data points. But I get an error- 'Series' objects are mutable, thus they cannot be hashed
Here is the code for that so far:
def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
delta = np.zeros(len(arr))
index = 0
for i in arr:
initial = df.loc(df["INVERTER_ID"] == i)
index += 1
break
return delta
List of inverters at plant 1 for reference(labeled as SOURCE_KEY):
['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']
List of inverters at plant 2:
['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']
Thank you very much.
I can't download the dataset to test this. Getting "To May Requests" Error.
However, you should be able to do this with a groupby.
import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])
So if I'm understanding this right, what you want is the TOTAL_YIELD for each inverter for the beginning of the time period starting 5-05-2020 02:00 and ending 17-06-2020 23:45. Try this:
# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr):
# to filter the info to between the two dates, but not necessarily assuming that
# each inverter's data starts and ends at each date
inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020
23:45:00')]
inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]
# sort by date
inverter_df.sort_values(by='DATE_TIME', inplace= True)
# grab TOTAL_YIELD at the first available date
initial = inverter_df['TOTAL_YIELD'].iloc[0]
# grab TOTAL_YIELD at the last available date
final = inverter_df['TOTAL_YIELD'].iloc[-1]
delta[index] = final - initial

Data analysis optimazation

I have a dataset with cars being spotted on two different cameras. I need to calculate the average time it takes to travel from camera 1 to 2. Database looking like so:
"ID","PLATE", "CSPOTID", "INOUT_FLAG","EVENT_TIME"
"33173","xys8","1","0","2020-08-27 08:24:53"
"33174","asd4","1","0","2020-08-27 08:24:58"
"33175","------","2","1","2020-08-27 08:25:03"
"33176","asd4","1","0","2020-08-27 08:25:04"
"33177","ghj1","1","0","2020-08-27 08:25:08"
...
Currently my code works as intended and calculates the average time between different rows. But working with big data and having a flow of incoming data, it takes too much time.
import numpy as np, matplotlib.pyplot as plt, pandas as pd,collections, sys, operator, datetime
df = pd.read_csv('tmetrics_base2.csv', quotechar='"', skipinitialspace=True, delimiter=',', dtype={"ID": int, "PLATE": "string", "CSPOTID": int, "INOUT_FLAG": int,"EVENT_TIME": "string"})
data = df.as_matrix()
#Sort values by PLATE
dfSortedByPlate = df.sort_values(['PLATE', 'EVENT_TIME'])
#List for already tested PLATEs
TestedPlate = []
resultList = []
#Iterate through all rows in db
for i,j in dfSortedByPlate.iterrows():
# If PLATE IS "------" = skip it
if j[1] == "-------":
continue
if j[1] in TestedPlate:
continue
TestedPlate.append(j[1])
for ii,jj in dfSortedByPlate.iterrows():
if j[1] != jj[1]:
continue
if j[1] == jj[1]:
dt1 = datetime.datetime.strptime(jj[4],'%Y-%m-%d %H:%M:%S')
dt2 = datetime.datetime.strptime(j[4],'%Y-%m-%d %H:%M:%S')
Travel_time = []
Travel_time.append((dt1 - dt2).total_seconds())
# Discard if greater than 1 hour and less than 3min
if (dt1 - dt2).total_seconds() < 3000 and (dt1 - dt2).total_seconds() > 180:
resultList.append((dt1 - dt2).total_seconds())
#print((dt1 - dt2).total_seconds())
print(sum(resultList) / len(resultList))
placeholdertime = jj[4]
I have sorted the database by plate number so that the comparison should be fairly quick. Any advice or pointers to where I could increase run speed would be greatly appreciated.
Also I am unsure of how long time I should expect calculations like these to take? I don't have experience with data in this scale.
You can speed up your code by removing unnecessary for loops. You can use in-built pandas functions that are typically faster than iterating through rows in the df. For instance, you can replace the two for loops by:
#get only relevant plates
df_relevant = dfSortedByPlate[dfSortedByPlate['PLATE'] != "-------"]
#test relevant plates
for i,j in df_relevant.iterrows():
df_same_plate_j = df_relevant[df_relevant['PLATE'] == j[1]]
for ii, jj in df_same_plate_j.iterrows():
dt1 = datetime.datetime.strptime(jj[4],'%Y-%m-%d %H:%M:%S')
dt2 = datetime.datetime.strptime(j[4],'%Y-%m-%d %H:%M:%S')
Travel_time = []
Travel_time.append((dt1 - dt2).total_seconds())
# Discard if greater than 1 hour and less than 3min
if (dt1 - dt2).total_seconds() < 3000 and (dt1 - dt2).total_seconds() > 180:
resultList.append((dt1 - dt2).total_seconds())
#print((dt1 - dt2).total_seconds())
print(sum(resultList) / len(resultList))
placeholdertime = jj[4]
df_relevant now contains all plates that you want to test. Then, df_same_plate_j gets the rows in df_relevant that have the same plate as row j. Then you do the rest. This way, the number of items you are iterating over is much less.
Just a few suggestions:
Read only what you need:
df = pd.read_csv('data_raw.csv',
quotechar='"',
skipinitialspace=True,
delimiter=',',
usecols=['PLATE', 'EVENT_TIME'],
index_col=['PLATE'])
Convert the EVENT_TIME column to datetime (you don't have to do that row by row):
df['EVENT_TIME'] = pd.to_datetime(df['EVENT_TIME'])
Sort (you already did that):
df.sort_index(inplace=True)
df.sort_values(by='PLATE', inplace=True)
Fetch the plates, excluding the one that isn't needed):
plates = set(df.index).difference({"------"})
Process the plate-chunks:
for plate in plates:
print(df.loc[plate])

Combining multiple functions through loops and parameters

UPDATE My question has been fully answered, I have applied it to my program using jarmod's answer, and although the code looks neater, it has not effected the speed of (when my graph appears( i plot this data using matplotlib) I am a a little confused on why my program runs slowly and how I can increase the speed ( takes about 30 seconds and I know this portion of the code is slowing it down) I have shown my real code in the second block of code. Also, the speed is strongly determined by the Range I set, with a short range it is quiet fast
I have this sample code here that shows my calculation needed to conduct forecasting and extracting values. I use the for loops to run through a specific range of CSV files that I labeled 1-100. I return numbers for each month (1-12) to get the forecasting average for a forecast for a given amount of month.
My full code includes 12 functions for a full year forecast but I feel the code is inefficient because the functions are very similar except for one number and reading the csv file so many times slows the program.
Is there a way I can combine these functions and perhaps add another parameter to make it run so. The biggest concern I had was that it would be hard to return separate numbers and categorize them. In other words, I would like to ideally only have one function for all 12 month accuracy predictions and the way I can possibly see how to do that would to add another parameter and another loop series, but have no idea how to go about that or if it is possible. Essentially, I would like to store all the values of onemonthaccuracy ( which goes into the file before the current file and compares the predicted value for the date associated with the currentfile) and then store all the values of twomonthaccurary and so on... so I can later use these variables for graphing and other purposes
import csv
import pandas as pd
def onemonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
onemonthread = pd.read_csv(str(basefilenumber-1)+'.csv', encoding='latin-1')
onemonthvalue = onemonthread.loc[onemonthread['Customer'].str.contains('Customer A', na=False),'Jun-16\nQty']
onetotal = int(onemonthvalue)/int(basefilevalue)
return onetotal
def twomonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twomonthread = pd.read_csv(str(basefilenumber-2)+'.csv', encoding = 'Latin-1')
twomonthvalue = twomonthread.loc[twomonthread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twototal = int(twomonthvalue)/int(basefilevalue)
return twototal
onetotal = 0
twototal = 0
onetotallist = []
twototallist = []
for basefilenumber in range(24,36):
onetotal += onemonthaccuracy(basefilenumber)
twototal +=twomonthaccuracy(basefilenumber)
onetotallist.append(onemonthaccuracy(i))
twototallist.append(twomonthaccuracy(i))
onetotalpermonth = onetotal/12
twototalpermonth = twototal/12
x = [1,2]
y = [onetotalpermonth, twototalpermonth]
z = [1,2]
w = [(onetotallist),(twototallist)]
for ze, we in zip(z, w):
plt.scatter([ze] * len(we), we, marker='D', s=5)
plt.scatter(x,y)
plt.show()
This is the real block of code I am using in my program, perhaps something is slowing it down that I am unaware of?
#other parts of code
#StartRange = yearvalue+Value
#EndRange = endValue + endyearvalue
#Range = EndRange - StartRange
# Department
#more code....
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
baseheader = getfileheader(basefilenumber)
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains(Department, na=False), baseheader]
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains(Department, na=False), baseheader]
return (1-(int(basefilevalue)/int(nmonthvalue))+1) if int(nmonthvalue) > int(basefilevalue) else int(nmonthvalue)/int(basefilevalue)
N = 13
total = [0] * N
total_by_month_list = [[] for _ in range(N)]
for basefilenumber in range(int(StartRange),int(EndRange)):
for n in range(N):
total[n] += nmonthaccuracy(basefilenumber, n)
total_by_month_list[n].append(nmonthaccuracy(basefilenumber,n))
onetotal=total[1]/ Range
twototal=total[2]/ Range
threetotal=total[3]/ Range
fourtotal=total[4]/ Range
fivetotal=total[5]/ Range #... all the way to 12
onetotallist=total_by_month_list[1]
twototallist=total_by_month_list[2]
threetotallist=total_by_month_list[3]
fourtotallist=total_by_month_list[4]
fivetotallist=total_by_month_list[5] #... all the way to 12
# alot more code after this
Something like this:
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
return int(nmonthvalue)/int(basefilevalue)
N = 2
total_by_month = [0] * N
total_aggregate = 0
for basefilenumber in range(20,30):
for n in range(N):
a = nmonthaccuracy(basefilenumber, n)
total_by_month[n] += a
total_aggregate += a
In case you are wondering what the following code does:
N = 2
total_by_month = [0] * N
It sets N to the number of months desired (2, but you could make it 12 or another value) and it then creates a total_by_month array that can store N results, one per month. It then initializes total_by_month to all zeroes (N zeroes) so that each of the N monthly totals starts at zero.

Categories

Resources