Combining multiple functions through loops and parameters - python

UPDATE My question has been fully answered, I have applied it to my program using jarmod's answer, and although the code looks neater, it has not effected the speed of (when my graph appears( i plot this data using matplotlib) I am a a little confused on why my program runs slowly and how I can increase the speed ( takes about 30 seconds and I know this portion of the code is slowing it down) I have shown my real code in the second block of code. Also, the speed is strongly determined by the Range I set, with a short range it is quiet fast
I have this sample code here that shows my calculation needed to conduct forecasting and extracting values. I use the for loops to run through a specific range of CSV files that I labeled 1-100. I return numbers for each month (1-12) to get the forecasting average for a forecast for a given amount of month.
My full code includes 12 functions for a full year forecast but I feel the code is inefficient because the functions are very similar except for one number and reading the csv file so many times slows the program.
Is there a way I can combine these functions and perhaps add another parameter to make it run so. The biggest concern I had was that it would be hard to return separate numbers and categorize them. In other words, I would like to ideally only have one function for all 12 month accuracy predictions and the way I can possibly see how to do that would to add another parameter and another loop series, but have no idea how to go about that or if it is possible. Essentially, I would like to store all the values of onemonthaccuracy ( which goes into the file before the current file and compares the predicted value for the date associated with the currentfile) and then store all the values of twomonthaccurary and so on... so I can later use these variables for graphing and other purposes
import csv
import pandas as pd
def onemonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
onemonthread = pd.read_csv(str(basefilenumber-1)+'.csv', encoding='latin-1')
onemonthvalue = onemonthread.loc[onemonthread['Customer'].str.contains('Customer A', na=False),'Jun-16\nQty']
onetotal = int(onemonthvalue)/int(basefilevalue)
return onetotal
def twomonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twomonthread = pd.read_csv(str(basefilenumber-2)+'.csv', encoding = 'Latin-1')
twomonthvalue = twomonthread.loc[twomonthread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twototal = int(twomonthvalue)/int(basefilevalue)
return twototal
onetotal = 0
twototal = 0
onetotallist = []
twototallist = []
for basefilenumber in range(24,36):
onetotal += onemonthaccuracy(basefilenumber)
twototal +=twomonthaccuracy(basefilenumber)
onetotallist.append(onemonthaccuracy(i))
twototallist.append(twomonthaccuracy(i))
onetotalpermonth = onetotal/12
twototalpermonth = twototal/12
x = [1,2]
y = [onetotalpermonth, twototalpermonth]
z = [1,2]
w = [(onetotallist),(twototallist)]
for ze, we in zip(z, w):
plt.scatter([ze] * len(we), we, marker='D', s=5)
plt.scatter(x,y)
plt.show()
This is the real block of code I am using in my program, perhaps something is slowing it down that I am unaware of?
#other parts of code
#StartRange = yearvalue+Value
#EndRange = endValue + endyearvalue
#Range = EndRange - StartRange
# Department
#more code....
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
baseheader = getfileheader(basefilenumber)
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains(Department, na=False), baseheader]
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains(Department, na=False), baseheader]
return (1-(int(basefilevalue)/int(nmonthvalue))+1) if int(nmonthvalue) > int(basefilevalue) else int(nmonthvalue)/int(basefilevalue)
N = 13
total = [0] * N
total_by_month_list = [[] for _ in range(N)]
for basefilenumber in range(int(StartRange),int(EndRange)):
for n in range(N):
total[n] += nmonthaccuracy(basefilenumber, n)
total_by_month_list[n].append(nmonthaccuracy(basefilenumber,n))
onetotal=total[1]/ Range
twototal=total[2]/ Range
threetotal=total[3]/ Range
fourtotal=total[4]/ Range
fivetotal=total[5]/ Range #... all the way to 12
onetotallist=total_by_month_list[1]
twototallist=total_by_month_list[2]
threetotallist=total_by_month_list[3]
fourtotallist=total_by_month_list[4]
fivetotallist=total_by_month_list[5] #... all the way to 12
# alot more code after this

Something like this:
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
return int(nmonthvalue)/int(basefilevalue)
N = 2
total_by_month = [0] * N
total_aggregate = 0
for basefilenumber in range(20,30):
for n in range(N):
a = nmonthaccuracy(basefilenumber, n)
total_by_month[n] += a
total_aggregate += a
In case you are wondering what the following code does:
N = 2
total_by_month = [0] * N
It sets N to the number of months desired (2, but you could make it 12 or another value) and it then creates a total_by_month array that can store N results, one per month. It then initializes total_by_month to all zeroes (N zeroes) so that each of the N monthly totals starts at zero.

Related

Python Monte Carlo Simulation end value not looping

I am trying to build a simulation of an investment portfolio where there is flexibility to adjust a couple of manual inputs. I have the returns simulation running fine, but I can't figure out how to loop the ending value for previous period to equal the beginning value for the next period.
def mcs(n_years = 10, n_scenarios=10, mu=0.07, sigma=0.15, steps_per_year=12, balance=100,
expenses=10, personal_income=5):
dt = 1/steps_per_year
n_steps = int(n_years*steps_per_year) + 1
beginning_value=balance
ending_value=balance
for i in range(n_steps):
rets_mcs = np.random.normal(loc=(1+mu)**dt-1, scale=(sigma*np.sqrt(dt)), size=(n_steps, n_scenarios))
ending_value = balance + balance*rets_mcs - expenses + personal_income
df = pd.DataFrame(data=ending_value,index=range(n_steps))
df.iloc[0]=balance
return df
I keep ending up results based off the original value. Any code help or resources would be appreciated.
Managed to solve this problem using the below code. This previous post was helpful (Dataframe with Monte Carlo Simulation calculation next row Problem)
n_years = 10
n_scenarios=2
mu=0.07
sigma=0.15
steps_per_year=12
s_0=100
expenses=10
personal_income=5
inflation=0
wage_growth=0
output = []
dt = 1/steps_per_year
n_steps = int(n_years*steps_per_year) + 1
for i in range(n_steps):
rets_ngbm = np.random.normal(loc=(1+mu)**dt-1, scale=(sigma*np.sqrt(dt)), size=n_scenarios)
ending_value = s_0 + s_0*rets_ngbm - expenses + personal_income
output.append(ending_value)
s_0 = ending_value
df = pd.DataFrame(data=output,index=range(n_steps),columns=range(n_scenarios))

Improving performance of Python for loop?

I am trying to write a code to construct dataFrame which consists of cointegrating pairs of portfolios (stock price is cointegrating). In this case, stocks in a portfolio are selected from S&P500 and they have the equal weights.
Also, for some economical issue, the portfolios must include the same sectors.
For example:
if stocks in one portfolio are from [IT] and [Financial] sectors, the second portoflio must select stocks from [IT] and [Financial] sectors.
There are no correct number of stocks in a portfolio, so I'm considering about 10 to 20 stocks for each of them. However, when it comes to think about the combination, this is (500 choose 10), so I have an issue of computation time.
The followings are my code:
def adf(x, y, xName, yName, pvalue=0.01, beta_lower=0.5, beta_upper=1):
res=pd.DataFrame()
regress1, regress2 = pd.ols(x=x, y=y), pd.ols(x=y, y=x)
error1, error2 = regress1.resid, regress2.resid
test1, test2 = ts.adfuller(error1, 1), ts.adfuller(error2, 1)
if test1[1] < pvalue and test1[1] < test2[1] and\
regress1.beta["x"] > beta_lower and regress1.beta["x"] < beta_upper:
res[(tuple(xName), tuple(yName))] = pd.Series([regress1.beta["x"], test1[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
elif test2[1] < pvalue and regress2.beta["x"] > beta_lower and\
regress2.beta["x"] < beta_upper:
res[(tuple(yName), tuple(xName))] = pd.Series([regress2.beta["x"], test2[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
else:
pass
def coint(dataFrame, nstocks = 2, pvalue=0.01, beta_lower=0.5, beta_upper=1):
# dataFrame = pandas_dataFrame, in this case, data['Adj Close'], row=time, col = tickers
# pvalue = level of significance of adf test
# nstocks = number of stocks considered for adf test (equal weight)
# if nstocks > 2, coint return cointegration between portfolios
# beta_lower = lower bound for slope of linear regression
# beta_upper = upper bound for slope of linear regression
a=time.time()
tickers = dataFrame.columns
tcomb = itertools.combinations(dataFrame.columns, nstocks)
res = pd.DataFrame()
sec = pd.DataFrame()
for pair in tcomb:
xName, yName = list(pair[:int(nstocks/2)]), list(pair[int(nstocks/2):])
xind, yind = tickers.searchsorted(xName), tickers.searchsorted(yName)
xSector = list(SNP.ix[xind]["Sector"])
ySector = list(SNP.ix[yind]["Sector"])
if set(xSector) == set(ySector):
sector = [[(xSector, ySector)]]
x, y = dataFrame[list(xName)].sum(axis=1), dataFrame[list(yName)].sum(axis=1)
res1 = adf(x,y,xName,yName)
if res1 is None:
continue
elif res.size==0:
res=res1
sec = pd.DataFrame(sector, index = res.index, columns = ["sector"])
print("added : ", pair)
else:
res=res.append(res1)
sec = sec.append(pd.DataFrame(sector, index = [res.index[-1]], columns = ["sector"]))
print("added : ", pair)
res = pd.concat([res,sec],axis=1)
res=res.sort_values(by=["pvalue"],ascending=True)
b=time.time()
print("time taken : ", b-a, "sec")
return res
when nstocks=2, this takes about 263 seconds, but as nstocks increases, the loop takes alot of time (more than a day)
I collected 'Adj Close' data from yahoo finance using pandas_datareader.data
and the index is time and columns are different tickers
Any suggestions or help will be appreciated
I dont know what computer you have, but i would advise you to use some kind of multiprocessing for the loop. I haven't looked really hard into your code, but as far as i see res and sec can be moved into shared memory objects, and the individual loops paralleled with multiprocessing.
If you have a decent CPU it can improve the performance 4-6 times. In case you have access to some kind of HPC it can do wonders.
I'd recommend using a profiler to narrow down the most time consuming calls, and the number of loops (does your loop make the expected number of passes?). Python 3 has a profiler in the standard library:
https://docs.python.org/3.6/library/profile.html
You can either invoke it in your code:
import cProfile
cProfile.run('your_function(inputs)')
Or if a script is an easier entrypoint:
python -m cProfile [-o output_file] [-s sort_order] your-script.py

Student - np.random.choice: How to isolate and tally hit frequency within a np.random.choice range

Currently learning Python and very new to Numpy & Panda
I have pieced together a random generator with a range. It uses Numpy and I am unable to isolate each individual result to count the iterations within a range within my random's range.
Goal: Count the iterations of "Random >= 1000" and then add 1 to the appropriate cell that correlates to the tally of iterations. Example in very basic sense:
#Random generator begins... these are first four random generations
Randomiteration0 = 175994 (Random >= 1000)
Randomiteration1 = 1199 (Random >= 1000)
Randomiteration2 = 873399 (Random >= 1000)
Randomiteration3 = 322 (Random < 1000)
#used to +1 to the fourth row of column A in CSV
finalIterationTally = 4
#total times random < 1000 throughout entire session. Placed in cell B1
hits = 1
#Rinse and repeat to custom set generations quantity...
(The logic would then be to +1 to A4 in the spreadsheet. If the iteration tally would have been 7, then +1 to the A7, etc. So essentially, I am measuring the distance and frequency of that distance between each "Hit")
My current code includes a CSV export portion. I do not need to export each individual random result any longer. I only need to export the frequency of each iteration distance between each hit. This is where I am stumped.
Cheers
import pandas as pd
import numpy as np
#set random generation quantity
generations=int(input("How many generations?\n###:"))
#random range and generator
choices = range(1, 100000)
samples = np.random.choice(choices, size=generations)
#create new column in excel
my_break = 1000000
if generations > my_break:
n_empty = my_break - generations % my_break
samples = np.append(samples, [np.nan] * n_empty).reshape((-1, my_break)).T
#export results to CSV
(pd.DataFrame(samples)
.to_csv('eval_test.csv', index=False, header=False))
#left uncommented if wanting to test 10 generations or so
print (samples)
I believe you are mixing up iterations and generations. It sounds like you want 4 iterations for N numbers of generations, but your bottom piece of code does not express the "4" anywhere. If you pull all your variables out to the top of your script it can help you organize better. Panda is great for parsing complicated csvs, but for this case you don't really need it. You probably don't even need numpy.
import numpy as np
THRESHOLD = 1000
CHOICES = 10000
ITERATIONS = 4
GENERATIONS = 100
choices = range(1, CHOICES)
output = np.zeros(ITERATIONS+1)
for _ in range(GENERATIONS):
samples = np.random.choice(choices, size=ITERATIONS)
count = sum([1 for x in samples if x > THRESHOLD])
output[count] += 1
output = map(str, map(int, output.tolist()))
with open('eval_test.csv', 'w') as f:
f.write(",".join(output)+'\n')

Efficient netCDF analysis when looping through data

This is a follow up question related to this question.
Thanks to previous help I have successfully imported a netCDF file (or files with MFDataset) and am able to compare the different times to one another to create another cumulative dataset. Here is a piece of the current code.
from numpy import *
import netCDF4
import os
f = netCDF4.MFDataset('air.2m.1979.nc')
atemp = f.variables['air']
ntimes, ny, nx = atemp.shape
cold_days = zeros((ntimes, ny, nx), dtype=int)
for i in range(ntimes):
for b in range(ny):
for c in range(nx):
if i == 1:
if atemp[i,b,c] < 0:
cold_days[i,b,c] = 1
else:
cold_days[i,b,c] = 0
else:
if atemp[i,b,c] < 0:
cold_days[i,b,c] = cold_days[i-1,b,c] + 1
else:
cold_days[i,b,c] = 0
This seems like a brute force way to get the job done, and though it works it takes a very long time. I'm not sure if it takes such a long time because I'm dealing with 365 349x277 matrices (35,285,645 pixels) or if my old school brute force way is simply slow in comparison to some built in python methods.
Below is an example of what I believe the code is doing. It looks at Time and increments cold days if temp < 0. If temp >= 0 than cold days resets to 0. In the below image you will see that the cell at row 2, column 1 increments each Time that passes but the cell at row 2, column 2 increments at Time 1 but resets to zero on Time 2.
Is there a more efficient way to rip through this netCDF dataset to perform this type of operation?
Seems like this is a minor modification -- just write the data out at each time step. Something close to this should work:
from pylab import *
import netCDF4
# open NetCDF input files
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
# print variables
f.variables.keys()
atemp = f.variables['air']
print atemp
ntimes, ny, nx = shape(atemp)
cold_days = zeros((ny,nx),dtype=int)
# create output NetCDF file
nco = netCDF4.Dataset('/usgs/data2/notebook/cold_days.nc','w',clobber=True)
nco.createDimension('x',nx)
nco.createDimension('y',ny)
nco.createDimension('time',ntimes)
cold_days_v = nco.createVariable('cold_days', 'i4', ( 'time', 'y', 'x'))
cold_days_v.units='days'
cold_days_v.long_name='total number of days below 0 degC'
cold_days_v.grid_mapping = 'Lambert_Conformal'
timeo = nco.createVariable('time','f8',('time'))
lono = nco.createVariable('lon','f4',('y','x'))
lato = nco.createVariable('lat','f4',('y','x'))
xo = nco.createVariable('x','f4',('x'))
yo = nco.createVariable('y','f4',('y'))
lco = nco.createVariable('Lambert_Conformal','i4')
# copy all the variable attributes from original file
for var in ['time','lon','lat','x','y','Lambert_Conformal']:
for att in f.variables[var].ncattrs():
setattr(nco.variables[var],att,getattr(f.variables[var],att))
# copy variable data for time, lon,lat,x and y
timeo[:] = f.variables['time'][:]
lato[:] = f.variables['lat'][:]
xo[:] = f.variables['x'][:]
yo[:] = f.variables['y'][:]
for i in xrange(ntimes):
cold_days += atemp[i,:,:].data-273.15 < 0
# write the cold_days data
cold_days_v[i,:,:]=cold_days
# copy Global attributes from original file
for att in f.ncattrs():
setattr(nco,att,getattr(f,att))
nco.Conventions='CF-1.6'
nco.close()

How to plot an output of a function in Python?

These three functions give me the progression of number of customers and their orders from state 0 to next 365 states (or days). In function state_evolution, I want to plot the output from line
custA = float(custA*1.09**(1.0/365))
against the output from line
A = sum(80 + random.random() * 50 for i in range(ordsA))
and do the same for custB so I can compare their outputs graphically.
def get_state0():
""" functions gets four columns from base data and finds their state 0"""
statetype0 = {'custt':{'typeA':100,'typeB':200}}
orderstype0 = {'orders':{'typeA':1095, 'typeB':4380}}
return {'custtypeA' : int(statetype0['custt']['typeA']),
'custtypeB' : int(statetype0['custt']['typeB']),
'ordstypeA': orderstype0['orders']['typeA'],'A':1095, 'B':4380,
'ordstypeB':orderstype0['orders']['typeB'],
'day':0 }
def state_evolution(state):
"""function takes state 0 and predicts state evolution """
custA = state['custtypeA']
custB = state['custtypeB']
ordsA = state['ordstypeA']
ordsB = state['ordstypeB']
A = state['A']
B = state['B']
day = state['day']
# evolve day
day += 1
#evolve cust typea
custA = float(custA*1.09**(1.0/365))
#evolve cust typeb
custB = float (custB*1.063**(1.0/365))
# evolve orders cust type A
ordsA += int(custA * order_rateA(day))
A = sum(80 + random.random() * 50 for i in range(ordsA))
# evolve orders cust type B
ordsB += int(custB * order_rateB(day))
B = sum(70 + random.random() * 40 for i in range(ordsB))
return {'custtypeA':custA ,'ordstypeA':ordsA, 'A':A, 'B':B,
'custtypeB':custB, 'ordstypeB':ordsB, 'day': day}
def show_all_states():
""" function runs state evolution function to find other states"""
s = get_state0()
for day in range(365):
s = state_evolution(s)
print day, s
You should do the following:
Modify your custA function so that it returns a sequence (list, tuple) with, say, 365 items. Alternatively, use custA inside a list comprehension or for loop to get the sequence of 365 results;
Do the same for ordsA function, to get the other sequence.
From now on, you can do different things depending on what you want.
If you want two plots in parallel (superimposed), then:
pyplot.plot(custA_result_list);
pyplot.plot(ordsA_result_list);
pyplot.show()
If you want to CORRELATE the data, you can do a scatter plot (faster), or a regular plot with dot markers (slower but more customizeable IMO):
pyplot.scatter(custA_result_list, ordsA_result_list)
# or
pyplot.plot(custA_result_list, ordsA_result_list, 'o')
## THIS WILL ONLY WORK IF BOTH SEQUENCES HAVE SAME LENGTH! (e.g. 365 elements each)
At last, if the data were irregularly sampled, you could also provide a sequence for the horizontal axis values. That would allow, for example, to plot only weekdays' results without "collapsing" the weekends (otherwise the gap between friday and monday would look like a single day):
weekdays = [1,2,3,4,5, 8,9,10,11,12, 15,16,17,18,19, 22, ...] # len(weekdays) ~ 260
pyplot.plot(weekdays, custA_result_list);
pyplot.plot(weekdays, ordsA_result_list);
pyplot.show()
Hope this helps!
EDIT: About excel, if I understand right you ALREADY have a csv file. Then, you could use the csv python module, or read it yourself like this:
with open('file.csv') as csv_in:
content = [line.strip().split(',') for line in csv_in]
Now if you have an actual .xls or .xlsx file, use the xlrd module that you can download here or by running pip install xlrd in a command prompt.

Categories

Resources