Cannot speed up total calculation time by adding more processors to multiprocessing. Takes just as long to run with 3 processors vs 7 processors.
I've tried chunking the data so each processors works on a much larger set of calculation. Same result.
I've initialized the static data to each process instead of passing as argument.
I've tried returning DataFrame to pool.map vs writing it out to file.
I've timed the EvalContract section. With 3 processors it takes 1 contract with 1 scenario 35 seconds to complete. With 7 processors running the same contract and scenario takes 55 seconds.
import pandas as pd
import numpy as np
import multiprocessing as mp
import itertools as it
def initializer(Cont,Scen,RandWD,dicS):
global dfCleanCont
global dfScen
global dfWithdrawalRandom
global dicSensit
dfCleanCont = Cont
dfScen = Scen
dfWithdrawalRandom = RandWD
dicSensit = dicS
def ValueProj(ContScen):
Contract = dfCleanCont.loc[ContScen[0]]
PTS = Contract.name
ProjWDs = dfWithdrawalRandom[Contract['WD_ID']]
dfScenOneSet = dfScen[dfScen["Trial"]==ContScen[1]]
'''Do various projection calculations. All calculation in numpy arrays then converted to DataFrame before returning. Dataframe shape[601,35]'''
return dfContProj
def ReserveProjectionPreprocess(Scen,dfBarclayRates,dicProjOuterSeries,liProjValContract):
Timestep = liProjValContract[0]['Outer_t']
dfInnerLoopScen = SetupInnerLoopScenarios(Timestep,Scen,dicSensit)
BBC = BuildBarclayCurve(Timestep,Scen[Scen['Timestep']==Timestep][dicSensit['irCols']].iloc[0].to_list(),dfBarclayRates.loc[Timestep],dicSensit)
'''Do various inner loop projection calculations, up to 601 timesteps. All calculation in numpy arrays.'''
return pd.Series({'PTS': Contract.name,
'OuterScenNum': ContractValProjOne['OuterScenNum'],
'Outer_t': ContractValProjOne['Outer_t'],
'Reserve': max(PVL-ContractValProjOne['MV']-AssetHaircut,0)})
def EvalContract(liCS):
for CS in liCS:
'''Evaluate single contract with single scenario'''
start_time = time.time()
dfOuterLoop = ValueProj(CS)
Contract = dfCleanCont.loc[CS[0]]
PTS = Contract.name
dfScenOneSet = dfScen[dfScen["Trial"]==CS[1]]
dfOuterLoopCut = dfOuterLoop[dfOuterLoop['BV']!=0][:-1]
MinMVt = dicSensit['ProjectionYrs']*12 if sum(dfOuterLoop[(dfOuterLoop['MV']==0) & (dfOuterLoop['Outer_t']>0)]['Outer_t'])==0 else min(dfOuterLoop[(dfOuterLoop['MV']==0) & (dfOuterLoop['Outer_t']>0)]['Outer_t'])
MinBVt = dicSensit['ProjectionYrs']*12 if sum(dfOuterLoop[(dfOuterLoop['BV']==0) & (dfOuterLoop['Outer_t']>0)]['Outer_t'])==0 else min(dfOuterLoop[(dfOuterLoop['BV']==0) & (dfOuterLoop['Outer_t']>0)]['Outer_t'])
dicProjOuterSeries = {'Contract': Contract,
'BaseLapsePartContribution': dfOuterLoop['BaseLapsePartContribution'].values,
'BaseLapsePartNetTransfer': dfOuterLoop['BaseLapsePartNetTransfer'].values,
'BaseLapsePartWithdrawal': dfOuterLoop['BaseLapsePartWithdrawal'].values,
'PrudentEstDynPartWDPct': dfOuterLoop['PrudentEstDynPartWDPct'].values,
'KnownPutQueueWD': dfOuterLoop['KnownPutQueueWD'].values,
'BaseLapsePlanSponsor': dfOuterLoop['BaseLapsePlanSponsor'].values,
'PrudentEstDynPlanWDPct': dfOuterLoop['PrudentEstDynPlanWDPct'].values,
'MonthlyDefaultCharge': dfOuterLoop['MonthlyDefaultCharge'].values,
'Outer_t_Maturity': min(MinMVt,MinBVt)-1}
liProjValContract=[]
for _,row in dfOuterLoopCut.iterrows():
liProjValContract.append([row])
func=partial(ReserveProjectionPreprocess,dfScenOneSet,dicProjOuterSeries)
dfReserve = pd.concat(map(func,liProjValContract),axis=1,ignore_index=True).T
dfOuterLoopwRes = pd.merge(dfOuterLoop,dfReserve,how='left',on=['PTS','OuterScenNum','Outer_t'])
dfOuterLoopwRes['Reserve'].fillna(value=0,inplace=True)
fname='OuterProjection_{0}_{1}.parquet'.format(PTS,CS[1])
dfOuterLoopwRes.to_parquet(os.path.join(dicSensit['OutputDirOuterLoop'],fname),index=False)
return 1
if __name__ == '__main__':
dfCleanCont = 'DataFrame of 150 contract data. Each row is a contract with various info such as market value, interest rate, maturity date, etc. Indentifier index is "PTS". Shape[150,41]
dfScen = 'DataFrame of interest rate scenarios. 100 scenarios("Trial"). Each scenario has 601 timesteps and 11 interest rate term points. Shape[60100,13]'
liContID = list(dfCleanCont.index)
liScenID = dfScen["Trial"].unique()
liCS = list(it.product(liContID,liScenID))
pool=mp.Pool(7,initializer,(ns.dfCleanCont,ns.dfScen,dfWithdrawalRandom,dicSensit,))
n=10
liCSGroup=[liCS[x:x+n] for x in range(0,len(liCS),n)]
dfCombOuterProj = pd.concat(pool.map(func=EvalContract,iterable=liCSGroup))
pool.close()
pool.join()
Expecting significant speed gain with more processors. There's a bottleneck somewhere but I can't seem to find it. Tried cProfile but getting the same cumulative time with 3 or 7 processors.
Related
I have a series of computations on some data which I’m modelling as a graph with dask delayed, and works well, however the graph itself takes longer (or a comparable time) to create than the calculations take to run.
I add data throughout the day, so would like to be able to change the inputs without recreating the graph, is there a way to do this?
This is an advanced topic, so I am going to provide only a somewhat-hacky solution:
import dask
from dask.multiprocessing import get
#dask.delayed()
def myfunc(x):
return x+1
nested = 0
for x in range(1, 3):
nested = myfunc(x*nested, dask_key_name=f'{x}')
# 1*0 + 1 = 1 -> 2*1 + 1 = 3
print(nested.compute())
dag_modified = nested.dask.to_dict()
dag_modified['1'] = modified_dag['1'][0], 2
# 1*2 + 1 = 3 -> 2*3 + 1 = 7
print(get(dag_modified, '2'))
Here is my code:
import xlwings as xw
import datetime as dt
import numpy as np
import pandas as pd
import threading
import time
#connect to workbook
wb = xw.Book(r'C:\Users\Ryan\AppData\Local\Programs\Python\Python37-32\constituents.xlsx')
sht = wb.sheets['constituents']
#store data in np array, pass to Pandas
a = sht.range('A2:C1760').options(np.array).value
df = pd.DataFrame(a)
df = df.rename(index=str, columns={0: "tickers", 1: "start_dates", 2: "end_dates"})
#initialize variables
start_quarter = 0
start_year = 0
fiscal_dates = []
s1 = pd.date_range(start='1/1/1964', end='12/31/2018', freq='B')
df2 = pd.DataFrame(data=np.ndarray(shape=(len(s1),500), dtype=str), index=s1)
#create list of fiscal quarters
def fiscal_quarters(start_year):
year_count = start_year - 1
quarter_count = 1
for n in range(2019 - start_year):
year_count += 1
for i in range(1,5):
fiscal_dates.append(str(quarter_count) + 'Q'+ str(year_count)[-2:])
quarter_count += 1
quarter_count = 1
#iterate over list of tickers to create self-named spreadsheets
def populate_worksheets():
for n in range(len(fiscal_dates)):
wb.sheets.add(name=fiscal_dates[n])
#populate df2 with appropriate tickers
def populate_tickers():
count = 0
for n in range(len(s1)):
for i in range(len(df['tickers'])):
if df.loc[str(i), 'start_dates'] <= s1[n] and df.loc[str(i), 'end_dates'] > s1[n]:
count += 1
df2.loc[str(s1[n]), str(count)] = df.loc[str(i), 'tickers']
count = 0
#run populate_tickers function with status updates
def pt_thread():
t = threading.Thread(target=populate_tickers)
c = 0
t.start()
while (t.is_alive()):
time.sleep(5)
count += 5
print("Working... " + str(c) + 's')
First, I run fiscal_quarters(1964) in the Python Shell, then pt_thread(), which appears to be particularly resource intensive. It's been running for over a half hour for this point on my (admittedly slow) laptop. However, without waiting for it to finish running, is there any way to be able to see whether it's working as intended? Or at all? It's still printing "Working..." to the shell which I suppose is a good sign, but I'd like to start troubleshooting if something's wrong rather than waiting an indefinite amount of time before giving up on it.
For reference, the s1 Series contains ~17,500 items, while the df['tickers'] column contains ~2,000 items, so there should be somewhere in the neighborhood of 35,000,000 iterations with 4 operations each. Is this a lot, or should a modern PC be able to work through this rather quickly and my program is probably just not working?
If you are running loops that are taking a long time and want to see what is going on you could use tqdm. It gives iterations per second and estimated time remaining. Here is a quick example:
from tqdm import tqdm
def sim(sims):
x = 0
pb = tqdm(total=sims, initial=x)
while x < sims:
x+=1
pb.update(1)
pb.close()
sim(5000000)
I am trying to write a code to construct dataFrame which consists of cointegrating pairs of portfolios (stock price is cointegrating). In this case, stocks in a portfolio are selected from S&P500 and they have the equal weights.
Also, for some economical issue, the portfolios must include the same sectors.
For example:
if stocks in one portfolio are from [IT] and [Financial] sectors, the second portoflio must select stocks from [IT] and [Financial] sectors.
There are no correct number of stocks in a portfolio, so I'm considering about 10 to 20 stocks for each of them. However, when it comes to think about the combination, this is (500 choose 10), so I have an issue of computation time.
The followings are my code:
def adf(x, y, xName, yName, pvalue=0.01, beta_lower=0.5, beta_upper=1):
res=pd.DataFrame()
regress1, regress2 = pd.ols(x=x, y=y), pd.ols(x=y, y=x)
error1, error2 = regress1.resid, regress2.resid
test1, test2 = ts.adfuller(error1, 1), ts.adfuller(error2, 1)
if test1[1] < pvalue and test1[1] < test2[1] and\
regress1.beta["x"] > beta_lower and regress1.beta["x"] < beta_upper:
res[(tuple(xName), tuple(yName))] = pd.Series([regress1.beta["x"], test1[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
elif test2[1] < pvalue and regress2.beta["x"] > beta_lower and\
regress2.beta["x"] < beta_upper:
res[(tuple(yName), tuple(xName))] = pd.Series([regress2.beta["x"], test2[1]])
res = res.T
res.columns=["beta","pvalue"]
return res
else:
pass
def coint(dataFrame, nstocks = 2, pvalue=0.01, beta_lower=0.5, beta_upper=1):
# dataFrame = pandas_dataFrame, in this case, data['Adj Close'], row=time, col = tickers
# pvalue = level of significance of adf test
# nstocks = number of stocks considered for adf test (equal weight)
# if nstocks > 2, coint return cointegration between portfolios
# beta_lower = lower bound for slope of linear regression
# beta_upper = upper bound for slope of linear regression
a=time.time()
tickers = dataFrame.columns
tcomb = itertools.combinations(dataFrame.columns, nstocks)
res = pd.DataFrame()
sec = pd.DataFrame()
for pair in tcomb:
xName, yName = list(pair[:int(nstocks/2)]), list(pair[int(nstocks/2):])
xind, yind = tickers.searchsorted(xName), tickers.searchsorted(yName)
xSector = list(SNP.ix[xind]["Sector"])
ySector = list(SNP.ix[yind]["Sector"])
if set(xSector) == set(ySector):
sector = [[(xSector, ySector)]]
x, y = dataFrame[list(xName)].sum(axis=1), dataFrame[list(yName)].sum(axis=1)
res1 = adf(x,y,xName,yName)
if res1 is None:
continue
elif res.size==0:
res=res1
sec = pd.DataFrame(sector, index = res.index, columns = ["sector"])
print("added : ", pair)
else:
res=res.append(res1)
sec = sec.append(pd.DataFrame(sector, index = [res.index[-1]], columns = ["sector"]))
print("added : ", pair)
res = pd.concat([res,sec],axis=1)
res=res.sort_values(by=["pvalue"],ascending=True)
b=time.time()
print("time taken : ", b-a, "sec")
return res
when nstocks=2, this takes about 263 seconds, but as nstocks increases, the loop takes alot of time (more than a day)
I collected 'Adj Close' data from yahoo finance using pandas_datareader.data
and the index is time and columns are different tickers
Any suggestions or help will be appreciated
I dont know what computer you have, but i would advise you to use some kind of multiprocessing for the loop. I haven't looked really hard into your code, but as far as i see res and sec can be moved into shared memory objects, and the individual loops paralleled with multiprocessing.
If you have a decent CPU it can improve the performance 4-6 times. In case you have access to some kind of HPC it can do wonders.
I'd recommend using a profiler to narrow down the most time consuming calls, and the number of loops (does your loop make the expected number of passes?). Python 3 has a profiler in the standard library:
https://docs.python.org/3.6/library/profile.html
You can either invoke it in your code:
import cProfile
cProfile.run('your_function(inputs)')
Or if a script is an easier entrypoint:
python -m cProfile [-o output_file] [-s sort_order] your-script.py
I have pandas data object - data - that is stored as Series of Series. The first series is indexed on ID1 and the second on ID2.
ID1 ID2
1 10259 0.063979
14166 0.120145
14167 0.177417
14244 0.277926
14245 0.436048
15021 0.624367
15260 0.770925
15433 0.918439
15763 1.000000
...
1453 812690 0.752274
813000 0.755041
813209 0.756425
814045 0.778434
814474 0.910647
814475 1.000000
Length: 19726, dtype: float64
I have a function that uses values from this object for further data processing. Here is the function:
#Function
def getData(ID1, randomDraw):
dataID2 = data[ID1]
value = dataID2.index[np.searchsorted(dataID2, randomDraw, side='left').iloc[0]]
return value
I use np.vectorize to apply this function on a DataFrame - dataFrame - that has about 22 million rows.
dataFrame['ID2'] = np.vectorize(getData)(dataFrame['ID1'], dataFrame['RAND'])
where ID1 and RAND are columns with values that are feeding into the function.
The code takes about 6 hours to process everything. A similar implementation in Java takes only about 6 minutes to get through 22 million rows of data.
On running a profiler on my program I find that the most expensive call is the indexing into data and the second most expensive is searchsorted.
Function Name: pandas.core.series.Series.__getitem__
Elapsed inclusive time percentage: 54.44
Function Name: numpy.core.fromnumeric.searchsorted
Elapsed inclusive time percentage: 25.49
Using data.loc[ID1] to get data makes the program even slower. How can I make this faster? I understand that Python cannot achieve the same efficiency as Java but 6 hours compared to 6 minutes seems too much of a difference. Maybe I should be using a different data structure/ functions? I am using Python 2.7 and PTVS IDE.
Adding a minimum working example:
import numpy as np
import pandas as pd
np.random.seed = 0
#Creating a dummy data object - Series within Series
alt = pd.Series(np.array([ 0.25, 0.50, 0.75, 1.00]), index=np.arange(1,5))
data = pd.Series([alt]*1500, index=np.arange(1,1501))
#Creating dataFrame -
nRows = 200000
d = {'ID1': np.random.randint(1500, size=nRows) + 1
,'RAND': np.random.uniform(low=0.0, high=1.0, size=nRows)}
dataFrame = pd.DataFrame(d)
#Function
def getData(ID1, randomDraw):
dataID2 = data[ID1]
value = dataID2.index[np.searchsorted(dataID2, randomDraw, side='left').iloc[0]]
return value
dataFrame['ID2'] = np.vectorize(getData)(dataFrame['ID1'], dataFrame['RAND'])
You may get a better performance with this code:
>>> def getData(ts):
... dataID2 = data[ts.name]
... i = np.searchsorted(dataID2.values, ts.values, side='left')
... return dataID2.index[i]
...
>>> dataFrame['ID2'] = dataFrame.groupby('ID1')['RAND'].transform(getData)
I have a vector that contain stock tickers like tickers = ['AAPL','XOM','GOOG'] and in my "traditional" python program I would loop over this tickers vector, select one ticker string like AAPL, import a csv file that contains AAPL stock returns, use the returns as an input to a common function, and finally generate a csv file as an output. I have over 4000 tickers and the function to apply to each ticker takes time to process. I have access to a computer cluster with the mpi4py package with access to about 100 processors per job. I understand well (and was able to implement) this mpi example in python:
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
if rank == 0:
data = [i for i in range(8)]
# dividing data into chunks
chunks = [[] for _ in range(size)]
for i, chunk in enumerate(data):
chunks[i % size].append(chunk)
else:
data = None
chunks = None
data = comm.scatter(chunks, root=0)
print str(rank) + ': ' + str(data)
[cha#cluster] ~/utils> mpirun -np 3 ./mpi.py
2: [2, 5]
0: [0, 3, 6]
1: [1, 4, 7]
So in this example, we have a data vector of size 8 and assign to each processor (3 in total) an equal number of elements of the data. How can I use the similar above example and assign to each processor one stock ticker and apply the function that needs to be run for each ticker? How can I tell python that once a processor get free, to go back in the tickers vector and process a ticker that has not yet been processed?
There's another way to think of this. You have 100 processors processing 4000 chunks of data. One way you can look at this is that each processor gets a block of data on which to operate. Evenly split, each processor will get 40 tickers to process. Processor 1 will get 0-39, processor 2 will get 40-79, etc.
Thinking this way, you don't need to worry about what happens when a processor finishes its tasks. Just have a loop:
block_size = len(tickers) / size # this will be 40 in your example
for i in range(block_size):
ticker = tickers[rank * block_size + i]
process(ticker)
def process(ticker):
# load data
# process data
# output data
Does this make sense?
[edit]
If you're wanting to read more, this is really just a variation on row-major order indexing, a common method for accessing multidimensional data that's stored in a single dimension of memory.