Optimizing Algorithm of large dataset calculations - python

Once again I find myself stumped with pandas, and how to best perform a 'vector operation'. My code works, however it will take a long time to iterate through everything.
What the code is trying to do is loop through shapes.cv and determine which shape_pt_sequence is a stop_id, and then assigns the stop_lat and stop_lon to shape_pt_lat and shape_pt_lon, while also marking the shape_pt_sequence as is_stop.
GISTS
stop_times.csv LINK
trips.csv LINK
shapes.csv LINK
Here is my code:
import pandas as pd
from haversine import *
'''
iterate through shapes and match stops along a shape_pt_sequence within
x amount of distance. for shape_pt_sequence that is closest, replace the stop
lat/lon to the shape_pt_lat/shape_pt_lon, and mark is_stop column with 1.
'''
# readability assignments for shapes.csv
shapes = pd.read_csv('csv/shapes.csv')
shapes_index = list(set(shapes['shape_id']))
shapes_index.sort(key=int)
shapes.set_index(['shape_id', 'shape_pt_sequence'], inplace=True)
# readability assignments for trips.csv
trips = pd.read_csv('csv/trips.csv')
trips_index = list(set(trips['trip_id']))
trips.set_index(['trip_id'], inplace=True)
# readability assignments for stops_times.csv
stop_times = pd.read_csv('csv/stop_times.csv')
stop_times.set_index(['trip_id','stop_sequence'], inplace=True)
print(len(stop_times.loc[1423492]))
# readability assginments for stops.csv
stops = pd.read_csv('csv/stops.csv')
stops.set_index(['stop_id'], inplace=True)
# for each trip_id
for i in trips_index:
print('******NEW TRIP_ID******')
print(i)
i = i.astype(int)
# for each stop_sequence in stop_times
for x in range(len(stop_times.loc[i])):
stop_lat = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][0]
stop_lon = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][1]
stop_coordinate = (stop_lat, stop_lon)
print(stop_coordinate)
# shape_id that matches trip_id
print('**SHAPE_ID**')
trips_shape_id = trips.loc[i,['shape_id']].iloc[0]
trips_shape_id = int(trips_shape_id)
print(trips_shape_id)
smallest = 0
for y in range(len(shapes.loc[trips_shape_id])):
shape_lat = shapes.loc[trips_shape_id].iloc[y,[0,1]][0]
shape_lon = shapes.loc[trips_shape_id].iloc[y,[0,1]][1]
shape_coordinate = (shape_lat, shape_lon)
haversined = haversine_mi(stop_coordinate, shape_coordinate)
if smallest == 0 or haversined < smallest:
smallest = haversined
smallest_shape_pt_indexer = y
else:
pass
print(haversined)
print('{0:.20f}'.format(smallest))
print('{0:.20f}'.format(smallest))
print(smallest_shape_pt_indexer)
# mark is_stop as 1
shapes.iloc[smallest_shape_pt_indexer,[2]] = 1
# replace coordinate value
shapes.loc[trips_shape_id].iloc[y,[0,1]][0] = stop_lat
shapes.loc[trips_shape_id].iloc[y,[0,1]][1] = stop_lon
shapes.to_csv('csv/shapes.csv', index=False)

What you could do to optmizing this code is use some threads/workers instead those for.
I recommend using the Pool of Workes as its very simple to use.
In:
for i in trips_index:
You could use something like:
from multiprocessing import Pool
pool = Pool(processes=4)
result = pool.apply_async(func, trips_index)
And than the method func would be like:
def func(i):
#code here
And you could simply put the whole for loop inside this method.
It would make it work with 4 subprocess in this example, git it a nice improvment.

One thing to consider is that a collection of trips will often have the same sequence of stops and the same shape data (the only difference between trips is the timing). So it might make sense to cache the find-closest-point-on-shape operation for (stop_id, shape_id). I bet that would reduce your runtime by an order-of-magnitude.

Related

Vectorize maximum drawdown stop loss logic

I have the following code that works as intended in a for loop:
import pandas as pd
import numpy as np
returns = pd.DataFrame(np.random.normal(scale=0.01,size=[1000,7]))
signal = pd.DataFrame(np.random.choice([1,np.nan],p=(0.1,0.9),size=[1000,7]))
window=20
breach = -0.03
positions = signal.copy(); positions[:] = np.nan
pf_returns = pd.Series(index=positions.index)
max_dd = pf_returns.copy()
for i in range(len(positions)):
#Positions are just the number of signals divided by the total active ones
positions.iloc[i] = signal.ffill().shift().div(signal.ffill().shift().sum(axis=1),axis=0).iloc[i]
pf_returns.iloc[i] = (positions.iloc[i] * returns.iloc[i]).sum()
equity_line = (pf_returns.iloc[max(i-window,0):i+1]+1).iloc[:i+1].cumprod()
max_dd.iloc[i] = (equity_line/equity_line.cummax()-1).rolling(window, min_periods=1).min().iloc[-1]
if max_dd.iloc[i] <= breach and i != len(positions)-1:
signal.iloc[i+1] = 0
Is it somehow possible to vectorize it? I thought in computing the equity_line at once (definitely possible without all these ilocs), however it then changes the upcoming values. I also thought somehow to a loop that runs in chunks (until next max_dd is found basically) but I am not sure if there is any possibility to vectorize or make this more efficient.
My desired outcome is basically to reproduce what the loops does (in terms of positions, if you wish, and consequently pf_returns) in a more efficient way.

Implementing Multiprocessing with the same Function in While Loop

I'm have implemented an Evolutionary Algorithm process in Python 3.8, and am attempting to optimise/reduce its runtime. Due to the heavy constraints upon valid solutions, it can take a few minutes to generate valid chromosomes. To avoid spending hours just generating the initial population, I want to use Multiprocessing to generate multiple at a time.
My code at this point in time is:
populationCount = 500
def readDistanceMatrix():
# code removed
def generateAvailableValues():
# code removed
def generateAvailableValuesPerColumn():
# code removed
def generateScheduleTemplate():
# code removed
def generateChromosome():
# code removed
if __name__ == '__main__':
# Data type = DataFrame
distanceMatrix = readDistanceMatrix()
# Data type = List of Integers
availableValues = generateAvailableValues()
# Data type = List containing Lists of Integers
availableValuesPerColumn = generateAvailableValuesPerColumn(availableValues)
# Data type = DataFrame
scheduleTemplate = generateScheduleTemplate(distanceMatrix)
# Data type = List containing custom class (with Integer and DataFrame)
population = []
while len(population) < populationCount:
chrmSolution = generateChromosome(availableValuesPerColumn, scheduleTemplate, distanceMatrix)
population.append(chrmSolution)
Where the population list is filled in with the while loop at the end. I would like to replace the while loop with a Multiprocessing solution that can use up to a pre-set number of cores. For example:
population = []
availableCores = 6
while len(population) < populationCount:
while usedCores < availableCores:
# start generating another chromosome as 'chrmSolution'
population.append(chrmSolution)
However, after reading and watching hours worth of tutorials, I'm unable to get a loop up-and-running. How should I go about doing this?
It sounds like a simple multiprocessing.Pool should do the trick, or at least be a place to start. Here's a simple example of how that might look:
from multiprocessing import Pool, cpu_count
child_globals = {} #mutable object at the `module` level acts as container for globals (constants)
if __name__ == '__main__':
# ...
def init_child(availableValuesPerColumn, scheduleTemplate, distanceMatrix):
#passing variables to the child process every time is inefficient if they're
# constant, so instead pass them to the initialization function, and let
# each child re-use them each time generateChromosome is called
child_globals['availableValuesPerColumn'] = availableValuesPerColumn
child_globals['scheduleTemplate'] = scheduleTemplate
child_globals['distanceMatrix'] = distanceMatrix
def child_work(i):
#child_work simply wraps generateChromosome with inputs, and throws out dummy `i` from `range()`
return generateChromosome(child_globals['availableValuesPerColumn'],
child_globals['scheduleTemplate'],
child_globals['distanceMatrix'])
with Pool(cpu_count(),
initializer=init_child, #init function to stuff some constants into the child's global context
initargs=(availableValuesPerColumn, scheduleTemplate, distanceMatrix)) as p:
#imap_unordered doesn't make child processes wait to ensure order is preserved,
# so it keeps the cpu busy more often. it returns a generator, so we use list()
# to store the results into a list.
population = list(p.imap_unordered(child_work, range(populationCount)))

Creating loop with particular behaviour depending on data length

In my program I have a part of code that uses an Estimated Moving Average (EMA) 4 times, but each time with different length. The program uses one or more EMAs depending on how much data it gets.
For now the code is not looped, just copy pasted with minor tweeks. That makes making changes difficult because I have to change everything 4 times.
Can somebody help me loop the code in such a way it wont loose it behaviour pattern. The mock-up code is presented here:
import random
import numpy as np
zakres=[5,10,15,20]
data=[]
def SI_sma(data, zakres):
weights=np.ones((zakres,))/zakres
smas=np.convolve(data, weights, 'valid')
return smas
def SI_ema(data, zakres):
weights_ema = np.exp(np.linspace(-1.,0.,zakres))
weights_ema /= weights_ema.sum()
ema=np.convolve(data,weights_ema)[:len(data)]
ema[:zakres]=ema[zakres]
return ema
while True:
data.append(random.uniform(0,100))
print(len(data))
if len(data)>zakres[0]:
smas=SI_sma(data=data, zakres=zakres[0])
ema=SI_ema(data=data, zakres=zakres[0])
print(smas[-1]) #calc using smas
print(ema[-1]) #calc using ema1
if len(data)>zakres[1]:
ema2=SI_ema(data=data, zakres=zakres[1])
print(ema2[-1]) #calc using ema2
if len(data)>zakres[2]:
ema3=SI_ema(data=data, zakres=zakres[2])
print(ema3[-1]) #calc using ema3
if len(data)>zakres[3]:
ema4=SI_ema(data=data, zakres=zakres[3])
print(ema4[-1]) #calc using ema4
input("press a key")
A variable number of variables is usually a bad idea. As you have found, it can make maintaining code cumbersome and error-prone. Instead, you can define a dict of results and use a for loop to iterate scenarios, defining len(data) just once.
ema = {}
while True:
data.append(random.uniform(0,100))
n = len(data)
for i, val in enumerate(zakres):
if n > val:
if i == 1:
smas = SI_sma(data=data, zakres=val)
ema[i] = SI_ema(data=data, zakres=val)
You can then access results via ema[0], ..., ema[3] as required.

Vectorize python code for improved performance

I am writing a scientific code in python to calculate the energy of a system.
Here is my function : cte1, cte2, cte3, cte4 are constants previously computed; pii is np.pi (calculated beforehand, since it slows the loop otherwise). I calculate the 3 components of the total energy, then sum them up.
def calc_energy(diam):
Energy1 = cte2*((pii*diam**2/4)*t)
Energy2 = cte4*(pii*diam)*t
d=diam/t
u=np.sqrt((d)**2/(1+d**2))
cc= u**2
E = sp.special.ellipe(cc)
K = sp.special.ellipk(cc)
Id=cte3*d*(d**2+(1-d**2)*E/u-K/u)
Energy3 = cte*t**3*Id
total_energy = Energy1+Energy2+Energy3
return (total_energy,Energy1)
My first idea was to simply loop over all values of the diameter :
start_diam, stop_diam, step_diam = 1e-10, 500e-6, 1e-9 #Diametre
diametres = np.arange(start_diam,stop_diam,step_diam)
for d in diametres:
res1,res2 = calc_energy(d)
totalEnergy.append(res1)
Energy1.append(res2)
In an attempt to speed up calculations, I decided to use numpy to vectorize, as shown below :
diams = diametres.reshape(-1,1) #If not reshaped, calculations won't run
r1 = np.apply_along_axis(calc_energy,1,diams)
However, the "vectorized" solution does not properly work. When timing I get 5 seconds for the first solution and 18 seconds for the second one.
I guess I'm doing something the wrong way but can't figure out what.
With your current approach, you're applying a Python function to each element of your array, which carries additional overhead. Instead, you can pass the whole array to your function and get an array of answers back. Your existing function appears to work fine without any modification.
import numpy as np
from scipy import special
cte = 2
cte1 = 2
cte2 = 2
cte3 = 2
cte4 = 2
pii = np.pi
t = 2
def calc_energy(diam):
Energy1 = cte2*((pii*diam**2/4)*t)
Energy2 = cte4*(pii*diam)*t
d=diam/t
u=np.sqrt((d)**2/(1+d**2))
cc= u**2
E = special.ellipe(cc)
K = special.ellipk(cc)
Id=cte3*d*(d**2+(1-d**2)*E/u-K/u)
Energy3 = cte*t**3*Id
total_energy = Energy1+Energy2+Energy3
return (total_energy,Energy1)
start_diam, stop_diam, step_diam = 1e-10, 500e-6, 1e-9 #Diametre
diametres = np.arange(start_diam,stop_diam,step_diam)
a = calc_energy(diametres) # Pass the whole array

How do I vectorize the following loop in Numpy?

"""Some simulations to predict the future portfolio value based on past distribution. x is
a numpy array that contains past returns.The interpolated_returns are the returns
generated from the cdf of the past returns to simulate future returns. The portfolio
starts with a value of 100. portfolio_value is filled up progressively as
the program goes through every loop. The value is multiplied by the returns in that
period and a dollar is removed."""
portfolio_final = []
for i in range(10000):
portfolio_value = [100]
rand_values = np.random.rand(600)
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
I couldn't find a way to write this code using numpy. I was having a look at iterations using nditer but I was unable to move ahead with that.
I guess the easiest way to figure out how you can vectorize your stuff would be to look at the equations that govern your evolution and see how your portfolio actually iterates, finding patterns that could be vectorized instead of trying to vectorize the code you already have. You would have noticed that the cumprod actually appears quite often in your iterations.
Nevertheless you can find the semi-vectorized code below. I included your code as well such that you can compare the results. I also included a simple loop version of your code which is much easier to read and translatable into mathematical equations. So if you share this code with somebody else I would definitely use the simple loop option. If you want some fancy-pants vectorizing you can use the vector version. In case you need to keep track of your single steps you can also add an array to the simple loop option and append the pv at every step.
Hope that helps.
Edit: I have not tested anything for speed. That's something you can easily do yourself with timeit.
import numpy as np
from scipy.special import erf
# Prepare simple return model - Normal distributed with mu &sigma = 0.01
x = np.linspace(-10,10,100)
cdf_values = 0.5*(1+erf((x-0.01)/(0.01*np.sqrt(2))))
# Prepare setup such that every code snippet uses the same number of steps
# and the same random numbers
nSteps = 600
nIterations = 1
rnd = np.random.rand(nSteps)
# Your code - Gives the (supposedly) correct results
portfolio_final = []
for i in range(nIterations):
portfolio_value = [100]
rand_values = rnd
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
# Using vectors
portfolio_final = []
for i in range(nIterations):
portfolio_values = np.ones(nSteps)*100.0
rcp = np.cumprod(np.interp(rnd,cdf_values,x) + 1)
portfolio_values = rcp * (portfolio_values - np.cumsum(1.0/rcp))
portfolio_final.append(portfolio_values[-1])
print (np.mean(portfolio_final))
# Simple loop
portfolio_final = []
for i in range(nIterations):
pv = 100
rets = np.interp(rnd,cdf_values,x) + 1
for i in range(nSteps):
pv = pv * rets[i] - 1
portfolio_final.append(pv)
print (np.mean(portfolio_final))
Forget about np.nditer. It does not improve the speed of iterations. Only use if you intend to go one and use the C version (via cython).
I'm puzzled about that inner loop. What is it supposed to be doing special? Why the loop?
In tests with simulated values these 2 blocks of code produce the same thing:
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio[j-1])
portfolio_value[j] = portfolio_value[j]-1
interpolated_returns = (interpolated_returns+1)*portfolio - 1
portfolio_value = portfolio_value + interpolated_returns.tolist()
I assuming that interpolated_returns and portfolio are 1d arrays of the same length.

Categories

Resources