Vectorizing code that includes conditional statement - python

I've been using for loops for a script I've been working on, and I think the for loops are making the script run too slow. I think I could pick up a lot of speed with vectorization, but I'm having a little trouble figuring out how to vectorize my code. I'm copying some simple sample code that more or less mimics what I'm trying to accomplish in my actual code to make the question easier to follow. (Sorry if it's not the most elegant or technically sound. I'm still getting experience with Python.)
import numpy as np
def create_combo(input1,input2):
combo = input1 + input2
return combo
def another_combo(monitor,scalar_1,scalar_2):
add_more = monitor + scalar_1 + scalar_2
return add_more
# Initialize monitor
monitor = 0
# Initialize a threshold variable
threshold = 15
# Create input arrays
primary_1 = [1, 3, 5, 7, 9]
primary_2 = [2, 4, 6, 8, 10]
primary_1 = np.array(primary_1)
primary_2 = np.array(primary_2)
storage_vec = []
for i in range(5):
# Create input variables
scalar_1 = 0.5
scalar_2 = 2
# Call the create_combo function
combination = create_combo(primary_1[i],primary_2[i])
# call the another_combo function
add_on = another_combo(monitor,scalar_1,scalar_2)
monitor = combination + add_on
# Check if monitor exceeds threshold
if monitor > threshold:
# Store the index if so
storage_vec.append(i)
# Reset the variable, monitor
monitor = 0
I can understand how it would be easy to vectorize the function named create_combo. And I can also see that it would be simple to make the variable named monitor a vector as well.
What I'm having trouble wrapping my head around is how I could reset that monitor variable at specific points in the vector and continue doing computations with monitor after it has been reset. So, I've been working with monitor as a scalar and doing computations on each element of the inputs to the functions. However, this seems to be way too slow, and I know vectorizing typically speeds things up.
I read that perhaps the np.where() function could help, but I'm also not sure how to apply that in this situation.
Any guidance?

Your example if probably over complicated (or your example is not complete).
But if we want to create the storage_vec easily we can noticed that monitor increase by 21.5 after each step until it reach the threshold.
So the vector storage_vec can simply be computed with:
#Increase value:
inc = primary_1[-1]+primary_2[-1]+scalar_1+scalar_2 # = 21.5
#floor division
fd = threshold//inc
#ceil division
cd = -(-threshold//inc)
#storage_vec
storage_vec = np.r_[fd:10:cd]

Related

Vectorize maximum drawdown stop loss logic

I have the following code that works as intended in a for loop:
import pandas as pd
import numpy as np
returns = pd.DataFrame(np.random.normal(scale=0.01,size=[1000,7]))
signal = pd.DataFrame(np.random.choice([1,np.nan],p=(0.1,0.9),size=[1000,7]))
window=20
breach = -0.03
positions = signal.copy(); positions[:] = np.nan
pf_returns = pd.Series(index=positions.index)
max_dd = pf_returns.copy()
for i in range(len(positions)):
#Positions are just the number of signals divided by the total active ones
positions.iloc[i] = signal.ffill().shift().div(signal.ffill().shift().sum(axis=1),axis=0).iloc[i]
pf_returns.iloc[i] = (positions.iloc[i] * returns.iloc[i]).sum()
equity_line = (pf_returns.iloc[max(i-window,0):i+1]+1).iloc[:i+1].cumprod()
max_dd.iloc[i] = (equity_line/equity_line.cummax()-1).rolling(window, min_periods=1).min().iloc[-1]
if max_dd.iloc[i] <= breach and i != len(positions)-1:
signal.iloc[i+1] = 0
Is it somehow possible to vectorize it? I thought in computing the equity_line at once (definitely possible without all these ilocs), however it then changes the upcoming values. I also thought somehow to a loop that runs in chunks (until next max_dd is found basically) but I am not sure if there is any possibility to vectorize or make this more efficient.
My desired outcome is basically to reproduce what the loops does (in terms of positions, if you wish, and consequently pf_returns) in a more efficient way.

Improve performance of combinations

Hey guys I have a script that compares each possible user and checks how similar their text is:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
similarity_score = fuzz.ratio(a[1][0], b[1][0])
if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
This script takes around 15 minutes to run, the dataframe contains 120k users, so comparing each possible combination takes quite a bit of time, if I just write pass on the for loop it takes 2 minutes to loop through all values.
I tried using filter() and map() for the if statements and fuzzy score but the performance was worse. I tried improving the script as much as I could but I don't know how I can improve this further.
Would really appreciate some help!
It is slightly complicated to reason about the data since you have not attached it, but we can see multiple places that might provide an improvement:
First, let's rewrite the code in a way which is easier to reason about than using the indices:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
a_id, (a_text, a_set, a_compre_string) = a
b_id, (b_text, b_set, b_compre_string) = b
if (a_compre_string == b_compre_string
and not a_set.isdisjoint(b_set)):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
You seem to only care about pairs having the same compare_string values. Therefore, and assuming this is not something that all pairs share, we can key by whatever that value is to cover much less pairs.
To put some numbers into it, let's say you have 120K inputs, and 1K values for each value of val[1][2] - then instead of covering 120K * 120K = 14 * 10^9 combinations, you would have 120 bins of size 1K (where in each bin we'd need to check all pairs) = 120 * 1K * 1K = 120 * 10^6 which is about 1000 times faster. And it would be even faster if each bin has less than 1K elements.
import collections
# Create a dictionary from compare_string to all items
# with the same compare_string
items_by_compare_string = collections.defaultdict(list)
for item in dictionary.items():
compare_string = item[1][2]
items_by_compare_string[compare_string].append(items)
# Iterate over each group of items that have the same
# compare string
for item_group in items_by_compare_string.values():
# Check pairs only within that group
for a, b in itertools.combinations(item_group, 2):
# No need to compare the compare_strings!
if not a_set.isdisjoint(b_set):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
But, what if we want more speed? Let's look at the remaining operations:
We have a check to find if two sets share at least one item
This seems like an obvious candidate for optimization if we have any knowledge about these sets (to allow us to determine which pairs are even relevant to compare)
Without additional knowledge, and just looking at every two pairs and trying to speed this up, I doubt we can do much - this is probably highly optimized using internal details of Python sets, I don't think it's likely to optimize it further
We a fuzz.ratio computation which is some external function, and I'm going to assume is heavy
If you are using this from the FuzzyWuzzy package, make sure to install python-Levenshtein to get the speedups detailed here
We have some comparisons which we are unlikely to be able to speed up
We might be able to cache the length of a_text by nesting the two loops, but that's negligible
We have appends to a list, which runs on average ("amortized") constant time per operation, so we can't really speed that up
Therefore, I don't think we can reasonably suggest any more speedups without additional knowledge. If we know something about the sets that can help optimize which pairs are relevant we might be able to speed things up further, but I think this is about it.
EDIT: As pointed out in other answers, you can obviously run the code in multi-threading. I assumed you were looking for an algorithmic change that would possibly reduce the number of operations significantly, instead of just splitting these over more CPUs.
Essentially, from python programming side, i see two things that can improve your processing time:
Multi-threads and Vectorized operations
From the fuzzy score side, here is a list of tips you can use to improve your processing time (new anonymous tab to avoid paywall):
https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
Using multi thread you can speed you operation up to N times, being N the number of threads in you CPU. You can check it with:
import multiprocessing
multiprocessing.cpu_count()
Using vectorized operations you can parallel process your operations in low level with SIMD (single instruction / multiple data) operations, or with gpu tensor operations (like those in tensorflow/pytorch).
Here is a small comparison of results for each case:
import numpy as np
import time
A = [np.random.rand(512) for i in range(2000)]
B = [np.random.rand(512) for i in range(2000)]
high_similarity = []
def measure(i,j,a,b,high_similarity):
d = ((a-b)**2).sum()
if d>12:
high_similarity.append((i,j,d))
start_single_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
measure(i,j,A[i],B[j],high_similarity)
finis_single_thread = time.time()
print("single thread time:",finis_single_thread-start_single_thread)
out[0] single thread time: 147.64517450332642
running on multi thread:
from threading import Thread
high_similarity = []
def measure(a = None,b= None,high_similarity = None):
d = ((a-b)**2).sum()
if d > 12:
high_similarity.append(d)
start_multi_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
thread = Thread(target=measure,kwargs= {'a':A[i],'b':B[j],'high_similarity':high_similarity} )
thread.start()
thread.join()
finish_multi_thread = time.time()
print("time to run on multi threads:",finish_multi_thread - start_multi_thread)
out[1] time to run on multi-threads: 11.946279764175415
A_array = np.array(A)
B_array = np.array(B)
start_vectorized = time.time()
for i in range(len(A_array)):
#vectorized distance operation
dists = (A_array-B_array)**2
high_similarity+= dists[dists>12].tolist()
aux = B_array[-1]
np.delete(B_array,-1)
np.insert(B_array, 0, aux)
finish_vectorized = time.time()
print("time to run vectorized operations:",finish_vectorized-start_vectorized)
out[2] time to run vectorized operations: 2.302949905395508
Note that you can't guarantee any order of execution, so will you also need to store the index of results. The snippet of code is just to illustrate that you can use parallel process, but i highly recommend to use a pool of threads and divide your dataset in N subsets for each worker and join the final result (instead of create a thread for each function call like i did).

"Processes" or "Threads" strategy to fill independant blocks into global array?

In python2, I would like to fill a global array by filling with parallel processes (or threads) different sub-arrays (there is a total 16 blocks). I must precise that each block doesn't depend of the others, I mean when I perfom the assignement of each cells of the current block.
1) From what I have found, I would have a great benefit from a CPU multi-cores by using different "processes" but it seems a little bit complicated to share the global array by all others processes.
2) From another point of view, I can use "threads" instead of "processes" since the implementation is less hard. I have found out the libray "ThreadPool" from "multiprocessing.dummy" allows to share this global array by all others concurrent threads.
For example, with python2.7, the following code works :
from multiprocessing.dummy import Pool as ThreadPool
## discretization along x-axis and y-axis for each block
arrayCross_k = np.linspace(kMIN, kMAX, dimPoints)
arrayCross_mu = np.linspace(-1, 1, dimPoints)
# Build all big matrix with N total blocks = dimBlock*dimBlock = 16 here
arrayFullCross = np.zeros((dimBlocks, dimBlocks, arrayCross_k.size, arrayCross_mu.size))
dimBlocks = 4
# Size of dimension along k and mu axis
dimPoints = 100
# dimension along one dimension of global arrayFullCross
dimMatCovCross = dimBlocks*dimPoints
# Build cross-correlation matrix
def buildCrossMatrix_loop(params_array):
# rows indices
xb = params_array[0]
# columns indices
yb = params_array[1]
# Current redshift
z = zrange[params_array[2]]
# Loop inside block
for ub in range(dimPoints):
for vb in range(dimPoints):
# Diagonal blocs
if (xb == yb):
# Fill the (xb,yb) su-block of global array by
arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb] , z, 10**P_m(np.log10(arrayCross_k[ub])),
...
...
# End of function buildCrossMatrix_loop
# Main loop
while i < len(zrange):
def generatorCrossMatrix(index):
for igen in range(dimBlocks):
for lgen in range(dimBlocks):
yield igen, lgen, index
if __name__ == '__main__':
# Use 20 threads
pool = ThreadPool(20)
pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))
# Increment index "i"
i = i+1
But unfortunately, even by using 20 threads, I realize that the cores of my CPU are not fully running (actually, with 'top' or 'htop' command, I only see a single process at 100%).
3) What is the strategy that I have to chose if I want to full exploit the 16 cores of my CPU (like this is the case with pool.map(function, generator)) but with also the sharing of global array ?
4) some people told me to do I/O for each sub-array (basically, write each block in a file and gather all sub-arrays by reading them and get the full array filled). This solution is handy but I would like to avoid I/O (unless there is really not other solutions).
5) I have practised MPI library with C language and the operation of filling sub-array and finally gather them to build a big array, is not very complicated. However, I wouldn't like to use MPI with Python language (I don't know even if it exists).
6) I tried also to use Process with target equal to my filling function (buildCrossMatrix_loop) like this into while Main loop above :
from multiprocessing import Process
# Main loop on z range
while i < len(zrange):
params_p = []
for ip in range(4):
for jp in range(4):
params_p.append(ip)
params_p.append(jp)
params_p.append(i)
p = Process(target=buildCrossMatrix_loop, args=(params_p,))
params_p = []
p.start()
# Finished : wait everybody
p.join()
...
...
i = i+1
# End of main while loop
But the final 2D global array is filled only of zeros. So I must deduce that Process function doesn't share the array to fill ?
7) So which strategy I have to look for ? :
1. The using of "pool processes" and find a way to share the global array knowing all my 16-cores will be running
2. The using of "Threads" and share the global array but performances, at first sight, seems to be less good than with "pool processes". Maybe there is a way to increase the power of each "Threads", I mean like with "pool processes" ?
I tried to follow the different examples on https://docs.python.org/2/library/multiprocessing.html but without success, this is to say, without relevant performances from a speed-up point of view.
I think that in my case, the major issue is the gathering of all sub-arrays OR the fact that the global array arrayFullCross is not shared by other processes or threads.
If someone had a simple example of the sharing of global variable in a multi-threading context (here an array), this would nice to put it here.
UPDATE 1: I made test with the Threading (and not multiprocessing) but performances remain rather bad. GIL is not apparently unlocked, i.e only one process appears in htop command (maybe the version of Threading library is not the right one).
So I am going to try to handle my issue with using the "return" method.
Naively, I tried to return the whole array at the end of the function on which I apply the map function, like this :
# Build cross-correlation matrix
def buildCrossMatrix_loop(params_array):
# rows indices
xb = params_array[0]
# columns indices
yb = params_array[1]
# Current redshift
z = zrange[params_array[2]]
# Loop inside block
for ub in range(dimPoints):
for vb in range(dimPoints):
# Diagonal blocs
if (xb == yb):
arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb])
...
... #others assignments on arrayFullCross elements
# Return global array to main process
return arrayFullCross
Then, I tried to receive this global array from map like this :
if __name__ == '__main__':
pool = Pool(16)
outputArray = pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))
pool.terminate()
## Print outputArray
print 'outputArray = ', outputArray
## Reshape 4D outputArray to 2D array
arrayFullCross2D_swap = np.array(outputArray).swapaxes(1,2).reshape(dimMatCovCross,dimMatCovCross)
Unfortunately, when I print the outputArray, I get :
outputArray = [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
This is not the 4D outputArray expected, just a list of 16 None (I think that number of 16 correspond to the number of processes provided by generatorCrossMatrix(i)).
How could I get back the whole 4D array once map is launched and when it has finished ?
First of all I believe multiprocessing.ThreadPool is a private API so you should avoid it. Now multiprocessing.dummy is a useless module. It does not do any multithreading/processing that's why you don't see any benefit. You should use the "plain" multiprocessing module.
The second code does not work because it is using multiple processes. Processes do not share memory, so the changes you do in a subprocess are not reflected in the other subprocesses or the main process. You either want to:
Return the value and combine them together in the main process, for example using multiprocessing.Pool.map
Use threading instead of multiprocessing: just replaceimport multiprocessingwithimport threadingandmultiprocessing.Processwiththreading.Thread` and the code should work.
Note that the threading version will work only because numpy releases the GIL during computations, otherwise it would be stuck at 1 CPU.
You may want to look at this similar question which I answered a couple of minutes ago.

Creating loop with particular behaviour depending on data length

In my program I have a part of code that uses an Estimated Moving Average (EMA) 4 times, but each time with different length. The program uses one or more EMAs depending on how much data it gets.
For now the code is not looped, just copy pasted with minor tweeks. That makes making changes difficult because I have to change everything 4 times.
Can somebody help me loop the code in such a way it wont loose it behaviour pattern. The mock-up code is presented here:
import random
import numpy as np
zakres=[5,10,15,20]
data=[]
def SI_sma(data, zakres):
weights=np.ones((zakres,))/zakres
smas=np.convolve(data, weights, 'valid')
return smas
def SI_ema(data, zakres):
weights_ema = np.exp(np.linspace(-1.,0.,zakres))
weights_ema /= weights_ema.sum()
ema=np.convolve(data,weights_ema)[:len(data)]
ema[:zakres]=ema[zakres]
return ema
while True:
data.append(random.uniform(0,100))
print(len(data))
if len(data)>zakres[0]:
smas=SI_sma(data=data, zakres=zakres[0])
ema=SI_ema(data=data, zakres=zakres[0])
print(smas[-1]) #calc using smas
print(ema[-1]) #calc using ema1
if len(data)>zakres[1]:
ema2=SI_ema(data=data, zakres=zakres[1])
print(ema2[-1]) #calc using ema2
if len(data)>zakres[2]:
ema3=SI_ema(data=data, zakres=zakres[2])
print(ema3[-1]) #calc using ema3
if len(data)>zakres[3]:
ema4=SI_ema(data=data, zakres=zakres[3])
print(ema4[-1]) #calc using ema4
input("press a key")
A variable number of variables is usually a bad idea. As you have found, it can make maintaining code cumbersome and error-prone. Instead, you can define a dict of results and use a for loop to iterate scenarios, defining len(data) just once.
ema = {}
while True:
data.append(random.uniform(0,100))
n = len(data)
for i, val in enumerate(zakres):
if n > val:
if i == 1:
smas = SI_sma(data=data, zakres=val)
ema[i] = SI_ema(data=data, zakres=val)
You can then access results via ema[0], ..., ema[3] as required.

How do I vectorize the following loop in Numpy?

"""Some simulations to predict the future portfolio value based on past distribution. x is
a numpy array that contains past returns.The interpolated_returns are the returns
generated from the cdf of the past returns to simulate future returns. The portfolio
starts with a value of 100. portfolio_value is filled up progressively as
the program goes through every loop. The value is multiplied by the returns in that
period and a dollar is removed."""
portfolio_final = []
for i in range(10000):
portfolio_value = [100]
rand_values = np.random.rand(600)
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
I couldn't find a way to write this code using numpy. I was having a look at iterations using nditer but I was unable to move ahead with that.
I guess the easiest way to figure out how you can vectorize your stuff would be to look at the equations that govern your evolution and see how your portfolio actually iterates, finding patterns that could be vectorized instead of trying to vectorize the code you already have. You would have noticed that the cumprod actually appears quite often in your iterations.
Nevertheless you can find the semi-vectorized code below. I included your code as well such that you can compare the results. I also included a simple loop version of your code which is much easier to read and translatable into mathematical equations. So if you share this code with somebody else I would definitely use the simple loop option. If you want some fancy-pants vectorizing you can use the vector version. In case you need to keep track of your single steps you can also add an array to the simple loop option and append the pv at every step.
Hope that helps.
Edit: I have not tested anything for speed. That's something you can easily do yourself with timeit.
import numpy as np
from scipy.special import erf
# Prepare simple return model - Normal distributed with mu &sigma = 0.01
x = np.linspace(-10,10,100)
cdf_values = 0.5*(1+erf((x-0.01)/(0.01*np.sqrt(2))))
# Prepare setup such that every code snippet uses the same number of steps
# and the same random numbers
nSteps = 600
nIterations = 1
rnd = np.random.rand(nSteps)
# Your code - Gives the (supposedly) correct results
portfolio_final = []
for i in range(nIterations):
portfolio_value = [100]
rand_values = rnd
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
# Using vectors
portfolio_final = []
for i in range(nIterations):
portfolio_values = np.ones(nSteps)*100.0
rcp = np.cumprod(np.interp(rnd,cdf_values,x) + 1)
portfolio_values = rcp * (portfolio_values - np.cumsum(1.0/rcp))
portfolio_final.append(portfolio_values[-1])
print (np.mean(portfolio_final))
# Simple loop
portfolio_final = []
for i in range(nIterations):
pv = 100
rets = np.interp(rnd,cdf_values,x) + 1
for i in range(nSteps):
pv = pv * rets[i] - 1
portfolio_final.append(pv)
print (np.mean(portfolio_final))
Forget about np.nditer. It does not improve the speed of iterations. Only use if you intend to go one and use the C version (via cython).
I'm puzzled about that inner loop. What is it supposed to be doing special? Why the loop?
In tests with simulated values these 2 blocks of code produce the same thing:
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio[j-1])
portfolio_value[j] = portfolio_value[j]-1
interpolated_returns = (interpolated_returns+1)*portfolio - 1
portfolio_value = portfolio_value + interpolated_returns.tolist()
I assuming that interpolated_returns and portfolio are 1d arrays of the same length.

Categories

Resources