NOTE: I'm trying to rewrite this Python function which is decorated to compile with Pythran into a Python module. It currently uses OpenMP fine but not SIMD. pythran -DUSE_XSIMD my-script.py
So the bounty is for rewriting a bit of Python code to take advantage of XSIMD + OpenMP... I think vectorizing the code segment will allow it to use SIMD so looking for a bit of help here. Here's what I have - you can find GBM stock price simulation code examples all over the internet BTW; this one is a bit tricky since it's multiple assets though:
#pythran export generate_paths(int, int, int, int, float64, float64, float64[:,:] order(C), float64[:,:,:] order(C))
import numpy as np
def generate_paths(N_assets, sims, Simulated_time_steps, timesteps, timestep, drift, CurrentVol, randnums3Dcorr):
Simulated_paths = np.zeros((N_assets, sims, Simulated_time_steps))
#omp parallel for
for i in range(Simulated_time_steps):
randnumscorr = randnums3Dcorr[:,:, i]
dt = np.array((timestep, timestep*2, timestep*3))
drift_component = np.multiply( drift - 0.5*CurrentVol**2, dt.reshape(dt.shape[0],1))
random_component = np.multiply(randnumscorr.T , np.multiply(CurrentVol.reshape(CurrentVol.shape[0],) , np.sqrt(dt) ))
Simulated_paths[:,:,i] = np.exp( drift_component.reshape(drift_component.shape[0],1) + random_component.T)
return Simulated_paths
# [NOTE] uncomment the below if you want to run it in Python...
#
# if __name__ == "__main__":
# N_assets = 3
# sims = 8192
# Simulated_time_steps = 20
# timesteps = 20
# timestep = 1/365
# drift = 0.0
# CurrentVol = np.array([0.5,0.4,0.3])
# CurrentVol = CurrentVol.reshape(CurrentVol.shape[0],1)
# randnums3Dcorr = np.random.rand(N_assets,sims,timesteps)
# Simulated_paths = generate_paths(N_assets, sims, Simulated_time_steps, timesteps, timestep, drift, CurrentVol, randnums3Dcorr)
Related
I'm trying to learn more about the use of shared memory to improve performance in some cuda kernels in Numba, for this I was looking at the Matrix multiplication Example in the Numba documentation and tried to implement to see the gain.
This is my test implementation, I'm aware that the example in the documentation has some issues that I followed from Here, so I copied the fixed example code.
from timeit import default_timer as timer
import numba
from numba import cuda, jit, int32, int64, float64, float32
import numpy as np
from numpy import *
#cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
size = 1024*4
tpbx,tpby = 16, 16
tpb = (tpbx,tpby)
bpgx, bpgy = int(size/tpbx), int(size/tpby)
bpg = (bpgx, bpgy)
a_in = cuda.to_device(np.arange(size*size, dtype=np.float32).reshape((size, size)))
b_in = cuda.to_device(np.ones(size*size, dtype=np.float32).reshape((size, size)))
c_out1 = cuda.device_array_like(a_in)
c_out2 = cuda.device_array_like(a_in)
s = timer()
cuda.synchronize()
matmul[bpg,tpb](a_in, b_in, c_out1);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host1 = c_out1.copy_to_host()
print(c_host1)
s = timer()
cuda.synchronize()
fast_matmul[bpg,tpb](a_in, b_in, c_out2);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host2 = c_out2.copy_to_host()
print(c_host2)
The time of execution of the above kernels are essentially the same, actually the matmul was making faster for some larger input matrices. I would like to know what I'm missing in order to see the gain as the documentation suggests.
Thanks,
Bruno.
I made a performance mistake in the code I put in that other answer. I've now fixed it. In a nutshell this line:
tmp = 0.
caused numba to create a 64-bit floating point variable tmp. That triggered other arithmetic in the kernel to be promoted from 32-bit floating point to 64-bit floating point. That is inconsistent with the rest of the arithmetic and also inconsistent with the intent of the demonstration in the other answer. This error affects both kernels.
When I change it in both kernels to
tmp = float32(0.)
both kernels get noticeably faster, and on my GTX960 GPU, your test case shows that the shared code runs about 2x faster than the non-shared code (but see below).
The non-shared kernel also has a performance issue related to memory access patterns. Similar to the indices swap in that other answer, for this particular scenario only, we can rectify this problem simply by reversing the assigned indices:
j, i = cuda.grid(2)
in the non-shared kernel. This allows that kernel to perform approximately as well as it can, and with that change the shared kernel runs about 2x faster than the non-shared kernel. Without that additional change to the non-shared kernel, the performance of the non-shared kernel is much worse.
Background: I'm trying to create a simple bootstrap function for sampling means with replacement. I want to parallelize the function since I will eventually be deploying this on data with millions of data points and will want to have sample sizes much larger. I've ran other examples such as the Mandelbrot example. In the code below you'll see that I have a CPU version of the code, which runs fine as well.
I've read several resources to get this up and running:
Random Numbers with CUDA
Writing Kernels in CUDA
The issue: This is my first foray into CUDA programming and I believe I have everything setup correctly. I'm getting this one error that I cannot seem to figure out:
TypingError: cannot determine Numba type of <class 'object'>
I believe the LOC in question is:
bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu)
Attempts to resolve the issue: I won't go into full detail, but here are the following attempts
Thought it might have something to do with cuda.to_device(). I changed it around and I also called cuda.to_device_array_like(). I've used to_device() for all parameters, and for just a few. I've seen code samples where it's used for every parameter and sometimes not. So I'm not sure what should be done.
I've removed the random number generator for GPUs (create_xoroshiro128p_states) and just used a static value to test.
Explicitly assigning integers with int() (and not). Not sure why I tried this. I read that Numba only supports a limited data types, so I made sure that they were ints
Numba Supported Datatypes
Few other things I don't recall...
Apologies for messy code. I'm a bit at wits' end on this.
Below is the full code:
import numpy as np
from numpy import random
from numpy.random import randn
import pandas as pd
from timeit import default_timer as timer
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *
def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
for i in range(boot_samp):
rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
out_mean[i] = dt_arry[rand_idx].mean()
#cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
thread_id = cuda.grid(1)
stride = cuda.gridsize(1)
for i in range(thread_id, dt_arry.shape[0], stride):
for k in range(0,n_samp-1,1):
rand_idx_arry[k] = int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)
out_mean[thread_id] = dt_arry[rand_idx_arry].mean()
mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)
dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean
out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)
##################
# RUN ON CPU
##################
start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)
##################
# RUN ON GPU
##################
threads_per_block = 64
blocks_per_grid = 24
#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1)
start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu_device.copy_to_host()
dt = timer() - start
print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)
You seem to have at least 4 issues:
In your kernel code, rand_idx_arry is undefined.
You can't do .mean() in cuda device code
Your kernel launch config parameters are reversed.
Your kernel had an incorrect range for the grid-stride loop. dt_array.shape[0] is 50, so you were only populating the first 50 locations in your gpu output array. Just like your host code, the range for this grid-stride loop should be the size of the output array (which is boot_samp)
There may be other issues as well, but when I refactor your code like this to address those issues, it seems to run without error:
$ cat t65.py
#import matplotlib.pyplot as plt
import numpy as np
from numpy import random
from numpy.random import randn
from timeit import default_timer as timer
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *
def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
for i in range(boot_samp):
rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
out_mean[i] = dt_arry[rand_idx].mean()
#cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
thread_id = cuda.grid(1)
stride = cuda.gridsize(1)
for i in range(thread_id, out_mean.shape[0], stride):
my_sum = 0.0
for k in range(0,n_samp-1,1):
my_sum += dt_arry[int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)]
out_mean[thread_id] = my_sum/(n_samp-1)
mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)
dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean
#plt.plot(dt_arry)
#figureData = plt.figure(1)
#plt.title('Plot ' + str(n_samp) + ' samples')
#plt.plot(dt_arry)
#figureData.show()
out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)
##################
# RUN ON CPU
##################
start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)
#figureMeanCpu = plt.figure(2)
#plt.title('Plot '+ str(boot_samp) + ' bootstrap means - CPU')
#plt.plot(out_mean_cpu)
#figureData.show()
##################
# RUN ON GPU
##################
threads_per_block = 64
blocks_per_grid = 24
#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1)
start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[blocks_per_grid, threads_per_block](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu = out_mean_gpu_device.copy_to_host()
dt = timer() - start
print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)
python t65.py
CPU Bootstrap mean of 1000 mean samples: 10.148048544038735
Bootstrap CPU in 0.037496 s
GPU Bootstrap mean of 1000 mean samples: 10.145088765532936
Bootstrap GPU in 0.416822 s
$
Notes:
I've commented out a bunch of stuff that doesn't seem to be relevant. You might want to do something like that in the future when posting code (remove stuff that is not relevant to your question.)
I've fixed some things about your final GPU printout at the end, also.
I haven't studied your code carefully. I'm not suggesting anything is defect free. I'm just pointing out some issues and providing a guide for how they might be addressed. I can see the results don't match between CPU and GPU, but since I don't know what you're doing, and also because the random number generators don't match between CPU and GPU code, it's not obvious to me that things should match.
trying to construct a large scale quadratic constraint in Pyomo as follows:
import pyomo as pyo
from pyomo.environ import *
scale = 5000
pyo.n = Set(initialize=range(scale))
pyo.x = Var(pyo.n, bounds=(-1.0,1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q_values = dict(zip(list(itertools.product(range(0,scale), range(0,scale))), Q.flatten()))
pyo.Q = Param(pyo.n, pyo.n, initialize=Q_values)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*pyo.Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )
turns out the last line is unbearably slow given the problem scale. tried several things mentioned in PyPSA, Performance of creating Pyomo constraints and pyomo seems very slow to write models. but no luck.
any suggestion (once the model was constructed, Ipopt solving was also slow. but that's independent from Pyomo i guess)?
ps: construct such quadratic constraint directly as follows didnt help either (also unbearably slow)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )
You can get a small speed-up by using quicksum in place of sum. To measure the performance of the last line, I modified your code a little bit, as shown:
import itertools
from pyomo.environ import *
import time
import numpy as np
scale = 5000
m = ConcreteModel()
m.n = Set(initialize=range(scale))
m.x = Var(m.n, bounds=(-1.0, 1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q = np.ones([scale, scale])
Q_values = dict(
zip(list(itertools.product(range(scale), range(scale))), Q.flatten()))
m.Q = Param(m.n, m.n, initialize=Q_values)
t = time.time()
m.xQx = Constraint(expr=sum(m.x[i]*m.Q[i, j]*m.x[j]
for i in m.n for j in m.n) <= 1.0)
print("Time to make QuadCon = {}".format(time.time() - t))
The time I measured with sum was around 174.4 s. With quicksum I got 163.3 seconds.
Not satisfied with such a modest gain, I tried to re-formulate as a SOCP. If you can factorize Q like so: Q= (F^T F), then you could easily express your constraint as a quadratic cone, as shown below:
import itertools
import time
import pyomo.kernel as pmo
from pyomo.environ import *
import numpy as np
scale = 5000
m = pmo.block()
m.n = np.arange(scale)
m.x = pmo.variable_list()
for j in m.n:
m.x.append(pmo.variable(lb=-1.0, ub=1.0))
# Q = (F^T)F factors (eg.: Cholesky factor)
_F = np.ones([scale, scale])
t = time.time()
F = pmo.parameter_list()
for f in _F:
_row = pmo.parameter_list(pmo.parameter(_e) for _e in f)
F.append(_row)
print("Time taken to make parameter F = {}".format(time.time() - t))
t1 = time.time()
x_expr = pmo.expression_tuple(pmo.expression(
expr=sum_product(f, m.x, index=m.n)) for f in F)
print("Time for evaluating Fx = {}".format(time.time() - t1))
t2 = time.time()
m.xQx = pmo.conic.quadratic.as_domain(1, x_expr)
print("Time for quad constr = {}".format(time.time() - t2))
Running on the same machine, I observed a time of around 112 seconds in the preparation of the expression that gets passed to the cone. Actually preparing the cone takes very little time (0.031 s).
Naturally, the only solver that can handle Conic constraints in pyomo is MOSEK. A recent update to the Pyomo-MOSEK interface has also shown promising speed-ups.
You can try MOSEK for free by getting yourself a MOSEK trial license. If you want to read more about Conic reformulations, a quick and thorough guide can be found in the MOSEK modelling cookbook. Lastly, if you are affiliated with an academic institution, then we can offer you a personal/institutional academic license. Hope you find this useful.
I have recently played around with Python's multiprocessing module to speed up the forward-backward algorithm for Hidden Markov Models as forward filtering and backward filtering can run independently. Seeing the run-time halve was awe-inspiring stuff.
I now attempt to include some multiprocessing in my iterative Viterbi algorithm.In this algorithm, the two processes I am trying to run are not independent. The val_max part can run independently but arg_max[t] depends on val_max[t-1]. So I played with the idea that one can run val_max as a separate process and then arg_max also as a separate process which can be fed by val_max.
I admit to be a bit out of my depth here and do not know much about multiprocessing other than watching some basic video's on it as well as browsing blogs. I provide my attempt below, but it does not work.
import numpy as np
from time import time,sleep
import multiprocessing as mp
class Viterbi:
def __init__(self,A,B,pi):
self.M = A.shape[0] # number of hidden states
self.A = A # Transition Matrix
self.B = B # Observation Matrix
self.pi = pi # Initial distribution
self.T = None # time horizon
self.val_max = None
self.arg_max = None
self.obs = None
self.sleep_time = 1e-6
self.output = mp.Queue()
def get_path(self,x):
# returns the most likely state sequence given observed sequence x
# using the Viterbi algorithm
self.T = len(x)
self.val_max = np.zeros((self.T, self.M))
self.arg_max = np.zeros((self.T, self.M))
self.val_max[0] = self.pi*self.B[:,x[0]]
for t in range(1, self.T):
# Indepedent Process
self.val_max[t] = np.max( self.A*np.outer(self.val_max[t-1],self.B[:,obs[t]]) , axis = 0 )
# Dependent Process
self.arg_max[t] = np.argmax( self.val_max[t-1]*self.A.T, axis = 1)
# BACKTRACK
states = np.zeros(self.T, dtype=np.int32)
states[self.T-1] = np.argmax(self.val_max[self.T-1])
for t in range(self.T-2, -1, -1):
states[t] = self.arg_max[t+1, states[t+1]]
return states
def get_val(self):
'''Independent Process'''
for t in range(1,self.T):
self.val_max[t] = np.max( self.A*np.outer(self.val_max[t-1],self.B[:,self.obs[t]]) , axis = 0 )
self.output.put(self.val_max)
def get_arg(self):
'''Dependent Process'''
for t in range(1,self.T):
while 1:
# Process info if available
if self.val_max[t-1].any() != 0:
self.arg_max[t] = np.argmax( self.val_max[t-1]*self.A.T, axis = 1)
break
# Else sleep and wait for info to arrive
sleep(self.sleep_time)
self.output.put(self.arg_max)
def get_path_parallel(self,x):
self.obs = x
self.T = len(obs)
self.val_max = np.zeros((self.T, self.M))
self.arg_max = np.zeros((self.T, self.M))
val_process = mp.Process(target=self.get_val)
arg_process = mp.Process(target=self.get_arg)
# get first initial value for val_max which can feed arg_process
self.val_max[0] = self.pi*self.B[:,obs[0]]
arg_process.start()
val_process.start()
arg_process.join()
val_process.join()
Note: get_path_parallel does not have backtracking yet.
It would seem that val_process and arg_process never really run. Really not sure why this happens. You can run the code on the Wikipedia example for the viterbi algorithm.
obs = np.array([0,1,2]) # normal then cold and finally dizzy
pi = np.array([0.6,0.4])
A = np.array([[0.7,0.3],
[0.4,0.6]])
B = np.array([[0.5,0.4,0.1],
[0.1,0.3,0.6]])
viterbi = Viterbi(A,B,pi)
path = viterbi.get_path(obs)
I also tried using Ray. However, I had no clue what I was really doing there. Can you please help recommend me what to do in order to get the parallel version to run. I must be doing something very wrong but I do not know what.
Your help would be much appreciated.
I have managed to get my code working thanks to #SıddıkAçıl. The producer-consumer pattern is what does the trick. I also realised that the processes can run successfully but if one does not store the final results in a "result queue" of sorts then it vanishes. By this I mean, that I filled in values in my numpy arrays val_max and arg_max by allowing the process to start() but when I called them, they were still np.zero arrays. I verified that they did fill up to the correct arrays by printing them just as the process is about to terminate (at last self.T in iteration). So instead of printing them, I just added them to a multiprocessing Queue object on the final iteration to capture then entire filled up array.
I provide my updated working code below. NOTE: it is working but takes twice as long to complete as the serial version. My thoughts on why this might be so is as follows:
I can get it to run as two processes but don't actually know how to do it properly. Experienced programmers might know how to fix it with the chunksize parameter.
The two processes I am separating are numpy matrix operations. These processes execute so fast already that the overhead of concurrency (multiprocessing) is not worth the theoretical improvement. Had the two processes been the two original for loops (as used in Wikipedia and most implementations) then multiprocessing might have given gains (perhaps I should investigate this). Furthermore, because we have a producer-consumer pattern and not two independent processes (producer-producer pattern) we can only expect the producer-consumer pattern to run as long as the longest of the two processes (in this case the producer takes twice as long as the consumer). We can not expect run time to halve as in the producer-producer scenario (this happened with my parallel forward-backward HMM filtering algorithm).
My computer has 4 cores and numpy already does built-in CPU multiprocessing optimization on its operations. By me attempting to use cores to make the code faster, I am depriving numpy of cores that it could use in a more effective manner. To figure this out, I am going to time the numpy operations and see if they are slower in my concurrent version as compared to that of my serial version.
I will update if I learn anything new. If you perhaps know the real reason for why my concurrent code is so much slower, please do let me know. Here is the code:
import numpy as np
from time import time
import multiprocessing as mp
class Viterbi:
def __init__(self,A,B,pi):
self.M = A.shape[0] # number of hidden states
self.A = A # Transition Matrix
self.B = B # Observation Matrix
self.pi = pi # Initial distribution
self.T = None # time horizon
self.val_max = None
self.arg_max = None
self.obs = None
self.intermediate = mp.Queue()
self.result = mp.Queue()
def get_path(self,x):
'''Sequential/Serial Viterbi Algorithm with backtracking'''
self.T = len(x)
self.val_max = np.zeros((self.T, self.M))
self.arg_max = np.zeros((self.T, self.M))
self.val_max[0] = self.pi*self.B[:,x[0]]
for t in range(1, self.T):
# Indepedent Process
self.val_max[t] = np.max( self.A*np.outer(self.val_max[t-1],self.B[:,obs[t]]) , axis = 0 )
# Dependent Process
self.arg_max[t] = np.argmax( self.val_max[t-1]*self.A.T, axis = 1)
# BACKTRACK
states = np.zeros(self.T, dtype=np.int32)
states[self.T-1] = np.argmax(self.val_max[self.T-1])
for t in range(self.T-2, -1, -1):
states[t] = self.arg_max[t+1, states[t+1]]
return states
def get_val(self,intial_val_max):
'''Independent Poducer Process'''
val_max = intial_val_max
for t in range(1,self.T):
val_max = np.max( self.A*np.outer(val_max,self.B[:,self.obs[t]]) , axis = 0 )
#print('Transfer: ',self.val_max[t])
self.intermediate.put(val_max)
if t == self.T-1:
self.result.put(val_max) # we only need the last val_max value for backtracking
def get_arg(self):
'''Dependent Consumer Process.'''
t = 1
while t < self.T:
val_max =self.intermediate.get()
#print('Receive: ',val_max)
self.arg_max[t] = np.argmax( val_max*self.A.T, axis = 1)
if t == self.T-1:
self.result.put(self.arg_max)
#print('Processed: ',self.arg_max[t])
t += 1
def get_path_parallel(self,x):
'''Multiprocessing producer-consumer implementation of Viterbi algorithm.'''
self.obs = x
self.T = len(obs)
self.arg_max = np.zeros((self.T, self.M)) # we don't tabulate val_max anymore
initial_val_max = self.pi*self.B[:,obs[0]]
producer_process = mp.Process(target=self.get_val,args=(initial_val_max,),daemon=True)
consumer_process = mp.Process(target=self.get_arg,daemon=True)
self.intermediate.put(initial_val_max) # initial production put into pipeline for consumption
consumer_process.start() # we can already consume initial_val_max
producer_process.start()
#val_process.join()
#arg_process.join()
#self.output.join()
return self.backtrack(self.result.get(),self.result.get()) # backtrack takes last row of val_max and entire arg_max
def backtrack(self,val_max_last_row,arg_max):
'''Backtracking the Dynamic Programming solution (actually a Trellis diagram)
produced by Multiprocessing Viterbi algorithm.'''
states = np.zeros(self.T, dtype=np.int32)
states[self.T-1] = np.argmax(val_max_last_row)
for t in range(self.T-2, -1, -1):
states[t] = arg_max[t+1, states[t+1]]
return states
if __name__ == '__main__':
obs = np.array([0,1,2]) # normal then cold and finally dizzy
T = 100000
obs = np.random.binomial(2,0.3,T)
pi = np.array([0.6,0.4])
A = np.array([[0.7,0.3],
[0.4,0.6]])
B = np.array([[0.5,0.4,0.1],
[0.1,0.3,0.6]])
t1 = time()
viterbi = Viterbi(A,B,pi)
path = viterbi.get_path(obs)
t2 = time()
print('Iterative Viterbi')
print('Path: ',path)
print('Run-time: ',round(t2-t1,6))
t1 = time()
viterbi = Viterbi(A,B,pi)
path = viterbi.get_path_parallel(obs)
t2 = time()
print('\nParallel Viterbi')
print('Path: ',path)
print('Run-time: ',round(t2-t1,6))
Welcome to SO. Consider taking a look at producer-consumer pattern that is heavily used in multiprocessing.
Beware that multiprocessing in Python reinstantiates your code for every process you create on Windows. So your Viterbi objects and therefore their Queue fields are not the same.
Observe this behaviour through:
import os
def get_arg(self):
'''Dependent Process'''
print("Dependent ", self)
print("Dependent ", self.output)
print("Dependent ", os.getpid())
def get_val(self):
'''Independent Process'''
print("Independent ", self)
print("Independent ", self.output)
print("Independent ", os.getpid())
if __name__ == "__main__":
print("Hello from main process", os.getpid())
obs = np.array([0,1,2]) # normal then cold and finally dizzy
pi = np.array([0.6,0.4])
A = np.array([[0.7,0.3],
[0.4,0.6]])
B = np.array([[0.5,0.4,0.1],
[0.1,0.3,0.6]])
viterbi = Viterbi(A,B,pi)
print("Main viterbi object", viterbi)
print("Main viterbi object queue", viterbi.output)
path = viterbi.get_path_parallel(obs)
There are three different Viterbi objects as there are three different processes. So, what you need in terms of parallelism is not processes. You should explore the threading library that Python offers.
I am new to Numba and am trying to speed up some calculations that have proved too unwieldy for numpy. The example I've given below compares a function containing a subset of my calculations using a vectorized/numpy and numba versions of the function the latter of which was also tested as pure python by commenting out the #autojit decorator.
I find that the numba and numpy versions give similar speed ups relative to the pure python, both of which are about a factor of 10 speed improvement.
The numpy version was actually slightly faster than my numba function but because of the 4D nature of this calculation I quickly run out of memory when the arrays in the numpy function are sized much larger than this toy example.
This speed up is nice but I have often seen speed ups of >100x on the web when moving from pure python to numba.
I would like to know if there is a general expected speed increase when moving to numba in nopython mode. I would also like to know if there are any components of my numba-ized function that would be limiting further speed increases.
import numpy as np
from timeit import default_timer as timer
from numba import autojit
import math
def vecRadCalcs(slope, skyz, solz, skya, sola):
nloc = len(slope)
ntime = len(solz)
[lenz, lena] = skyz.shape
asolz = np.tile(np.reshape(solz,[ntime,1,1,1]),[1,nloc,lenz,lena])
asola = np.tile(np.reshape(sola,[ntime,1,1,1]),[1,nloc,lenz,lena])
askyz = np.tile(np.reshape(skyz,[1,1,lenz,lena]),[ntime,nloc,1,1])
askya = np.tile(np.reshape(skya,[1,1,lenz,lena]),[ntime,nloc,1,1])
phi1 = np.cos(asolz)*np.cos(askyz)
phi2 = np.sin(asolz)*np.sin(askyz)*np.cos(askya- asola)
phi12 = phi1 + phi2
phi12[phi12> 1.0] = 1.0
phi = np.arccos(phi12)
return(phi)
#autojit
def RadCalcs(slope, skyz, solz, skya, sola, phi):
nloc = len(slope)
ntime = len(solz)
pop = 0.0
[lenz, lena] = skyz.shape
for iiT in range(ntime):
asolz = solz[iiT]
asola = sola[iiT]
for iL in range(nloc):
for iz in range(lenz):
for ia in range(lena):
askyz = skyz[iz,ia]
askya = skya[iz,ia]
phi1 = math.cos(asolz)*math.cos(askyz)
phi2 = math.sin(asolz)*math.sin(askyz)*math.cos(askya- asola)
phi12 = phi1 + phi2
if phi12 > 1.0:
phi12 = 1.0
phi[iz,ia] = math.acos(phi12)
pop = pop + 1
return(pop)
zenith_cells = 90
azim_cells = 360
nloc = 10 # nominallly ~ 700
ntim = 10 # nominallly ~ 200000
slope = np.random.rand(nloc) * 10.0
solz = np.random.rand(ntim) *np.pi/2.0
sola = np.random.rand(ntim) * 1.0*np.pi
base = np.ones([zenith_cells,azim_cells])
skya = np.deg2rad(np.cumsum(base,axis=1))
skyz = np.deg2rad(np.cumsum(base,axis=0)*90/zenith_cells)
phi = np.zeros(skyz.shape)
start = timer()
outcalc = RadCalcs(slope, skyz, solz, skya, sola, phi)
stop = timer()
outcalc2 = vecRadCalcs(slope, skyz, solz, skya, sola)
stopvec = timer()
print(outcalc)
print(stop-start)
print(stopvec-stop)
On my machine running numba 0.31.0, the Numba version is 2x faster than the vectorized solution. When timing numba functions, you need to run the function more than one time because the first time you're seeing the time of jitting the code + the run time. Subsequent runs will not include the overhead of jitting the functions time since Numba caches the jitted code in memory.
Also, please note that your functions are not calculating the same thing -- you want to be careful that you're comparing the same things using something like np.allclose on the results.