I am new to vectorizing code, and I am really psyched about how much faster everything is, but I can't get the high speed out of this particular piece of code...
Here is the housing class...
class GaussianMixtureModel:
def __init__(self, image_matrix, num_components, means=None):
self.image_matrix = image_matrix
self.num_components = num_components
if(means is None):
self.means = np.zeros(num_components)
else:
self.means = np.array(means)
self.variances = np.zeros(num_components)
self.mixing_coefficients = np.zeros(num_components)
And here is what I've got so far that works:
def likelihood(self):
def g2(x):
#N =~ 5
#self.mixing_coefficients = 1D, N items
#self.variances = 1D, N items
#self.means = 1D, N items
mc = self.mixing_coefficients[:,None,None]
std = self.variances[:,None,None] ** 0.5
var = self.variances[:,None,None]
mean = self.means[:,None,None]
return np.log((mc*(1.0/(std*np.sqrt(2.0*np.pi)))*(np.exp(-((x-mean)**2.0)/(2.0*var)))).sum())
f = np.vectorize(g2)
#self.image_matrix =~ 400*700 2D matrix
log_likelihood = (f(self.image_matrix)).sum()
return log_likelihood
And here is what I've got that gives a strange result (note that self.image_matrix is an nxn matrix of a grayscale image):
def likelihood(self):
def g2():
#N =~ 5
#self.mixing_coefficients = 1D, N items
#self.variances = 1D, N items
#self.means = 1D, N items
#self.image_matrix = 1D, 400x700 2D matrix
mc = self.mixing_coefficients[:,None,None]
std = self.variances[:,None,None] ** 0.5
var = self.variances[:,None,None]
mean = self.means[:,None,None]
return np.log((mc*(1.0/(std*np.sqrt(2.0*np.pi)))*(np.exp(-((self.image_matrix-mean)**2.0)/(2.0*var)))).sum())
log_likelihood = (g2()).sum()
return log_likelihood
However, the second version is really fast compared to the first (which takes almost 10 seconds...and speed is really important here, because this is part of a convergence algorithm)
Is there a way to replicate the results of the first version and the speed of the second? (And I'm really not familiar enough with vectorizing to know why the second version isn't working)
The second version is so fast because it only uses the first cell of self.image_matrix:
return np.log((mc*(1.0/(std*np.sqrt(2.0*np.pi)))*(np.exp(-((self.image_matrix[0,0]-mean)**2.0)/(2.0*var)))).sum())
# ^^^^^
This is also why it's completely wrong. It's not actually a vectorized computation over self.image_matrix at all. Don't try to use its runtime as a point of comparison; you can always make wrong code faster than right code.
By eliminating the use of np.vectorize, you can make the first version much faster, but not as fast as the wrong code. The sum inside the log simply needs the appropriate axis specified:
def likelihood(self):
def f(x):
mc = self.mixing_coefficients[:,None,None]
std = self.variances[:,None,None] ** 0.5
var = self.variances[:,None,None]
mean = self.means[:,None,None]
return np.log((mc*(1.0/(std*np.sqrt(2.0*np.pi)))*(np.exp(-((x-mean)**2.0)/(2.0*var)))).sum(axis=0))
log_likelihood = (f(self.image_matrix)).sum()
This can be further simplified and optimized in a few ways. For example, the nested function can be eliminated, and multiplying by 1.0/whatever is slower than dividing by whatever, but eliminating np.vectorize is the big thing.
Related
I'm writing a program in which I'm trying to see how well a given redshift gets a set of lines detected in an spectrum to match up to an atomic line database. The closer the redshift gets the lines to overlap, the lower the "score" and the higher the chance that the redshift is correct.
I do this by looping over a range of possible redshifts, calculating the score for each. Within that outer loop, I was looping within each line in the set of detected lines to calculate its sub_score, and summing that inner loop to get the overall score.
I tried to vectorize the inner loop with numpy, but surprisingly it actually slowed down the execution. In the example given, the nested for loop takes ~2.6 seconds on my laptop to execute, while the single for loop with numpy on the inside takes ~5.3 seconds.
Why would vectorizing the inner loop slow things down? Is there a better way to do this that I'm missing?
import numpy as np
import time
def find_nearest_line(lines, energies):
# Return the indices (from the lines vector) that are closest to the energies vector supplied
# Vectorized with help from https://stackoverflow.com/a/53111759
energies = np.expand_dims(energies, axis=-1)
idx = np.abs(lines / energies - 1).argmin(axis=-1)
return idx
def calculate_distance_to_line(lines, energies):
# Returns distance between an array of lines and an array of energies)
z = (lines / energies) - 1
return z
rng = np.random.default_rng(2021)
det_lines = rng.random(1000)
atom_lines = rng.random(10000)
redshifts = np.linspace(-0.1, 0.1, 100)
# loop version
start=time.time()
scores = []
for redshift in redshifts:
atom_lines_shifted = atom_lines * (1 + redshift)
score = 0
for det_line in det_lines:
closest_atom_line = find_nearest_line(atom_lines_shifted, det_line)
score += abs(calculate_distance_to_line(atom_lines_shifted[closest_atom_line], det_line))
scores.append(score)
print(time.time()-start)
print(scores)
# (semi)-vectorized version
start=time.time()
scores = []
for redshift in redshifts:
atom_lines_shifted = atom_lines * (1 + redshift)
closest_atom_lines = find_nearest_line(atom_lines_shifted, det_lines)
score = np.sum(np.abs(calculate_distance_to_line(atom_lines_shifted[closest_atom_lines], det_lines)))
scores.append(score)
print(time.time()-start)
print(scores)
Numpy codes generally creates many temporary arrays. This is the case for your function find_nearest_line for example. Working on all the items of det_lines simultaneously would results in the creation of many relatively big arrays (1000 * 10_000 * 8 = 76 MiB per array). The thing is big array often do not fit in CPU caches. If so, the array needs to be stored in RAM with a much lower throughput and much higher latency. Moreover, allocating/freeing bigger array takes more time and results often in more page faults (due to the actual implementation of most default standard allocators). It is sometimes faster to use big array because the overhead of the CPython interpreter is huge but both strategies are inefficient in practice.
The thing is that the algorithm is not efficient. Indeed, you can sort the array and use a binary search to find the closest value much more efficiently. np.searchsorted does most of the work but it only returns the index of the closest value greater (or equal) than the target value. Thus, there is some additional operation to do to get the closest value (possibly greater or lesser than the target value). Note that this algorithm do not generate huge array thanks to the binary search.
scores = []
n = atom_lines.size
m = det_lines.size
line_idx = np.arange(m)
for redshift in redshifts:
atom_lines_shifted = atom_lines * (1 + redshift)
sorted_atom_lines_shifted = np.sort(atom_lines_shifted)
close_idx = np.searchsorted(sorted_atom_lines_shifted, det_lines)
lower_bound = sorted_atom_lines_shifted[np.maximum(close_idx - 1, 0)]
upper_bound = sorted_atom_lines_shifted[np.minimum(close_idx, n - 1)]
bounds = np.hstack((lower_bound[:, None], upper_bound[:, None]))
closest_bound_idx = find_nearest_line(bounds, det_lines)
close_values = bounds[line_idx, closest_bound_idx]
score = np.sum(np.abs(calculate_distance_to_line(close_values, det_lines)))
scores.append(score)
Since atom_lines is not modified and the multiplication preserve the order, the algorithm can be further optimized by sorting atom_lines directly:
scores = []
n = atom_lines.size
m = det_lines.size
line_idx = np.arange(m)
sorted_atom_lines = np.sort(atom_lines)
for redshift in redshifts:
sorted_atom_lines_shifted = sorted_atom_lines * (1 + redshift)
close_idx = np.searchsorted(sorted_atom_lines_shifted, det_lines)
lower_bound = sorted_atom_lines_shifted[np.maximum(close_idx - 1, 0)]
upper_bound = sorted_atom_lines_shifted[np.minimum(close_idx, n - 1)]
bounds = np.hstack((lower_bound[:, None], upper_bound[:, None]))
closest_bound_idx = find_nearest_line(bounds, det_lines)
close_values = bounds[line_idx, closest_bound_idx]
score = np.sum(np.abs(calculate_distance_to_line(close_values, det_lines)))
scores.append(score)
This last implementation is about 300 times faster on my machine.
How can python be used for numerical finite difference calculation without using numpy?
For example I want to find multiple function values numerically in a certain interval with a step size 0.05 for a first order and second order derivatives.
Why don't you want to use Numpy? It's a good library and very fast for doing numerical computations because it's written in C (which is generally faster for numerical stuff than pure Python).
If you're curious how these methods work and how they look in code here's some sample code:
def linspace(a, b, step):
if a > b:
# see if going backwards?
if step < 0:
return linspace(b, a, -1*step)[::-1]
# step isn't negative so no points
return []
pt = a
res = [pt]
while pt <= b:
pt += step
res.append(pt)
return res
def forward(data, step):
if not data:
return []
res = []
i = 0
while i+1 < len(data):
delta = (data[i+1] - data[i])/step
res.append(delta)
i += 1
return res
# example usage
size = 0.1
ts = linspace(0, 1, size)
y = [t*t for t in ts]
dydt = forward(y, size)
d2ydt2 = forward(dydt, size)
Note: this will still use normal floating point numbers and so there are still odd rounding errors that happen because some numbers don't have an exact binary decimal representation.
Another library to check out is mpmath which has a lot of cool math functions like integration and special functions AND it allows you to specify how much precision you want. Of course using 100 digits of precision is going to be a lot slower than normal floats, but it is still a very cool library!
I have a pretty simple function which uses Numpy arrays and for loops, but adding the Numba #jit decorator gives absolutely no speed up:
# #jit(float64[:](int32,float64,float64,float64,int32))
#jit
def Ising_model_1D(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
beta = 1/T
s = randn(N,1) > 10
s[N-1] = s[0]
mag = zeros((n_iter,1))
aux_idx = randint(low=0,high=N,size=(n_iter,1))
for i1 in arange(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx]*2 - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
if(delta_E < 0):
s[rnd_idx] = np.logical_not(s[rnd_idx])
elif(np.exp(-1*beta*delta_E) >= rand()):
s[rnd_idx] = np.logical_not(s[rnd_idx])
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
return mag
MATLAB on the other hand takes less than 0.5 seconds to run this!
Why is Numba missing something so basic?
Here is a reworking of your code that runs in about 0.4 seconds on my machine:
def ising_model_1d(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
n_iter = int(n_iter)
beta = 1/T
s = randn(N) > 10
s[N-1] = s[0]
mag = zeros(n_iter)
aux_idx = randint(low=0,high=N,size=n_iter)
pre_rand = rand(n_iter)
_ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag)
return mag
#jit(nopython=True)
def _ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag):
for i1 in range(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx*2] - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
t = rand()
if delta_E < 0:
s[rnd_idx] = not s[rnd_idx]
elif np.exp(-1*beta*delta_E) >= pre_rand[i1]:
s[rnd_idx] = not s[rnd_idx]
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
Please make sure the results are as expected! I changed much of what you had, and can't guarantee that the calculations are correct!
Working with numba requires a little care. Python functions, as well as most numpy functions, cannot be optimized by the compiler. One thing I find helpful is to use the nopython option to #jit. This means that the compiler will complain whenever you give it some code that it can't really optimize. You can then look at the error message and find the line that will likely slow down your code.
The trick, I find, is to write a "gateway" function in Python that does as much of the work as possible using numpy and its vectorized functions. It should create the empty arrays that you'll need to store the results in. It should package all of the data you'll need during the computation. Then it should pass all of these into your jitted function in one big, long argument list.
Case in point: notice how I handle random number generation in the jitted code. In your original code, you called rand():
elif(np.exp(-1*beta*delta_E) >= rand()):
But rand() can't be optimized by numba (in older versions of numba, at least. In newer versions it can, provided that rand is called without arguments). The observation is that you need a single random number for every one of the n_iter iterations. So we simply create a random array using numpy in our wrapper function, then feed this random array to the jitted function. Getting a random number is then as simple as indexing into this array.
Lastly, for a list of the numpy functions that can be optimized by the latest version of the compiler, see here. In my reworking of your code I was aggressive in removing calls to numpy functions so that the code would work over more versions of numba.
I'm currently trying to change a row in a matrix which I created using the scipy.sparse.diags function. However, it returns the following error saying that I cannot assign to this object:
TypeError: 'dia_matrix' object does not support item assignment
Is there any way around this without having to change the original vectors used to form the tridiagonal matrix? The following is my code:
def Mass_Matrix(x0):
"""Finds the Mass matrix for any non uniform mesh x0"""
x0 = np.array(x0)
N = len(x0) - 1
h = x0[1:] - x0[:-1]
a = np.zeros(N+1)
a[0] = h[0]/3
for j in range(1,N):
a[j] = h[j-1]/3 + h[j]/3
a[N] = h[N-1]/3
b = h/6
c = h/6
data = [a.tolist(), b.tolist(), c.tolist()]
Positions = [0,1,-1]
Mass_Matrix = diags(data, Positions, (N+1,N+1))
return Mass_Matrix
def Initial_U(x0): #BC here
x0 = np.array(x0)
h = x0[1:] - x0[:-1]
N = len(x0) - 1
Mass = Mass_Matrix(x0)
Mass[0] = 0 #ITEM ASSIGNMENT ERROR
print Mass.toarray()
For a sparse matrix defined with your function:
x0=np.arange(10)
mm=Mass_Matrix(x0)
The csr format is the one that is normally used for calculations, such as matrix multiplication, and linalg solve. It does define assignment, but gives an efficiency warning:
In [29]: mmr=mm.tocsr()
In [30]: mmr[0]=0
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
lil works fine
In [31]: mml=mm.tolil()
In [32]: mml[0]=0
Many of the sparse functions and methods convert one format to another to take advantage of their respective strengths. But the developers haven't implemented all possible combinations. You need to read the pros and cons of the various formats, and also note the methods for each.
I am trying to fill an array with calculated values from functions defined earlier in my code. I started with a code that has a similar structure to the following:
from numpy import cos, sin, arange, zeros
a = arange(1000)
b = arange(1000)
def defcos(x):
return cos(x)
def defsin(x):
return sin(x)
a_len = len(a)
b_len = len(b)
result = zeros((a_len,b_len))
for i in xrange(b_len):
for j in xrange(a_len):
a_res = defcos(a[j])
b_res = defsin(b[i])
result[i,j] = a_res * b_res
I tried to use array representations of the functions, which ended up in the following change for the loop
a_res = defsin(a)
b_res = defcos(b)
for i in xrange(b_len):
for j in xrange(a_len):
result[i,j] = a_res[i] * b_res[j]
This is already significantly faster, than the first version. But is there a way to avoid the loop entirely? I have encountered those loops a couple of times in the past but never botheres as it was not critical in terms of speed. But this time it is the core component of something, which is looped through a couple of times more. :)
Any help would be appreciated, thanks in advance!
Like so:
from numpy import newaxis
a_res = sin(a)
b_res = cos(b)
result = a_res[:, newaxis] * b_res
To understand how this works, have a look at the rules for array broadcasting. And please don't define useless functions like defsin, just use sin itself! Another minor detail, you get i from range(b_len), but you use it to index a_res! This is a bug if a_len != b_len.