numpy vectorizing a function slows it down? - python

I'm writing a program in which I'm trying to see how well a given redshift gets a set of lines detected in an spectrum to match up to an atomic line database. The closer the redshift gets the lines to overlap, the lower the "score" and the higher the chance that the redshift is correct.
I do this by looping over a range of possible redshifts, calculating the score for each. Within that outer loop, I was looping within each line in the set of detected lines to calculate its sub_score, and summing that inner loop to get the overall score.
I tried to vectorize the inner loop with numpy, but surprisingly it actually slowed down the execution. In the example given, the nested for loop takes ~2.6 seconds on my laptop to execute, while the single for loop with numpy on the inside takes ~5.3 seconds.
Why would vectorizing the inner loop slow things down? Is there a better way to do this that I'm missing?
import numpy as np
import time
def find_nearest_line(lines, energies):
# Return the indices (from the lines vector) that are closest to the energies vector supplied
# Vectorized with help from https://stackoverflow.com/a/53111759
energies = np.expand_dims(energies, axis=-1)
idx = np.abs(lines / energies - 1).argmin(axis=-1)
return idx
def calculate_distance_to_line(lines, energies):
# Returns distance between an array of lines and an array of energies)
z = (lines / energies) - 1
return z
rng = np.random.default_rng(2021)
det_lines = rng.random(1000)
atom_lines = rng.random(10000)
redshifts = np.linspace(-0.1, 0.1, 100)
# loop version
start=time.time()
scores = []
for redshift in redshifts:
atom_lines_shifted = atom_lines * (1 + redshift)
score = 0
for det_line in det_lines:
closest_atom_line = find_nearest_line(atom_lines_shifted, det_line)
score += abs(calculate_distance_to_line(atom_lines_shifted[closest_atom_line], det_line))
scores.append(score)
print(time.time()-start)
print(scores)
# (semi)-vectorized version
start=time.time()
scores = []
for redshift in redshifts:
atom_lines_shifted = atom_lines * (1 + redshift)
closest_atom_lines = find_nearest_line(atom_lines_shifted, det_lines)
score = np.sum(np.abs(calculate_distance_to_line(atom_lines_shifted[closest_atom_lines], det_lines)))
scores.append(score)
print(time.time()-start)
print(scores)

Numpy codes generally creates many temporary arrays. This is the case for your function find_nearest_line for example. Working on all the items of det_lines simultaneously would results in the creation of many relatively big arrays (1000 * 10_000 * 8 = 76 MiB per array). The thing is big array often do not fit in CPU caches. If so, the array needs to be stored in RAM with a much lower throughput and much higher latency. Moreover, allocating/freeing bigger array takes more time and results often in more page faults (due to the actual implementation of most default standard allocators). It is sometimes faster to use big array because the overhead of the CPython interpreter is huge but both strategies are inefficient in practice.
The thing is that the algorithm is not efficient. Indeed, you can sort the array and use a binary search to find the closest value much more efficiently. np.searchsorted does most of the work but it only returns the index of the closest value greater (or equal) than the target value. Thus, there is some additional operation to do to get the closest value (possibly greater or lesser than the target value). Note that this algorithm do not generate huge array thanks to the binary search.
scores = []
n = atom_lines.size
m = det_lines.size
line_idx = np.arange(m)
for redshift in redshifts:
atom_lines_shifted = atom_lines * (1 + redshift)
sorted_atom_lines_shifted = np.sort(atom_lines_shifted)
close_idx = np.searchsorted(sorted_atom_lines_shifted, det_lines)
lower_bound = sorted_atom_lines_shifted[np.maximum(close_idx - 1, 0)]
upper_bound = sorted_atom_lines_shifted[np.minimum(close_idx, n - 1)]
bounds = np.hstack((lower_bound[:, None], upper_bound[:, None]))
closest_bound_idx = find_nearest_line(bounds, det_lines)
close_values = bounds[line_idx, closest_bound_idx]
score = np.sum(np.abs(calculate_distance_to_line(close_values, det_lines)))
scores.append(score)
Since atom_lines is not modified and the multiplication preserve the order, the algorithm can be further optimized by sorting atom_lines directly:
scores = []
n = atom_lines.size
m = det_lines.size
line_idx = np.arange(m)
sorted_atom_lines = np.sort(atom_lines)
for redshift in redshifts:
sorted_atom_lines_shifted = sorted_atom_lines * (1 + redshift)
close_idx = np.searchsorted(sorted_atom_lines_shifted, det_lines)
lower_bound = sorted_atom_lines_shifted[np.maximum(close_idx - 1, 0)]
upper_bound = sorted_atom_lines_shifted[np.minimum(close_idx, n - 1)]
bounds = np.hstack((lower_bound[:, None], upper_bound[:, None]))
closest_bound_idx = find_nearest_line(bounds, det_lines)
close_values = bounds[line_idx, closest_bound_idx]
score = np.sum(np.abs(calculate_distance_to_line(close_values, det_lines)))
scores.append(score)
This last implementation is about 300 times faster on my machine.

Related

Can i rewrite this code to make it work faster?

Is it actually possible to make this run faster? I need to get half of all possible grids (all elements can be either -1 or 1) of size 4*Lx (for counting energies in Ising model).
def get_grid(Lx):
a = list()
count = 0
t = list(product([1,-1], repeat=Lx))
for i in range(len(t)):
for j in range(len(t)):
for k in range(len(t)):
for l in range(len(t)):
count += 1
a.append([t[i], t[j], t[k], t[l]])
if count == 2**(Lx*4)/2:
return np.array(a)
Tried using numba, but that didn't work out.
First of all, Numba does not like lists. If you want an efficient code, then you need to operate on arrays (except when you really do not know the size at runtime and estimating it is hard/slow). Here the size of the output array is already known so it is better to preallocate it and then fill it. Numba does not like much high-level features like generators, you should prefer using basic loops which are faster (as long as they are executed in a JITed function). The Cartesian product can be replaced by the efficient computation of an array based on the bits of an increasing integer. The whole computation is mainly memory-bound so it is better to use small integer datatypes like uint8 which take 4 times less space in RAM (and thus about 4 times faster to fill). Here is the resulting code:
import numpy as np
import numba as nb
#nb.njit('int8[:,:,:](int64,)')
def get_grid_numba(Lx):
t = np.empty((2**Lx, Lx), dtype=np.int8)
for i in range(2**Lx):
for j in range(Lx):
t[i, Lx-1-j] = 1 - 2 * ((i >> j) & 1)
outSize = 2**(Lx*4 - 1)
out = np.empty((outSize, 4, Lx), dtype=np.int8)
cur = 0
for i in range(len(t)):
for j in range(len(t)):
for k in range(len(t)):
for l in range(len(t)):
out[cur, 0, :] = t[i, :]
out[cur, 1, :] = t[j, :]
out[cur, 2, :] = t[k, :]
out[cur, 3, :] = t[l, :]
cur += 1
if cur == outSize:
return out
return out
For Lx=4, the initial code takes 66.8 ms while this Numba code takes 0.36 ms on my i5-9600KF processor. It is thus 185 times faster.
Note that the size of the output array exponentially grows very quickly. For Lx=7, the output shape is (134217728, 4, 7) and it takes 3.5 GiB of RAM. The Numba code takes 2.47 s to generate it, that is 1.4 GiB/s. If this is not enough to you, then you can write specific implementation from Lx=1 to Lx=8, use loops for the out slice assignment and even use multiple threads for Lx>=5. For small arrays, you can pre-compute them once. This should be an order of magnitude faster.

Improving sum() calculations on evenly spaced list

I have a code where I need 10Million evenly spaced number between 0 and 1 and I have a logic function which is responsible to pick a random index and return the sum of numbers from that index till the end of the list.
Thus the code looks like below,
import random
import numpy as np
ten_million = np.linspace(0.0, 1.0, 10000000)
def deep_dive_logic():
# this pick is derived from good logic, however, let's just use random here for demonstration
pick = random.randint(0, 10000000)
return sum(ten_million[pick:])
for _ in range(2500):
r = deep_dive_logic()
print(r)
# more logic ahead...
The problem here is as I loop sum() on a list of such size it takes approx. 1.3 s for each result.
Is there any efficient way to reduce the 1.3s wait per call? I also tried creating a kind of cache dictionary, but the deep_dive_logic() function runs in a multi-process environment hence there is need to cache this dictionary, either redis or a json.dump not a choice because of the size of dictionary mounts to around 236MB and adds up as overhead in inter-process communication if not cached.
sums_dict = {0: sum(ten_million)}
even_difference = (ten_million[1] - ten_million[0])
for i in range(len(ten_million) - 1):
sums_dict[i+1] = sums_dict[i] - (even_difference * (i+1))
I need help with either caching of 10Million dictionary or an alternate formula to return the result without using sum() or any out-of-box solution.
https://repl.it/repls/HoneydewGoldenShockwave
np.sum(ten_million) does it in about 0.005 seconds, whereas sum(ten_million) is about 1.5 seconds on my machine
As for a solution without using any out of the box functions, as suggested in the comments to your question by MrT, you can use the property of arithmetic progressions, which says that the sum of a progression is equal to n(a1+an) / 2, where n is the number of elements (10000000), a1 is your first element (0), and an is your last element (1). In your example, this is 10000000(0+1) / 2 = 5000000
so, for your deep_dive_logic function, just return that:
def deep_dive_logic():
pick = random.randint(0, 10000000)
return (len(ten_million)-pick)*(ten_million[pick]+ten_million[-1]) / 2
Also does the job extremely fast, in fact, much faster than np.sum: on average, the arithmetic progression calculation took 1.223e-06 seconds, whereas np.sum took 0.00577 seconds on my machine. Makes sense, seeing how it's just one addition, one multiplication, and one division...
Do it analytically:
def cumm_sum(start, finish, steps, k):
step = (finish - start) / steps
pop = (finish - k) / step
return (pop + 1) * 0.5 * (k + finish)
and the call would be like:
pick = ten_million[random.randint(0, 10000000)]
result = cumm_sum(0.0, 1.0, 10000000, pick)
use math to reduce the problem complexity:
The sum of an arithmetic progression is given by
(m+n)*(m-n+1)*0.5
use np.vectorize to speed up the array operation:
ten_m = 10000000
def sum10m_py(n):
return (1+n)*(ten_m-n*ten_m+1)*0.5
sum_np = np.vectorize(sum_py)
pick the elements you want, and then apply the vectorized function on it.
mask = np.random.randint(0,ten_m,2500)
sums = sum_np(ten_million[mask])

vectorizing a function of an array of arrays

I am new to vectorizing code, and I am really psyched about how much faster everything is, but I can't get the high speed out of this particular piece of code...
Here is the housing class...
class GaussianMixtureModel:
def __init__(self, image_matrix, num_components, means=None):
self.image_matrix = image_matrix
self.num_components = num_components
if(means is None):
self.means = np.zeros(num_components)
else:
self.means = np.array(means)
self.variances = np.zeros(num_components)
self.mixing_coefficients = np.zeros(num_components)
And here is what I've got so far that works:
def likelihood(self):
def g2(x):
#N =~ 5
#self.mixing_coefficients = 1D, N items
#self.variances = 1D, N items
#self.means = 1D, N items
mc = self.mixing_coefficients[:,None,None]
std = self.variances[:,None,None] ** 0.5
var = self.variances[:,None,None]
mean = self.means[:,None,None]
return np.log((mc*(1.0/(std*np.sqrt(2.0*np.pi)))*(np.exp(-((x-mean)**2.0)/(2.0*var)))).sum())
f = np.vectorize(g2)
#self.image_matrix =~ 400*700 2D matrix
log_likelihood = (f(self.image_matrix)).sum()
return log_likelihood
And here is what I've got that gives a strange result (note that self.image_matrix is an nxn matrix of a grayscale image):
def likelihood(self):
def g2():
#N =~ 5
#self.mixing_coefficients = 1D, N items
#self.variances = 1D, N items
#self.means = 1D, N items
#self.image_matrix = 1D, 400x700 2D matrix
mc = self.mixing_coefficients[:,None,None]
std = self.variances[:,None,None] ** 0.5
var = self.variances[:,None,None]
mean = self.means[:,None,None]
return np.log((mc*(1.0/(std*np.sqrt(2.0*np.pi)))*(np.exp(-((self.image_matrix-mean)**2.0)/(2.0*var)))).sum())
log_likelihood = (g2()).sum()
return log_likelihood
However, the second version is really fast compared to the first (which takes almost 10 seconds...and speed is really important here, because this is part of a convergence algorithm)
Is there a way to replicate the results of the first version and the speed of the second? (And I'm really not familiar enough with vectorizing to know why the second version isn't working)
The second version is so fast because it only uses the first cell of self.image_matrix:
return np.log((mc*(1.0/(std*np.sqrt(2.0*np.pi)))*(np.exp(-((self.image_matrix[0,0]-mean)**2.0)/(2.0*var)))).sum())
# ^^^^^
This is also why it's completely wrong. It's not actually a vectorized computation over self.image_matrix at all. Don't try to use its runtime as a point of comparison; you can always make wrong code faster than right code.
By eliminating the use of np.vectorize, you can make the first version much faster, but not as fast as the wrong code. The sum inside the log simply needs the appropriate axis specified:
def likelihood(self):
def f(x):
mc = self.mixing_coefficients[:,None,None]
std = self.variances[:,None,None] ** 0.5
var = self.variances[:,None,None]
mean = self.means[:,None,None]
return np.log((mc*(1.0/(std*np.sqrt(2.0*np.pi)))*(np.exp(-((x-mean)**2.0)/(2.0*var)))).sum(axis=0))
log_likelihood = (f(self.image_matrix)).sum()
This can be further simplified and optimized in a few ways. For example, the nested function can be eliminated, and multiplying by 1.0/whatever is slower than dividing by whatever, but eliminating np.vectorize is the big thing.

Faster looping with itertools

I have a function
def getSamples():
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
goodSamples = []
rightCount = 0
totalCount = 0
while(rightCount < 100000):
z0 = np.random.normal(5, 14)
u0 = np.random.uniform(0,k*q(z0))
if(p(z0) > u0):
goodSamples.append(z0)
rightCount += 1
totalCount += 1
return np.array(goodSamples)
My implementation to generate 100000 samples is taking much long. How can I make it fast with itertools or something similar?
I would say that the secret to making this code faster does not lie in changing the loop syntax. Here are a few points:
np.random.normal has an additional parameter size that lets you get many values at once. I would suggest using an array of say 1E09 elements and then checking your condition on that for how many are good. You can then estimate how likely that is.
To create your uniform samples, why not use sympy for symbolic evaluation of the pdf? (I don't know if this is faster but it could be since you already know the mean and variance.)
Again, for p could you use a symbolic function?
In general, performance problems are caused by doing things the "wrong way". Numpy can be very fast when used as it is designed to be used, that is by exploiting its vector processing where these vectorized operations are handed off to compiled code. Two bad practices that come from other programing languages/approaches are
Loops: Whenever you think you need a loop stop and think. Most of the time you do not and in fact do not even want one. It is much faster both to write and run code without loops.
Memory allocation: Whenever you know the size of an object, preallocate space for it. Growing memory, particularly in Python lists, is very slow compared to the alternatives.
In this case it is easy to get (approximately) two orders of magnitude speedup; the tradeoff is more memory usage.
Below is some representative code, it is not meant to be blindly used. I have not even verified it produces the correct results. It is more or less a direct translation of your routine. It appears you are drawing random numbers from a probability distribution using the rejection method. There may be more efficient algorithms to do this for your probability distribution.
def getSamples2() :
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
N = 100000 # Total number of samples we want
Ngood = 0 # Current number of good samples
goodSamples = np.zeros(N) # Storage for the good samples
while Ngood < N : # Unfortunately a loop, ....
z0 = np.random.normal(5, 14, size=N)
u0 = np.random.uniform(size=N)*k*q(z0)
ind, = np.where(p(z0) > u0)
n = min(len(ind), N-Ngood)
goodSamples[Ngood:Ngood+n] = z0[ind[:n]]
Ngood += n
return goodSamples
This generates random numbers in chunks and saves the good ones. I have not tried to optimize the chunk size (here I just use N, the total number we want, in principle this could/should be different and could even be adjusted based on the number we have left to generate). This still uses a loop, unfortunately, but now this will be run "tens" of times instead of 100,000 times. This also uses the where function and array slicing; these are good general tools to be comfortable with.
In one test with %timeit on my machine I found
In [27]: %timeit getSamples() # Original routine
1 loops, best of 3: 49.3 s per loop
In [28]: %timeit getSamples2()
1 loops, best of 3: 505 ms per loop
Here is kinda itertools "magic", but I'm not sure it can help. Probably it's much better for perfomance to prepare an numpy array (using zeros) and fill it without creating python auto-growing list. Here is both itertools and zero-preparations. (Excuse me in advance for untested code)
from itertools import count, ifilter, imap, takewhile
import operator
def getSamples():
p = lambda x : mlab.normpdf(x, 3, 2) + mlab.normpdf(x, -5, 1)
q = lambda x : mlab.normpdf(x, 5, 14)
k = 30
n = 100000
samples_iter = imap(
operator.itemgetter(1),
takewhile(
lambda i, s: i < n,
enumerate(
ifilter(lambda z: p(z) > np.random.uniform(0,k*q(z)),
(np.random.normal(5, 14) for _ in count()))
)))
goodSamples = numpy.zeros(n)
# set values from iterator, probably there is a better way for that
for i, sample in enumerate(samples_iter):
goodSamples[i] = sample
return goodSamples

Manual fft not giving me same results as fft

import numpy as np
import matplotlib.pyplot as pp
curve = np.genfromtxt('C:\Users\latel\Desktop\kool\Neuro\prax2\data\curve.csv',dtype = 'float', delimiter = ',')
curve_abs2 = np.empty_like(curve)
z = 1j
N = len(curve)
for i in range(0,N-1):
curve_abs2[i] =0
for k in range(0,N-1):
curve_abs2[i] += (curve[i]*np.exp((-1)*z*(np.pi)*i*((k-1)/N)))
for i in range(0,N):
curve_abs2[i] = abs(curve_abs2[i])/(2*len(curve_abs2))
#curve_abs = (np.abs(np.fft.fft(curve)))
#pp.plot(curve_abs)
pp.plot(curve_abs2)
pp.show()
The code behind # gives me 3 values. But this is just ... different
Wrong ^^ this code: http://www.upload.ee/image/3922681/Ex5problem.png
Correct using numpy.fft.fft(): http://www.upload.ee/image/3922682/Ex5numpyformulas.png
There are several problems:
You are assigning complex values to the elements of curve_abs2, so it should be declared to be complex, e.g. curve_abs2 = np.empty_like(curve, dtype=np.complex128). (And I would recommend using the name, say, curve_fft instead of curve_abs2.)
In python, range(low, high) gives the sequence [low, low + 1, ..., high - 2, high - 1], so instead of range(0, N - 1), you must use range(0, N) (which can be simplified to range(N), if you want).
You are missing a factor of 2 in your formula. You could fix this by using z = 2j.
In the expression that is being summed in the inner loop, you are indexing curve as curve[i], but this should be curve[k].
Also in that expression, you don't need to subtract 1 from k, because the k loop ranges from 0 to N - 1.
Because k and N are integers and you are using Python 2.7, the division in the expression (k-1)/N will be integer division, and you'll get 0 for all k. To fix this and the previous problem, you can change that term to k / float(N).
If you fix those issues, when the first double loop finishes, the array curve_abs2 (now a complex array) should match the result of np.fft.fft(curve). It won't be exactly the same, but the differences should be very small.
You could eliminate that double loop altogether using numpy vectorized calculations, but that is a topic for another question.

Categories

Resources