Optimizing Calculations with numpy and numba Python

Optimizing Calculations with numpy and numba Python - python

I am trying to make python run standard deviation functions faster with numba and numpy. However the problem is the for loop is very slow and I need alternatives so that I could make the code much faster. I iterated numba to the already existing numpy version however there is not much of a gain in performance. My original list_ has million of values within it thus it is taking a very long time to compute the standard deviation function. The list_ function down below is a very short numpy array that is meant to be an example for my problem as I wont be able to post the original list numbers. The for loop in the function below calculates the standard deviation of every nth number defined by the variable number in the list_ below. How would I be able to make this current function run faster.
import numpy as np
from numba import njit,jit,vectorize
number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])
Normal code:
def std_():
std = np.array([list_[i:i+number].std() for i in range(0, len(list_)-number)])
print(std)
std_()
Numba Code:
jitted_func = njit()(std_)
jitted_func()
performance results:

You can do this in a vectorised fashion.
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def std_():
std = np.array([list_[i:i+number].std() for i in range(0, len(list_)-number)])
return std
std1 = np.std(rolling_window(list_, 5), axis=1)
print(np.allclose(std1[:-1], std_()))
Gives True. The code for rolling_window has been taken from this answer.
Comparison with numba -
import numpy as np
from numba import njit,jit,vectorize
number = 5
list_= np.random.rand(10000)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def std_():
std = np.array([list_[i:i+number].std() for i in range(0, len(list_)-number)])
return std
%timeit np.std(rolling_window(list_, 5), axis=1)
%%timeit
jitted_func = njit()(std_)
jitted_func()
Gives
499 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
106 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Efficient (not DataFrame.apply) way of getting cosine distance for mapped values

Here's some data I've generated:
import numpy as np
import pandas as pd
import scipy
import scipy.spatial
df = pd.DataFrame(
{
"item_1": np.random.randint(low=0, high=10, size=1000),
"item_2": np.random.randint(low=0, high=10, size=1000),
}
)
embeddings = {item_id: np.random.randn(100) for item_id in range(0, 10)}
def get_distance(item_1, item_2):
arr1 = embeddings[item_1]
arr2 = embeddings[item_2]
return scipy.spatial.distance.cosine(arr1, arr2)
I'd like to apply get_distance to each row. I can do:
df.apply(lambda row: get_distance(row["item_1"], row["item_2"]), axis=1)
But that would be very slow for large datasets.
Is there a way to calculate the cosine similarity of the embeddings corresponding to each row, without using DataFrame.apply?

For scipy version
%%timeit
df.apply(lambda row: get_distance(row["item_1"], row["item_2"]), axis=1)
# 38.3 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For what its worth I added numba with extra complication
Thinking about memory and numpy broadcast use tmp allocation, I used for loops
Also it is worth considering passing arguments, maybe you can pass vectors instead of dictionary.
Also first run is slow due to compilation
Also you can make it parallel with numba
#nb.njit((nb.float64[:, ::100], nb.float64[:, ::100]))
def cos(a, b):
norm_a = np.empty((a.shape[0],), dtype=np.float64)
norm_b = np.empty((b.shape[0],), dtype=np.float64)
cos_ab = np.empty((a.shape[0],), dtype=np.float64)
for i in nb.prange(a.shape[0]):
sq_norm = 0.0
for j in range(100):
sq_norm += a[i][j] ** 2
norm_a[i] = sq_norm ** 0.5
for i in nb.prange(b.shape[0]):
sq_norm = 0.0
for j in range(100):
sq_norm += b[i][j] ** 2
norm_b[i] = sq_norm ** 0.5
for i in nb.prange(a.shape[0]):
dot = 0.0
for j in range(100):
dot += a[i][j] * b[i][j]
cos_ab[i] = 1 - dot / (norm_a[i] * norm_b[i])
return cos_ab
%%timeit
cos(item_1_embedded, item_2_embedded)
# 218 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using vectorized numpy operations directly is much faster:
item_1_embedded = np.array([embeddings[x]for x in df.item_1])
item_2_embedded = np.array([embeddings[x]for x in df.item_2])
cos_dist = 1 - np.sum(item_1_embedded*item_2_embedded, axis=1)/(np.linalg.norm(item_1_embedded, axis=1)*np.linalg.norm(item_2_embedded, axis=1))
(This version runs in 771 µs on average on my pc, vs 37.4 ms for the DataFrame.apply, which makes the pure numpy version about 50 times faster).

You can vectorize the call to cosine with numpy.vectorize. There is a slight gain in speed (34 ms vs 53 ms)
vec_cosine = np.vectorize(scipy.spatial.distance.cosine)
vec_cosine(df['item_1'].map(embeddings),
df['item_2'].map(embeddings))
output:
array([0.90680875, 0.90999454, 0.99212814, 1.12455852, 1.06354469,
0.95542037, 1.07133003, 1.07133003, 0. , 1.00837058,
0. , 0.93961103, 0.8943738 , 1.04872436, 1.21171375,
1.04621226, 0.90392229, 1.0365102 , 0. , 0.90180297,
0.90180297, 1.04516879, 0.94877277, 0.90180297, 0.93713404,
...
1.17548653, 1.11700641, 0.97926805, 0.8943738 , 0.93961103,
1.21171375, 0.91817959, 0.91817959, 1.04674315, 0.88210679,
1.11806218, 1.07816675, 1.00837058, 1.12455852, 1.04516879,
0.93713404, 0.93713404, 0.95542037, 0.93876964, 0.91817959])

Faster way to check if elements in numpy array windows are finite

I have a very long NumPy array with 1_000_000_000 elements and I want to slide a 50 element window across the array and ask if all of the elements within the window are finite. If all elements within a 50 element window are all finite then return True (for that window), otherwise, if one or more elements within the 50 element window are not finite then return False (for that window). Continue this assessment until all windows are assessed. A nice way to do this is:
import numpy as np
def rolling_window(a, window):
a = np.asarray(a)
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
if __name__ == "__main__":
a = np.random.rand(100_000_000) # This is 10x shorter than my real data
w = 50
idx = np.random.randint(0, len(a), size=len(a)//10) # Simulate having np.nan in my array
a[idx] = np.nan
print(np.all(rolling_window(np.isfinite(a), w), axis=1))
However, this is slow when my array is of length 1_000_000_000. Is there a faster way to accomplish this that also doesn't require a ton of memory?

Approach #1: Abuse strided windows directly into the isfinite-mask for the assignment -
def strided_allfinite(a, w):
m = np.isfinite(a)
p = rolling_window(m, w)
nmW = ~m[:w]
if nmW.any():
m[:np.flatnonzero(nmW).max()] = False
p[~m[w-1:]] = False
return m[:-w+1]
Timings on given sample data :
In [323]: N = 100_000_000
...: w = 50
...:
...: np.random.seed(0)
...: a = np.random.rand(N) # This is 10x shorter than my real data
...: idx = np.random.randint(0, len(a), size=len(a)//10) # Simulate...
...: a[idx] = np.nan
# Original soln
In [324]: %timeit np.all(rolling_window(np.isfinite(a), w), axis=1)
1.61 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [325]: %timeit strided_allfinite(a, w)
556 ms ± 87.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Approach #2
We can leverage convolution -
np.convolve(np.isfinite(a), np.ones(w),'valid')==w
Approach #3
With binary-erosion -
from scipy.ndimage.morphology import binary_erosion
m = np.isfinite(a)
out = binary_erosion(m, np.ones(w, dtype=bool))[w//2:len(a)-w+1+w//2]

Why numpy.var is O(N) space?

I have an array of ~13GB. I call numpy.var on it to compute the variance. However, it allocates another ~13GB to do this. Why does it need O(N) space? Or am I calling numpy.var in a wrong way?
import numpy as np
# data = ...
print('Variance: ', np.var(data))

NumPy will create an intermediate array to compute abs(data - data.mean()) ** 2 in order to compute the variance. You can write your own variance function with a loop and make it fast with Numba:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def var_nb(a, ddof=0):
n = len(a)
s = a.sum()
m = s / (n - ddof)
v = 0
for i in nb.prange(n):
v += abs(a[i] - m) ** 2
return v / (n - ddof)
np.random.seed(100)
a = np.random.rand(100_000)
print(np.var(a))
# 0.08349747560941487
print(var_nb(a))
# 0.08349747560941487
%timeit np.var(a)
# 143 µs ± 414 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit var_nb(a)
# 40.2 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is faster whitout parallelization:
import numpy as np
def var(a: np.ndarray, axis: int = 0):
return np.sum(abs(a - (a.sum(axis=axis) / len(a))) ** 2, axis=axis) / len(a)

A tedious loop looking for improvements

in my code I need to calculate the values of a vector many times which are the mean values from different patches of another array.
Here is an example of my code showing how I do it but I found that it is too less-efficient in running...
import numpy as np
vector_a = np.zeros(10)
array_a = np.random.random((100,100))
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,i+20:i+40]
Is there any way to make it more efficient? Any comments or suggestions are very welcome! Many thanks!
-yes, the 20 and 40 are fixed.

EDIT:
Actually you can do this much faster. The previous function can be improved by operating on summed columns like this:
def rolling_means_faster1(array_a, n, first, size):
# Sum each relevant columns
sum_a = np.sum(array_a[:, first:(first + size + n - 1)], axis=0)
# Reshape as before
strides_b = (sum_a.strides[0], sum_a.strides[0])
array_b = np.lib.stride_tricks.as_strided(sum_a, (n, size), (strides_b))
# Average
v = np.sum(array_b, axis=1)
v /= (len(array_a) * size)
return v
Another way is to work with accumulated sums, adding and removing as necessary for each output element.
def rolling_means_faster2(array_a, n, first, size):
# Sum each relevant columns
sum_a = np.sum(array_a[:, first:(first + size + n - 1)], axis=0)
# Add a zero a the beginning so the next operation works fine
sum_a = np.insert(sum_a, 0, 0)
# Sum the initial `size` elements and add and remove partial sums as necessary
v = np.sum(sum_a[:size]) - np.cumsum(sum_a[:n]) + np.cumsum(sum_a[-n:])
# Average
v /= (size * len(array_a))
return v
Benchmarking with the previous solution from before:
import numpy as np
np.random.seed(100)
array_a = np.random.random((1000, 1000))
n = 100
first = 100
size = 200
%timeit rolling_means_orig(array_a, n, first, size)
# 12.7 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means(array_a, n, first, size)
# 5.49 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means_faster1(array_a, n, first, size)
# 166 µs ± 874 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit rolling_means_faster2(array_a, n, first, size)
# 182 µs ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So these last two seem to be very close in performance. It may depend on the relative sizes of the inputs.
This is a possible vectorized solution:
import numpy as np
# Data
np.random.seed(100)
array_a = np.random.random((100, 100))
# Take all the relevant columns
slice_a = array_a[:, 20:40 + 10]
# Make a "rolling window" with stride tricks
strides_b = (slice_a.strides[1], slice_a.strides[0], slice_a.strides[1])
array_b = np.lib.stride_tricks.as_strided(slice_a, (10, 100, 20), (strides_b))
# Take mean
result = np.mean(array_b, axis=(1, 2))
# Original method for testing correctness
vector_a = np.zeros(10)
idv1 = np.arange(10) + 20
idv2 = np.arange(10) + 40
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,idv1[i]:idv2[i]])
print(np.allclose(vector_a, result))
# True
Here is a quick benchmark in IPython (sizes increased for appreciation):
import numpy as np
def rolling_means(array_a, n, first, size):
slice_a = array_a[:, first:(first + size + n)]
strides_b = (slice_a.strides[1], slice_a.strides[0], slice_a.strides[1])
array_b = np.lib.stride_tricks.as_strided(slice_a, (n, len(array_a), size), (strides_b))
return np.mean(array_b, axis=(1, 2))
def rolling_means_orig(array_a, n, first, size):
vector_a = np.zeros(n)
idv1 = np.arange(n) + first
idv2 = np.arange(n) + (first + size)
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,idv1[i]:idv2[i]])
return vector_a
np.random.seed(100)
array_a = np.random.random((1000, 1000))
n = 100
first = 100
size = 200
%timeit rolling_means(array_a, n, first, size)
# 5.48 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means_orig(array_a, n, first, size)
# 32.8 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This solution works on the assumption that you are trying to compute rolling average of a subset of window of columns.
As an example and ignoring rows, given [0, 1, 2, 3, 4] and a window of 2 the averages are [0.5, 1.5, 2.5, 3.5], and that you might only want the second and third averages.
Your current solution is inefficient as it is recomputes the mean for a column for each output in vector_a. Given that (a / n) + (b / n) == (a + b) / n we can get away with computing the mean of each column only once, and then combine the column means as needed to produce the final output.
window_first_start = idv1.min() # or idv1[0]
window_last_end = idv2.max() # or idv2[-1]
window_size = idv2[0] - idv1[0]
assert ((idv2 - idv1) == window_size).all(), "sanity check, not needed if assumption holds true"
# a view of the columns we are interested in, no copying is done here
view = array_a[:,window_first_start:window_last_end]
# calculate the means for each column
col_means = view.mean(axis=0)
# cumsum is used to find the rolling sum of means and so the rolling average
# We use an out variable to make sure we have a 0 in the first element of cum_sum.
# This makes like a little easier in the next step.
cum_sum = np.empty(len(col_means) + 1, dtype=col_means.dtype)
cum_sum[0] = 0
np.cumsum(col_means, out=cum_sum[1:])
result = (cum_sum[window_size:] - cum_sum[:-window_size]) / window_size
Having tested this against your own code, the above is significantly faster (increasing with the size of the input array), and slightly faster than the solution provided by jdehesa. With an input array of 1000x1000, it is two orders of magnitude faster than your solution and one order of magnitude faster than jdehesa's.

Try this:
import numpy as np
array_a = np.random.random((100,100))
vector_a = [np.mean(array_a[:,i+20:i+40]) for i in range(10)]

Numpy efficient matrix self-multiplication (gram matrix)

I want to multiply B = A # A.T in numpy. Obviously, the answer would be a symmetric matrix (i.e. B[i, j] == B[j, i]).
However, it is not clear to me how to leverage this easily to cut the computation time down in half (by only computing the lower triangle of B and then using that to get the upper triangle for free).
Is there a way to perform this optimally?

As noted in #PaulPanzer's link, dot can detect this case. Here's the timing proof:
In [355]: A = np.random.rand(1000,1000)
In [356]: timeit A.dot(A.T)
57.4 ms ± 960 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [357]: B = A.T.copy()
In [358]: timeit A.dot(B)
98.6 ms ± 805 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy dot too clever about symmetric multiplications

You can always use sklearns's pairwise_distances
Usage:
from sklearn.metrics.pairwise import pairwise_distances
gram = pairwise_distance(x, metric=metric)
Where metric is a callable or a string defining one of their implemented metrics (full list in the link above)
But, I wrote this for myself a while back so I can share what I did:
import numpy as np
def computeGram(elements, dist):
n = len(elements)
gram = np.zeros([n, n])
for i in range(n):
for j in range(i + 1):
gram[i, j] = dist(elements[i], elements[j])
upTriIdxs = np.triu_indices(n)
gram[upTriIdxs] = gram.T[upTriIdxs]
return gram
Where dist is a callable, in your case np.inner

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing Calculations with numpy and numba Python - python

Related

Efficient (not DataFrame.apply) way of getting cosine distance for mapped values

Faster way to check if elements in numpy array windows are finite

Why numpy.var is O(N) space?

A tedious loop looking for improvements

Numpy efficient matrix self-multiplication (gram matrix)

Categories

Resources