How to efficiently populate an array with values from different functions?

How to efficiently populate an array with values from different functions? - python

I have an array x of length n_x, an array all_par1 of length n_par1 and a single parameter par2. Furthermore, two functions func1 and func2 that take these parameters and x as input.
I want to create an array with the dimensions n_x x 2 * n_par1, where the first half of the columns is populated with the values from func1 and the second half with values from func2.
I currently do it like this:
import numpy as np
def func1(x, par1):
return x / (par1 + x)
def func2(x, par1, par2):
return -par1 * x / ((par2 + x) ** 2)
def populate_matrix(xvec, par1_vec, par2):
first_half = np.stack((func1(xvec, par1_i) for par1_i in par1_vec), axis=1)
second_half = np.stack((func2(xvec, par1_i, par2) for par1_i in par1_vec), axis=1)
return np.concatenate((first_half, second_half), axis=1)
np.random.seed(0)
all_par1 = [1., 2., 3.]
my_par2 = 5.
n_x = 2
x_variable_length = np.random.rand(n_x)
print x_variable_length
mat = populate_matrix(x_variable_length, all_par1, my_par2)
This gives me then e.g.
[[ 0.35434447 0.21532117 0.15464704 -0.01782479 -0.03564959 -0.05347438]
[ 0.416974 0.26340313 0.19250415 -0.02189575 -0.0437915 -0.06568725]]
As n_x is 2, it has two rows, the first half of columns is generated with func1 which is always positive, the second half with values from func2 which are always negative.
I need to call this function a lot of times and I am wondering whether this is the most efficient way of doing it. Any ideas?
Not sure whether it is of interest but the actual dimensions are something like 300 x 100.

Here is the vectored way for a 10x improvement on big arrays (100x200 in the tests) :
def populate_matrix_v(xvec, par1_vec, par2):
n,m=xvec.size,par1_vec.size
res= np.empty((n,m+m))
res[:,:m]=func1(x_variable_length[:,None],par1_vec)
res[:,m:]=func2(x_variable_length[:,None], par1_vec, par2)
return res
In [377]: %timeit matv = populate_matrix_v(x_variable_length, all_par1, my_par2)
171 µs ± 6.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [378]: %timeit mat = populate_matrix(x_variable_length, all_par1, my_par2)
1.88 ms ± 61.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

Efficient (not DataFrame.apply) way of getting cosine distance for mapped values

Here's some data I've generated:
import numpy as np
import pandas as pd
import scipy
import scipy.spatial
df = pd.DataFrame(
{
"item_1": np.random.randint(low=0, high=10, size=1000),
"item_2": np.random.randint(low=0, high=10, size=1000),
}
)
embeddings = {item_id: np.random.randn(100) for item_id in range(0, 10)}
def get_distance(item_1, item_2):
arr1 = embeddings[item_1]
arr2 = embeddings[item_2]
return scipy.spatial.distance.cosine(arr1, arr2)
I'd like to apply get_distance to each row. I can do:
df.apply(lambda row: get_distance(row["item_1"], row["item_2"]), axis=1)
But that would be very slow for large datasets.
Is there a way to calculate the cosine similarity of the embeddings corresponding to each row, without using DataFrame.apply?

For scipy version
%%timeit
df.apply(lambda row: get_distance(row["item_1"], row["item_2"]), axis=1)
# 38.3 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For what its worth I added numba with extra complication
Thinking about memory and numpy broadcast use tmp allocation, I used for loops
Also it is worth considering passing arguments, maybe you can pass vectors instead of dictionary.
Also first run is slow due to compilation
Also you can make it parallel with numba
#nb.njit((nb.float64[:, ::100], nb.float64[:, ::100]))
def cos(a, b):
norm_a = np.empty((a.shape[0],), dtype=np.float64)
norm_b = np.empty((b.shape[0],), dtype=np.float64)
cos_ab = np.empty((a.shape[0],), dtype=np.float64)
for i in nb.prange(a.shape[0]):
sq_norm = 0.0
for j in range(100):
sq_norm += a[i][j] ** 2
norm_a[i] = sq_norm ** 0.5
for i in nb.prange(b.shape[0]):
sq_norm = 0.0
for j in range(100):
sq_norm += b[i][j] ** 2
norm_b[i] = sq_norm ** 0.5
for i in nb.prange(a.shape[0]):
dot = 0.0
for j in range(100):
dot += a[i][j] * b[i][j]
cos_ab[i] = 1 - dot / (norm_a[i] * norm_b[i])
return cos_ab
%%timeit
cos(item_1_embedded, item_2_embedded)
# 218 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using vectorized numpy operations directly is much faster:
item_1_embedded = np.array([embeddings[x]for x in df.item_1])
item_2_embedded = np.array([embeddings[x]for x in df.item_2])
cos_dist = 1 - np.sum(item_1_embedded*item_2_embedded, axis=1)/(np.linalg.norm(item_1_embedded, axis=1)*np.linalg.norm(item_2_embedded, axis=1))
(This version runs in 771 µs on average on my pc, vs 37.4 ms for the DataFrame.apply, which makes the pure numpy version about 50 times faster).

You can vectorize the call to cosine with numpy.vectorize. There is a slight gain in speed (34 ms vs 53 ms)
vec_cosine = np.vectorize(scipy.spatial.distance.cosine)
vec_cosine(df['item_1'].map(embeddings),
df['item_2'].map(embeddings))
output:
array([0.90680875, 0.90999454, 0.99212814, 1.12455852, 1.06354469,
0.95542037, 1.07133003, 1.07133003, 0. , 1.00837058,
0. , 0.93961103, 0.8943738 , 1.04872436, 1.21171375,
1.04621226, 0.90392229, 1.0365102 , 0. , 0.90180297,
0.90180297, 1.04516879, 0.94877277, 0.90180297, 0.93713404,
...
1.17548653, 1.11700641, 0.97926805, 0.8943738 , 0.93961103,
1.21171375, 0.91817959, 0.91817959, 1.04674315, 0.88210679,
1.11806218, 1.07816675, 1.00837058, 1.12455852, 1.04516879,
0.93713404, 0.93713404, 0.95542037, 0.93876964, 0.91817959])

How can I speed up iterating large list and summing values

# Generate test data
test = list(range(150))
groups = []
for _ in range(75_000):
groups.append(random.sample(test, 6))

Setup variables as numpy arrays:
# Best version
import numpy as np
import random
from numba import jit # Kind of optional see below
# Generate test data
test = list(range(150))
groups = np.array([random.sample(test, 6) for _ in range(75_000)])
# This will change every time but just leaving the same for example
scores_dict = {i: random.uniform(0, 120) for i in range(150)}
scores = np.array(list(scores_dict.items()))
Here's the vectorized version using numpy's sum and take:
def fun1(scores, groups):
for _ in range(6250):
c = np.sum(np.take(scores[:, 1], groups), axis=1)
return c
%timeit fun1(scores, groups) # Takes ~2.5 mins to run
18.6 s ± 625 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you really want to go all out you can try using numba on top of numpy:
#jit(nopython=True)
def fun2(scores, groups):
for _ in range(6250):
c = np.sum(np.take(scores[:, 1], groups), axis=1)
return c
%timeit fun2(scores, groups) # Takes ~1.2 mins to run
10.1 s ± 1.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Cumsum with restarts [duplicate]

This question already has answers here:
Restart cumsum and get index if cumsum more than value
(3 answers)
Closed 2 years ago.
I want to bin the data every time the threshold 10000 is exceeded.
I have tried this with no luck:
# data which is an array of floats
diff = np.diff(np.cumsum(data)//10000, prepend=0)
indices = (np.argwhere(diff > 0)).flatten()
The problem is that all the bins does not contain 10000, which was my goal.
Expected output
input_data = [4000, 5000, 6000, 2000, 8000, 3000]
# (4000+5000+6000 >= 10000. Index 2)
# (2000+8000 >= 10000. Index 4)
Output: [2, 4]
I wonder if there is any alternative to a for loop?

Not sure how this could be vectorized, if it even can be, since by taking the cumulative sum you'll be propagating the remainders each time the threshold is surpassed. So probably this is a good case for numba, which will compile the code down to C level, allowing for a loopy but performant approach:
from numba import njit, int32
#njit('int32[:](int32[:], uintc)')
def windowed_cumsum(a, thr):
indices = np.zeros(len(a), int32)
window = 0
ix = 0
for i in range(len(a)):
window += a[i]
if window >= thr:
indices[ix] = i
ix += 1
window = 0
return indices[:ix]
The explicit signature implies ahead of time compilation, though this enforces specific dtypes on the input array. The inferred dtype for the example array is of int32, though if this might not always be the case or for a more flexible solution you can always ignore the dtypes in the signature, which will only imply that the function will be compiled on the first execution.
input_data = np.array([4000, 5000, 6000, 2000, 8000, 3000])
windowed_cumsum(input_data, 10000)
# array([2, 4])
Also #jdehesa raises an interesting point, which is that for very long arrays compared to the number of bins, a better option might be to just append the indices to a list. So here is an alternative approach using lists (also in no-python mode), along with timings under different scenarios:
from numba import njit, int32
#njit
def windowed_cumsum_list(a, thr):
indices = []
window = 0
for i in range(len(a)):
window += a[i]
if window >= thr:
indices.append(i)
window = 0
return indices
a = np.random.randint(0,10,10_000)
%timeit windowed_cumsum(a, 20)
# 16.1 µs ± 232 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit windowed_cumsum_list(a, 20)
# 65.5 µs ± 623 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit windowed_cumsum(a, 2000)
# 7.38 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit windowed_cumsum_list(a, 2000)
# 7.1 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So it seems that under most scenarios using numpy will be a faster option, since even in the second case, with a length 10000 array and a resulting array of 20 indices of bins, both perform similarly, though for memory efficiency reasons the latter might be more convenient in some cases.

Here is how you can do it fairly efficiently with a loop, using np.searchsorted to find bin boundaries fast:
import numpy as np
np.random.seed(0)
bin_size = 10_000
data = np.random.randint(100, size=20_000)
# Naive solution (incorrect, for comparison)
data_f = np.floor(np.cumsum(data) / bin_size).astype(int)
bin_starts = np.r_[0, np.where(np.diff(data_f) > 0)[0] + 1]
# Check bin sizes
bin_sums = np.add.reduceat(data, bin_starts)
# We go over the limit!
print(bin_sums.max())
# 10080
# Better solution with loop
data_c = np.cumsum(data)
ref_val = 0
bin_starts = [0]
while True:
# Search next split point
ref_idx = bin_starts[-1]
# Binary search through remaining cumsum
next_idx = np.searchsorted(data_c[ref_idx:], ref_val + bin_size, side='right')
next_idx += ref_idx
# If we finished the array stop
if next_idx >= len(data_c):
break
# Add new bin boundary
bin_starts.append(next_idx)
ref_val = data_c[next_idx - 1]
# Convert bin boundaries to array
bin_starts = np.array(bin_starts)
# Check bin sizes
bin_sums = np.add.reduceat(data, bin_starts)
# Does not go over limit
print(bin_sums.max())
# 10000

Why numpy.var is O(N) space?

I have an array of ~13GB. I call numpy.var on it to compute the variance. However, it allocates another ~13GB to do this. Why does it need O(N) space? Or am I calling numpy.var in a wrong way?
import numpy as np
# data = ...
print('Variance: ', np.var(data))

NumPy will create an intermediate array to compute abs(data - data.mean()) ** 2 in order to compute the variance. You can write your own variance function with a loop and make it fast with Numba:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def var_nb(a, ddof=0):
n = len(a)
s = a.sum()
m = s / (n - ddof)
v = 0
for i in nb.prange(n):
v += abs(a[i] - m) ** 2
return v / (n - ddof)
np.random.seed(100)
a = np.random.rand(100_000)
print(np.var(a))
# 0.08349747560941487
print(var_nb(a))
# 0.08349747560941487
%timeit np.var(a)
# 143 µs ± 414 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit var_nb(a)
# 40.2 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is faster whitout parallelization:
import numpy as np
def var(a: np.ndarray, axis: int = 0):
return np.sum(abs(a - (a.sum(axis=axis) / len(a))) ** 2, axis=axis) / len(a)

A tedious loop looking for improvements

in my code I need to calculate the values of a vector many times which are the mean values from different patches of another array.
Here is an example of my code showing how I do it but I found that it is too less-efficient in running...
import numpy as np
vector_a = np.zeros(10)
array_a = np.random.random((100,100))
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,i+20:i+40]
Is there any way to make it more efficient? Any comments or suggestions are very welcome! Many thanks!
-yes, the 20 and 40 are fixed.

EDIT:
Actually you can do this much faster. The previous function can be improved by operating on summed columns like this:
def rolling_means_faster1(array_a, n, first, size):
# Sum each relevant columns
sum_a = np.sum(array_a[:, first:(first + size + n - 1)], axis=0)
# Reshape as before
strides_b = (sum_a.strides[0], sum_a.strides[0])
array_b = np.lib.stride_tricks.as_strided(sum_a, (n, size), (strides_b))
# Average
v = np.sum(array_b, axis=1)
v /= (len(array_a) * size)
return v
Another way is to work with accumulated sums, adding and removing as necessary for each output element.
def rolling_means_faster2(array_a, n, first, size):
# Sum each relevant columns
sum_a = np.sum(array_a[:, first:(first + size + n - 1)], axis=0)
# Add a zero a the beginning so the next operation works fine
sum_a = np.insert(sum_a, 0, 0)
# Sum the initial `size` elements and add and remove partial sums as necessary
v = np.sum(sum_a[:size]) - np.cumsum(sum_a[:n]) + np.cumsum(sum_a[-n:])
# Average
v /= (size * len(array_a))
return v
Benchmarking with the previous solution from before:
import numpy as np
np.random.seed(100)
array_a = np.random.random((1000, 1000))
n = 100
first = 100
size = 200
%timeit rolling_means_orig(array_a, n, first, size)
# 12.7 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means(array_a, n, first, size)
# 5.49 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means_faster1(array_a, n, first, size)
# 166 µs ± 874 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit rolling_means_faster2(array_a, n, first, size)
# 182 µs ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So these last two seem to be very close in performance. It may depend on the relative sizes of the inputs.
This is a possible vectorized solution:
import numpy as np
# Data
np.random.seed(100)
array_a = np.random.random((100, 100))
# Take all the relevant columns
slice_a = array_a[:, 20:40 + 10]
# Make a "rolling window" with stride tricks
strides_b = (slice_a.strides[1], slice_a.strides[0], slice_a.strides[1])
array_b = np.lib.stride_tricks.as_strided(slice_a, (10, 100, 20), (strides_b))
# Take mean
result = np.mean(array_b, axis=(1, 2))
# Original method for testing correctness
vector_a = np.zeros(10)
idv1 = np.arange(10) + 20
idv2 = np.arange(10) + 40
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,idv1[i]:idv2[i]])
print(np.allclose(vector_a, result))
# True
Here is a quick benchmark in IPython (sizes increased for appreciation):
import numpy as np
def rolling_means(array_a, n, first, size):
slice_a = array_a[:, first:(first + size + n)]
strides_b = (slice_a.strides[1], slice_a.strides[0], slice_a.strides[1])
array_b = np.lib.stride_tricks.as_strided(slice_a, (n, len(array_a), size), (strides_b))
return np.mean(array_b, axis=(1, 2))
def rolling_means_orig(array_a, n, first, size):
vector_a = np.zeros(n)
idv1 = np.arange(n) + first
idv2 = np.arange(n) + (first + size)
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,idv1[i]:idv2[i]])
return vector_a
np.random.seed(100)
array_a = np.random.random((1000, 1000))
n = 100
first = 100
size = 200
%timeit rolling_means(array_a, n, first, size)
# 5.48 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means_orig(array_a, n, first, size)
# 32.8 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This solution works on the assumption that you are trying to compute rolling average of a subset of window of columns.
As an example and ignoring rows, given [0, 1, 2, 3, 4] and a window of 2 the averages are [0.5, 1.5, 2.5, 3.5], and that you might only want the second and third averages.
Your current solution is inefficient as it is recomputes the mean for a column for each output in vector_a. Given that (a / n) + (b / n) == (a + b) / n we can get away with computing the mean of each column only once, and then combine the column means as needed to produce the final output.
window_first_start = idv1.min() # or idv1[0]
window_last_end = idv2.max() # or idv2[-1]
window_size = idv2[0] - idv1[0]
assert ((idv2 - idv1) == window_size).all(), "sanity check, not needed if assumption holds true"
# a view of the columns we are interested in, no copying is done here
view = array_a[:,window_first_start:window_last_end]
# calculate the means for each column
col_means = view.mean(axis=0)
# cumsum is used to find the rolling sum of means and so the rolling average
# We use an out variable to make sure we have a 0 in the first element of cum_sum.
# This makes like a little easier in the next step.
cum_sum = np.empty(len(col_means) + 1, dtype=col_means.dtype)
cum_sum[0] = 0
np.cumsum(col_means, out=cum_sum[1:])
result = (cum_sum[window_size:] - cum_sum[:-window_size]) / window_size
Having tested this against your own code, the above is significantly faster (increasing with the size of the input array), and slightly faster than the solution provided by jdehesa. With an input array of 1000x1000, it is two orders of magnitude faster than your solution and one order of magnitude faster than jdehesa's.

Try this:
import numpy as np
array_a = np.random.random((100,100))
vector_a = [np.mean(array_a[:,i+20:i+40]) for i in range(10)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently populate an array with values from different functions? - python

Related

Efficient (not DataFrame.apply) way of getting cosine distance for mapped values

How can I speed up iterating large list and summing values

Cumsum with restarts [duplicate]

Why numpy.var is O(N) space?

A tedious loop looking for improvements

Categories

Resources