Assume an n-dimensional array of observations that are reshaped to be a 2d-array with each row being one observation set. Using this reshape approach, np.polyfit can compute 2nd order fit coefficients for the entire ndarray (vectorized):
fit = np.polynomial.polynomialpolyfit(X, Y, 2)
where Y is shape (304000, 21) and X is a vector. This results in a (304000,3) array of coefficients, fit.
Using an iterator it is possible to call np.polyval(fit, X) for each row. This is inefficient when a vectorized approach may exist. Could the fit result be applied to the entire observation array without iterating? If so, how?
This is along the lines of this SO question.
np.polynomial.polynomial.polyval takes multidimensional coefficient arrays:
>>> x = np.random.rand(100)
>>> y = np.random.rand(100, 25)
>>> fit = np.polynomial.polynomial.polyfit(x, y, 2)
>>> fit.shape # 25 columns of 3 polynomial coefficients
(3L, 25L)
>>> xx = np.random.rand(50)
>>> interpol = np.polynomial.polynomial.polyval(xx, fit)
>>> interpol.shape # 25 rows, each with 50 evaluations of the polynomial
(25L, 50L)
And of course:
>>> np.all([np.allclose(np.polynomial.polynomial.polyval(xx, fit[:, j]),
... interpol[j]) for j in range(25)])
True
np.polynomial.polynomial.polyval is a perfectly fine (and convenient) approach to efficient evaluation of polynomial fittings.
However, if 'speediest' is what you are looking for, simply constructing the polynomial inputs and using the rudimentary numpy matrix multiplication functions results in slightly faster ( roughly 4x faster) computational speeds.
Setup
Using the same setup as above, we'll create 25 different line fittings.
>>> num_samples = 100000
>>> num_lines = 100
>>> x = np.random.randint(0,100,num_samples)
>>> y = np.random.randint(0,100,(num_samples, num_lines))
>>> fit = np.polyfit(x,y,deg=2)
>>> xx = np.random.randint(0,100,num_samples*10)
Numpy's polyval Function
res1 = np.polynomial.polynomial.polyval(xx, fit)
Basic Matrix Multiplication
inputs = np.array([np.power(xx,d) for d in range(len(fit))])
res2 = fit.T.dot(inputs)
Timing the functions
Using the same parameters above...
%timeit _ = np.polynomial.polynomial.polyval(xx, fit)
1 loop, best of 3: 247 ms per loop
%timeit inputs = np.array([np.power(xx, d) for d in range(len(fit))]);_ = fit.T.dot(inputs)
10 loops, best of 3: 72.8 ms per loop
To beat a dead horse...
mean Efficiency bump of ~3.61x faster. Speed fluctuations probably come from random computer processes in background.
Related
Here's my problem. I have two matrices A and B, with complex entries, of dimensions (n,n,m,m) and (n,n) respectively.
Below is the operation I perform to get a matrix C -
C = np.sum(B[:,:,None,None]*A, axis=(0,1))
Computing the above once takes about 6-8 seconds. Since I have to compute many such Cs, it takes a lot of time. Is there a faster way to do this? (I'm doing these using JAX NumPy on a multi-core CPU; normal NumPy takes even longer)
n=77 and m=512, if you are wondering. I can parallelize as I'm working on a cluster, but the sheer size of the arrays consumes a lot of memory.
It looks like you want einsum:
C = np.einsum('ijkl,ij->kl', A, B)
With numpy on a Colab CPU I get this:
import numpy as np
x = np.random.rand(50, 50, 500, 500)
y = np.random.rand(50, 50)
def f1(x, y):
return np.sum(y[:,:,None,None]*x, axis=(0,1))
def f2(x, y):
return np.einsum('ijkl,ij->kl', x, y)
np.testing.assert_allclose(f1(x, y), f2(x, y))
%timeit f1(x, y)
# 1 loop, best of 5: 1.52 s per loop
%timeit f2(x, y)
# 1 loop, best of 5: 620 ms per loop
I have the following data structures in numpy:
import numpy as np
a = np.random.rand(267, 173) # dense img matrix
b = np.random.rand(199) # array of probability samples
My goal is to take each entry i in b, find the x,y coordinates/index positions of all values in a that are <= i, then randomly select one of the values in that subset:
from random import randint
for i in b:
l = np.argwhere(a <= i) # list of img coordinates where pixel <= i
sample = l[randint(0, len(l)-1)] # random selection from `l`
This "works", but I'd like to vectorize the sampling operation (i.e. replace the for loop with apply_along_axis or similar). Does anyone know how this can be done? Any suggestions would be greatly appreciated!
You can't exactly vectorize np.argmax because you have a random subset size every time. What you can do though, is speed up the computation pretty dramatically with sorting. Sorting the image once will create a single allocation, while masking the image at every step will create a temporary array for the mask and for the extracted elements. With a sorted image, you can just apply np.searchsorted to get the sizes:
a_sorted = np.sort(a.ravel())
indices = np.searchsorted(a_sorted, b, side='right')
You still need a loop to do the sampling, but you can do something like
samples = np.array([a_sorted[np.random.randint(i)] for i in indices])
Getting x-y coordinates instead of sample values is a bit more complicated with this system. You can use np.unravel_index to get the indices, but first you must convert form the reference frame of a_sorted to a.ravel(). If you sort using np.argsort instead of np.sort, you can get the indices in the original array. Fortunately, np.searchsorted supports this exact scenario with the sorter parameter:
a_ind = np.argsort(a, axis=None)
indices = np.searchsorted(a.ravel(), b, side='right', sorter=a_ind)
r, c = np.unravel_index(a_ind[[np.random.randint(i) for i in indices]], a.shape)
r and c are the same size as b, and correspond to the row and column indices in a of each selection based on b. The index conversion depends on the strides in your array, so we'll assume that you're using C order, as 90% of arrays will do by default.
Complexity
Let's say b has size M and a has size N.
Your current algorithm does a linear search through each element of a for each element of b. At each iteration, it allocates a mask for the matching elements (N/2 on average), and then a buffer of the same size to hold the masked choices. This means that the time complexity is on the order of O(M * N) and the space complexity is the same.
My algorithm sorts a first, which is O(N log N). Then it searches for M insertion points, which is O(M log N). Finally, it selects M samples. The space it allocates is one sorted copy of the image and two arrays of size M. It is therefore of O((M + N) log N) time complexity and O(M + N) in space.
Here is an alternative approach argsorting b instead and then binning a accordingly using np.digitize and this post:
import numpy as np
from scipy import sparse
from timeit import timeit
import math
def h_digitize(a,bs,right=False):
mx,mn = a.max(),a.min()
asz = mx-mn
bsz = bs[-1]-bs[0]
nbins=int(bs.size*math.sqrt(bs.size)*asz/bsz)
bbs = np.concatenate([[0],((nbins-1)*(bs-mn)/asz).astype(int).clip(0,nbins),[nbins]])
bins = np.repeat(np.arange(bs.size+1), np.diff(bbs))
bbs = bbs[:bbs.searchsorted(nbins)]
bins[bbs] = -1
aidx = bins[((nbins-1)*(a-mn)/asz).astype(int)]
ambig = aidx == -1
aa = a[ambig]
if aa.size:
aidx[ambig] = np.digitize(aa,bs,right)
return aidx
def f_pp():
bo = b.argsort()
bs = b[bo]
aidx = h_digitize(a,bs,right=True).ravel()
aux = sparse.csr_matrix((aidx,aidx,np.arange(aidx.size+1)),
(aidx.size,b.size+1)).tocsc()
ridx = np.empty(b.size,int)
ridx[bo] = aux.indices[np.fromiter(map(np.random.randint,aux.indptr[1:-1].tolist()),int,b.size)]
return np.unravel_index(ridx,a.shape)
def f_mp():
a_ind = np.argsort(a, axis=None)
indices = np.searchsorted(a.ravel(), b, sorter=a_ind, side='right')
return np.unravel_index(a_ind[[np.random.randint(i) for i in indices]], a.shape)
a = np.random.rand(267, 173) # dense img matrix
b = np.random.rand(199) # array of probability samples
# round to test wether equality is handled correctly
a = np.round(a,3)
b = np.round(b,3)
print('pp',timeit(f_pp, number=1000),'ms')
print('mp',timeit(f_mp, number=1000),'ms')
# sanity checks
S = np.max([a[f_pp()] for _ in range(1000)],axis=0)
T = np.max([a[f_mp()] for _ in range(1000)],axis=0)
print(f"inequality satisfied: pp {(S<=b).all()} mp {(T<=b).all()}")
print(f"largest smalles distance to boundary: pp {(b-S).max()} mp {(b-T).max()}")
print(f"equality done right: pp {not (b-S).all()} mp {not (b-T).all()}")
Using a tweaked digitize I'm a bit faster but this may vary with problem size. Also, #MadPhysicist's solution is much less convoluted. With standard digitize we are about equal.
pp 2.620121960993856 ms
mp 3.301037881989032 ms
inequality satisfied: pp True mp True
largest smalles distance to boundary: pp 0.0040000000000000036 mp 0.006000000000000005
equality done right: pp True mp True
A slight improvement on #MadPhysicist 's algorithm to make it more vectorized:
%%timeit
a_ind = np.argsort(a, axis=None)
indices = np.searchsorted(a.ravel(), b, sorter=a_ind)
r, c = np.unravel_index(a_ind[[np.random.randint(i) for i in indices]], a.shape)
100 loops, best of 3: 6.32 ms per loop
%%timeit
a_ind = np.argsort(a, axis=None)
indices = np.searchsorted(a.ravel(), b, sorter=a_ind)
r, c = np.unravel_index(a_ind[(np.random.rand(indices.size) * indices).astype(int)], a.shape)
100 loops, best of 3: 4.16 ms per loop
#PaulPanzer 's solution still rules the field, though I'm not sure what it's caching:
%timeit f_pp()
The slowest run took 14.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 1.88 ms per loop
I have a list of points in (x,y) pairs, which represents the positions of a list of agents. For example, given 3 agents, there are 3 pairs of points, which I store as follows:
points = np.array([[x1, y1],
[x2, y2],
[x3, y3]])
I would like to calculate a subsequent array, that is the relative position from one agent to every other agent, but NOT itself. So, using the data above, I would like to generate the array relative_positions using the array points. points can have N positions (I can have upwards of 50-100 agents at any one time).
So using points described above, I would like to produce the output:
relative_positions = [[x2-x1, y2-y1],
[x3-x1, y3-y1],
[x1-x2, y1-y2],
[x3-x2, y3-y2],
[x1-x3, y1-y3],
[x2-x3, y2-y3]]
For example, given four agent positions stored as a numpy array:
agent_points = np.array([[10, 1],
[30, 3],
[25, 10],
[5, 5]])
I would like to generate the output:
relative_positions = [[30-10, 3-1],
[25-10, 10-1],
[5-10, 5-1],
[10-30, 1-3],
[25-30, 10-3],
[5-30, 5-3],
[10-25, 1-10],
[30-25, 3-10],
[5-25, 5-10],
[10-5, 1-5],
[30-5, 3-5],
[25-5, 10-5]]
How do I effectively go about doing this? I have thought about just calculating every difference possible, and removing the 0 cases (for when it's the relative position from the agent to itself), however I do not think that is a "pure" way to do it, since I could accidentally remove an agent that just happens to be on exactly the same point (or very close to)
Approach #1
With a the input array, you can do -
d = (a-a[:,None,:])
valid_mask = ~np.eye(len(a),dtype=bool)
out = d[valid_mask]
Basically, we are extending a to 3D such that first axis is made outer-broadcastable and then we perform subtraction against its 2D version, resulting in mxmx2 shaped output, with m being the a.shape[0]. Schematically put -
a[:, None, :] : 4 x 1 x 2
a : 4 x 2
output : 4 x 4 x 2
More info.
Another way to create valid_mask, would be -
r = np.arange(len(a))
valid_mask = r[:,None] != r
Approach #2
We will leverage np.lib.stride_tricks.as_strided to get a no-diagonal mask for 3D arrays (along first two axes), so that we will use it here to mask the differences array d. This mask generation is inspired by a 2D array problem as posted here and for a3D case would look something like this -
def nodiag_view3D(a):
m = a.shape[0]
p,q,r = a.strides
return np.lib.stride_tricks.as_strided(a[:,1:], shape=(m-1,m,2), strides=(p+q,q,r))
To solve our problem, it would be -
d = (a-a[:,None,:])
out = nodiag_view3D(d).reshape(-1,a.shape[1])
Timings to showcase how approach#2 improves upon #1
In [96]: a = np.random.rand(5000,2)
In [97]: d = (a-a[:,None,:])
In [98]: %%timeit
...: valid_mask = ~np.eye(len(a),dtype=bool)
...: out = d[valid_mask]
1 loop, best of 3: 763 ms per loop
In [99]: %%timeit
...: r = np.arange(len(a))
...: valid_mask = r[:,None] != r
...: out = d[valid_mask]
1 loop, best of 3: 767 ms per loop
In [100]: %timeit nodiag_view3D(d).reshape(-1,a.shape[1])
10 loops, best of 3: 177 ms per loop
While I don't have a numpy specific solution (I'm sure it exists) a double for-loop and id check can do the trick just fine. It'll need some time if points grows, though.
points = [
[x1, y1],
[x2, y2],
[x3, y3]
]
relative_positions = []
for point1 in points:
for point2 in point:
if id(point1) != id(point2):
relative_positions.append([CALC_HERE_OR_FUNCTION])
I am trying to implement PCA in python.
Currently I am using this code to represent the data back into the initial dimensions from the low dimensional data and the principal components:
sameDimRepresentation = lowDimRepresentation[:, np.newaxis] * principalComponents.T
sameDimRepresentation = sameDimRepresentation.sum(axis=2)
What the code does:
for each row in lowDimRepresentation it computes the product between each element of the row (seen as a scalar) and each of the row vectors of principal components (column vectors of principalComponents.T) and then it sums all these product vectors up (line 2)
lowDimRepresentation: an array of x by 100
principalComponents: an array of 100 by 784
resulting array: x by 784
This method works fine when using x = 10000 but after that I get a memory error.
I know einsum is more memory efficient, I was trying to rewrite the same code with it but I did not manage.
Can someone help me with that?
Worst case I just split the 60000 cases into batches of 10000 and I run those bits, but I was hoping to achieve something more elegant.
Thanks a lot!
So there's good news and there's bad news. The good news is that the einsum version is very simple:
np.einsum('ij,jl->il', lowDimRepresentation, principalComponents)
For example:
>>> import numpy as np
>>> x = 1000
>>> lowDimRepresentation = np.random.random((x, 100))
>>> principalComponents = np.random.random((100, 784))
>>> sameDimRepresentation = (lowDimRepresentation[:, np.newaxis] * principalComponents.T).sum(axis=2)
>>> esum_same = np.einsum('ij,jl->il', lowDimRepresentation, principalComponents)
>>> np.allclose(sameDimRepresentation, esum_same)
True
This should also be a little faster:
>>> %timeit sameDimRepresentation = (lowDimRepresentation[:, np.newaxis] * principalComponents.T).sum(axis=2)
1 loops, best of 3: 1.12 s per loop
>>> %timeit esum_same = np.einsum('ij,jl->il', lowDimRepresentation, principalComponents)
10 loops, best of 3: 88.7 ms per loop
The bad news is that when I try applying it to the x=60000 case:
>>> esum_same = np.einsum('ij,jl->il', lowDimRepresentation, principalComponents)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: iterator is too large
So I'm not sure whether it'll actually help with your real problem..
I'm working to implement the following equation:
X =(Y.T * Y + Y.T * C * Y) ^ -1
Y is a (n x f) matrix and C is (n x n) diagonal one; n is about 300k and f will vary between 100 and 200. As part of an optimization process this equation will be used almost 100 million times so it has to be processed really fast.
Y is initialized randomly and C is a very sparse matrix with only a few numbers out of the 300k on the diagonal will be different than 0.Since Numpy's diagonal functions creates dense matrices, I created C as a sparse csr matrix. But when trying to solve the first part of the equation:
r = dot(C, Y)
The computer crashes due Memory limits. I decided then trying to convert Y to csr_matrix and make the same operation:
r = dot(C, Ysparse)
and this approach took 1.38 ms. But this solution is somewhat "tricky" since I'm using a sparse matrix to store a dense one, I wonder how efficient this really.
So my question is if is there some way of multiplying the sparse C and the dense Y without having to turn Y into sparse and improve performance? If somehow C could be represented as diagonal dense without consuming tons of memory maybe this would lead to very efficient performance but I don't know if this is possible.
I appreciate your help!
The reason the dot product runs into memory issues when computing r = dot(C,Y) is because numpy's dot function does not have native support for handling sparse matrices. What is happening is numpy thinks of the sparse matrix C as a python object, and not a numpy array. If you inspect on small scale you can see the problem first hand:
>>> from numpy import dot, array
>>> from scipy import sparse
>>> Y = array([[1,2],[3,4]])
>>> C = sparse.csr_matrix(array([[1,0], [0,2]]))
>>> dot(C,Y)
array([[ (0, 0) 1
(1, 1) 2, (0, 0) 2
(1, 1) 4],
[ (0, 0) 3
(1, 1) 6, (0, 0) 4
(1, 1) 8]], dtype=object)
Clearly the above is not the result you are interested in. Instead what you want to do is compute using scipy's sparse.csr_matrix.dot function:
r = sparse.csr_matrix.dot(C, Y)
or more compactly
r = C.dot(Y)
Try:
import numpy as np
from scipy import sparse
f = 100
n = 300000
Y = np.random.rand(n, f)
Cdiag = np.random.rand(n) # diagonal of C
Cdiag[np.random.rand(n) < 0.99] = 0
# Compute Y.T * C * Y, skipping zero elements
mask = np.flatnonzero(Cdiag)
Cskip = Cdiag[mask]
def ytcy_fast(Y):
Yskip = Y[mask,:]
CY = Cskip[:,None] * Yskip # broadcasting
return Yskip.T.dot(CY)
%timeit ytcy_fast(Y)
# For comparison: all-sparse matrices
C_sparse = sparse.spdiags([Cdiag], [0], n, n)
Y_sparse = sparse.csr_matrix(Y)
%timeit Y_sparse.T.dot(C_sparse * Y_sparse)
My timings:
In [59]: %timeit ytcy_fast(Y)
100 loops, best of 3: 16.1 ms per loop
In [18]: %timeit Y_sparse.T.dot(C_sparse * Y_sparse)
1 loops, best of 3: 282 ms per loop
First, are you really sure you need to perform a full matrix inversion in your problem ? Most of the time, one only really need to compute x = A^-1 y which is a much easier problem to solve.
If this is really so, I would consider computing an approximation of the inverse matrix instead of the full matrix inversion. Since matrix inversion is really costly. See for example the Lanczos algorithm for an efficient approximation of the inverse matrix. The approximation can be stored sparsely as a bonus. Plus, it requires only matrix-vector operations so you don't even have to store the full matrix to inverse.
As an alternative, using pyoperators, you can also use to .todense method to compute the matrix to inverse using efficient matrix vector operations. There is a special sparse container for diagonal matrices.
For an implementation of the Lanczos algorithm, you can have a look at pyoperators (disclaimer: I am one of the coauthor of this piece of software).
I don't know if it was possible when the question was asked; but nowadays, broadcasting is your friend. An n*n diagonal matrix needs only be an array of the diagonal elements to be used in a matrix product:
>>> n, f = 5, 3
>>> Y = np.random.randint(0, 10, (n, f))
>>> C = np.random.randint(0, 10, (n,))
>>> Y.shape
(5, 3)
>>> C.shape
(5,)
>>> np.all(Y.T # np.diag(C) # Y == Y.T*C # Y)
True
Do note that Y.T*C # Y is non-associative:
>>> Y.T*(C # Y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (3,5) (3,)
But Y.T # (C[:, np.newaxis]*Y) would yield the expected result:
>>> np.all(Y.T*C # Y == Y.T#(C[:, np.newaxis]*Y))
True