Python numpy : "Array is too big" - python

import numpy
from scipy.spatial.distance import pdist
X = numpy.zeros(50000,25)
C = pdist(X, 'euclidian')
I want to find:
And then numpy gives error : Array is too big.
I think problem is about array size of C. Pdist cannot creates (50000,50000) array. I dont know why numpy restricts? I can run same code in matlab. How can i run this code using array?
And also ,i found possible duplication but their array-matrix size too big.
Is it possible to create a 1million x 1 million matrix using numpy?
Very large matrices using Python and NumPy

first thing there are a couple of typos in your code. It's:
X = numpy.zeros((50000,25)) # it's a tuple going in
C = pdist(X, 'euclidean') # euclidean with an e
of course it does not matter for the question.
The Euclidean pdist is just a call for numpy.linalg.norm (http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html). It's a very general function. If it does not work in your case due to memory constraints you can always create something yourself. Two 50000 length vectors do not take that much memory and this can make one pairwise comparison:
np.sqrt(np.sum(np.square(X[0])) + np.sum(np.square(X[1])))
And then you only need to loop through the whole thing.
Hope it helps,
P

Related

Eigenvalues of a Scipy Sparse Block Diagonal Matrix

I am trying to find the eigenvalues of many small matrices, while not trying to use a loop, with the intent to use CuPy later on.
Thus, I have tried to set up a large matrix that takes the matrices that I want to solve as blocks on its diagonal. This matrix contains a lot of unnecessary zeros, thus I use Scipy.Sparse.
All works well, until I want to find the eigenvalues, where the spsolve() function calculates the full eigenvectors to the problem, when most of the entries should also be zero.
import numpy as np
from scipy import sparse as sp
from scipy.sparse.linalg import spsolve, eigs
sigx=np.array([[0, 1],[1, 0]], dtype=np.complex128) # a 2x2 Pauli matrix
karray=np.arange(-np.pi, np.pi, np.pi/100) #200 elements
H_sci=sp.kron(sp.diags(karray), sigx) #The sparse matrix I want to find the eigenvalues to
H_reg=H_sci.toarray() #Converted into a regular numpy array to see the memory difference
print(H_sci.data.nbytes) #12800 = 2*2*200*16, reminder that 16 bytes = 128 bits --> saves 4 arrays of length 200
print(H_reg.nbytes) #2560000 = 2*2*200*200*16 --> saves the entire matrix
E_sci=eigs(H_sci, k=398) #throws an error for k=400 and 399, even though I should have 400 eigenvalues?
print(E_sci[1].data.nbytes) #2547200 --> as much as H_reg
Do I do something wrong? Is there an alternative approach to solving many matrices (here 2x2 for example) in parallel? I have used Numba for looping over the matrices before, but I would like to try to use my GPU to see whether I can speed this problem up, because I do not see why I should solve these matrices one after another.

Calculate a trace of large matrix in python

I have a matrix X and I need to write a function, which calculate a trace of matrix .
I wrote a next script:
import numpy as np
def test(matrix):
return (np.dot(matrix, matrix.T)).trace()
np.random.seed(42)
matrix = np.random.uniform(size=(1000, 1))
print(test(matrix))
It works fine on small matrix, but when I try to calculate on large matrix (for example on matrix with shape (50000, 1)), it gives me a memory error.
I tried to find a solution to the problem in other questions on the site, but nothing helped me. I would be grateful for any advice!
The number you're trying to compute is just the sum of the squares of all entries of X. Sum the squares instead of computing a giant matrix product full of entries you don't want:
return (X**2).sum()
Or ravel the matrix and use dot, which is probably faster for contiguous X:
raveled = X.ravel()
return raveled.dot(raveled)
Actually, ravel is probably faster for non-contiguous X, too - even when ravel needs to copy, it's not doing more allocation than (X**2).sum().

Numpy, avoid loop in 3d array difference nested summation

I have a simple problem for Numpy: I have 3d coordinates and I want to compute the overlap between two distinct configurations with the following function
def Overlap(rt, r0,a):
s=0
for i in range(len(rt)):
s+=(( pl.norm(r0[i]-rt ,axis=1) <=a).astype('int')).sum()
return s`
Where rt and r0 represent two m by 3 tables, the configurations.
Practically, it computes the distance between a vector in the first configuration and any other vector in the second, checks for a threshold value a, and returns the total sum after a loop over all the positions. Is there a smart way to avoid the explicit for loop? I have the feeling that the complexity cannot really be changed, but there is maybe a way to avoid the slowness of the native for construct.
How about the following:
from scipy.spatial.distance import cdist
import numpy as np
overlap = np.sum(cdist(rt, r0) <= a)
When m is 1000 on my machine, this is about 9x faster. It's much faster for small arrays

Numpy linalg on multidimensional arrays

Is there a way to use numpy.linalg.det or numpy.linalg.inv on an nx3x3 array (a line in a multiband image), for example? Right now I am doing something like:
det = numpy.array([numpy.linalg.det(i) for i in X])
but surely there is a more efficient way. Of course, I could use map:
det = numpy.array(map(numpy.linalg.det, X))
Any other more direct way?
I'm pretty sure there is no substantially more efficient way than what you have. You can save some memory by first creating an empty array for the results and writing all results directly to that array:
res = numpy.empty_like(X)
for i, A in enumerate(X):
res[i] = numpy.linalg.inv(A)
This won't be any faster, though -- it will only use less memory.
a "normal" determinant is only defined for a matrix (dimension=2), so if that's what you want i don't see another way.
if you really want to compute the determinant of a cube then you could try to implement one of the ways described here:
http://en.wikipedia.org/wiki/Hyperdeterminant
notice that it is not necessarily the same value as the one you're currently computing.
New answer to an old question: Since version 1.8.0, numpy supports evaluating a batch of 2D matrices. For a batch of MxM matrices, the input and output now looks like:
linalg.det(a)
Compute the determinant of an array.
Parameters a(…, M, M) array_like
Input array to compute determinants for.
Returns det(…) array_like
Determinant of a.
Note the ellipsis. There can be multiple "batch dimensions", where for example you can evaluate a determinants on a meshgrid.
https://numpy.org/doc/stable/reference/generated/numpy.linalg.det.html
https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html

Numpy, all pairwise correlations of a 3d array

I have an array of shape (l,m,n). I'm trying to calculate a distance matrix of shape (l,m,n) where entry (i,j,k) is the coefficient between vectors (i,j,:) and (i,:,k). I haven't found anything in numpy or scipy that fits the bill.
I tried using a for loop and iterating along axis 0, then feeding that to scipy.spatial.distance.pdist, but that takes a long time as pdist itself uses a nested for loop. In essence, what I would like to do would be to perform pdist down axis 0, but ideally make it so pdist doesn't use for loops either....
Any thoughts?
I would personally write a little Cython function to do this ( http://cython.org). Write and test an iterative pure Python version (with for loops), move it to a .pyx Cython file, add type declarations and follow the NumPy integration guide:
http://docs.cython.org/src/tutorial/numpy.html
Might seem like work but if you're doing computing in Python, some basic Cython skills are well worth cultivating as it makes writing C extensions much easier.
Any thoughts?
First thought is that you cannot compute such distances as long as m != n
Second thought is that internal loops of pdist should not bother you if those are written in C, so the probable reason is not in implementation, but in the amount of computations needed
Final thought is that your problem may be solved by numpy.einsum and linear algebra:
Code (which I assume to be optimal):
products = numpy.einsum('ijl, ilk -> ijk')
distances = numpy.einsum('ijj -> ij', products)
distances = distances[:, :, None] + distances[:, None, :] - 2 * product

Categories

Resources