With the following code snippet, I am trying to generate a vector where each element of it is drawn from a different normal distribution. The "mean" and "standard deviation" (arguments to random.normal) values for this is obtained from 2 numpy vectors, meanVect and varVect. Both the vectors have the same shape as that of vector to be generated.
I am using list comprehension to achieve the same, which I have used as a quicj and dirty fix to achieve my objective. Is there a numpy specific approach to achieve the same, which is more efficient than my current solution.
from numpy import random
meanVect = np.random.rand(1,100) # using random vectors for MWE
varVect = np.random.rand(1,100) # Originally vectors from a different source is used
newVect = [random.normal(meanVect[i],varVect[i]) for i in range(len(meanVects[0])) ]
Since np.random.normal takes array-like inputs for loc and scale, you can just do:
newVect = np.random.normal(meanVect, varVect)
As long as both input vectors have the same .shape, this should work.
I am generating a series of Gaussian arrays given a x vector of length (1400), and arrays for the sigma, center, amplitude (amp), all with length (100). I thought the best way to speed this up would be to use numpy and list comprehension:
g = np.sum([(amp[i]*np.exp(-0.5*(x - (center[i]))**2/(sigma[i])**2)) for i in range(len(center))],axis=0)
Each row is a gaussian along a vector x, and then I sum the columns into a single array of length x.
But this doesn't seem to speed things up at all. I think there is a faster way to do this while avoiding the for loop but I can't quite figure out how.
You should use vectorized computation instead of comprehension so the loops are all performed at c speed.
In order to do so you have to reshape x to be a column vector. For example you could do x = x.reshape((1400,1)).
Then you can operate directly on the arrays, like this:
v=(amp*np.exp(-0.5*(x - (center))**2/(sigma)**2
Then you obtain an array of shape (1400,100) which you can sum up to a vector by np.sum(v, axe=1)
You should try to vectorize all the operations. IMHO the most efficient to first converts your input data to numpy arrays (if they were plain Python lists) and then let numpy process the computations:
np_amp = np.array(amp)
np_center = np.array(center)
np_sigma = np.array(sigma)
g = np.sum((np_amp*np.exp(-0.5*(x - (np_center))**2/(np_sigma)**2)),axis=0)
I want to find the covariance of a 10304*280 matrix (i.e 280 variable and each have 10304 subjects) and I am using the following numpy function to find this.
cov = numpy.cov(matrix)
I am expected 208*280 matrix as a result but it returned 10304*10304 matrix.
As suggested in the previous answer, you can change your memory layout.
An easy way to do this in 2d is simply transposing the matrix:
import numpy as np
r = np.random.rand(100, 10)
np.cov(r).shape # is (100,100)
np.cov(r.T).shape # is (10,10)
But you can also specify a rowvar flag. Read about it here:
import numpy as np
r = np.random.rand(100, 10)
np.cov(r).shape # is (100,100)
np.cov(r, rowvar=False).shape # is (10,10)
I think especially for large matrices this might be more performant, since you avoid the swapping/transposing of axes.
I thought about this and wondered if the algorithm is actually different depending on rowvar == True or rowvar == False. Well, as it turns out, if you change the rowvar flag, numpy simply transposes the array itself :P.
Look here.
So, in terms of performance, nothing will change between the two versions.
here is what numpy.cov(m, y=None..) document says
m : array_like A 1-D or 2-D array containing multiple variables and
observations. Each row of m represents a variable, and each column a
single observation of all those variables...
So if your matrix contains 280 variable with 10304 samples for each, it suppose to be 280*10304 matrix instead of 10304*280 one. The simple solution would be same as others suggesting.
swap_matrix = numpy.swapaxis(matrix)
cov = numpy.cov(swap_matrix)
Suppose I have a grid of numbers in Python that I have created using
import numpy as np
h = np.linspace(0,20,100)
I am trying to make a random selection within the elements of h in a way that the distribution of the selections follows for example the log-normal distribution, with a given mean and standard deviation. How would I be able to do this?
May be easier to just draw samples from a lognormal distribution
This can be solved very fast. At first you have to find a way to draw random indices following your custom pdf. After you have done this, you can use these indices to draw numbers from 0 to 100 and return the entries of the array at these indices.
To draw the numbers randomly in this way, there are a few ways in ´python´, like this for example. When you have drawn your random indices in this way in an array called indices you can use:
result = h[indices]
to create your desired numpy array.
import numpy
from scipy.spatial.distance import pdist
X = numpy.zeros(50000,25)
C = pdist(X, 'euclidian')
I want to find:
And then numpy gives error : Array is too big.
I think problem is about array size of C. Pdist cannot creates (50000,50000) array. I dont know why numpy restricts? I can run same code in matlab. How can i run this code using array?
And also ,i found possible duplication but their array-matrix size too big.
Is it possible to create a 1million x 1 million matrix using numpy?
Very large matrices using Python and NumPy
first thing there are a couple of typos in your code. It's:
X = numpy.zeros((50000,25)) # it's a tuple going in
C = pdist(X, 'euclidean') # euclidean with an e
of course it does not matter for the question.
The Euclidean pdist is just a call for numpy.linalg.norm (http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html). It's a very general function. If it does not work in your case due to memory constraints you can always create something yourself. Two 50000 length vectors do not take that much memory and this can make one pairwise comparison:
np.sqrt(np.sum(np.square(X[0])) + np.sum(np.square(X[1])))
And then you only need to loop through the whole thing.
Hope it helps,
I have two M X N matrices which I construct after extracting data from images. Both the vectors have lengthy first row and after the 3rd row they all become only first column.
for example raw vector looks like this
Both vectors have a similar pattern where first three rows have lengthy row and then thin out as it progress. Do do cosine similarity I was thinking to use a padding technique to add zeros and make these two vectors N X N. I looked at Python options of cosine similarity but some examples were using a package call numpy. I couldn't figure out how exactly numpy can do this type of padding and carry out a cosine similarity. Any guidance would be greatly appreciated.
If both arrays have the same dimension, I would flatten them using NumPy. NumPy (and SciPy) is a powerful scientific computational tool that makes matrix manipulations way easier.
Here an example of how I would do it with NumPy and SciPy:
import numpy as np
from scipy.spatial import distance
A = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
B = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
Aflat = np.hstack(A)
Bflat = np.hstack(B)
dist = distance.cosine(Aflat, Bflat)
The result here is dist = 1.10e-16 (i.e., 0).
Note that I've used here the dtype=object because that's the only way I know to be able to store different shapes into an array in NumPy. That's why later I used hstack() in order to flatten the array (instead of using the more common flatten() function).
I would make them into a scipy sparse matrix (http://docs.scipy.org/doc/scipy/reference/sparse.html) and then run cosine similarity from the scikit learn module.
from scipy import sparse
sparse_matrix= scipy.sparse.csr_matrix(your_np_array)
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
distance_matrix= pairwise_distances(sparse_matrix, metric="cosine")
Why cant you just run a nested loop over both jagged lists (presumably), summating each row using Euclidian/vector dot product and using the result as a similarity measure. This assumes that the jagged dimensions are identical.
Although I'm not quite sure how you are getting a jagged array from a bitmap image (I would of assumed it would be a proper dense matrix of MxN form) or how the jagged array of arrays above is meant to represent an MxN matrix/image data, and therefore, how padding the data with zeros would make sense? If this was a sparse matrix representation, one would expect row/col information annotated with the values.