Related
I would like to do something that is likely very simple, but is giving me difficulty. Trying to draw N samples from a multivariate normal distribution and calculate the probability of each of those randomly drawn samples. Here I attempt to use scipy, but am open to using np.random.multivariate_normal as well. Whichever is easiest.
>>> import numpy as np
>>> from scipy.stats import multivariate_normal
>>> num_samples = 10
>>> num_features = 6
>>> std = np.random.rand(num_features)
# define distribution
>>> mvn = multivariate_normal(mean = np.zeros(num_features), cov = np.diag(std), allow_singular = False, seed = 42)
# draw samples
>>> sample = mvn.rvs(size = num_samples); sample
# determine probability of each drawn sample
>>> prob = mvn.pdf(x = sample)
# print samples
>>> print(sample)
[[ 0.04816243 -0.00740458 -0.00740406 0.04967142 -0.01382643 0.06476885]
...
[-0.00977815 0.01047547 0.03084945 0.10309995 0.09312801 -0.08392175]]
# print probability all samples
[26861.56848337 17002.29353025 2182.26793265 3749.65049331
42004.63147989 3700.70037411 5569.30332186 16103.44975393
14760.64667235 19148.40325233]
This is confusing for me for a number of reasons:
For the rvs sampling function: I don't use the keyword arguments mean and cov per the docs because it seems odd to define a distribution with a mean and cov in mvn = multivariate_normal(mean = np.zeros(num_features), cov = np.diag(std), allow_singular = False, seed = 42) and then repeat that definition in the rvs call. Am I missing something?
For the mvn.pdf call, the probability density is obviously >>>1 which isn't impossible for a continuous multivariate normal, but I would like to convert these numbers to approximate probabilities at that particular point. How can I do this?
Thanks!
I am looking to find the peaks in some gaussian smoothed data that I have. I have looked at some of the peak detection methods available but they require an input range over which to search and I want this to be more automated than that. These methods are also designed for non-smoothed data. As my data is already smoothed I require a much more simple way of retrieving the peaks. My raw and smoothed data is in the graph below.
Essentially, is there a pythonic way of retrieving the max values from the array of smoothed data such that an array like
a = [1,2,3,4,5,4,3,2,1,2,3,2,1,2,3,4,5,6,5,4,3,2,1]
would return:
r = [5,3,6]
There exists a bulit-in function argrelextrema that gets this task done:
import numpy as np
from scipy.signal import argrelextrema
a = np.array([1,2,3,4,5,4,3,2,1,2,3,2,1,2,3,4,5,6,5,4,3,2,1])
# determine the indices of the local maxima
max_ind = argrelextrema(a, np.greater)
# get the actual values using these indices
r = a[max_ind] # array([5, 3, 6])
That gives you the desired output for r.
As of SciPy version 1.1, you can also use find_peaks. Below are two examples taken from the documentation itself.
Using the height argument, one can select all maxima above a certain threshold (in this example, all non-negative maxima; this can be very useful if one has to deal with a noisy baseline; if you want to find minima, just multiply you input by -1):
import matplotlib.pyplot as plt
from scipy.misc import electrocardiogram
from scipy.signal import find_peaks
import numpy as np
x = electrocardiogram()[2000:4000]
peaks, _ = find_peaks(x, height=0)
plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.plot(np.zeros_like(x), "--", color="gray")
plt.show()
Another extremely helpful argument is distance, which defines the minimum distance between two peaks:
peaks, _ = find_peaks(x, distance=150)
# difference between peaks is >= 150
print(np.diff(peaks))
# prints [186 180 177 171 177 169 167 164 158 162 172]
plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.show()
If your original data is noisy, then using statistical methods is preferable, as not all peaks are going to be significant. For your a array, a possible solution is to use double differentials:
peaks = a[1:-1][np.diff(np.diff(a)) < 0]
# peaks = array([5, 3, 6])
>> import numpy as np
>> from scipy.signal import argrelextrema
>> a = np.array([1,2,3,4,5,4,3,2,1,2,3,2,1,2,3,4,5,6,5,4,3,2,1])
>> argrelextrema(a, np.greater)
array([ 4, 10, 17]),)
>> a[argrelextrema(a, np.greater)]
array([5, 3, 6])
If your input represents a noisy distribution, you can try smoothing it with NumPy convolve function.
If you can exclude maxima at the edges of the arrays you can always check if one elements is bigger than each of it's neighbors by checking:
import numpy as np
array = np.array([1,2,3,4,5,4,3,2,1,2,3,2,1,2,3,4,5,6,5,4,3,2,1])
# Check that it is bigger than either of it's neighbors exluding edges:
max = (array[1:-1] > array[:-2]) & (array[1:-1] > array[2:])
# Print these values
print(array[1:-1][max])
# Locations of the maxima
print(np.arange(1, array.size-1)[max])
I have to implement my own PCA function function Y,V = PCA(data, M, whitening) that computes the first M principal
components and transforms the data, so that y_n = U^T x_n. The function should further
return V that explains the amount of variance that is explained by the transformation.
I have to reduce the dimension of data D=4 to M=2 > given function below <
def PCA(data,nr_dimensions=None, whitening=False):
""" perform PCA and reduce the dimension of the data (D) to nr_dimensions
Input:
data... samples, nr_samples x D
nr_dimensions... dimension after the transformation, scalar
whitening... False -> standard PCA, True -> PCA with whitening
Returns:
transformed data... nr_samples x nr_dimensions
variance_explained... amount of variance explained by the the first nr_dimensions principal components, scalar"""
if nr_dimensions is not None:
dim = nr_dimensions
else:
dim = 2
what I have done is the following:
import numpy as np
import matplotlib.cm as cm
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import multivariate_normal
import pdb
import sklearn
from sklearn import datasets
#covariance matrix
mean_vec = np.mean(data)
cov_mat = (data - mean_vec).T.dot((data - mean_vec)) / (data.shape[0] - 1)
print('Covariance matrix \n%s' % cov_mat)
#now the eigendecomposition of the cov matrix
cov_mat = np.cov(data.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' % eig_vecs)
print('\nEigenvalues \n%s' % eig_vals)
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
This is the point where I don't know what to do now and how to reduce dimension.
Any help would be welcome! :)
Here is a simple example for the case where the initial matrix A that contains the samples and features has shape=[samples, features]
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# calculate the mean of each column since I assume that it's column is a variable/feature
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)
PCA is actually the same as singular value decomposition, so you can either use numpy.linalg.svd:
import numpy as np
def PCA(U,ndim,whitening=False):
L,G,R=np.linalg.svd(U,full_matrices=False)
if not whitening:
L=L # G
Y=L[:,:ndim] # R[:,:ndim].T
return Y,G[:ndim]
If you want to use the eigenvalue problem, then assuming that the number of samples is higher than the number of features (or your data would be underfit), it is inefficient to calculate the spatial correlations (left eigenvectors) directly. Instead, using SVD use the right eigenfunctions:
def PCA(U,ndim,whitening=False):
K=U.T # U # Calculating right eigenvectors
G,R=np.linalg.eigh(K)
G=G[:,::-1]
R=R[::-1]
L=U # R # reconstructing left ones
nrm=np.linalg.norm(L,axis=0,keepdims=True) #normalizing them
L/=nrm
if not whitening:
L=L # G
Y=L[:,:ndim] # R[:,:ndim].T
return Y,G[:ndim]
My goal is to find the nearest x,y point co-ordinate for every pixel. Based on that i have to colour the pixel points.
Here is what i have tried,
The below code will draw the points.
import numpy as np
import matplotlib.pyplot as plt
points = np.array([[0,40],[0,0],[5,30],[4,10],[10,25],[20,5],[30,35],[35,3],[50,0],[45,15],[40,22],[50,40]])
print (points)
x1, y1 = zip(*points)
plt.plot(x1,y1,'.')
plt.show()
Now to find the nearest point for each pixel.
I am found something like this where i have to give manually each pixel co-ordinates, to get the nearest point.
from scipy import spatial
import numpy as np
A = np.random.random((10,2))*100
print (A)
pt = np.array([[6, 30],[9,80]])
print (pt)
for each in pt:
A[spatial.KDTree(A).query(each)[1]] # <-- the nearest point
distance,index = spatial.KDTree(A).query(each)
print (distance) # <-- The distances to the nearest neighbors
print (index) # <-- The locations of the neighbors
print (A[index])
The output will be like this,
[[1.76886192e+01 1.75054781e+01]
[4.17533199e+01 9.94619127e+01]
[5.30943347e+01 9.73358766e+01]
[3.05607891e+00 8.14782701e+01]
[5.88049334e+01 3.46475520e+01]
[9.86076676e+01 8.98375851e+01]
[9.54423012e+01 8.97209269e+01]
[2.62715747e+01 3.81651805e-02]
[6.59340306e+00 4.44893348e+01]
[6.66997434e+01 3.62820929e+01]]
[[ 6 30]
[ 9 80]]
14.50148095039858
8
[ 6.59340306 44.48933479]
6.124988197559344
3
[ 3.05607891 81.4782701 ]
Instead of giving each point manually i want to take each pixel from the image and i wanted to find the nearest blue point. This is my first question.
After that i want to classify those points into two categories,
Based on pixel and point i want to colour it, basically i want to do a cluster on it.
This is not in proper form. But at the end i want it like this.
Thanks in advance guys.
Use cKDTree instead of KDTree, which is faster (see this answer).
You can give the kdtree an array of points to query instead of looping over all of them.
Constructing a kdtree is a costly operation compared to querying it, so construct it once and query many times.
Compare the following two code snippets, on my tests the second one run x800 times faster.
from timeit import default_timer as timer
np.random.seed(0)
A = np.random.random((1000,2))*100
pt = np.random.randint(0,100,(100,2))
start1 = timer()
for each in pt:
A[spatial.KDTree(A).query(each)[1]]
distance,index = spatial.KDTree(A).query(each)
end1 = timer()
print end1-start1
start2 = timer()
kdt = spatial.cKDTree(A) # cKDTree + outside construction
distance,index = kdt.query(pt)
A[index]
end2 = timer()
print end2-start2
you can use scikit-learn for this:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1)
labels = list(range(len(points)))
neigh.fit(points, labels)
pred = neigh.predict(np.random.random((10,2))*50)
if you want the points itself and not their class labels you can do
points[pred]
When working with healpy, I am able to plot a Healpix map in Mollview using
import healpy
map = 'filename.fits'
healpy.visufunc.mollview(map)
or as in the tutorial
>>> import numpy as np
>>> import healpy as hp
>>> NSIDE = 32
>>> m = np.arange(hp.nside2npix(NSIDE))
>>> hp.mollview(m, title="Mollview image RING")
which outputs
Is there a way to display only certain regions of the map? For example, only the upper hemisphere, or only the left side?
What I have in mind is viewing only small patches of the sky to see small point sources, or something like the "half-sky" projection from LSST
You can use a mask, which is a boolean map of the same size, where 1 are masked, 0 are not masked:
http://healpy.readthedocs.org/en/latest/tutorial.html#masked-map-partial-maps
Example:
import numpy as np
import healpy as hp
NSIDE = 32
m = hp.ma(np.arange(hp.nside2npix(NSIDE), dtype=np.double))
mask = np.zeros(hp.nside2npix(NSIDE), dtype=np.bool)
pixel_theta, pixel_phi = hp.pix2ang(NSIDE, np.arange(hp.nside2npix(NSIDE)))
mask[pixel_theta > np.pi/2] = 1
m.mask = mask
hp.mollview(m)