Principal component analysis dimension reduction in python - python

I have to implement my own PCA function function Y,V = PCA(data, M, whitening) that computes the first M principal
components and transforms the data, so that y_n = U^T x_n. The function should further
return V that explains the amount of variance that is explained by the transformation.
I have to reduce the dimension of data D=4 to M=2 > given function below <
def PCA(data,nr_dimensions=None, whitening=False):
""" perform PCA and reduce the dimension of the data (D) to nr_dimensions
Input:
data... samples, nr_samples x D
nr_dimensions... dimension after the transformation, scalar
whitening... False -> standard PCA, True -> PCA with whitening
Returns:
transformed data... nr_samples x nr_dimensions
variance_explained... amount of variance explained by the the first nr_dimensions principal components, scalar"""
if nr_dimensions is not None:
dim = nr_dimensions
else:
dim = 2
what I have done is the following:
import numpy as np
import matplotlib.cm as cm
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import multivariate_normal
import pdb
import sklearn
from sklearn import datasets
#covariance matrix
mean_vec = np.mean(data)
cov_mat = (data - mean_vec).T.dot((data - mean_vec)) / (data.shape[0] - 1)
print('Covariance matrix \n%s' % cov_mat)
#now the eigendecomposition of the cov matrix
cov_mat = np.cov(data.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' % eig_vecs)
print('\nEigenvalues \n%s' % eig_vals)
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
This is the point where I don't know what to do now and how to reduce dimension.
Any help would be welcome! :)

Here is a simple example for the case where the initial matrix A that contains the samples and features has shape=[samples, features]
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# calculate the mean of each column since I assume that it's column is a variable/feature
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)

PCA is actually the same as singular value decomposition, so you can either use numpy.linalg.svd:
import numpy as np
def PCA(U,ndim,whitening=False):
L,G,R=np.linalg.svd(U,full_matrices=False)
if not whitening:
L=L # G
Y=L[:,:ndim] # R[:,:ndim].T
return Y,G[:ndim]
If you want to use the eigenvalue problem, then assuming that the number of samples is higher than the number of features (or your data would be underfit), it is inefficient to calculate the spatial correlations (left eigenvectors) directly. Instead, using SVD use the right eigenfunctions:
def PCA(U,ndim,whitening=False):
K=U.T # U # Calculating right eigenvectors
G,R=np.linalg.eigh(K)
G=G[:,::-1]
R=R[::-1]
L=U # R # reconstructing left ones
nrm=np.linalg.norm(L,axis=0,keepdims=True) #normalizing them
L/=nrm
if not whitening:
L=L # G
Y=L[:,:ndim] # R[:,:ndim].T
return Y,G[:ndim]

Related

Clustering binary image using sparse matrix representation

I have seen this question on StackOverflow, and I do want to solve the problem of clustering binary images such as this one from the same abovementioned question using sparse matrix representation:
I know that there are more simple and efficient methods for clustering (KMeans, Mean-shift, etc...), and I'm aware that this problem can be solved with other solutions such as connected components,
but my aim is to use sparse matrix representation approach.
What I have tried so far is:
Reading the image and defining distance function (Euclidean distance as an example):
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import cv2
import numpy as np
from math import sqrt
from scipy.sparse import csr_matrix
original = cv2.imread("km.png", 0)
img = original.copy()
def distance(p1, p2):
"""
Euclidean Distance
"""
return sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
constructing the Sparse Matrix depending on the nonzero points of the binary image, I'm considering two clusters for simplicity, and later it can be extended to k clusters:
data = np.nonzero(img)
data = tuple(zip(data[0],data[1]))
data = np.asarray(data).reshape(-1,2)
# print(data.shape)
l = data.shape[0] # 68245
# num clusters
k = 2
cov_mat = csr_matrix((l, l, k), dtype = np.int8).toarray()
First Loop to get the centroids as the most far points from each other:
# inefficient loop!
for i in range(l):
for j in range(l):
if(i!=j):
cov_mat[i, j, 0] = distance(data[i], data[j])
# Centroids
# TODO: check if more than two datapoints with same max distance then take first!
ci = cov_mat[...,0].argmax()
Calculating the distance from each centroid:
# inefficient loop!
# TODO: repeat for k clusters!
for i in range(l):
for j in range(l):
if(i!=j):
cov_mat[i, j, 0] = distance(data[i], data[ci[0]])
cov_mat[i, j, 1] = distance(data[i], data[ci[1]])
Clustering depending on min distance:
# Labeling
cov_mat[cov_mat[...,0]>cov_mat[...,1]] = +1
cov_mat[cov_mat[...,0]<cov_mat[...,1]] = -1
# Labeling Centroids
cov_mat[ci[0], ci[0]] = +1
cov_mat[ci[1], ci[1]] = -1
obtaining the indices of the clusters:
# clusters Indicies
cl1 = cov_mat[...,0].argmax()
cl2 = cov_mat[...,0].argmin()
# TODO: pass back the indices to the image to cluster it.
This approach is time costly, can you please tell me how to increase the efficiency please? thanks in advance.
As pointed out in the comments, when working with NumPy arrays, vectorised code should be preferred over scalar code (i.e., code with explicit for loops). Besides, as I see it, in this particular problem a dense NumPy array is a better choice than a Scipy sparse array. If you are not constrained to utilize a sparse array, this code would do:
import numpy as np
from skimage.io import imread
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
n_clusters = 2
palette = [[255, 0, 0], [0, 255, 0]]
img = imread('https://i.stack.imgur.com/v1NDe.png')
rows, cols = img.nonzero()
coords = np.stack([rows, cols], axis=1)
labels = KMeans(n_clusters=n_clusters, random_state=0).fit_predict(coords)
result = np.zeros(shape=img.shape+(3,), dtype=np.uint8)
for label in range(n_clusters):
idx = (labels == label)
result[rows[idx], cols[idx]] = palette[label]
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 8))
ax0.imshow(img, cmap='gray')
ax0.set_axis_off()
ax0.set_title('Binary image')
ax1.imshow(result)
ax1.set_axis_off()
ax1.set_title('Cluster labels');

scRNA-seq: How to use TSNE python implementation using precalculated PCA score/load?

Python t-sne implementation from this resource: https://lvdmaaten.github.io/tsne/
Btw I'm a beginner to scRNA-seq.
What I am trying to do: Use a scRNA-seq data set and run t-SNE on it but with using previously calculated PCAs (I have PCA.score and PCA.load files)
Q1: I should be able to use my selected calculated PCAs in the tSNE, but which file do I use the pca.score or pca.load when running Y = tsne.tsne(X)?
Q2: I've tried removing/replacing parts of the PCA calculating code to attempt to remove PCA preprocessing but it always seems to give an error. What should I change for it to properly use my already PCA data and not calculate PCA from it again?
The piece of PCA processing code is this in its raw form:
def pca(X=np.array([]), no_dims=50):
"""
Runs PCA on the NxD array X in order to reduce its dimensionality to
no_dims dimensions.
"""
print("Preprocessing the data using PCA...")
(n, d) = X.shape
X = X - np.tile(np.mean(X, 0), (n, 1))
(l, M) = X #np.linalg.eig(np.dot(X.T, X))
Y = np.dot(X, M[:, 0:no_dims])
return Y
You should use the PCA score.
As for not running pca, you can just comment out this line:
X = pca(X, initial_dims).real
What I did is to add a parameter do_pca and edit the function such:
def tsne(X=np.array([]), no_dims=2, initial_dims=50, perplexity=30.0,do_pca=True):
"""
Runs t-SNE on the dataset in the NxD array X to reduce its
dimensionality to no_dims dimensions. The syntaxis of the function is
`Y = tsne.tsne(X, no_dims, perplexity), where X is an NxD NumPy array.
"""
# Check inputs
if isinstance(no_dims, float):
print("Error: array X should have type float.")
return -1
if round(no_dims) != no_dims:
print("Error: number of dimensions should be an integer.")
return -1
# Initialize variables
if do_pca:
X = pca(X, initial_dims).real
(n, d) = X.shape
max_iter = 50
[.. rest stays the same..]
Using an example dataset, without commenting out that line:
import numpy as np
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import sys
import os
from tsne import *
X,y = load_digits(return_X_y=True,n_class=3)
If we run the default:
res = tsne(X=X,initial_dims=20,do_pca=True)
plt.scatter(res[:,0],res[:,1],c=y)
If we pass it a pca :
pc = pca(X)[:,:20]
res = tsne(X=pc,initial_dims=20,do_pca=False)
plt.scatter(res[:,0],res[:,1],c=y)

How to estimate motion with FTT and Cross-Correlation?

I'm working in the estimation of cloud displacement for wind energy purposes with RGB GOES satellital images. I find the following the methodology from this paper "An Automated Technique for Obtaining Cloud Motion From Geosynchronous Satellite Data Using Cross Correlation" to achieve it. I don't know if this is a good way to compute this. The code bassically gets the cross correlation from the Fourier Transform to calculate cloud displacement between roi_a and roi_b images.
import numpy as np
import cv2 as cv
import matplotlib.pyplot as plt
img_a = cv.imread('2019.1117.1940.goes-16.rgb.tif', 0)
img_b = cv.imread('2019.1117.1950.goes-16.rgb.tif', 0)
roi_a = img_a[700:900, 1900:2100]
roi_b = img_b[700:900, 1900:2100]
def Fou(image):
fft_roi= np.fft.fft2(image)
return fft_roi
def inv_Fou(C_w):
c_t = np.fft.ifft2(C_w)
c_t = np.abs(c_t)
return c_t
#Step 1: gets the FFT
G_t0 = Fou(roi_a)##t_0
fft_roiA_conj = np.conj(G_t0) #Conjugate
G_t1 = Fou(roi_b)##t_1
#Step 2: Compute C(m, v)
prod = np.dot(fft_roiA_conj, G_t1)
#Step 3: Perform the inverse FFT
inv = inv_Fou(prod)
plt.imshow(inv, cmap = 'gray', )
plt.title('C (m,v) --> Cov(p,q)')
plt.xticks([])
plt.yticks([])
plt.show()
#Step 4: Compute cross correlation coefficient and the maximum cross correlation coefficient
def rms(sigma):
"Compute the standar deviation of an image"
rms = np.std(sigma)
return rms
R_t = inv / (rms(roi_a) * rms(roi_b))
This is the first time that I use FFT on images, so I have some questions about it:
I don't add fftshift, is this can affect the result?
What is difference between use np.dot in step 2 and simple '*', like prod = fft_roiA_conj * G_t1
How to interpret the image result (C(m, v) -> Cov (p, q)) from step 3?
How can I obtain the maximum coefficient p' and q' (maximum coefficient of x and y directions) from R_t?
1 - fftshift is a circular rotation, if you have a two sided signal you are computing the correlation is shifted (circularly), what is important is that you map your indices to displacements correctly, with or without fftshift.
2 - numpy.dot is the matrix product (equivalent to # operator for recent python versions), and the * operator does element-wise multiplication, in my understanding you want the element-wise product at step 2.
3 - Once you correct the step 2 you will have an image such that inv[i, j] the correlation of the immage roi_a and the image roi_b rolled by i rows and j columns
To answer the last question I will workout an example.
I will use the image scipy.misc.face, it is a RGB image, so it brings three matrices that are highly correlated.
import scipy
import numpy as np
import matplotlib.pyplot as plt
f = scipy.misc.face()
plt.figure(figsize=(12, 4))
plt.subplot(131), plt.imshow(f[:,:, 0])
plt.subplot(132), plt.imshow(f[:,:, 1])
plt.subplot(133), plt.imshow(f[:,:, 2])
The function img_corrcombine the three steps of the cross correlation (for images of the same size), notice that I am use rfft2 and irfft2, this are the FFT for real data, that take advantage of symmetry in the frequency domain.
def img_corr(foi_a, foi_b):
return np.fft.irfft2(np.fft.rfft2(foi_a) * np.conj(np.fft.rfft2(foi_b)))
C = img_corr(f[:,:,1], f[:,:,2])
plt.figure(figsize=(12, 4))
plt.subplot(121), plt.imshow(C), plt.title('FFT indices')
plt.subplot(122), plt.imshow(np.fft.fftshift(C, (0, 1))), plt.title('fftshift ed version')
To retrieve the position
# this returns the indice in the vector of all pixels
best_corr = np.argmax(C)
# unravel index gives the 2D index
best_pos = np.unravel_index(best_corr, C.shape)
# this get the positions as a fraction of the image size
relative_pos = [np.fft.fftfreq(size)[index] for index, size in zip(best_pos, C.shape)]
I hope this completes the answer.

How to find Eigenspace of a matrix using python

I have a matrix which is I found its Eigenvalues and EigenVectors, but now I want to solve for eigenspace, which is Find a basis for each of the corresponding eigenspaces! and don't know how to start! by finding the null space from scipy or solve for reef(), I tried but didn't work! please help!
this is the code I am using
# import packages
import numpy as np
from numpy import linalg as LA
from scipy.linalg import null_space
# define matrix and vector
M = np.array([[0.82, 0.1],[0.18,0.9]])
v0 = np.array([[15000],[800]])
eigenVal, eigenVec = LA.eig(M)
print(eigenVal)
# Based on the Characteristic polynomial formula
#pol_formula =(A- \lambda I)\mathbf{v} = 0\)
identity = np.identity(2, dtype=float)
lamdbdaI= eigenVal*identity
## Apply the Characteristic polynomial formula using ###M matrix
char_poly = M-lamdbdaI
print(char_poly)
Here I am stuck !
The np.linalg.eig functions already returns the eigenvectors, which are exactly the basis vectors for your eigenspaces. More precisely:
v1 = eigenVec[:,0]
v2 = eigenVec[:,1]
span the corresponding eigenspaces for eigenvalues lambda1 = eigenVal[0] and lambda2 = eigenvVal[1].

Simultaneously fit linearly every line of a 2d numpy array

I am working in Python on image analysis. I have an image (2d numpy array) with some intensity drift in it. I want to level it.
To remove the increasing/decreasing intensity over the width of the image, I want to fit every row of the 2d numpy array with a line. I however do not want to loop through every row index.
MWE:
import numpy as np
import matplotlib.pyplot as plt
width=1500
height=2500
np.random.random((width,height))
fill_fun = lambda x,a,b : a*x+b
play_image = fill_fun(np.tile(np.arange(width),(height,1)),0.15,2)+np.random.random( (height,width) )
#For representation purposes:
#plt.imshow(play_image,cmap='Greys_r')
#plt.show()
#1) Fit every row and kill the intensity decrease/increase tendency
fit_func = lambda p,x: p[0]*x+b
errfunc = lambda p, x, y: abs(fitfunc(p, x) - y) # Distance to the target function
x_axis=np.linspace(0,width,width)
for i in range(height):
row_val=play_image[i,:]
p0=[(row_val[-1]-row_val[0])/float(width),row_val[0]] #guess
p1, success = optimize.leastsq(errfunc, p0[:], args=(x_axis,row_val))
play_image[i,:]-= fit_func(p1,x_axis)-p1[1]
By doing this I effectively level my image intensity horizontally. Is there anyway I can replace the loop by a matrix operation ? To somehow fit all the lines at the same time with a (height,2) parameter vector ?
Thanks for the help
Fitting a line is a simple formula to use directly, which can be done about three short lines in numpy (most of the code below is just making and plotting the data and fits):
import numpy as np
import matplotlib.pyplot as plt
# make the data as sequential sections of a circle
theta = np.linspace(np.pi, 0, 120)
y = np.reshape(np.sin(theta), (10,12))
x = np.repeat(np.arange(12)[None,:], 10, axis=0)
# fit the line
m = lambda x: np.mean(x, axis=1)
beta = ( m(y*x) - m(x)*m(y) )/(m(x*x) - m(x)**2)
alpha = m(y) - beta*m(x)
# plot the data and fits
plt.plot([y[:,i] for i in range(12)], ".") # plot the data
plt.gca().set_color_cycle(None) # reset the color cycle
fits = alpha[:,None] + beta[:,None]*x # make lines from the fits for the plots
plt.plot(fits.T)
plt.show()
You can implement the normal equations and their solution pretty easily. The main challenge is keeping track of the appropriate dimensions so all the vectorized operations work correctly. Here's one method:
import numpy as np
# image size
m = 100
n = 125
# A random image to work with.
np.random.seed(123)
img = np.random.randint(0, 100, size=(m, n))
# X is the design matrix. It is the same for each row. It has shape (n, 2).
X = np.column_stack((np.ones(n), np.arange(n)))
# A is X.T.dot(X), but in this case we can use an explicit formula for each term.
s1 = 0.5*n*(n - 1) # Sum of integers
s2 = n*(n - 0.5)*(n - 1)/3.0 # Sum of squared integers
A = np.array([[n, s1], [s1, s2]])
# Y has shape (2, m). Each column is a vector on the right-hand-side of the
# normal equations.
Y = X.T.dot(img.T)
# Solve the normal equations. beta has shape (2, m). Each column gives the
# coefficients of the linear fit for each row of img.
beta = np.linalg.solve(A, Y)
# Create an array that holds the linear drift for each row.
# X has shape (n, 2) and beta has shape (2, m), so row_drift has shape (m, n),
# the same as img.
row_drift = X.dot(beta).T
# Remove the drift from img.
img2 = img - row_drift

Categories

Resources