AgglomerativeClustering on precomputed Sparse Matrix

AgglomerativeClustering on precomputed Sparse Matrix - python

In my current approach, I have
from scipy.sparse import csr_matrix
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
s = pd.DataFrame([[0.8, 0. , 3. ],
[1. , 1. , 2. ],
[0.3, 3. , 4. ]], columns=['dist', 'v1', 'v2'])
sparseD = csr_matrix((1-s['dist'], (s['v1'].values, s['v2'].values)), shape=(N, N))
agg = AgglomerativeClustering(n_clusters=None, affinity='precomputed', linkage='complete', distance_threshold=.25)
agg.fit_predict(sparseD)
The last line raises
TypeError: cannot take a sparse matrix.
If I cast the data toarray, the code works and produces the expected output, but uses a lot of memory and is slow: on the real data size: 61K x 61K.
I am wondering if there is another library (or scikit API) that can do the same linkage clustering on a precomputed, sparse Distance matrix -- if there were no entry for a given (element1, element2) pair, the API would not link them and everything else would be the same.

Related

Mixed Integer Programming Constraints in CVXPY

Where:
V :3x3 Matrix of complex numbers constants
V: scalar Complex number constant
The problem is to find a boolean matrix X that
Minimize Residules=cp.norm(cp.sum(cp.multiply(Vc,S))-V)
The following code works:
import numpy as np
import cvxpy as cp
V= np.random.random(3)*10 + np.random.random(3)*10 * 1j
C=3+4j
X=cp.Variable((3,3), boolean=True)
Residules=cp.norm(cp.sum(cp.multiply(Vc,S))-V)
Objective= cp.Minimize(Residules)
Const1=cp.sum(X,0)<=1
Prob1= cp.Problem(Objective)
Prob1.solve()
X=np.array(X.value)
print(np.round(X))
print(Prob1.value)
The output:
[[ 1. 0. 0.]
[ 1. -0. -0.]
[-0. 1. -0.]]
1.39538277332097
My question:
I want put a constraint on the problem so that for each column in Matrix X only one element can be '1' and the rest should be zeros. So that in each Column there is at maximum one element with the value 1.
I tried :
Const1=cp.sum(X,0)<=1
Prob1= cp.Problem(Objective,[Const1])
Prob1.solve()
The following error occured:
File
"path\Anaconda3\lib\site-packages\cvxpy\reductions\complex2real\complex2real.py",
line 95, in invert
dvars[vid] = solution.dual_vars[cid]
KeyError: 11196
Any other way to set this constraint ??

I separated the complex from real part. I think it works.
import numpy as np
import cvxpy as cp
Vr= np.random.random((3,3))
Vi=np.random.random((3,3))
Cr=3
Ci=4
X=cp.Variable((3,3),boolean=True)
Real=cp.sum(cp.multiply(Vr,X))-Cr
Imag=cp.sum(cp.multiply(Vi,X))-Ci
Residules=cp.norm(cp.hstack([Real, Imag]), 2)
Objective= cp.Minimize(Residules)
const1=[cp.sum(X,axis = 0)<=1]
Prob1= cp.Problem(Objective,const1)
Prob1.solve()
X=np.array(X.value)
print(np.round(X))
print(Prob1.value)

Difference between scipy.linalg.expm versus hand-coded one

I was trying to implement the matrix exponential function as in scipy.linalg.expm. I gained inspiration from kaityo256's github repository. I thus wrote down the following.
from scipy.linalg import expm
from scipy.linalg import eigh
from scipy.linalg import inv
from math import exp as math_exp
from numpy import array, zeros
from numpy.random import random_sample
from numpy.testing import assert_allclose
def diag2sqr(x):
'''Makes an square matrix from a diagonal one.
Takes a 1d matrix. Determines its data type.
Finds out the shape of the 1d matrix.
Makes an empty square matrix with both
dimensions equal to largest (nonzero) dimension of
the 1d matrix. It then fills the elements of the
1d matrix into diagonal slots of the empty
square one.
Parameters
----------
x : ndarray
ndarray of be coverted to a square ndarray
Returns
-------
xsqr : ndarray
ndarray with diagonals sameas those of x
all other elements are zero
dtype same as that of x
'''
x_flat = x.ravel()
xsqr = zeros((x_flat.shape[0], x_flat.shape[0]), dtype=x.dtype)
# Making the empty matrix
for i in range(x_flat.shape[0]):
xsqr[i, i] = x_flat[i]
# filling up the ith element
print('xsqr', xsqr)
return xsqr
def kaityo_expm(x, ):
'''Exponentiates an ndarray (kaityo).
Exponentiates a ndarray in the most naive way.
Parameters
----------
x : ndarray
The ndarray to be exponentiated
Returns
-------
kexpm : ndarray
x after exponentiating
'''
rx, ux = eigh(x)
# Find eigenvalues and eigenvectors
# eigenvectors composed to form a unitary
ux_inv = inv(ux)
# Inverse of the unitary
# tx = diag([array([math_exp(i) for i in rx]).ravel()])
# tx = array([math_exp(i) for i in rx])
tx = diag2sqr(array([math_exp(i) for i in rx]))
# Constructing the diagonal matrix
kexpm1 = tx#ux_inv
kexpm = ux#kexpm1
return kexpm
Afterwards, I tried to test the above code versus scipy.linalg.expm.
x = random_sample((10, 10))
assert_allclose(expm(x), kaityo_expm(x))
This leads to the following output.
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatch: 100%
Max absolute difference: 7.04655733
Max relative difference: 0.59875635
x: array([[18.032424, 16.224408, 12.432163, 16.614248, 12.85653 , 13.705387,
15.096966, 10.577946, 18.399573, 17.938062],
[16.352809, 17.525898, 12.79079 , 16.295562, 13.512996, 14.407979,...
y: array([[18.649103, 13.157682, 11.264763, 16.099163, 15.2293 , 17.854499,
11.691586, 13.412066, 15.023189, 15.598455],
[13.157682, 13.612502, 9.628261, 12.659313, 13.559437, 13.382417,..
Obviously, both the implementations differ.
The questions are as follows:
Is it acceptable for them to differ?
Is my implementation wrong?
If my implementation is wrong, how do I fix it?
If my implementation is correct, when is it safe to use scipy.linalg.expm?
I have seen the following questions:
Matrix exponentiation with scipy: expm, expm2 and expm3

from a mathematical approach the definition of exponential of a matrix is made using the Taylor series of the exponential, so:
let A be a diagonal square matrix:
the problem arise when A is a generic square matrix, so before doing the exponential you will need do diagonalize it using eigenvalue and eigenvectors:
with U the matrix of eigenvectors and Lambda the matrix with the eigenvalues on the diagonal.
at this point we are close to finding what is an exponential of a matrix:
now lets implement this result in a simple script:
>>> import numpy as np
>>> import scipy.linalg as ln
>>> A = [[2/3, -4/3, 2],
[5/6, 4/3, -2],
[5/6, -2/3, 0]]
>>> A = np.matrix(A)
>>> print(A)
[[ 0.66666667 -1.33333333 2. ]
[ 0.83333333 1.33333333 -2. ]
[ 0.83333333 -0.66666667 0. ]]
>>> eigvalue, eigvectors = np.linalg.eig(A)
>>> print("eigvalue: ", eigvalue)
>>> print("eigvectors:")
>>> print(eigvectors)
eigvalue: [ 1. -1. 2.]
eigvectors:
[[ 0.81649658 0.27216553 0.87287156]
[ 0.40824829 -0.68041382 -0.21821789]
[ 0.40824829 -0.68041382 0.43643578]]
>>> e_Lambda = np.eye(np.size(A, 0))*(np.exp(eigvalue))
>>> print(e_Lambda)
[[2.71828183 0. 0. ]
[0. 0.36787944 0. ]
[0. 0. 7.3890561 ]]
>>> e_A = eigvectors*e_Lambda*eigvectors.I
>>> print(e_A)
[[ 2.3265481 -6.22769903 7.01116649]
[ 0.97933433 4.27520659 -3.51559341]
[ 0.97933433 -3.11384951 3.87346269]]
>>> e_A2 = ln.expm(A)
>>> print(e_A2)
[[ 2.3265481 -6.22769903 7.01116649]
[ 0.97933433 4.27520659 -3.51559341]
[ 0.97933433 -3.11384951 3.87346269]]
>>> np.testing.assert_allclose(e_A, e_A2)
>>> print(e_A - e_A2)
[[-1.77635684e-15 1.77635684e-15 -8.88178420e-16]
[ 4.44089210e-16 -1.77635684e-15 8.88178420e-16]
[-2.22044605e-16 0.00000000e+00 4.44089210e-16]]
we see that the result is basically the same, so i think it's safe to use scipy.linalg.expm for matrix exponentiation.
i created a repo with the notebook for further testing.

resampling of 2D numpy array

I have a 2D array of size (3,2) and i have to re sample this by using nearest neighbor, linear and bi cubic method of interpolation so that the size become (4,3).
I am using Python, numpy and scipy for this.
How can I achieve resampling of the input array?

There is a good tutorial on re-sampling using convolution here.
For integer factor up-scaling:
import numpy
import scipy
from scipy import ndimage, signal
# Scale factor
factor = 2
# Input image
a = numpy.arange(16).reshape((4,4))
# Empty image enlarged by scale factor
b = numpy.zeros((a.shape[0]*factor, a.shape[0]*factor))
# Fill the new array with the original values
b[::factor,::factor] = a
# Define the convolution kernel
kernel_1d = scipy.signal.boxcar(factor)
kernel_2d = numpy.outer(kernel_1d, kernel_1d)
# Apply the kernel by convolution, seperately in each axis
c = scipy.signal.convolve(b, kernel_2d, mode="valid")
Note that the factor can be different for each axis, and that you can also apply the convolution sequentially, on each axis. The kernels for bi-linear and bi-cubic are also shown in the link, with the bilinear interpolation making use of a triangular signal (scipy.signal.triang) and bi-cubic being a piece wise function.
You should also mind which portion of the interpolated image is valid; along the edges there is not sufficient support for the kernel.
Bi-cubic interpolation is the best option of the three, as far as satellite imagery goes.

There is a simpler solution for this https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.zoom.html.
Nearest neighbor interpolation is order=0, bilinear interpolation is order=1, and bicubic is order=3 (default).
import numpy as np
import scipy.ndimage
x = np.arange(6).reshape(3,2).astype(float)
z = (4/3, 3/2)
print('Original array:\n{0}\n\n'.format(x))
methods=['nearest-neighbor', 'bilinear', 'biquadratic', 'bicubic']
for o in range(4):
print('Resampled with {0} interpolation:\n {1}\n\n'.
format(methods[o], scipy.ndimage.zoom(x, z, order=o)))
This results to:
Original array:
[[0. 1.]
[2. 3.]
[4. 5.]]
Resampled with nearest-neighbor interpolation:
[[0. 1. 1.]
[2. 3. 3.]
[2. 3. 3.]
[4. 5. 5.]]
Resampled with bilinear interpolation:
[[0. 0.5 1. ]
[1.33333333 1.83333333 2.33333333]
[2.66666667 3.16666667 3.66666667]
[4. 4.5 5. ]]
Resampled with biquadratic interpolation:
[[1.04083409e-16 5.00000000e-01 1.00000000e+00]
[1.11111111e+00 1.61111111e+00 2.11111111e+00]
[2.88888889e+00 3.38888889e+00 3.88888889e+00]
[4.00000000e+00 4.50000000e+00 5.00000000e+00]]
Resampled with bicubic interpolation:
[[5.55111512e-16 5.00000000e-01 1.00000000e+00]
[1.03703704e+00 1.53703704e+00 2.03703704e+00]
[2.96296296e+00 3.46296296e+00 3.96296296e+00]
[4.00000000e+00 4.50000000e+00 5.00000000e+00]]

Most efficient way to create non-redundant correlation matrix Python?

I feel like numpy, scipy, or networkx has a method to do this but I just haven't figured it out yet.
My question is how to create a nonredundant correlation matrix in the form of a DataFrame on from a redundant correlation matrix for LARGE DATASETS in the MOST EFFICIENT way (In Python)?
I'm using this method on a 7000x7000 matrix and it's taking forever on my MacBook Air 4GB Ram (I know, I definitely shouldn't use this for programming but that's another discussion)
Example of redundant correlation matrix
Example of nonredundant correlation matrix
I gave a pretty naive way of doing it below but there has to be a better way. I like storing my matrices in sparse matrices and converting them to dataframes for storage purposes.
import pandas as pd
import numpy as np
import networkx as nx
#Example DataFrame
L_test = [[0.999999999999999,
0.374449352805868,
0.000347439531148995,
0.00103026903356954,
0.0011830950375467401],
[0.374449352805868,
1.0,
1.17392596672424e-05,
1.49428208843456e-07,
1.216664263989e-06],
[0.000347439531148995,
1.17392596672424e-05,
1.0,
0.17452569907144502,
0.238497202355299],
[0.00103026903356954,
1.49428208843456e-07,
0.17452569907144502,
1.0,
0.7557000865939779],
[0.0011830950375467401,
1.216664263989e-06,
0.238497202355299,
0.7557000865939779,
1.0]]
labels = ['AF001', 'AF002', 'AF003', 'AF004', 'AF005']
DF_1 = pd.DataFrame(L_test,columns=labels,index=labels)
#Create Nonredundant Similarity Matrix
n,m = DF_test.shape #they will be the same since it's adjacency
#Empty array to fill
A_tmp = np.zeros((n,m))
#Copy part of the array
for i in range(n):
for j in range(m):
A_tmp[i,j] = DF_test.iloc[i,j]
if j==i:
break
#Make array sparse for storage
A_csr = csr_matrix(A_tmp)
#Recreate DataFrame
DF_2 = pd.DataFrame(A_csr.todense(),columns=DF_test.columns,index=DF_test.index)
DF_2.head()

I think you can create array with np.tril and then multiple it with DataFrame DF_1:
print np.tril(np.ones(DF_1.shape))
[[ 1. 0. 0. 0. 0.]
[ 1. 1. 0. 0. 0.]
[ 1. 1. 1. 0. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 1.]]
print np.tril(np.ones(DF_1.shape)) * DF_1
AF001 AF002 AF003 AF004 AF005
AF001 1.000000 0.000000e+00 0.000000 0.0000 0
AF002 0.374449 1.000000e+00 0.000000 0.0000 0
AF003 0.000347 1.173926e-05 1.000000 0.0000 0
AF004 0.001030 1.494282e-07 0.174526 1.0000 0
AF005 0.001183 1.216664e-06 0.238497 0.7557 1

Scipy's pdist correlation metric not same as numpy corrcoef

I used scipy's pdist with the correlation metric to construct a correlation matrix, but the values were not matching the ones I obtained from numpy's corrcoef.
I applied pdist on a very simple two 1-d arrays of the same values: [1,2,3] and [1,2,3]:
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,1],[2,2],[3,3]]).transpose()
print np.corrcoef(df)
print squareform(pdist(df, metric='correlation'))
Instead of outputting a correlation value of 1, I got 2.2E-16 from the pdist:
[[ 1. 1.]
[ 1. 1.]]
[[ 0.00000000e+00 2.22044605e-16]
[ 2.22044605e-16 0.00000000e+00]]
The following is the code I found in scipy for their correlation metric:
umu = u.mean()
vmu = v.mean()
um = u - umu
vm = v - vmu
dist = 1.0 - np.dot(um, vm) / (norm(um) * norm(vm))

"Correlation distance" is not the same as the correlation coefficient. A "distance" between two equal points is supposed to be 0. (If you search for "correlation distance", note that there is yet another concept, the "distance correlation", which is not the same as the "correlation distance".)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AgglomerativeClustering on precomputed Sparse Matrix - python

Related

Mixed Integer Programming Constraints in CVXPY

Difference between scipy.linalg.expm versus hand-coded one

resampling of 2D numpy array

Most efficient way to create non-redundant correlation matrix Python?

Scipy's pdist correlation metric not same as numpy corrcoef

Categories

Resources