scipy linkage format - python

I have written my own clustering routine and would like to produce a dendrogram. The easiest way to do this would be to use scipy dendrogram function. However, this requires the input to be in the same format that the scipy linkage function produces. I cannot find an example of how the output of this is formatted. I was wondering whether someone out there can enlighten me.

I agree with https://stackoverflow.com/users/1167475/mortonjt that the documentation does not fully explain the indexing of intermediate clusters, while I do agree with the https://stackoverflow.com/users/1354844/dkar that the format is otherwise precisely explained.
Using the example data from this question: Tutorial for scipy.cluster.hierarchy
A = np.array([[0.1, 2.5],
[1.5, .4 ],
[0.3, 1 ],
[1 , .8 ],
[0.5, 0 ],
[0 , 0.5],
[0.5, 0.5],
[2.7, 2 ],
[2.2, 3.1],
[3 , 2 ],
[3.2, 1.3]])
A linkage matrix can be built using the single (i.e, the closest matching points):
z = hac.linkage(a, method="single")
array([[ 7. , 9. , 0.3 , 2. ],
[ 4. , 6. , 0.5 , 2. ],
[ 5. , 12. , 0.5 , 3. ],
[ 2. , 13. , 0.53851648, 4. ],
[ 3. , 14. , 0.58309519, 5. ],
[ 1. , 15. , 0.64031242, 6. ],
[ 10. , 11. , 0.72801099, 3. ],
[ 8. , 17. , 1.2083046 , 4. ],
[ 0. , 16. , 1.5132746 , 7. ],
[ 18. , 19. , 1.92353841, 11. ]])
As the documentation explains the clusters below n (here: 11) are simply the data points in the original matrix A. The intermediate clusters going forward, are indexed successively.
Thus, clusters 7 and 9 (the first merge) are merged into cluster 11, clusters 4 and 6 into 12. Then observe line three, merging clusters 5 (from A) and 12 (from the not-shown intermediate cluster 12) resulting with a Within-Cluster Distance (WCD) of 0.5. The single method entails that the new WCS is 0.5, which is the distance between A[5] and the closest point in cluster 12, A[4] and A[6]. Let's check:
In [198]: norm([a[5]-a[4]])
Out[198]: 0.70710678118654757
In [199]: norm([a[5]-a[6]])
Out[199]: 0.5
This cluster should now be intermediate cluster 13, which subsequently is merged with A[2]. Thus, the new distance should be the closest between the points A[2] and A[4,5,6].
In [200]: norm([a[2]-a[4]])
Out[200]: 1.019803902718557
In [201]: norm([a[2]-a[5]])
Out[201]: 0.58309518948452999
In [202]: norm([a[2]-a[6]])
Out[202]: 0.53851648071345048
Which, as can be seen also checks out, and explains the intermediate format of new clusters.

This is from the scipy.cluster.hierarchy.linkage() function documentation, I think it's a pretty clear description for the output format:
A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.
Do you need something more?

The scipy documentation is accurate as dkar pointed out ... but it's a little bit difficult to turn the returned data into something that is usable for further analysis.
In my opinion they should include the ability to return the data in a tree like data structure. The code below will iterate through the matrix and build a tree:
from scipy.cluster.hierarchy import linkage
import numpy as np
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
centers = np.concatenate((a, b),)
def create_tree(centers):
clusters = {}
to_merge = linkage(centers, method='single')
for i, merge in enumerate(to_merge):
if merge[0] <= len(to_merge):
# if it is an original point read it from the centers array
a = centers[int(merge[0]) - 1]
else:
# other wise read the cluster that has been created
a = clusters[int(merge[0])]
if merge[1] <= len(to_merge):
b = centers[int(merge[1]) - 1]
else:
b = clusters[int(merge[1])]
# the clusters are 1-indexed by scipy
clusters[1 + i + len(to_merge)] = {
'children' : [a, b]
}
# ^ you could optionally store other info here (e.g distances)
return clusters
print create_tree(centers)

Here's another piece of code that performs the same function. This version tracks the distance (size) of each cluster (node_id), and confirms the number of members.
This uses the scipy linkage() function which is the same foundation of the Aggregator clusterer.
from scipy.cluster.hierarchy import linkage
import copy
Z = linkage(data_x, 'ward')
n_points = data_x.shape[0]
clusters = [dict(node_id=i, left=i, right=i, members=[i], distance=0, log_distance=0, n_members=1) for i in range(n_points)]
for z_i in range(Z.shape[0]):
row = Z[z_i]
cluster = dict(node_id=z_i + n_points, left=int(row[0]), right=int(row[1]), members=[], log_distance=np.log(row[2]), distance=row[2], n_members=int(row[3]))
cluster["members"].extend(copy.deepcopy(members[cluster["left"]]))
cluster["members"].extend(copy.deepcopy(members[cluster["right"]]))
clusters.append(cluster)
on_split = {c["node_id"]: [c["left"], c["right"]] for c in clusters}
up_merge = {c["left"]: {"into": c["node_id"], "with": c["right"]} for c in clusters}
up_merge.update({c["right"]: {"into": c["node_id"], "with": c["left"]} for c in clusters})

input
output
consider [input] is the data for which you are interested in drawing a dentogram
when you use linkage that returns a matrix with four columns
column1 and column2 -represents the formation of cluster in order
i.e
the 2 and 3 makes a cluster first this cluster is named as 5 ( 2 and 3 represents index that is 2 and 3rd row)
1 and 5 is the second formed cluster this cluster is named as 6
column 3-represents the distance between the clusters
column 4-represents how many data points involved in making this cluster
dentogram

Related

Why np.corrcoef is not working as expected with two vectors of dimension two?

I am trying to calculate the Pearson correlation coefficient between two vectors in 2-dimensions using np.corrcoef. When the dimension of the vectors is different than two, they work fine, see for example:
import numpy as np
x = np.random.uniform(-10, 10, 3)
y = np.random.uniform(-10, 10, 3)
print(x, y)
print(np.corrcoef(x,y))
Output:
[-6.59840638 -1.81100446 5.6158669 ] [ 6.7200348 -7.0373677 -2.11395157]
[[ 1. -0.53299763]
[-0.53299763 1. ]]
However, when the dimension is exactly two, the correlation is wrong with the only values 1 or -1:
import numpy as np
x = np.random.uniform(-10, 10, 2)
y = np.random.uniform(-10, 10, 2)
print(x, y)
print(np.corrcoef(x,y))
Output 1:
[-2.61268708 8.32602293] [6.42020314 3.43806504]
[[ 1. -1.]
[-1. 1.]]
Output 2:
[ 5.04249697 -3.6599369 ] [6.12936665 3.15827974]
[[1. 1.]
[1. 1.]]
Output 3:
[7.33503682 7.7145613 ] [-9.54304108 7.43840944]
[[1. 1.]
[1. 1.]]
Question: What's happening and how to solve it?
There are a couple misunderstandings leading to your confusion:
I'll use row major order as numpy "Each row of x represents a variable, and each column a single observation of all those variables."
The Pearson correlation coefficient describes the linear relationship between 2 variables. If you only have 2 values point for each. You can always create a linear relationship between the 2. With the normalization, you'll always get 1 or -1.
A covariance or correlation matrix is usually calculated amongst the components of a random vector X=(X1,....,Xn).T . When you say you want the correlation between 2 vectors, it is unclear whether you want the cross-correlation between X an Y in which case you need np.correlate.

Why is the correlation coefficient of these two list equal to 1?

I have two list a and b as follows:
a = [4,4,4,1.1]
b = [4,4,4,1.2]
It is clear that the last value in both the list is different, still why do I get the correlation co-eff (from numpy) to be equal to 1 in the below code:
print(corrcoef(a,b))
output:
[[1. 1.]
[1. 1.]]
You assume just because the last value is different, the correlation coefficient should not be 1. This assumption however, can be flawed.
The important thing to realize is that correlation is calculated only after adjusting for the scales of each list/feature. With that in mind, you only have two unique pairs of datapoints. The correlation given only two datapoints can almost* always be constructed in such a way that it comes as 1 or -1. This is because the actual values don't matter, since they are scaled accordingly before comparison.
For example:
import numpy as np
a = [60, 30]
b = [1050, 490]
print(np.corrcoef(a,b)) #still gives 1.
Compare this to what you essentially passed:
import numpy as np
a = [4, 1.1]
b = [4, 1.2]
print(np.corrcoef(a,b)) #still gives 1.
Two datapoints don't contain enough information to show that the correlation can be a specific value that is not equal to 1 or -1.
To see why the correlation of 1 can make sense here, consider a 3rd point that i can add.
a = [6.9, 4, 1.1] #gaps of 2.9
b = [6.8, 4, 1.2] #gaps of 2.8
print(np.corrcoef(a,b)) #still gives 1.
Perhaps this makes it slightly clearer why the correlation can be 1, because the data points in the two lists are still moving together perfectly.
For getting a different correlation value with 3 points, we can compare to this.
a = [7, 4, 1.1]
b = [7, 4, 1.2]
print(np.corrcoef(a,b)) #gives 0.99994879
Now we have enough datapoints to show that the correlation is not perfectly 1.
*regarding the almost, exceptions would be cases where one feature does not change at all. such as a = [0, 0] with b = [0, 1]

How to Create an Evenly spaced Numpy Array from Set Values

I am looking to create a an array via numpy that generates an equally spaced values from interval to interval based on values in a given array.
I understand there is:
np.linspace(min, max, num_elements)
but what I am looking for is imagine you have a set of values:
arr = np.array([1, 2, 4, 6, 7, 8, 12, 10])
When I do:
#some numpy function
arr = np.somefunction(arr, 16)
>>>arr
>>> array([1, 1.12, 2, 2.5, 4, 4.5, etc...)]
# new array with 16 elements including all the numbers from previous
# array with generated numbers to 'evenly space them out'
So I am looking for the same functionality as linspace() but takes all the elements in an array and creates another array with the desired elements but evenly spaced intervals from the set values in the array. I hope I am making myself clear on this..
What I am trying to actually do with this set up is take existing x,y data and expand the data to have more 'control points' in a sense so i can do calculations in the long run.
Thank you in advance.
xp = np.arange(len(arr)) # X coordinates of arr
targets = np.arange(0, len(arr)-0.5, 0.5) # X coordinates desired
np.interp(targets, xp, arr)
The above does simple linear interpolation of 8 data points at 0.5 spacing for a total of 15 points (because of fenceposting):
array([ 1. , 1.5, 2. , 3. , 4. , 5. , 6. , 6.5, 7. ,
7.5, 8. , 10. , 12. , 11. , 10. ])
There are some additional options you can use in numpy.interp to tweak the behavior. You can also generate targets in different ways if you want.

find Markov steady state with left eigenvalues (using numpy or scipy)

I need to find the steady state of Markov models using the left eigenvectors of their transition matrices using some python code.
It has already been established in this question that scipy.linalg.eig fails to provide actual left eigenvectors as described, but a fix is demonstrated there. The official documentation is mostly useless and incomprehensible as usual.
A bigger problem than than the incorrect format is that the eigenvalues produced are not in any particular order (not sorted and different each time). So if you want to find the left eigenvectors that correspond to the 1 eigenvalues you have to hunt for them, and this poses it's own problem (see below). The math is clear, but how to get python to compute this and return the correct eigenvectors is not clear. Other answers to this question, like this one, don't seem to be using the left eigenvectors, so those can't be correct solutions.
This question provides a partial solution, but it doesn't account for the unordered eigenvalues of larger transition matrices. So, just using
leftEigenvector = scipy.linalg.eig(A,left=True,right=False)[1][:,0]
leftEigenvector = leftEigenvector / sum(leftEigenvector)
is close, but doesn't work in general because the entry in the [:,0] position may not be the eigenvector for the correct eigenvalue (and in my case it is usually not).
Okay, but the output of scipy.linalg.eig(A,left=True,right=False) is an array in which the [0] element is a list of each eigenvalue (not in any order) and that is followed in position [1] by an array of eigenvectors in a corresponding order to those eigenvalues.
I don't know a good way to sort or search that whole thing by the eigenvalues to pull out the correct eigenvectors (all eigenvectors with eigenvalue 1 normalized by the sum of the vector entries.) My thinking is to get the indices of the eigenvalues that equal 1, and then pull those columns from the array of eigenvectors. My version of this is slow and cumbersome. First I have a function (that doesn't quite work) to find positions in a last that matches a value:
# Find the positions of the element a in theList
def findPositions(theList, a):
return [i for i, x in enumerate(theList) if x == a]
Then I use it like this to get the eigenvectors matching the eigenvalues = 1.
M = transitionMatrix( G )
leftEigenvectors = scipy.linalg.eig(M,left=True,right=False)
unitEigenvaluePositions = findPositions(leftEigenvectors[0], 1.000)
steadyStateVectors = []
for i in unitEigenvaluePositions:
thisEigenvector = leftEigenvectors[1][:,i]
thisEigenvector / sum(thisEigenvector)
steadyStateVectors.append(thisEigenvector)
print steadyStateVectors
But actually this doesn't work. There is one eigenvalue = 1.00000000e+00 +0.00000000e+00j that is not found even though two others are.
My expectation is that I am not the first person to use python to find stationary distributions of Markov models. Somebody who is more proficient/experienced probably has a working general solution (whether using numpy or scipy or not). Considering how popular Markov models are I expected there to be a library just for them to perform this task, and maybe it does exist but I couldn't find one.
You linked to How do I find out eigenvectors corresponding to a particular eigenvalue of a matrix? and said it doesn't compute the left eigenvector, but you can fix that by working with the transpose.
For example,
In [901]: import numpy as np
In [902]: import scipy.sparse.linalg as sla
In [903]: M = np.array([[0.5, 0.25, 0.25, 0], [0, 0.1, 0.9, 0], [0.2, 0.7, 0, 0.1], [0.2, 0.3, 0, 0.5]])
In [904]: M
Out[904]:
array([[ 0.5 , 0.25, 0.25, 0. ],
[ 0. , 0.1 , 0.9 , 0. ],
[ 0.2 , 0.7 , 0. , 0.1 ],
[ 0.2 , 0.3 , 0. , 0.5 ]])
In [905]: eval, evec = sla.eigs(M.T, k=1, which='LM')
In [906]: eval
Out[906]: array([ 1.+0.j])
In [907]: evec
Out[907]:
array([[-0.32168797+0.j],
[-0.65529032+0.j],
[-0.67018328+0.j],
[-0.13403666+0.j]])
In [908]: np.dot(evec.T, M).T
Out[908]:
array([[-0.32168797+0.j],
[-0.65529032+0.j],
[-0.67018328+0.j],
[-0.13403666+0.j]])
To normalize the eigenvector (which you know should be real):
In [913]: u = (evec/evec.sum()).real
In [914]: u
Out[914]:
array([[ 0.18060201],
[ 0.36789298],
[ 0.37625418],
[ 0.07525084]])
In [915]: np.dot(u.T, M).T
Out[915]:
array([[ 0.18060201],
[ 0.36789298],
[ 0.37625418],
[ 0.07525084]])
If you don't know the multiplicity of eigenvalue 1 in advance, see #pv.'s comment showing code using scipy.linalg.eig. Here's an example:
In [984]: M
Out[984]:
array([[ 0.9 , 0.1 , 0. , 0. , 0. , 0. ],
[ 0.3 , 0.7 , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.25, 0.75, 0. , 0. ],
[ 0. , 0. , 0.5 , 0.5 , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 1. ],
[ 0. , 0. , 0. , 0. , 1. , 0. ]])
In [985]: import scipy.linalg as la
In [986]: evals, lvecs = la.eig(M, right=False, left=True)
In [987]: tol = 1e-15
In [988]: mask = abs(evals - 1) < tol
In [989]: evals = evals[mask]
In [990]: evals
Out[990]: array([ 1.+0.j, 1.+0.j, 1.+0.j])
In [991]: lvecs = lvecs[:, mask]
In [992]: lvecs
Out[992]:
array([[ 0.9486833 , 0. , 0. ],
[ 0.31622777, 0. , 0. ],
[ 0. , -0.5547002 , 0. ],
[ 0. , -0.83205029, 0. ],
[ 0. , 0. , 0.70710678],
[ 0. , 0. , 0.70710678]])
In [993]: u = lvecs/lvecs.sum(axis=0, keepdims=True)
In [994]: u
Out[994]:
array([[ 0.75, -0. , 0. ],
[ 0.25, -0. , 0. ],
[ 0. , 0.4 , 0. ],
[ 0. , 0.6 , 0. ],
[ 0. , -0. , 0.5 ],
[ 0. , -0. , 0.5 ]])
In [995]: np.dot(u.T, M).T
Out[995]:
array([[ 0.75, 0. , 0. ],
[ 0.25, 0. , 0. ],
[ 0. , 0.4 , 0. ],
[ 0. , 0.6 , 0. ],
[ 0. , 0. , 0.5 ],
[ 0. , 0. , 0.5 ]])
All right, I had to make some changes while implementing Warren's solution and I've included those below. It's basically the same so he gets all the credit, but the realities of numerical approximations with numpy and scipy required more massaging which I thought would be helpful to see for others trying to do this in the future. I also changed the variable names to be super noob-friendly.
Please let me know if I got anything wrong or there are further recommended improvements (e.g. for speed).
# in this case my Markov model is a weighted directed graph, so convert that nx.graph (G) into it's transition matrix
M = transitionMatrix( G )
#create a list of the left eigenvalues and a separate array of the left eigenvectors
theEigenvalues, leftEigenvectors = scipy.linalg.eig(M, right=False, left=True)
# for stationary distribution the eigenvalues and vectors are always real, and this speeds it up a bit
theEigenvalues = theEigenvalues.real
leftEigenvectors = leftEigenvectors.real
# set how close to zero is acceptable as being zero...1e-15 was too low to find one of the actual eigenvalues
tolerance = 1e-10
# create a filter to collect the eigenvalues that are near enough to zero
mask = abs(theEigenvalues - 1) < tolerance
# apply that filter
theEigenvalues = theEigenvalues[mask]
# filter out the eigenvectors with non-zero eigenvalues
leftEigenvectors = leftEigenvectors[:, mask]
# convert all the tiny and negative values to zero to isolate the actual stationary distributions
leftEigenvectors[leftEigenvectors < tolerance] = 0
# normalize each distribution by the sum of the eigenvector columns
attractorDistributions = leftEigenvectors / leftEigenvectors.sum(axis=0, keepdims=True)
# this checks that the vectors are actually the left eigenvectors, but I guess it's not needed to usage
#attractorDistributions = np.dot(attractorDistributions.T, M).T
# convert the column vectors into row vectors (lists) for each attractor (the standard output for this kind of analysis)
attractorDistributions = attractorDistributions.T
# a list of the states in any attractor with the approximate stationary distribution within THAT attractor (e.g. for graph coloring)
theSteadyStates = np.sum(attractorDistributions, axis=1)
Putting that all together in an easy copy-and-paste format:
M = transitionMatrix( G )
theEigenvalues, leftEigenvectors = scipy.linalg.eig(M, right=False, left=True)
theEigenvalues = theEigenvalues.real
leftEigenvectors = leftEigenvectors.real
tolerance = 1e-10
mask = abs(theEigenvalues - 1) < tolerance
theEigenvalues = theEigenvalues[mask]
leftEigenvectors = leftEigenvectors[:, mask]
leftEigenvectors[leftEigenvectors < tolerance] = 0
attractorDistributions = leftEigenvectors / leftEigenvectors.sum(axis=0, keepdims=True)
attractorDistributions = attractorDistributions.T
theSteadyStates = np.sum(attractorDistributions, axis=0)
Using this analysis on a generated Markov model produced one attractor (of three) with a steady state distribution of 0.19835218 and 0.80164782 compared to the mathematically accurate values of 0.2 and 0.8. So that's more than 0.1% off, kind of a big error for science. That's not a REAL problem because if accuracy is important then, now that the individual attractors have been identified, a more accurate analyses of behavior within each attractor using a matrix subset can be done.

Why do I get rows of zeros in my 2D fft?

I am trying to replicate the results from a paper.
"Two-dimensional Fourier Transform (2D-FT) in space and time along sections of constant latitude (east-west) and longitude (north-south) were used to characterize the spectrum of the simulated flux variability south of 40degS." - Lenton et al(2006)
The figures published show "the log of the variance of the 2D-FT".
I have tried to create an array consisting of the seasonal cycle of similar data as well as the noise. I have defined the noise as the original array minus the signal array.
Here is the code that I used to plot the 2D-FT of the signal array averaged in latitude:
import numpy as np
from numpy import ma
from matplotlib import pyplot as plt
from Scientific.IO.NetCDF import NetCDFFile
### input directory
indir = '/home/nicholas/data/'
### get the flux data which is in
### [time(5day ave for 10 years),latitude,longitude]
nc = NetCDFFile(indir + 'CFLX_2000_2009.nc','r')
cflux_southern_ocean = nc.variables['Cflx'][:,10:50,:]
cflux_southern_ocean = ma.masked_values(cflux_southern_ocean,1e+20) # mask land
nc.close()
cflux = cflux_southern_ocean*1e08 # change units of data from mmol/m^2/s
### create an array that consists of the seasonal signal fro each pixel
year_stack = np.split(cflux, 10, axis=0)
year_stack = np.array(year_stack)
signal_array = np.tile(np.mean(year_stack, axis=0), (10, 1, 1))
signal_array = ma.masked_where(signal_array > 1e20, signal_array) # need to mask
### average the array over latitude(or longitude)
signal_time_lon = ma.mean(signal_array, axis=1)
### do a 2D Fourier Transform of the time/space image
ft = np.fft.fft2(signal_time_lon)
mgft = np.abs(ft)
ps = mgft**2
log_ps = np.log(mgft)
log_mgft= np.log(mgft)
Every second row of the ft consists completely of zeros. Why is this?
Would it be acceptable to add a randomly small number to the signal to avoid this.
signal_time_lon = signal_time_lon + np.random.randint(0,9,size=(730, 182))*1e-05
EDIT: Adding images and clarify meaning
The output of rfft2 still appears to be a complex array. Using fftshift shifts the edges of the image to the centre; I still have a power spectrum regardless. I expect that the reason that I get rows of zeros is that I have re-created the timeseries for each pixel. The ft[0, 0] pixel contains the mean of the signal. So the ft[1, 0] corresponds to a sinusoid with one cycle over the entire signal in the rows of the starting image.
Here are is the starting image using following code:
plt.pcolormesh(signal_time_lon); plt.colorbar(); plt.axis('tight')
Here is result using following code:
ft = np.fft.rfft2(signal_time_lon)
mgft = np.abs(ft)
ps = mgft**2
log_ps = np.log1p(mgft)
plt.pcolormesh(log_ps); plt.colorbar(); plt.axis('tight')
It may not be clear in the image but it is only every second row that contains completely zeros. Every tenth pixel (log_ps[10, 0]) is a high value. The other pixels (log_ps[2, 0], log_ps[4, 0] etc) have very low values.
Consider the following example:
In [59]: from scipy import absolute, fft
In [60]: absolute(fft([1,2,3,4]))
Out[60]: array([ 10. , 2.82842712, 2. , 2.82842712])
In [61]: absolute(fft([1,2,3,4, 1,2,3,4]))
Out[61]:
array([ 20. , 0. , 5.65685425, 0. ,
4. , 0. , 5.65685425, 0. ])
In [62]: absolute(fft([1,2,3,4, 1,2,3,4, 1,2,3,4]))
Out[62]:
array([ 30. , 0. , 0. , 8.48528137,
0. , 0. , 6. , 0. ,
0. , 8.48528137, 0. , 0. ])
If X[k] = fft(x), and Y[k] = fft([x x]), then Y[2k] = 2*X[k] for k in {0, 1, ..., N-1} and zero otherwise.
Therefore, I would look into how your signal_time_lon is being tiled. That may be where the problem lies.

Categories

Resources