2D local maxima and minima in Python - python

I have a data frame, df, representing a correlation matrix, with this heatmap with example extrema. Every point has, obviously, (x,y,value):
I am looking into getting the local extrema. I looked into argrelextrema, I tried it on individual rows and the results were as expected, but that didn't work for 2D. I have also looked into scipy.signal.find_peaks, but this is for a 1D array.
Is there anything in Python that will return the local extrema over/under certain values(threshold)?
Something like an array of (x, y, value)? If not then can you point me in the right direction?

This is a tricky question, because you need to carefully define the notion of how "big" a maximum or minimum needs to be before it is relevant. For example, imagine that you have a patch containing the following 5x5 grid of pixels:
im = np.array([[ 0 0 0 0 0
0 5 5 5 0
0 5 4 5 0
0 5 5 5 0
0 0 0 0 0. ]])
This might be looked at as a local minimum, because 4 is less than the surrounding 5s. OTOH, it might be looked at as a local maximum, where the single lone 4 pixel is just "noise", and the 3x3 patch of average 4.89-intensity pixels is actually a single local maximum. This is commonly known as the "scale" at which you are viewing the image.
In any case, you can estimate the local derivative in one direction by using a finite difference in that direction. The x direction might be something like:
k = np.array([[ -1 0 1
-1 0 1
-1 0 1. ]])
Applying this filter to the image patch defined above gives:
>>> cv2.filter2D(im, cv2.CV_64F, k)[1:-1,1:-1]
array([[ 9., 0., -9.],
[ 14., 0., -14.],
[ 9., 0., -9.]])
Applying a similar filter in the y direction will transpose this. The only point in here with a 0 in both the x and the y directions is the very middle, which is the 4 that we decided was a local minimum. This is tantamount to checking that the gradient is 0 in both x and y.
This whole process can be extended to find the larger single local maximum that we have identified. You'll use a larger filter, e.g.
k = np.array([[ -2, -1, 0, 1, 2],
[ -2, -1, 0, 1, 2], ...
Since the 4 makes the local maximum an approximate thing, you'll need to use some "approximate" logic. i.e. you'll look for values that are "close" to 0. Just how close depends on just how fudgy you are willing to allow the local extrema to be. To sum up, the two fudge factors here are 1. filter size and 2. ~=0 fudge factor.

Related

Why are there discrepanices when generating a distance matrix with scipy pdist(metric = 'jaccard') vs scipy jaccard?

I am comparing the Jaccard distance matrix I get when I process a dataset using pdist and a DIY Jaccard distance matrix function. I'm getting different results in my output distance matrices and I'm not sure why.
I think one of the following is the cause:
My implementation of jaccard distance calculation is wrong
scipy.spatial.distance.pdist(metric = 'jaccard') and scipy.spatial.distance.jaccard calculate jaccard distance in different ways (seems unlikely as their both in scipy.spatial.distance)
squareform is doing something to my data, potentially a normalisation
The docs for squareform go a bit over my head so some form of normalisation might be what's happening. However, the squareform-ed distance matrix does not have the same relative distance magnitudes between cells which is confusing (e.g. row 0 in my DIY distance matrix is 0, 0.571429, 1, and with pdist is 0, 1, 1 - the middle value is twice as high with pdist).
Can anyone explain the why I'm getting a different distance matrix when it's being analysed with the same metric?
My code:
import numpy as np
from scipy.spatial.distance import jaccard, squareform, pdist
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
#I don't care about every value in the array for my use case, so dont want to include them in my comparison
all_features = set([i for i in feature_list1 if i != filler_val])
all_features.update(set([i for i in feature_list2 if i != filler_val]))
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
data_array = np.array([[1, 2, 3, 4, 5],
[3, 4, 5, 6, 7],
[8, 9, 10, 11, 12]])
# =============================================================================
# DIY distance matrix
# =============================================================================
#set filler val to None, so the arrays being compared are equivalent to pdist
dist_diy = np.array([[jaccard_dissimilarity(a,b, None) for a in data_array] for b in data_array])
# =============================================================================
# pdist distance matrix
# =============================================================================
dist_pdist = squareform(pdist(data_array, metric = 'jaccard'))
Input array:
1 2 3 4 5
3 4 5 6 7
8 9 10 11 12
dist_diy:
0 0.571429 1
0.571429 0 1
1 1 0
dist_pdist:
0 1 1
1 0 1
1 1 0
Looks like pdist considers objects at a given index when comparing arrays, rather than just what objects are present in the array itself - if I change data_array[1] to 3, 4, 5, 4, 5 then the distance matrix changes to reflect the fact that data_array[0][3:5] == data_array[1][3:5]:
0 0.6 1
0.6 0 1
1 1 0
The behaviour is discussed here, but the arrays don't have to be boolean based on the above tests (if the arrays were treated as boolean then the distance matrix would not change as all numbers are > 1 and are therefore == True).
The DIY function considered the objects present rather than the index at which those objects were found, hence the discrepancy!

Measure of Feature Importance in PCA

I am doing Principle Component Analysis (PCA) and I'd like to find out which features that contribute the most to the result.
My intuition is to sum up all the absolute values of the individual contribution of the features to the individual components.
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1, 4, 1], [-2, -1, 4, 2], [-3, -2, 4, 3], [1, 1, 4, 4], [2, 1, 4, 5], [3, 2, 4, 6]])
pca = PCA(n_components=0.95, whiten=True, svd_solver='full').fit(X)
pca.components_
array([[ 0.71417303, 0.46711713, 0. , 0.52130459],
[-0.46602418, -0.23839061, -0. , 0.85205128]])
np.sum(np.abs(pca.components_), axis=0)
array([1.18019721, 0.70550774, 0. , 1.37335586])
This yields, in my eyes, a measure of importance of each of the original features. Note that the 3rd feature has zero importance, because I intentionally created a column that is just a constant value.
Is there a better "measure of importance" for PCA?
The measure of importance for PCA is in explained_variance_ratio_. This array provides percentage of variance explained by each component. It is sorted by importance of the components in descending order and sums up to 1 when all the components are used, or minimal possible value above the requested threshold. In your example you set a threshold to 95% (of variance that should be explained), so the array sum will be 0.9949522861608583 as the first component explains 92.021143% and the second 7.474085% of the variance, hence the 2 components you receive.
components_ is the array that stores the directions of maximum variance in the feature space. It's dimensions are n_components_ by n_features_. This is what you multiply the data point(s) by when applying transform() to get reduced dimensionality projection of the data.
update
In order to get the percentage of contribution of the original features to each of the Principal Components, you just need to normalize components_, as they set the amount original vectors contribute to the projection.
r = np.abs(pca.components_.T)
r/r.sum(axis=0)
array([[0.41946155, 0.29941172],
[0.27435603, 0.15316146],
[0. , 0. ],
[0.30618242, 0.54742682]])
As you can see third feature does not contribute to the PCs.
If you need the total contribution of the original features to the explained variance, you need to take into account each PC contribution (i.e. explained_variance_ratio_):
ev = np.abs(pca.components_.T).dot(pca.explained_variance_ratio_)
ttl_ev = pca.explained_variance_ratio_.sum()*ev/ev.sum()
print(ttl_ev)
[0.40908847 0.26463667 0. 0.32122715]
If you just purely sum the PCs with np.sum(np.abs(pca.components_), axis=0), that assumes all PCs are equally important which is rarely true. To use PCA for crude feature selection, sum after discarding low-contribution PCs and/or after scaling the PCs by their relative contributions.
Here is a visual example that highlights why a plain sum doesn't work as desired.
Given 3 observations of 20 features (visualized as three 5x4 heatmaps):
>>> print(X.T)
[[2 1 1 1 1 1 1 1 1 4 1 1 1 4 1 1 1 1 1 2]
[1 1 1 1 1 1 1 1 1 4 1 1 1 6 3 1 1 1 1 2]
[1 1 1 2 1 1 1 1 1 5 2 1 1 5 1 1 1 1 1 2]]
These are the resulting PCs:
>>> pca = PCA(n_components=None, whiten=True, svd_solver='full').fit(X.T)
Note that PC3 has high magnitude at (2,1), but if we check its explained variance, it offers ~0 contribution:
>>> print(pca.explained_variance_ratio_)
array([0.6638886943392722, 0.3361113056607279, 2.2971091700327738e-32])
This causes a feature selection discrepancy when summing the unscaled PCs (left) vs summing the PCs scaled by their explained variance ratios (right):
>>> unscaled = np.sum(np.abs(pca.components_), axis=0)
>>> scaled = np.sum(pca.explained_variance_ratio_[:, None] * np.abs(pca.components_), axis=0)
With the unscaled sum (left), the meaningless PC3 is still given 33% weight. This causes (2,1) to be considered the most important feature, but if we look back to the original data, (2,1) offers low discrimination between observations.
With the scaled sum (right), PC1 and PC2 respectively have 66% and 33% weight. Now (3,1) and (3,2) are the most important features which actually tracks with the original data.

Random 2D array without specifying min and max

In Python, is there a way to generate a 2d array using numpy with random integer entries without specifying either the low or high?
I tried mat = np.random.randint(size=(3, 4)) but it did not work.
Assuming you don't want to specify the min or max values of the array, one can use numpy.random.normal
np.random.normal(mean, standard deviation, (rows,columns))
And then round it with astype(np.int), as
>>> import numpy as np
>>> mat = (np.random.normal(1, 3, (3,4))).astype(np.int)
[[ 0 0 0 -1]
[ 0 5 0 0]
[-5 1 2 2]]
Please note that the output may vary, as the values are random.
If you want to specify the min and max values, there are various ways of doing that, such as
mat = (np.random.random((3,4))*10).astype(np.int) # Random ints between 0 and 10
or
mat = np.random.randint(1,5, size=(3,4)) # Random ints between 1 and 5
And more.

Batch operation to all matrices stored in a n-dimensional numpy array [duplicate]

This question already has an answer here:
Compute inverse of 2D arrays along the third axis in a 3D array without loops
(1 answer)
Closed 5 years ago.
I have a numpy array of size (4, 4, 6890), which basically stores contains 6890 4x4 matrices. I need to invert all of them and I am currently doing in a loop, which I know is a bad practice
for i in range(0, T.shape[2]):
T_inv[:,:,i] = np.linalg.inv(T[:,:,i])
How can I do it with a single call?
np.linalg.inv will do it, but you need to rearrange your axes:
T_inv = np.moveaxis(np.linalg.inv(np.moveaxis(T, -1, 0)), 0, -1)
Might be better to just construct T so that T.shape = (68690, 4, 4). It will help with broadcasting as well.
I'm not sure how to do it with numpy, but check this out:
[ A 0 0 ] [ A^(-1) 0 0 ] [ I 0 0 ]
[ 0 B 0 ] * [ 0 B^(-1) 0 ] = [ 0 I 0 ]
[ 0 0 C ] [ 0 0 C^(-1) ] [ 0 0 I ]
A,B,C being matrices of the same size (for example 4x4), and A^(-1), B^(-1), C^(-1), being their inverses. I is a unity matrix.
So, what does this tell us? We can construct a large sparse block-diagonal matrix with all the sub-matrices (4x4) on diagonal, take the inverse of that large matrix, and just read-out the sub-matrices' inverses off diagonal blocks.

build a map out of a matrix with multiple outcomes

I have an input matrix that is of unknown n x m dimensions that is populated by 1s and 0s
For example, a 5x4 matrix:
A = array(
[[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 1, 0],
[0, 1, 1, 0],
[1, 0, 1, 1]])
Goal
I need to create a 1 : 1 map between as many columns and rows as possible, where the element at that location is 1.
What I mean by a 1 : 1 map is that each column and row can be used once at most.
the ideal solution has the most mappings possible ie. the most rows and columns used. It should also avoid exhaustive combinations or operations that do not scale well with larger matrices (practically, maximum dimensions should be 100x100, but there is no declared limit so they could go higher)
Here's a possible outcome of the above
array([[ 1., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 1., 0., 0.],
[ 0., 0., 0., 1.]])
Some more Examples:
input:
0 1 1
0 1 0
0 1 1
output (one of several possible ones):
0 0 1
0 1 0
0 0 0
another (this shows one problem that can arise)
input:
0 1 1 1
0 1 0 0
1 1 0 0
a good output (again, one of several):
0 0 1 0
0 1 0 0
1 0 0 0
a bad output (still valid, but has fewer mappings)
0 1 0 0
0 0 0 0
1 0 0 0
to better show how their can be multiple outputs
input:
0 1 1
1 1 0
one possible output:
0 1 0
1 0 0
a second possible output:
0 0 1
0 1 0
a third possible output
0 0 1
1 0 0
What have I done?
I have a really dumb way of handling it right now which is not at all guaranteed to work. Basically I just build a filter matrix out of an identity matrix (because its the perfect map, every row and every column are used once) and then I randomly swap its columns (n times) and filter the original matrix with it, recording the filter matrix with the best results.
My [non] solution:
import random
import numpy as np
# this is a starting matrix with random values
A = np.array(
[[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 1, 0],
[0, 1, 1, 0],
[1, 0, 1, 1]])
# add dummy row to make it square
new_col = np.zeros([5,1]) + 1
A = np.append(A, new_col, axis=1)
# make an identity matrix (the perfect map)
imatrix = np.diag([1]*5)
# randomly swap columns on the identity matrix until they match.
n = 1000
# this will hold the map that works the best
best_map_so_far = np.zeros([1,1])
for i in range(n):
a, b = random.sample(range(5), 2)
t = imatrix[:,a].copy()
imatrix[:,a] = imatrix[:,b]
imatrix[:,b] = t
# is this map better than the previous best?
if sum(sum(imatrix * A)) > sum(sum(best_map_so_far)):
best_map_so_far = imatrix
# could it be? a perfect map??
if sum(sum(imatrix * A)) == A.shape[0]:
break
# jk.
# did we still fail
if sum(sum(imatrix * A)) != 5:
print('haha')
# drop the dummy row
output = imatrix * A
output[:,:-1]
#... wow. it actually kind of works.
How about this?
let S be the solution vector, as wide as A, containing row numbers.
let Z be a vector containing the number of zeros in each column.
for each row:
select the cells which contain 1 in A and no value in S.
select from those cells those with the highest score in Z.
select from those cells the first (or a random) cell.
store the row number in the column of S corresponding to the cell.
Does that give you a sufficient solution? If so it should be much more efficient than what you have.
Let me give it a go. The algorithm I suggest will not always give the optimal solution, but maybe somebody can improve it.
You can always interchange two columns or two rows without changing the problem. Further, by keeping track of the changes you can always go back to the original problem.
We are going to fill the main diagonal with 1s as far as it will go. Get the first 1 in the upper left corner by interchanging columns, or rows, or both. Now the first row and column are fixed and we don't touch them anymore. We now try to fill in the second element on the diagonal with 1, and then fix the second row and column. And so on.
If the bottom right submatrix is zero, we should try to bring a 1 there by interchanging two columns or two rows using the whole matrix but preserving the existing 1s in the diagonal. (Here lies the problem. It is easy to check efficiently if one interchange can help. But it could be that at least two interchanges are required, or maybe more.)
We stop when no more 1s can be obtained on the diagonal.
So, while the algorithm is not always optimal, maybe it is possible to come up with extra rules how to interchange columns and rows so as to populate the diagonal with 1s as far as possible.

Categories

Resources