Measure of Feature Importance in PCA

Measure of Feature Importance in PCA - python

I am doing Principle Component Analysis (PCA) and I'd like to find out which features that contribute the most to the result.
My intuition is to sum up all the absolute values of the individual contribution of the features to the individual components.
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1, 4, 1], [-2, -1, 4, 2], [-3, -2, 4, 3], [1, 1, 4, 4], [2, 1, 4, 5], [3, 2, 4, 6]])
pca = PCA(n_components=0.95, whiten=True, svd_solver='full').fit(X)
pca.components_
array([[ 0.71417303, 0.46711713, 0. , 0.52130459],
[-0.46602418, -0.23839061, -0. , 0.85205128]])
np.sum(np.abs(pca.components_), axis=0)
array([1.18019721, 0.70550774, 0. , 1.37335586])
This yields, in my eyes, a measure of importance of each of the original features. Note that the 3rd feature has zero importance, because I intentionally created a column that is just a constant value.
Is there a better "measure of importance" for PCA?

The measure of importance for PCA is in explained_variance_ratio_. This array provides percentage of variance explained by each component. It is sorted by importance of the components in descending order and sums up to 1 when all the components are used, or minimal possible value above the requested threshold. In your example you set a threshold to 95% (of variance that should be explained), so the array sum will be 0.9949522861608583 as the first component explains 92.021143% and the second 7.474085% of the variance, hence the 2 components you receive.
components_ is the array that stores the directions of maximum variance in the feature space. It's dimensions are n_components_ by n_features_. This is what you multiply the data point(s) by when applying transform() to get reduced dimensionality projection of the data.
update
In order to get the percentage of contribution of the original features to each of the Principal Components, you just need to normalize components_, as they set the amount original vectors contribute to the projection.
r = np.abs(pca.components_.T)
r/r.sum(axis=0)
array([[0.41946155, 0.29941172],
[0.27435603, 0.15316146],
[0. , 0. ],
[0.30618242, 0.54742682]])
As you can see third feature does not contribute to the PCs.
If you need the total contribution of the original features to the explained variance, you need to take into account each PC contribution (i.e. explained_variance_ratio_):
ev = np.abs(pca.components_.T).dot(pca.explained_variance_ratio_)
ttl_ev = pca.explained_variance_ratio_.sum()*ev/ev.sum()
print(ttl_ev)
[0.40908847 0.26463667 0. 0.32122715]

If you just purely sum the PCs with np.sum(np.abs(pca.components_), axis=0), that assumes all PCs are equally important which is rarely true. To use PCA for crude feature selection, sum after discarding low-contribution PCs and/or after scaling the PCs by their relative contributions.
Here is a visual example that highlights why a plain sum doesn't work as desired.
Given 3 observations of 20 features (visualized as three 5x4 heatmaps):
>>> print(X.T)
[[2 1 1 1 1 1 1 1 1 4 1 1 1 4 1 1 1 1 1 2]
[1 1 1 1 1 1 1 1 1 4 1 1 1 6 3 1 1 1 1 2]
[1 1 1 2 1 1 1 1 1 5 2 1 1 5 1 1 1 1 1 2]]
These are the resulting PCs:
>>> pca = PCA(n_components=None, whiten=True, svd_solver='full').fit(X.T)
Note that PC3 has high magnitude at (2,1), but if we check its explained variance, it offers ~0 contribution:
>>> print(pca.explained_variance_ratio_)
array([0.6638886943392722, 0.3361113056607279, 2.2971091700327738e-32])
This causes a feature selection discrepancy when summing the unscaled PCs (left) vs summing the PCs scaled by their explained variance ratios (right):
>>> unscaled = np.sum(np.abs(pca.components_), axis=0)
>>> scaled = np.sum(pca.explained_variance_ratio_[:, None] * np.abs(pca.components_), axis=0)
With the unscaled sum (left), the meaningless PC3 is still given 33% weight. This causes (2,1) to be considered the most important feature, but if we look back to the original data, (2,1) offers low discrimination between observations.
With the scaled sum (right), PC1 and PC2 respectively have 66% and 33% weight. Now (3,1) and (3,2) are the most important features which actually tracks with the original data.

Related

Why are there discrepanices when generating a distance matrix with scipy pdist(metric = 'jaccard') vs scipy jaccard?

I am comparing the Jaccard distance matrix I get when I process a dataset using pdist and a DIY Jaccard distance matrix function. I'm getting different results in my output distance matrices and I'm not sure why.
I think one of the following is the cause:
My implementation of jaccard distance calculation is wrong
scipy.spatial.distance.pdist(metric = 'jaccard') and scipy.spatial.distance.jaccard calculate jaccard distance in different ways (seems unlikely as their both in scipy.spatial.distance)
squareform is doing something to my data, potentially a normalisation
The docs for squareform go a bit over my head so some form of normalisation might be what's happening. However, the squareform-ed distance matrix does not have the same relative distance magnitudes between cells which is confusing (e.g. row 0 in my DIY distance matrix is 0, 0.571429, 1, and with pdist is 0, 1, 1 - the middle value is twice as high with pdist).
Can anyone explain the why I'm getting a different distance matrix when it's being analysed with the same metric?
My code:
import numpy as np
from scipy.spatial.distance import jaccard, squareform, pdist
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
#I don't care about every value in the array for my use case, so dont want to include them in my comparison
all_features = set([i for i in feature_list1 if i != filler_val])
all_features.update(set([i for i in feature_list2 if i != filler_val]))
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
data_array = np.array([[1, 2, 3, 4, 5],
[3, 4, 5, 6, 7],
[8, 9, 10, 11, 12]])
# =============================================================================
# DIY distance matrix
# =============================================================================
#set filler val to None, so the arrays being compared are equivalent to pdist
dist_diy = np.array([[jaccard_dissimilarity(a,b, None) for a in data_array] for b in data_array])
# =============================================================================
# pdist distance matrix
# =============================================================================
dist_pdist = squareform(pdist(data_array, metric = 'jaccard'))
Input array:
1 2 3 4 5
3 4 5 6 7
8 9 10 11 12
dist_diy:
0 0.571429 1
0.571429 0 1
1 1 0
dist_pdist:
0 1 1
1 0 1
1 1 0

Looks like pdist considers objects at a given index when comparing arrays, rather than just what objects are present in the array itself - if I change data_array[1] to 3, 4, 5, 4, 5 then the distance matrix changes to reflect the fact that data_array[0][3:5] == data_array[1][3:5]:
0 0.6 1
0.6 0 1
1 1 0
The behaviour is discussed here, but the arrays don't have to be boolean based on the above tests (if the arrays were treated as boolean then the distance matrix would not change as all numbers are > 1 and are therefore == True).
The DIY function considered the objects present rather than the index at which those objects were found, hence the discrepancy!

Optimizing a function to replace a row with a previous row given, a condition in Pandas

I have a relatively large dataframe (~24000 rows and 15 columns) which has 2D coordinate data of rat movements, outputted by a neural network (DeepLabCut).
As part of this output data, there is a p-value score that is a measure of how certain the neural network was when applying that label. I'm trying to filter low quality predictions by copying the previous row into its place, each time that a low p-value is encountered, which assumes that the rat remained still for that frame.
Here's my function thus far:
def checkPVals(DataFrame, CutOff):
for Cols in DataFrame.columns.values:
if Cols % 3 == 0:
for Vals in DataFrame.index.values:
if float(DataFrame[Cols][Vals]) < CutOff:
if (Vals != 0):
PreviousRow = DataFrame.loc[Vals - 1, Cols - 3:Cols]
DataFrame.loc[Vals, Cols - 3:Cols] = PreviousRow
return(DataFrame)
Here is a sample of the input data frame:
pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
Here is a sample of the desired output:
x y Pval
0 1 5 1.0
1 2 4 1.0
2 2 4 1.0
3 4 2 1.0
With the idea being that row index 2 is replaced with values from row index 1, such that when the inter-frame Euclidean distance between these coordinates is calculated, the distance is 0, implying the label (rat) has not moved.
Clearly, my current implementation is very inefficient. I was looking at iterrows(), but that converts my data into a series and messes with it. My other thought was to convert the p-value columns into np.arrrays, iterate through those, take the index of the p-values below threshold and then swap the rows for the previous one in an iterative manner. However, I feel like that'll take just as long.
Any help is very much appreciated. Thank you!

I'm pretty sure I understood what you are attempting to do. If you could update your question to have a sample output that's paired with you sample input, that would be greatly beneficial.
If I understood correctly, you should be using a vectorized approach instead of explicit looping (this will massively speed up your data wrangling). Essentially you can mask the rows of the dataframe depending on whether or not the "likelihood" column is above a certain value. Once you mask the low likelihoods away (i.e. replace those values with NaN), you can simply forward fill the entire dataframe to fill in the "bad" rows with the previous row's values.
df = pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
cutoff = 0.5
new_df = df.mask(df["likelihood"] < cutoff).ffill()
print(new_df)
x y likelihood
0 1.0 5.0 1.0
1 2.0 4.0 1.0
2 2.0 4.0 1.0
3 4.0 2.0 1.0

2D local maxima and minima in Python

I have a data frame, df, representing a correlation matrix, with this heatmap with example extrema. Every point has, obviously, (x,y,value):
I am looking into getting the local extrema. I looked into argrelextrema, I tried it on individual rows and the results were as expected, but that didn't work for 2D. I have also looked into scipy.signal.find_peaks, but this is for a 1D array.
Is there anything in Python that will return the local extrema over/under certain values(threshold)?
Something like an array of (x, y, value)? If not then can you point me in the right direction?

This is a tricky question, because you need to carefully define the notion of how "big" a maximum or minimum needs to be before it is relevant. For example, imagine that you have a patch containing the following 5x5 grid of pixels:
im = np.array([[ 0 0 0 0 0
0 5 5 5 0
0 5 4 5 0
0 5 5 5 0
0 0 0 0 0. ]])
This might be looked at as a local minimum, because 4 is less than the surrounding 5s. OTOH, it might be looked at as a local maximum, where the single lone 4 pixel is just "noise", and the 3x3 patch of average 4.89-intensity pixels is actually a single local maximum. This is commonly known as the "scale" at which you are viewing the image.
In any case, you can estimate the local derivative in one direction by using a finite difference in that direction. The x direction might be something like:
k = np.array([[ -1 0 1
-1 0 1
-1 0 1. ]])
Applying this filter to the image patch defined above gives:
>>> cv2.filter2D(im, cv2.CV_64F, k)[1:-1,1:-1]
array([[ 9., 0., -9.],
[ 14., 0., -14.],
[ 9., 0., -9.]])
Applying a similar filter in the y direction will transpose this. The only point in here with a 0 in both the x and the y directions is the very middle, which is the 4 that we decided was a local minimum. This is tantamount to checking that the gradient is 0 in both x and y.
This whole process can be extended to find the larger single local maximum that we have identified. You'll use a larger filter, e.g.
k = np.array([[ -2, -1, 0, 1, 2],
[ -2, -1, 0, 1, 2], ...
Since the 4 makes the local maximum an approximate thing, you'll need to use some "approximate" logic. i.e. you'll look for values that are "close" to 0. Just how close depends on just how fudgy you are willing to allow the local extrema to be. To sum up, the two fudge factors here are 1. filter size and 2. ~=0 fudge factor.

How to interpet the output of knn sklearn for matching people based on interest

I'm quite new to machine learning. I'm trying to match people from SetA with people from SetB based on their interest ratings (1=Low, 10=High). My real data set has 40 features (also later I want to set a higher weighting on certain features, as well as interests that are less common - I believe this will help me?).
Example dataset:
>>> dfA = pd.DataFrame(np.array([[1, 1, 1], [4, 4, 4], [8, 8, 8]]),
columns=['interest1', 'interest2', 'interest3'],
index=['personA1','personA2','personA3'])
>>> dfB = pd.DataFrame(np.array([[4, 4, 3], [2, 2, 1], [1, 2, 2]]),
columns=['interest1', 'interest2', 'interest3'],
index=['personB1','personB2','personB3'])
print(dfA, "\n", dfB)
>>> interest1 interest2 interest3
personA1 1 1 1
personA2 4 4 4
personA3 8 8 8
interest1 interest2 interest3
personB1 4 4 3
personB2 2 2 1
personB3 1 2 2
I'm using sklearn's nearest neighbors algorithm for this:
knn = NearestNeighbors(n_neighbors = 2).fit(dfA)
distances, indicies = knn.kneighbors(dfB)
>>> print(distances, "\n \n", indicies)
>>>[[1. 4.69041576]
[1.41421356 4.12310563]
[1.41421356 4.12310563]]
[[1 0]
[0 1]
[0 1]]
I don't understand the output? I'm aware of a similar question's explanation however I don't know how to apply it to this situation as there are 2 different datasets.
Ultimately, I want a final dataframe for matches like:
SetA SetB
personA1 personB2
personA2 personB1
personA3 personB3

The results that you get are the nearest neighbours of a given person in SetB selected from the people in SetA.
In other words, the first element distances[0] tells you the distances of personB1 from its two nearest neighbours in SetA. indicies[0] tells you the indices of those two persons.
In this example:
indicies[0] = [1, 0] means that personB1's nearest neighbours in SetA1 are SetA[1] = personA2 and SetA[0] = personA1.
distances[0] = [1. 4.69041576] tells us that the distance between personB1 and personA2 is 1, and that the distance between personB1 and personA1 is 4.69041576 (you can easily check this if you compute the Euclidean distances by hand).
A couple of remarks:
From the description of your problem, it seems that you are interested only the the nearest neighbour of a person in SetB from a person in SetA (not the 2 nearest neighbours). If that is the case, I would suggest changing n_neighbors=2 to n_neighbors=1 in the knn parameters.
Be careful with your indices: in your dataset the labels start from 1 (personA1, personA2, ...), but in knn the indices always start from 0. This can lead to confusion when things get more complicated, since SetA[0]=personA1, so be mindful about it.

Faster 3D Matrix Operation - Python

I am working with 3D matrix in Python, for example, given matrix like this with size of 2x3x4:
[[[1 2 1 4]
[3 2 1 1]
[4 3 1 4]]
[[2 1 3 3]
[1 4 2 1]
[3 2 3 3]]]
I have task to find the value of entropy in each row in each dimension matrix. For example, in row 1 of dimension 1 of the matrix above [1,2,1,4], the normalized value (as such the total sum is 1) is [0.125, 0.25, 0.125, 0.5] and the value of entropy is calculated by the formula -sum(i*log(i)) where i is the normalized value. The resulting matrix is a 2x3 matrix where in each dimension there are 3 values of entropy (because there are 3 rows).
Here is the working example of my code using random matrix each time:
from scipy.stats import entropy
import numpy as np
matrix = np.random.randint(low=1,high=5,size=(2,3,4)) #how if size is (200,50,1000)
entropy_matrix=np.zeros((matrix.shape[0],matrix.shape[1]))
for i in range(matrix.shape[0]):
normalized = np.array([float(k)/np.sum(j) for j in matrix[i] for k in j]).reshape(matrix.shape[1],matrix.shape[2])
entropy_matrix[i] = np.array([entropy(m) for m in normalized])
My question is how do I scale-up this program to work with very large 3D matrix (for example with size of 200x50x1000) ?
I am using Python in Windows 10 (with Anaconda distribution).
Using 3D matrix size of 200x50x1000, I got running time of 290 s on my computer.

Using the definition of entropy for the second part and broadcasted operation on the first part, one vectorized solution would be -
p1 = matrix/matrix.sum(-1,keepdims=True).astype(float)
entropy_matrix_out = -np.sum(p1 * np.log(p1), axis=-1)
Alternatively, we can use einsum for the second part for further perf. boost -
entropy_matrix_out = -np.einsum('ijk,ijk->ij',p1,np.log(p1),optimize=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Measure of Feature Importance in PCA - python

Related

Why are there discrepanices when generating a distance matrix with scipy pdist(metric = 'jaccard') vs scipy jaccard?

Optimizing a function to replace a row with a previous row given, a condition in Pandas

2D local maxima and minima in Python

How to interpet the output of knn sklearn for matching people based on interest

Faster 3D Matrix Operation - Python

Categories

Resources