I am trying to plot boundary lines of Iris data set using LDA in sklearn Python based on this documentation.
For two dimensional data, we can easily plot the lines using LDA.coef_ and LDA.intercept_.
But for multidimensional data that has been reduced to two components, the LDA.coef_ and LDA.intercept has many dimensions which I don't know how to use these to plot the boundary lines in 2D reduced-dimension plot.
I've tried to plot using only the first two-element of LDA.coef_ and LDA.intercept, but It didn't work.
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
x = np.array([-10,10])
y_hyperplane = -1*(lda.intercept_[0]+x*lda.coef_[0][0])/lda.coef_[0][1]
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
plt.plot(x,y_hyperplane,'k')
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of IRIS dataset')
plt.show()
Result of boundary line produced by lda.coef_[0] and lda.intercept[0] showed a line that isn't likely to separate between two classes
enter image description here
I've tried using np.meshgrid to draw areas of the classes. But I get an error like this
ValueError: X has 2 features per sample; expecting 4
which expecting 4 dimensional of original data, instead of 2D points from the meshgrid.
Linear discriminant analysis (LDA) can be used as a classifier or for dimensionality reduction.
LDA for dimensionality reduction
Dimensionality reduction techniques reduces the number of features. Iris dataset has 4 features, lets use LDA to reduce it to 2 features so that we can visualise it.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
lda = LinearDiscriminantAnalysis(n_components=2)
lda_object = lda.fit(X, y)
X = lda_object.transform(X)
for l,c,m in zip(np.unique(y),['r','g','b'],['s','x','o']):
plt.scatter(X[y==l,0],
X[y==l,1],
c=c, marker=m, label=l,edgecolors='black')
Output:
LDA for multi class classification
LDA does multi class classification using One-vs-rest. If you have 3 classes you will get 3 hyperplanes (decision boundaries) for each class. If there are n features then each hyperplane is represented using n weights (coefficients) and 1 intersect. In general
coef_ : shape of (n_classes, n_features)
intercept_ : shape of (n_classes,)
Sample, documented inline
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(13)
# Generate 3 linearly separable dataset of 2 features
X = [[0,0]]*25+[[0,10]]*25+[[10,10]]*25
X = np.array(list(map(lambda x: list(map(lambda y: np.random.randn()+y, x)), X)))
y = np.array([0]*25+[1]*25+[2]*25)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda_object = lda.fit(X, y)
# Plot the hyperplanes
for l,c,m in zip(np.unique(y),['r','g','b'],['s','x','o']):
plt.scatter(X[y==l,0],
X[y==l,1],
c=c, marker=m, label=l,edgecolors='black')
x1 = np.array([np.min(X[:,0], axis=0), np.max(X[:,0], axis=0)])
for i, c in enumerate(['r','g','b']):
b, w1, w2 = lda.intercept_[i], lda.coef_[i][0], lda.coef_[i][1]
y1 = -(b+x1*w1)/w2
plt.plot(x1,y1,c=c)
As you can see each decision boundary separates one class from the rest (follow the color of the decision boundary)
You case
You have dataset which is of 4 features, so you cannot visualise the data as well as the decision boundary (human visualisation is limited only upto 3D). One approach is to use LDA and reduce the dimentions to 2D and then again using LDA to classify these 2D features.
Related
I have a dataset for banknotes wavelet data of genuine and forged banknotes with 2 features which are:
X axis: Variance of Wavelet Transformed image
Y axis: Skewness of Wavelet Transformed image
I run on this dataset K-means to identify 2 clusters of the data which are basically genuine and forged banknotes.
Now I have 3 questions:
How can I count the data points of each cluster?
How can I set a color of each data point based on it's cluster?
How do I know without another feature in the data if the datapoint is genuine or forged? I know the data set has a "class" which shows 1 and 2 for genuine and forged but can I identify this without the "class" feature?
My code:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as patches
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.patches as patches
data = pd.read_csv('Banknote-authentication-dataset-all.csv')
V1 = data['V1']
V2 = data['V2']
bn_class = data['Class']
V1_min = np.min(V1)
V1_max = np.max(V1)
V2_min = np.min(V2)
V2_max = np.max(V2)
normed_V1 = (V1 - V1_min)/(V1_max - V1_min)
normed_V2 = (V2 - V2_min)/(V2_max - V2_min)
V1_mean = normed_V1.mean()
V2_mean = normed_V2.mean()
V1_std_dev = np.std(normed_V1)
V2_std_dev = np.std(normed_V2)
ellipse = patches.Ellipse([V1_mean, V2_mean], V1_std_dev*2, V2_std_dev*2, alpha=0.4)
V1_V2 = np.column_stack((normed_V1, normed_V2))
km_res = KMeans(n_clusters=2).fit(V1_V2)
clusters = km_res.cluster_centers_
plt.xlabel('Variance of Wavelet Transformed image')
plt.ylabel('Skewness of Wavelet Transformed image')
scatter = plt.scatter(normed_V1,normed_V2, s=10, c=bn_class, cmap='coolwarm')
#plt.scatter(V1_std_dev, V2_std_dev,s=400, Alpha=0.5)
plt.scatter(V1_mean, V2_mean, s=400, Alpha=0.8, c='lightblue')
plt.scatter(clusters[:,0], clusters[:,1],s=3000,c='orange', Alpha=0.8)
unique = list(set(bn_class))
plt.text(1.1, 0, 'Kmeans cluster centers', bbox=dict(facecolor='orange'))
plt.text(1.1, 0.11, 'Arithmetic Mean', bbox=dict(facecolor='lightblue'))
plt.text(1.1, 0.33, 'Class 1 - Genuine Notes',color='white', bbox=dict(facecolor='blue'))
plt.text(1.1, 0.22, 'Class 2 - Forged Notes', bbox=dict(facecolor='red'))
plt.savefig('figure.png',bbox_inches='tight')
plt.show()
Appendix image for better visibility
How to count the data points of each cluster
You can do this easily by using fit_predict instead of fit, or calling predict on your training data after fitting it.
Here's a working example:
kM = KMeans(...).fit_predict(V1_V2)
labels = kM.labels_
clusterCount = np.bincount(labels)
clusterCount will now hold your information for how many points are in each cluster. You can just as easily do this with fit then predict, but this should be more efficient:
kM = KMeans(...).fit(V1_V2)
labels = kM.predict(V1_V2)
clusterCount = np.bincount(labels)
To set its color, use kM.labels_ or the output of kM.predict() as a coloring index.
labels = kM.predict(V1_V2)
plt.scatter(normed_V1, normed_V2, s=10, c=labels, cmap='coolwarm') # instead of c=bn_class
For a new data point, notice how the KMeans you have quite nicely separates out the majority of the two classes. This separability means you can actually use your KMeans clusters as predictors. Simply use predict.
predictedClass = KMeans.predict(newDataPoint)
Where a cluster is assigned the value of the class which it has the majority of. Or even a percentage chance.
Say that I have many vectors, some of them are:
a: [1,2,3,4,3,2,1,0,0,0,0,0]
b: [5,5,5,5,5,10,20,30,5,10]
c: [1,2,3,2,1,0,0,0,0,0,0,0]
We can see similar patterns between vector a and c.
My question is if it is possible to classify these two to the same cluster and classify b to another cluster.
I rather not use algorithms like KMeans, because the values are not interesting, only the patterns do.
any advice is welcome, especially solutions in Phyton.
Thanks
You may want to use Support Vector Classifier as it produces boundaries between clusters based on the patterns (generalized directions) between points in the clusters, rather than naive distance between points (like KMeans and Spectral Clustering will do). You will however have to construct labels Y yourself as SVC is a supervised method. Here is an example:
import numpy as np
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
a = [1,2,3,4,3,2,1,0,0,0,0,0]
b = [5,5,5,5,5,10,20,30,5,10]
c = [1,2,3,2,1,0,0,0,0,0,0,0]
d = [100,2,300,4,100,0,0,0,0,0,0,0]
vectors = [a, b, c]
# Vectors have different lengths. Append them to get equal dimensions.
L = max(len(elem) for elem in vectors)
imputed = []
for elem in vectors:
l = len(elem)
imputed.append(elem + [0]*(L-l))
print(imputed)
X = np.array(imputed)
print(X)
Y = np.array([0, 1, 0])
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X, Y)
print(clf.predict(np.array([d])))
I wrote a Python script that uses scikit-learn to fit Gaussian Processes to some data.
IN SHORT: the problem I am facing is that while the Gaussian Processses seem to learn very well the training dataset, the predictions for the testing dataset are off, and it seems to me there is a problem of normalization behind this.
IN DETAIL: my training dataset is a set of 1500 time series. Each time series has 50 time components. The mapping learnt by the Gaussian Processes is between a set of three coordinates x,y,z (which represent the parameters of my model) and one time series. In other words, there is a 1:1 mapping between x,y,z and one time series, and the GPs learn this mapping. The idea is that, by giving to the trained GPs new coordinates, they should be able to give me the predicted time series associated to those coordinates.
Here is my code:
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
coordinates_training = np.loadtxt(...) # read coordinates training x, y, z from file
coordinates_testing = np.loadtxt(..) # read coordinates testing x, y, z from file
# z-score of the coordinates for the training and testing data.
# Note I am using the mean and std of the training dataset ALSO to normalize the testing dataset
mean_coords_training = np.zeros(3)
std_coords_training = np.zeros(3)
for i in range(3):
mean_coords_training[i] = coordinates_training[:, i].mean()
std_coords_training[i] = coordinates_training[:, i].std()
coordinates_training[:, i] = (coordinates_training[:, i] - mean_coords_training[i])/std_coords_training[i]
coordinates_testing[:, i] = (coordinates_testing[:, i] - mean_coords_training[i])/std_coords_training[i]
time_series_training = np.loadtxt(...)# reading time series of training data from file
number_of_time_components = np.shape(time_series_training)[1] # 100 time components
# z_score of the time series
mean_time_series_training = np.zeros(number_of_time_components)
std_time_series_training = np.zeros(number_of_time_components)
for i in range(number_of_time_components):
mean_time_series_training[i] = time_series_training[:, i].mean()
std_time_series_training[i] = time_series_training[:, i].std()
time_series_training[:, i] = (time_series_training[:, i] - mean_time_series_training[i])/std_time_series_training[i]
time_series_testing = np.loadtxt(...)# reading test data from file
# the number of time components is the same for training and testing dataset
# z-score of testing data, again using mean and std of training data
for i in range(number_of_time_components):
time_series_testing[:, i] = (time_series_testing[:, i] - mean_time_series_training[i])/std_time_series_training[i]
# GPs
pred_time_series_training = np.zeros((np.shape(time_series_training)))
pred_time_series_testing = np.zeros((np.shape(time_series_testing)))
# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5)
gp = GaussianProcessRegressor(kernel=kernel)
for i in range(number_of_time_components):
print("time component", i)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(coordinates_training, time_series_training[:,i])
# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred_train, sigma_train = gp.predict(coordinates_train, return_std=True)
y_pred_test, sigma_test = gp.predict(coordinates_test, return_std=True)
pred_time_series_training[:,i] = y_pred_train*std_time_series_training[i] + mean_time_series_training[i]
pred_time_series_testing[:,i] = y_pred_test*std_time_series_training[i] + mean_time_series_training[i]
# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(time_series_training[100*i], color='blue', label='Original training')
ax[i].plot(pred_time_series_training[100*i], color='black', label='GP predicted - training')
# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(features_time_series_testing[100*i], color='blue', label='Original testing')
ax[i].plot(pred_time_series_testing[100*i], color='black', label='GP predicted - testing')
Here examples of performance on the training data.
Here examples of performance on the testing data.
first you should use the sklearn preprocessing tool to treat your data.
from sklearn.preprocessing import StandardScaler
There are other useful tools to organaize but this specific one its to normalize the data.
Second you should normalize the training set and the test set with the same parameters¡¡ the model will fit the "geometry" of the data to define the parameters, if you train the model with other scale its like use the wrong system of units.
scale = StandardScaler()
training_set = scale.fit_tranform(data_train)
test_set = scale.transform(data_test)
this will use the same tranformation in the sets.
and finaly you need to normalize the features not the traget, I mean to normalize the X entries not the Y output, the normalization helps the model to find the answer faster changing the topology of the objective function in the optimization process the outpu doesnt affect this.
I hope this respond your question.
I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the original features that are important in the reduced dataset.
How do I find out which feature is important and which is not among the remaining Principal Components after the dimension reduction?
Here is my code:
from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)
Furthermore, I tried also to perform a clustering algorithm on the reduced dataset but surprisingly for me, the score is lower than on the original dataset. How is it possible?
First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data.
Before the example, please note that the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). See my last paragraph after the plot for more details.
Overview:
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
PART 1:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
x_new = pca.fit_transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex,ys * scaley, c = y)
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()
Visualize what's going on using the biplot
Now, the importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance)
Let's see first what amount of variance does each PC explain.
pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]
PC1 explains 72% and PC2 23%. Together, if we keep PC1 and PC2 only, they explain 95%.
Now, let's find the most important features.
print(abs( pca.components_ ))
[[0.52237162 0.26335492 0.58125401 0.56561105]
[0.37231836 0.92555649 0.02109478 0.06541577]
[0.72101681 0.24203288 0.14089226 0.6338014 ]
[0.26199559 0.12413481 0.80115427 0.52354627]]
Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: [0.52237162 0.26335492 0.58125401 0.56561105]] we can conclude that feature 1, 3 and 4 (or Var 1, 3 and 4 in the biplot) are the most important. This is also clearly visible from the biplot (that's why we often use this plot to summarize the information in a visual way).
To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained_variance_. The larger they are these absolute values, the more a specific feature contributes to that principal component.
PART 2:
The important features are the ones that influence more the components and thus, have a large absolute value/score on the component.
To get the most important features on the PCs with names and save them into a pandas dataframe use this:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(dic.items())
This prints:
0 1
0 PC0 e
1 PC1 d
So on the PC1 the feature named e is the most important and on PC2 the d.
Nice article as well here: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
the pca library contains this functionality.
pip install pca
A demonstration to extract the feature importance is as following:
# Import libraries
import numpy as np
import pandas as pd
from pca import pca
# Lets create a dataset with features that have decreasing variance.
# We want to extract feature f1 as most important, followed by f2 etc
f1=np.random.randint(0,100,250)
f2=np.random.randint(0,50,250)
f3=np.random.randint(0,25,250)
f4=np.random.randint(0,10,250)
f5=np.random.randint(0,5,250)
f6=np.random.randint(0,4,250)
f7=np.random.randint(0,3,250)
f8=np.random.randint(0,2,250)
f9=np.random.randint(0,1,250)
# Combine into dataframe
X = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9]
X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9'])
# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)
# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])
# PC feature
# 0 PC1 f1
# 1 PC2 f2
# 2 PC3 f3
# 3 PC4 f4
# 4 PC5 f5
# 5 PC6 f6
# 6 PC7 f7
# 7 PC8 f8
# 8 PC9 f9
Plot the explained variance
model.plot()
Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc.
ax = model.biplot(n_feat=10, legend=False)
Biplot in 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction.
ax = model.biplot3d(n_feat=10, legend=False)
# original_num_df the original numeric dataframe
# pca is the model
def create_importance_dataframe(pca, original_num_df):
# Change pcs components ndarray to a dataframe
importance_df = pd.DataFrame(pca.components_)
# Assign columns
importance_df.columns = original_num_df.columns
# Change to absolute values
importance_df =importance_df.apply(np.abs)
# Transpose
importance_df=importance_df.transpose()
# Change column names again
## First get number of pcs
num_pcs = importance_df.shape[1]
## Generate the new column names
new_columns = [f'PC{i}' for i in range(1, num_pcs + 1)]
## Now rename
importance_df.columns =new_columns
# Return importance df
return importance_df
# Call function to create importance df
importance_df =create_importance_dataframe(pca, original_num_df)
# Show first few rows
display(importance_df.head())
# Sort depending on PC of interest
## PC1 top 10 important features
pc1_top_10_features = importance_df['PC1'].sort_values(ascending = False)[:10]
print(), print(f'PC1 top 10 feautres are \n')
display(pc1_top_10_features )
## PC2 top 10 important features
pc2_top_10_features = importance_df['PC2'].sort_values(ascending = False)[:10]
print(), print(f'PC2 top 10 feautres are \n')
display(pc2_top_10_features )
I am following the book Building Machine Learning Systems with Python. After loading the dataset from scipy I need to extract index of all features belonging to setosa. But I am unable to extract. Probably because I am not using a numpy array. can someone please help me in extracting index numbers? Code below
from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
import numpy as np
# We load the data with load_iris from sklearn
data = load_iris()
features = data['data']
feature_names = data['feature_names']
target = data['target']
for t,marker,c in zip(xrange(3),">ox","rgb"):
# We plot each class on its own to get different colored markers
plt.scatter(features[target == t,0], features[target == t,1],
marker=marker, c=c)
plength = features[:, 2]
# use numpy operations to get setosa features
is_setosa = (labels == 'setosa')
# This is the important step:
max_setosa = plength[is_setosa].max()
min_non_setosa = plength[~is_setosa].min()
print('Maximum of setosa: {0}.'.format(max_setosa))
print('Minimum of others: {0}.'.format(min_non_setosa))
Define labels before the problem line.
target_names = data['target_names']
labels = target_names[target]
Now these lines will work fine:
is_setosa = (labels == 'setosa')
setosa_petal_length = plength[is_setosa].
Extra.
Data Bunch from sklearn ( data = load_iris() ) consists of target array with numbers 0-2 which are related to features and means sort of the flower. Using that you can extract all features belonged to setosa (where target equals 0) like:
petal_length = features[:, 2]
setosa_petal_length = petal_length[target == 0]
Confront this with data['target_names'] and you will get two lines on the top which are solution to your question. By the way all arrays from the data are ndarrays from NumPy.