I have a set of data that I have used scikit learn PCA. I scaled the data before performing PCA with StandardScaler().
variance_to_retain = 0.99
np_scaled = StandardScaler().fit_transform(df_data)
pca = PCA(n_components=variance_to_retain)
np_pca = pca.fit_transform(np_scaled)
# make dataframe of scaled data
# put column names on scaled data for use later
df_scaled = pd.DataFrame(np_scaled, columns=df_data.columns)
num_components = len(pca.explained_variance_ratio_)
cum_variance_explained = np.cumsum(pca.explained_variance_ratio_)
eigenvalues = pca.explained_variance_
eigenvectors = pca.components_
I then ran K-Means clustering on the scaled dataset. I can plot the cluster centers just fine in scaled space.
My question is: how do I transform the locations of the centers back into the original data space. I know that StandardScaler.fit_transform() make the data have zero mean and unit variance. But with the new points of shape (num_clusters, num_features), can I use inverse_transform(centers) to get the centers transformed back into the range and offset of the original data?
Thanks, David
you can get cluster_centers on a kmeans, and just push that into your pca.inverse_transform
here's an example
import numpy as np
from sklearn import decomposition
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
scal = StandardScaler()
X_t = scal.fit_transform(X)
pca = decomposition.PCA(n_components=3)
pca.fit(X_t)
X_t = pca.transform(X_t)
clf = KMeans(n_clusters=3)
clf.fit(X_t)
scal.inverse_transform(pca.inverse_transform(clf.cluster_centers_))
Note that sklearn has multiple ways to do the fit/transform. You can do StandardScaler().fit_transform(X) but you lose the scaler, and can't reuse it; nor can you use it to create an inverse.
Alternatively, you can do scal = StandardScaler() followed by scal.fit(X) and then by scal.transform(X)
OR you can do scal.fit_transform(X) which combines the fit/transform step
Here I am using SVR to Fit the data before that I am using scaling technique to scale the values and to get the prediction I am using the Inverse transform function
from sklearn.preprocessing import StandardScaler
#Creating two objects for dependent and independent variable
ss_X = StandardScaler()
ss_y = StandardScaler()
X = ss_X.fit_transform(X)
y = ss_y.fit_transform(y.reshape(-1,1))
#Creating a model object and fiting the data
reg = SVR(kernel='rbf')
reg.fit(X,y)
#To make a prediction
#First we have transform the value into scalar level
#Second inverse tranform the value to see the original value
ss_y.inverse_transform(reg.predict(ss_X.transform(np.array([[6.5]]))))
Related
I've trained a logistic regression model like this:
reg = LogisticRegression(random_state = 40)
cvreg = GridSearchCV(reg, param_grid={'C':[0.05,0.1,0.5],
'penalty':['none','l1','l2'],
'solver':['saga']},
cv = 5)
cvreg.fit(X_train, y_train)
Now to show the feature's importance I've tried this code, but I don't get the names of the coefficients in the plot:
from matplotlib import pyplot
importance = cvreg.best_estimator_.coef_[0]
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Obviously, the plot isn't very informative. How do I add the names of the coefficients to the x-axis?
The importance of the coeff is:
cvreg.best_estimator_.coef_
array([[1.10303023e+00, 7.48816905e-01, 4.27705027e-04, 6.01404570e-01]])
The coefficients correspond to the columns of X_train, so pass in the X_train names instead of range(len(importance)).
Assuming X_train is a pandas dataframe:
import matplotlib.pyplot as plt
features = X_train.columns
importance = cvreg.best_estimator_.coef_[0]
plt.bar(features, importance)
plt.show()
Note that if X_train is just a numpy array without column names, you will have to define the features list based on your own data dictionary.
I have a very large training dataset. My training dataset contains 1050 gestures, with each gesture containing 12,000 data points. Feeding our machine learning models with this many data points will result to a very slow performance and poor accuracy. As a result, I used PCA to remove irrelevant characteristics from a high-dimensional space and projected the most important features into a lower-dimensional subspace, improving classification accuracy and reducing computational time. Using PCA we have reduced 12,000 data points for each gesture to 15 PCs without compromising the information extracted from the data.
In the future, I would like to store my machine learning model onto an Arduino. An Arduino is a small chip that roughly has 256KB storage. My training dataset that I use to fit the PCA to is 225MB in storage, therefore not possible.
Is there a way to perform and fit PCA to my training dataset so that I can transpose my unseen testing dataset in the future on the Arduino without having to store the training dataset to my Arduino for fitting?
Here is my code to fit my training dataset
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
transposed_normDF.columns = transposed_normDF.columns.map(str)
features = [str(i) for i in range(0,11999)]
x = transposed_normDF.loc[:, features].values
y = df.loc[:,['label']].values
pca = PCA(n_components=0.99)
principalComponents = pca.fit_transform(x)
pc = pca.explained_variance_ratio_.cumsum()
x1 = StandardScaler().fit_transform(principalComponents)
full_newdf = pd.DataFrame(data = x1
, columns = [f'pc_stdscaled_{i}' for i in range(len(pc))])
full_finalDf = pd.concat([full_newdf, df[['label']]], axis = 1)
print(full_finalDf)
print(full_newdf.shape)
Here is my code to transpose unseen data
pca = PCA(n_components=0.99)
newdata_transformed = pca.transform(in_data)
pc = pca.explained_variance_ratio_.cumsum()
x1 = StandardScaler().fit(principalComponents)
X1 = x1.transform(newdata_transformed)
newdf = pd.DataFrame(data = X1
, columns = [f'pc_stdscaled_{i}' for i in range(len(pc))])
newdf.head()
Yes, it is possible to fit PCA on a training set and reuse later on another program.
You can use pickle to save the model and load it.
Here is a code snippet for that:
from sklearn.decomposition import PCA
import pickle as pk
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10, centers=3, n_features=20, random_state=0)
pca = PCA(n_components=2)
result = pca.fit_transform(X) # Assume X is having more than 2 dimensions
input = X[0]
result = pca.transform([input])
print(result) # output: [[ 25.27946068 -2.74478573]]
pk.dump(pca, open("pca.pkl","wb"))
After saving the fitted PCA, you can reload in another program and transform new input samples without loading the training data as follow:
# later reload the pickle file, no training data needed
pca_reloaded = pk.load(open("pca.pkl",'rb'))
result_new = pca_reloaded.transform([input]) # X_new is a new data sample
print(result_new) # output: [[ 25.27946068 -2.74478573]]
When you compare result and result_new, you find that they are equal.
Source: https://datascience.stackexchange.com/questions/55066/how-to-export-pca-to-use-in-another-program
I have a dataset for regression: (X_train_scaled, y_train) and (X_val_scaled, y_val) for training and validation respectively. The inputs were scaled using StandardScaler.
I create a linear regression model using sklearn.linear_model.LinearRegression like follows:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
linear_reg = LinearRegression()
linear_reg.fit(X_train_scaled, y_train)
y_pred_train = linear_reg.predict(X_train_scaled)
y_pred_val = linear_reg.predict(X_val_scaled)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
After that I do the same but use polynomial features with degree = 1 (which are just the same as the original features but with an additional feature of ones, i.e. x^0, which I ignore).
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(1)
X_train_poly = pf.fit_transform(X_train_scaled)[:, 1:] # ignore first col
X_val_poly = pf.transform(X_val_scaled)[:, 1:] # ignore first col
linear_reg = LinearRegression()
linear_reg.fit(X_train_poly, y_train)
y_pred_train = linear_reg.predict(X_train_poly)
y_pred_val = linear_reg.predict(X_val_poly)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
However, I get different results. The first code gives me the following outputs:
r2_train 0.7409525513417043
r2_val 0.7239859358973735
whereas the second code gives this output:
r2_train 0.7410093370149977
r2_val 0.7241725658840452
Why are the outputs different although the dataset and model are the same?
To prove the datasets are the same, I tried the following code:
print(X_train_scaled.shape, X_train_poly.shape)
print(X_val_scaled.shape, X_val_poly.shape)
print((X_train_poly != X_train_scaled).sum())
print((X_val_poly != X_val_scaled).sum())
which has the output:
(802, 9) (802, 9)
(268, 9) (268, 9)
0
0
which indicates that the two datasets are identical.
Also, I use LinearRegession in the two cases which uses OLS algorithm and has no random operations at all. So, it's supposed to do the same calculations on the same data. However, I get different results.
Does anyone have an idea about the reason?
Sklearn LinearRegression uses ordinary least squares optimization to fit train data into a linear model while it is not clear what Sklearn PolynomialFeatures use. But based on its transform() function:
Prefer CSR over CSC for sparse input (for speed), but CSC is required
if the degree is 4 or higher. If the degree is less than 4 and the
input format is CSC, it will be converted to CSR, have its polynomial
features generated, then converted back to CSC.
(see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
Assuming PolynomialFeatures uses ordinary least squares optimization, you would still have same results but with slight difference (just like yours) because Compressed Sparse Row (CSR) method would compromise float values (in other words, truncation/approximation error).
I am working on the following data set:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
The data can be found by clicking on the Data Folder link. There are two data sets present, a training and a testing set. The file I am using contains the combined data from both sets.
I am attempting to apply Linear Discriminant Analysis (LDA) to obtain two components, however when my code runs, it produces just a single component. I also obtain just a single component if I set "n_components = 3"
I just got done testing PCA, which works just fine for any number "n" I provide, such that "n" is less than or equal to the number of features present in the X arrays at the time of the transformation.
I am not sure why LDA seems to behaving so strangely. Here is my code:
#Load libraries
import pandas
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
dataset = pandas.read_csv('bank-full.csv',engine="python", delimiter='\;')
#Output Basic Dataset Info
print(dataset.shape)
print(dataset.head(20))
print(dataset.describe())
# Split-out validation dataset
X = dataset.iloc[:,[0,5,9,11,12,13,14]] #we are selecting only the "clean data" w/o preprocessing
Y = dataset.iloc[:,16]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_temp = X_train
X_validation = sc_X.transform(X_validation)
'''# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 5)
X_train = pca.fit_transform(X_train)
X_validation = pca.transform(X_validation)
explained_variance = pca.explained_variance_ratio_'''
# Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, Y_train)
X_validation = lda.transform(X_validation)
LDA (at least the implementation in sklearn) can produce at most k-1 components (where k is number of classes). So if you are dealing with binary classification - you'll end up with only 1 dimension.
Refer to manual for more detail: http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
Also related:
Python (scikit learn) lda collapsing to single dimension
LDA ignoring n_components?
I am using the various mechanisms in scikit-learn to create a tf-idf representation of a training data set and a test set consisting of text features. Both data sets are preprocessed to use the same vocabulary so the features and the number of features are the same. I can create a model on the training data and assess its performance on the test data. I am wondering if I use SelectPercentile to reduce the number of features in the training set after transformation, how can identify the same features in the test set to utilise in prediction?
trainDenseData = trainTransformedData.toarray()
testDenseData = testTransformedData.toarray()
if ( useFeatureReduction== True):
reducedTrainData = SelectPercentile(f_regression,percentile=10).fit_transform(trainDenseData,trainYarray)
clf.fit(reducedTrainData, trainYarray)
# apply feature reduction to the test data
See code and comments below.
import numpy as np
from sklearn.datasets import make_classification
from sklearn import feature_selection
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
sp = feature_selection.SelectPercentile(feature_selection.f_regression, percentile=30)
sp.fit_transform(X[:-1], y[:-1]) #here, training are the first 9 data vectors, and the last one is the test set
idx = np.arange(0, X.shape[1]) #create an index array
features_to_keep = idx[sp.get_support() == True] #get index positions of kept features
x_fs = X[:,features_to_keep] #prune X data vectors
x_test_fs = x_fs[-1] #take your last data vector (the test set) pruned values
print x_test_fs #these are your pruned test set values
You should store the SelectPercentile object, and use it to transform the test data:
select = SelectPercentile(f_regression,percentile=10)
reducedTrainData = select.fit_transform(trainDenseData,trainYarray)
reducedTestData = select.transform(testDenseData)