I am working on the following data set:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
The data can be found by clicking on the Data Folder link. There are two data sets present, a training and a testing set. The file I am using contains the combined data from both sets.
I am attempting to apply Linear Discriminant Analysis (LDA) to obtain two components, however when my code runs, it produces just a single component. I also obtain just a single component if I set "n_components = 3"
I just got done testing PCA, which works just fine for any number "n" I provide, such that "n" is less than or equal to the number of features present in the X arrays at the time of the transformation.
I am not sure why LDA seems to behaving so strangely. Here is my code:
#Load libraries
import pandas
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
dataset = pandas.read_csv('bank-full.csv',engine="python", delimiter='\;')
#Output Basic Dataset Info
print(dataset.shape)
print(dataset.head(20))
print(dataset.describe())
# Split-out validation dataset
X = dataset.iloc[:,[0,5,9,11,12,13,14]] #we are selecting only the "clean data" w/o preprocessing
Y = dataset.iloc[:,16]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_temp = X_train
X_validation = sc_X.transform(X_validation)
'''# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 5)
X_train = pca.fit_transform(X_train)
X_validation = pca.transform(X_validation)
explained_variance = pca.explained_variance_ratio_'''
# Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, Y_train)
X_validation = lda.transform(X_validation)
LDA (at least the implementation in sklearn) can produce at most k-1 components (where k is number of classes). So if you are dealing with binary classification - you'll end up with only 1 dimension.
Refer to manual for more detail: http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
Also related:
Python (scikit learn) lda collapsing to single dimension
LDA ignoring n_components?
Related
I'm trying to perfrom LassoCV feature selection on my miRNA expression dataset and after finding out the 100 best features(miRNAs in this case) I want to build some classification models (like SVM, RF,KNN etc.) for prediction using those 100 miRNAs. I can use the following code for my data without any problems if I don't do train-test splitting.
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
feature_names = df.columns[0:2565]
clf = LassoCV().fit(X, y)
importance = np.abs(clf.coef_)
idx_third = importance.argsort()[-3]
threshold = importance[idx_third] + 0.01
idx_features = (-importance).argsort()[:100]
name_features = np.array(feature_names)[idx_features]
print('Selected features: {}'.format(name_features))
sfm = SelectFromModel(clf, threshold=threshold)
sfm.fit(X, y)
X = sfm.transform(X)
But my goal is to select features after the split. And I think I'm having trouble identifying the x_train and X_test after applying LassoCV. Here's the code after train_test_split:
clf = LassoCV().fit(X_train, y_train)
importance = np.abs(clf.coef_)
idx_third = importance.argsort()[-3]
threshold = importance[idx_third] + 0.01
idx_features = (-importance).argsort()[:100]
name_features = np.array(feature_names)[idx_features]
print('Selected features: {}'.format(name_features))
sfm = SelectFromModel(clf, threshold=threshold)
sfm.fit(X_train, y_train)
and the output:
Selected features: ['MIMAT0019071' 'MIMAT0019947' 'MIMAT0005951' 'MIMAT0025458'
'MIMAT0019710' 'MIMAT0005880' 'MIMAT0004810' 'MIMAT0026481'
'MIMAT0016904' 'MIMAT0003340' 'MIMAT0016851' 'MIMAT0019033'
'MIMAT0004508' 'MIMAT0024615' 'MIMAT0022478' 'MIMAT0019004'
'MIMAT0004948' 'MIMAT0005898' 'MIMAT0000064' 'MIMAT0015087'
'MIMAT0005942' 'MIMAT0004602' 'MIMAT0027666' 'MIMAT0003250'
'MIMAT0022289' 'MIMAT0005866' 'MIMAT0004903' 'MIMAT0004592'
'MIMAT0021040' 'MIMAT0003237' 'MIMAT0018954' 'MIMAT0019858'
'MIMAT0003270' 'MIMAT0030416' 'MIMAT0019361' 'MIMAT0018083'
'MIMAT0000440' 'MIMAT0018070' 'MIMAT0016863' 'MIMAT0015066'
'MIMAT0027576' 'MIMAT0017997' 'MIMAT0000421' 'MIMAT0003165'
'MIMAT0027587' 'MIMAT0004603' 'MIMAT0003330' 'MIMAT0019948'
'MIMAT0004978' 'MIMAT0018951' 'MIMAT0016872' 'MIMAT0019203'
'MIMAT0015005' 'MIMAT0003319' 'MIMAT0003316' 'MIMAT0022265'
'MIMAT0011159' 'MIMAT0016898' 'MIMAT0003240' 'MIMAT0004925'
'MIMAT0027580' 'MIMAT0019067' 'MIMAT0018121' 'MIMAT0028112'
'MIMAT0019714' 'MIMAT0000685' 'MIMAT0019742' 'MIMAT0027627'
'MIMAT0003277' 'MIMAT0019737' 'MIMAT0003284' 'MIMAT0020925'
'MIMAT0022929' 'MIMAT0022938' 'MIMAT0020924' 'MIMAT0020603'
'MIMAT0020602' 'MIMAT0020956' 'MIMAT0020601' 'MIMAT0020600'
'MIMAT0022719' 'MIMAT0020300' 'MIMAT0022939' 'MIMAT0022940'
'MIMAT0019984' 'MIMAT0019983' 'MIMAT0019982' 'MIMAT0019981'
'MIMAT0019980' 'MIMAT0019979' 'MIMAT0019978' 'MIMAT0019977'
'MIMAT0019976' 'MIMAT0022941' 'MIMAT0020541' 'MIMAT0019985'
'MIMAT0020958' 'MIMAT0019975' 'MIMAT0021036' 'MIMAT0021037']
SelectFromModel(estimator=LassoCV(), threshold=0.041810456987634005)
So, no problems until here and we can see the 100 miRNAs to be selected. I try to select these features by applying X = sfm.transform(X) to the split dataset like this:
X_train = sfm.transform(X_train)
X_test = sfm.transform(X_test)
But when I check the X_train.shape and X_test.shape the output is like this:
((164, 0), (55, 0))
So, of course when I try to train my model:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)
it gives me this error:
ValueError: Found array with 0 feature(s) (shape=(164, 0)) while a minimum of 1 is required.
I'm new to machine learning especially feature selection bit. If anyone can tell me how to develope models with the selected features in this particular case, I would greatly appreciate it.
I am trying to utilize the cosine similarity kernel to text classification with SVM with a raw dataset of 1000 words:
# Libraries
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
# Data
x_train, x_test, y_train, y_test = train_test_split(raw_data[:, 0], raw_data[:, 1], test_size=0.33, random_state=42)
# CountVectorizer
c = CountVectorizer(max_features=1000, analyzer = "char")
X_train = c.fit_transform(x_train).toarray()
X_test = c.transform(x_test).toarray()
# Kernel
cosine_X_tr = cosine_similarity(X_train)
cosine_X_tst = cosine_similarity(X_test)
# SVM
svm_model = SVC(kernel="precomputed")
svm_model.fit(cosine_X_tr, y_train)
y_pred = svm_model.predict(cosine_X_tst)
But that code throws the following error:
ValueError: X has 330 features, but SVC is expecting 670 features as input
I've tried the following, but I don't know it is mathematically accurate and because also I want to implement other custom kernels not implemented within scikit-learn like histogram intersection:
cosine_X_tst = cosine_similarity(X_test, X_train)
So, basically the main problem resides in the dimensions of the matrix SVC recieves. Once CountVectorizer is applied to train and test datasets those have 1000 features because of max_features parameter:
Train dataset of shape (670, 1000)
Test dataset of shape (330, 1000)
But after applying cosine similarity are converted to squared matrices:
Train dataset of shape (670, 670)
Test dataset of shape (330, 330)
When SVC is fitted to train data it learns 670 features and will not be able to predict test dataset because has a different number of features (330). So, how can i solve that problem and be able to use custom kernels with SVC?
So, how can i solve that problem and be able to use custom kernels with SVC?
Define a function yourself, and pass that function to the kernel parmeter in SVC(), like: SVC(kernel=your_custom_function). See this.
Also, you should use the cosine_similarity kernel like below in your code:
svm_model = SVC(kernel=cosine_similarity)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)
I have a sample time-series dataset (23, 208), which is a pivot table count for 24hrs count for some users; I was experimenting with different regressors from sklearn which work fine (except for SGDRegressor()), but this LightGBM Python-package gives me very linear prediction as follows:
my tried code:
import pandas as pd
dff = pd.read_csv('ex_data2.csv',sep=',')
dff.set_index("timestamp",inplace=True)
print(dff.shape)
from sklearn.model_selection import train_test_split
trainingSetf, testSetf = train_test_split(dff,
#target_attribute,
test_size=0.2,
random_state=42,
#stratify=y,
shuffle=False)
import lightgbm as lgb
from sklearn.multioutput import MultiOutputRegressor
username = 'MMC_HEC_LVP' # select one column for plotting & check regression performance
user_list = []
for column in dff.columns:
user_list.append(column)
index = user_list.index(username)
X_trainf = trainingSetf.iloc[:,:].values
y_trainf = trainingSetf.iloc[:,:].values
X_testf = testSetf.iloc[:,:].values
y_testf = testSetf.iloc[:,:].values
test_set_copy = y_testf.copy()
model_LGBMRegressor = MultiOutputRegressor(lgb.LGBMRegressor()).fit(X_trainf, y_trainf)
pred_LGBMRegressor = model_LGBMRegressor.predict(X_testf)
test_set_copy[:,[index]] = pred_LGBMRegressor[:,[index]]
#plot the results for selected user/column
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.figure(figsize=(12, 10))
plt.xlabel("Date")
plt.ylabel("Values")
plt.title(f"{username} Plot")
plt.plot(trainingSetf.iloc[:,[index]],label='trainingSet')
plt.plot(testSetf.iloc[:,[index]],"--",label='testSet')
plt.plot(test_set_copy[:,[index]],'b--',label='RF_predict')
plt.legend()
So what I am missing is if I use default (hyper-)parameters?
Short Answer
Your dataset has a very small number of rows, and LightGBM's parameters have default values set to provide good performance on medium-sized datasets.
Set the following parameters to force LightGBM to fit to the provided data.
min_data_in_bin = 1
min_data_in_leaf = 1
Long Answer
Before training, LightGBM does some pre-processing on the input data.
For example:
bundling sparse features
binning continuous features into histograms
dropping features which are guaranteed to be uninformative (for example, features which are constant)
The result of that preprocessing is a LightGBM Dataset object, and running that preprocessing is called Dataset "construction". LightGBM performs boosting on this Dataset object, not raw data like numpy arrays or pandas data frames.
To speed up construction and prevent overfitting during training, LightGBM provides ability to the prevent creation of histogram bins that are too small (min_data_in_bin) or splits that produce leaf nodes which match too few records (min_data_in_leaf).
Setting those parameters to very low values may be required to train on small datasets.
I created the following minimal, reproducible example, using Python 3.8.12, lightgbm==3.3.2, numpy==1.22.2, and scikit-learn==1.0.2 demonstrating this behavior.
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
# 20-row input data
X, y = make_regression(
n_samples=20,
n_informative=5,
n_features=5,
random_state=708
)
# training produces 0 trees, and predicts mean(y)
reg = LGBMRegressor(
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.000
# training fits and predicts well
reg = LGBMRegressor(
min_data_in_bin=1,
min_data_in_leaf=1,
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.985
If you use LGBMRegressor(min_data_in_bin=1, min_data_in_leaf=1) in the code in the original post, you'll see predictions that better fit to the provided data.
In this way the model is overfitted!
If you do a random split after creating the dataset and evaluate the model on the test dataset, you will notice that the performance is essentially the same or worse (as in this example).
# SETUP
# =============================================================
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(
n_samples=200, n_informative=10, n_features=40, random_state=123
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
# =============================================================
# TEST 1
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.815
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.974
# =============================================================
# TEST 2
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X_train, y_train)
print(f"r2 (defaults): {r2_score(y_train, reg.predict(X_train))}")
# 0.759
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X_train, y_train)
print(f"r2 (small min_data): {r2_score(y_test, reg.predict(X_test))}")
# 0.219
I recently discovered this amazing library for ML interpretability. I decided to build a simple xgboost classifier using a toy dataset from sklearn and to draw a force_plot.
To understand the plot the library says:
The above explanation shows features each contributing to push the
model output from the base value (the average model output over the
training dataset we passed) to the model output. Features pushing the
prediction higher are shown in red, those pushing the prediction lower
are in blue (these force plots are introduced in our Nature BME
paper).
So it looks to me as the base_value should be the same as clf.predict(X_train).mean()which equals 0.637. However this is not the case when looking at the plot, the number is actually not even within [0,1]. I tried doing the log in different basis (10, e, 2) assuming it would be some kind of monotonic transformation... but still not luck. How can I get to this base_value?
!pip install shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.predict(X_train).mean())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])
To get base_value in raw space (when link="identity") you need to unwind class labels --> to probabilities --> to raw scores. Note, the default loss is "deviance", so the raw is inverse sigmoid:
# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores, default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))
2.065861773054686
[ True]
The relevant plot for 0th data point in raw space:
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")
Should you wish to switch to sigmoid probability space (link="logit"):
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522
The relevant plot for 0th data point in probability space:
Note, the probability base_value from shap's perspective, what they call a baseline probability if no data is available, is not what a reasonable person would define by having no independent variables (0.6373626373626373 in this case)
Full reproducible example:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train.values.ravel())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf, model_output="raw")
shap_values = explainer.shap_values(X_train)
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print("Expected raw score (before sigmoid):", y_raw)
print("Expected probability:", expit(y_raw))
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")
Output:
0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522
I am using the various mechanisms in scikit-learn to create a tf-idf representation of a training data set and a test set consisting of text features. Both data sets are preprocessed to use the same vocabulary so the features and the number of features are the same. I can create a model on the training data and assess its performance on the test data. I am wondering if I use SelectPercentile to reduce the number of features in the training set after transformation, how can identify the same features in the test set to utilise in prediction?
trainDenseData = trainTransformedData.toarray()
testDenseData = testTransformedData.toarray()
if ( useFeatureReduction== True):
reducedTrainData = SelectPercentile(f_regression,percentile=10).fit_transform(trainDenseData,trainYarray)
clf.fit(reducedTrainData, trainYarray)
# apply feature reduction to the test data
See code and comments below.
import numpy as np
from sklearn.datasets import make_classification
from sklearn import feature_selection
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
sp = feature_selection.SelectPercentile(feature_selection.f_regression, percentile=30)
sp.fit_transform(X[:-1], y[:-1]) #here, training are the first 9 data vectors, and the last one is the test set
idx = np.arange(0, X.shape[1]) #create an index array
features_to_keep = idx[sp.get_support() == True] #get index positions of kept features
x_fs = X[:,features_to_keep] #prune X data vectors
x_test_fs = x_fs[-1] #take your last data vector (the test set) pruned values
print x_test_fs #these are your pruned test set values
You should store the SelectPercentile object, and use it to transform the test data:
select = SelectPercentile(f_regression,percentile=10)
reducedTrainData = select.fit_transform(trainDenseData,trainYarray)
reducedTestData = select.transform(testDenseData)