I am using the various mechanisms in scikit-learn to create a tf-idf representation of a training data set and a test set consisting of text features. Both data sets are preprocessed to use the same vocabulary so the features and the number of features are the same. I can create a model on the training data and assess its performance on the test data. I am wondering if I use SelectPercentile to reduce the number of features in the training set after transformation, how can identify the same features in the test set to utilise in prediction?
trainDenseData = trainTransformedData.toarray()
testDenseData = testTransformedData.toarray()
if ( useFeatureReduction== True):
reducedTrainData = SelectPercentile(f_regression,percentile=10).fit_transform(trainDenseData,trainYarray)
clf.fit(reducedTrainData, trainYarray)
# apply feature reduction to the test data
See code and comments below.
import numpy as np
from sklearn.datasets import make_classification
from sklearn import feature_selection
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
sp = feature_selection.SelectPercentile(feature_selection.f_regression, percentile=30)
sp.fit_transform(X[:-1], y[:-1]) #here, training are the first 9 data vectors, and the last one is the test set
idx = np.arange(0, X.shape[1]) #create an index array
features_to_keep = idx[sp.get_support() == True] #get index positions of kept features
x_fs = X[:,features_to_keep] #prune X data vectors
x_test_fs = x_fs[-1] #take your last data vector (the test set) pruned values
print x_test_fs #these are your pruned test set values
You should store the SelectPercentile object, and use it to transform the test data:
select = SelectPercentile(f_regression,percentile=10)
reducedTrainData = select.fit_transform(trainDenseData,trainYarray)
reducedTestData = select.transform(testDenseData)
Related
Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?
I have several logs that are manipulated by two different TfIdfVectorizer objects.
The first one reads and splits the log in ngrams
with open("my/log/path.txt", "r") as test:
corpus = [test.read()]
tf = TfidfVectorizer(ngram_range=ngr)
corpus_transformed = tf.fit_transform(corpus)
infile.close()
The resulting data is written in a Pandas dataframe that has 4 columns
(score [float], review [ngrams of text], isbad [0/1], kfold [int]).
Initial kfold value is -1.
where I have:
my_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf.get_feature_names()).transpose()
For cross-validation I split the dataset in test and train with StratifiedKFolds by doing a simple:
ngr=(1,2)
for fold_ in range(5):
# 'reviews' column has short sentences expressing an opinion
train_df = df[df.kfold != fold_].reset_index(drop=True)
test_df = df[df.kfold == fold_].reset_index(drop=True)
new_tf = TfidfVectorizer(ngram_range=ngr)
new_tf.fit(train_df.reviews)
xtrain = new_tf.transform(train_df.reviews)
xtest = new_tf.transform(test_df.reviews)
And only after this double tfidf transformation I fit my SVC model with:
model.fit(xtrain, train_df.isbad) # where 'isbad' column is 0 if negative and 1 if positive
preds = model.predict(xtest)
accuracy = metrics.accuracy_score(test_df.isbad, preds)
So at the end of the day I have my model that classifies reviews in both classes (negative-0 or positive-1), I dump my model and both tfidf vectorizers (tf and new_tf) but when it comes to new data, even if I do:
with open("never/seen/data.txt", "r") as unseen: # load REAL SAMPLE data
corpus = [unseen.read()]
# to transform the unseen data I use one of the dumped tfidf's obj
corpus_transformed = tf_dump.transform(corpus)
unseen.close()
my_unseen_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf_dump.get_feature_names()).transpose()
my_unseen_df = my_unseen_df.sample(frac=1).reset_index(drop=True) # randomize rows
# to transform reviews' data that are going to be classified I use the new_tf dump, like before
X = new_tf_dump.transform(my_unseen_df.reviews)
# use the previously loaded model and make predictions
res = model_dump.predict(X)
#print(res)
I got ValueError: X has 604,969 features, but SVC is expecting 605,424 as input, but how is that possibile if I manipulate the data with the same objects? What am I doing wrong here?
I want to use my trained model as a classifier for new, unseen data. Isn't this the right way to go?
Thank you.
I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.
Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?
I have tried to use MultiLabelBinarizer and LabelBinarizer, but with no luck.
Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?
A subset of the data and my code is here:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
https://www.transfernow.net/dl/20210921NbWDt3eo
Code Analysis
The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.
Multi-Output Regressor
Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
The output should be the same as what you have, but you only need to maintain a single model and train it once.
To use this approach, wrap your LR model in a MultiOutputRegressor.
Here is a good tutorial on multi-output regression models.
model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', MultiOutputRegressor(model))])
preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)
combine_data() merges all data into a single DataFrame for convenience:
def combine_data(X, Y, y_cols):
""" X is a dataframe, Y is a np array, y_cols is a list """
df_out = pd.DataFrame(Y, columns=y_cols)
df_out.index = X.index
return pd.concat([X, df_out], axis=1).sort_index()
Multinomial Logistic Regression
To use a LogisticRegression classifier on all labels at once, set multi_class=multinomial.
The softmax function is used to find the predicted probability of a sample belonging to a class.
You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
Here is a good tutorial on multinomial logistic regression.
label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])
# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)
# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)
I am working on the following data set:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
The data can be found by clicking on the Data Folder link. There are two data sets present, a training and a testing set. The file I am using contains the combined data from both sets.
I am attempting to apply Linear Discriminant Analysis (LDA) to obtain two components, however when my code runs, it produces just a single component. I also obtain just a single component if I set "n_components = 3"
I just got done testing PCA, which works just fine for any number "n" I provide, such that "n" is less than or equal to the number of features present in the X arrays at the time of the transformation.
I am not sure why LDA seems to behaving so strangely. Here is my code:
#Load libraries
import pandas
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
dataset = pandas.read_csv('bank-full.csv',engine="python", delimiter='\;')
#Output Basic Dataset Info
print(dataset.shape)
print(dataset.head(20))
print(dataset.describe())
# Split-out validation dataset
X = dataset.iloc[:,[0,5,9,11,12,13,14]] #we are selecting only the "clean data" w/o preprocessing
Y = dataset.iloc[:,16]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_temp = X_train
X_validation = sc_X.transform(X_validation)
'''# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 5)
X_train = pca.fit_transform(X_train)
X_validation = pca.transform(X_validation)
explained_variance = pca.explained_variance_ratio_'''
# Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, Y_train)
X_validation = lda.transform(X_validation)
LDA (at least the implementation in sklearn) can produce at most k-1 components (where k is number of classes). So if you are dealing with binary classification - you'll end up with only 1 dimension.
Refer to manual for more detail: http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
Also related:
Python (scikit learn) lda collapsing to single dimension
LDA ignoring n_components?
import numpy as np
from sklearn import svm
from sklearn.feature_selection import SelectKBest, f_classif
I have 3 labels (male, female, na), denoted as follows:
labels = [0,1,2]
Each label was defined by 3 features (height, weight, and age) as the training data:
Training data for males:
male_height = np.array([111,121,137,143,157])
male_weight = np.array([60,70,88,99,75])
male_age = np.array([41,32,73,54,35])
males = np.vstack([male_height,male_weight,male_age]).T
Training data for females:
female_height = np.array([91,121,135,98,90])
female_weight = np.array([32,67,98,86,56])
female_age = np.array([51,35,33,67,61])
females = np.vstack([female_height,female_weight,female_age]).T
Training data for not availables:
na_height = np.array([96,127,145,99,91])
na_weight = np.array([42,97,78,76,86])
na_age = np.array([56,35,49,64,66])
nas = np.vstack([na_height,na_weight,na_age]).T
So, the complete training data are:
trainingData = np.vstack([males,females,nas])
Complete labels are:
labels = np.repeat(labels,5)
Now, I want to select the best features, output their names, and apply only those best features for fitting the support vector machine model.
I tried below according to the answer from #eickenberg and the comments from #larsmans
selector = SelectKBest(f_classif, k=keep)
clf = make_pipeline(selector, StandardScaler(), svm.SVC())
clf.fit(trainingData, labels)
selected = trainingData[selector.get_support()]
print selected
[[111 60 41]
[121 70 32]]
However, all the selected elements belongs to the label 'male' with the features: height, weight, and age respectively. I could not figure out where I am messing up? Could someone guide me into right direction?
You can use e.g. SelectKBest as follows
from sklearn.feature_selection import SelectKBest, f_classif
keep = 2
selector = SelectKBest(f_classif, k=keep)
and place it into your pipeline
pipe = make_pipeline(selector, StandardScaler(), svm.SVC())
pipe.fit(trainingData, labels)
To be honest, I have used the Support Vector Machine Model on text classification (which is an entirely different problem altogether). But, through that experience, I can confidently say that the more features you have, the better your predictions will be.
To summarize, do not filter out the features that are most important because the Support Vector Machine will make use of features no matter how little importance.
But, if this is a huge necessity, look into scikit learn's Random Forest Classifier. It can accurately assess which features are more important, using the "feature_importances_" attribute.
Here's an example of how I would use it (code not tested):
clf = RandomForestClassifier() #tweak the parameters yourself
clf.fit(X,Y) #if you're passing in a sparse matrix, apply .toarray() to X
print clf.feature_importances_
Hope that helps.