I have scraped some data from spotify to see if I can classify the music genre of different songs.
I have split my data up into a test set and a remaining set, which I have then further divided into training and validation set.
When I run the model (I try to classify between 112 genres) I get 30% accuracy in the validation set. Of course this is not great, but to be expected with 112 genres and limited data. What really confuses me is that when I apply the model to the test data, accuracy goes down to 1%.
I am not sure why that is: as far as I can see the validation and test data should be comparable. I train the model on the training data which should be completely independent.
I must be making some mistake either allowing the model to peak into the validation data (better performance there) or mess up my test data.
Or maybe applying the model twice messes things up?
Any idea what could be going on or how to debug it?
Thanks a lot!
Franka
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
# re-read data
track_df = pd.read_csv('track_df_corr.csv')
features = [ 'acousticness', 'speechiness',
'key', 'liveness', 'instrumentalness', 'energy', 'tempo',
'loudness', 'danceability', 'valence',
'duration_mins', 'year', 'genre']
track_df = track_df[features]
#First make a big split of all the data into test and train.
train, test = train_test_split(track_df, test_size=0.2, random_state = 0)
#Then create training and validation data set from the traindata.
# Read the data. Assign train and test data
# "full" is the data before preprocessing
X_full = train
X_test_full = test
# select to be predicted data
y = X_full.genre # just the target for the test data
y = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier
#Since we later on want to validate our data on the testdata, we also need to make sure we have a #y_test.
# select to be predicted data
y_test = X_test_full.genre # just the target for the test data
y_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0]
# numbers needed for classifier
# remove to be predicted variable
X_full.drop(['genre'], axis=1, inplace=True) # rest of training free of target, which is now stored in y
X_test_full.drop(['genre'], axis=1, inplace=True) # not sure if necessary but cannot hurt
# Break off validation set from training data (X_full)
# Remember we still have X_test_full as an entirely independend test set.
# Here we just create our training and validation sets from X_full.
X_train_full, X_valid_full, y_train, y_valid = \
train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0)
# General preprocessing steps: take care of categorical data (does not apply here).
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
#Time to run the model.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#Run our model on the TRAINING data
# FRR set up input values that are passed to the Bundle below
# Preprocessing for NUMERICAL data
numerical_transformer = SimpleImputer(strategy='median')
# Preprocessing for CATEGORICAL data
categorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator", here "categorical_transformer".
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# FRR Run the numerical_transformer and categorical_transformer defined above here:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame.
transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to
#be applied to subsets of the data.
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = RandomForestClassifier(n_estimators=100, random_state=0)
# Bundle preprocessing and modeling code in a pipeline
# clf stands for clasiifier.
# Pipeline can be used to chain multiple estimators into one
# Preprocessing of training data, fit model
clf = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model)
clf.fit(X_train, y_train)
# --------------------------------------------------------
#Test our model on the VALIDATION data
# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)
# Return the mean accuracy on the given test data and labels.
clf.score(X_valid, y_valid) # this is correct!
# The code yields a value around 30%.
# --------------------------------------------------------
Apply our model on the TESTING data
# Preprocessing of training data, fit model
preds_test = clf.predict(X_test)
clf.score(X_test, y_test)
#The code yields a value around 1%.
The problem that I see is that you're encoding the train and test labels using pd.factorize. Since you're using pd.factorize on y and y_test independently, the resulting encodings will not correspond to one another. You want to use a LabelEncoder, so that when you fit the encoder using the train data, you then transform y_test using the same encoding scheme.
Here's an example to illustrate this:
from sklearn.preprocessing import LabelEncoder
l = [1,4,6,1,4]
le = LabelEncoder()
le.fit(l)
le.transform(l)
# array([0, 1, 2, 0, 1], dtype=int64)
le.transform([1,6,4])
# array([0, 2, 1], dtype=int64)
Here we get the correct encodings. However if we apply a pd.factorize, obviously pandas can't guess which are the correct encodings:
pd.factorize(l)[0]
# array([0, 1, 2, 0, 1], dtype=int64)
pd.factorize([1,6,4])[0]
# array([0, 1, 2], dtype=int64)
Related
Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?
I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.
Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?
I have tried to use MultiLabelBinarizer and LabelBinarizer, but with no luck.
Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?
A subset of the data and my code is here:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
https://www.transfernow.net/dl/20210921NbWDt3eo
Code Analysis
The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.
Multi-Output Regressor
Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
The output should be the same as what you have, but you only need to maintain a single model and train it once.
To use this approach, wrap your LR model in a MultiOutputRegressor.
Here is a good tutorial on multi-output regression models.
model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', MultiOutputRegressor(model))])
preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)
combine_data() merges all data into a single DataFrame for convenience:
def combine_data(X, Y, y_cols):
""" X is a dataframe, Y is a np array, y_cols is a list """
df_out = pd.DataFrame(Y, columns=y_cols)
df_out.index = X.index
return pd.concat([X, df_out], axis=1).sort_index()
Multinomial Logistic Regression
To use a LogisticRegression classifier on all labels at once, set multi_class=multinomial.
The softmax function is used to find the predicted probability of a sample belonging to a class.
You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
Here is a good tutorial on multinomial logistic regression.
label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])
# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)
# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)
I am trying to improve my submissions for the Kaggle House Prices Competition found here. I'm working with the Iowa data available here.
I'm trying to train and test my model using a pipeline(sklearn.pipeline.Pipeline), cross-validating with GridSearchCV(sklearn.model_selection.GridSearchCV) and using and using XGBRegressor(xgboost.XGBRegressor). The features selected had categorical data and NaN values that had to be imputed (sklearn.impute.SimpleImputer().
Initial setup:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.impute import SimpleImputer
# Path of the file to read.
iowa_file_path = '../input/train.csv'
original_home_data = pd.read_csv(iowa_file_path)
home_data = original_home_data.copy()
# delete rows where SalePrice is Nan
home_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
# Create a target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
extra_features = ['OverallCond', 'GarageArea', 'LotFrontage', 'OverallQual', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', 'GrLivArea', 'MoSold']
categorical_data = ['LotShape', 'MSZoning', 'Neighborhood', 'BldgType', 'HouseStyle', 'Foundation', 'KitchenQual']
features.extend(extra_features)
features.extend(categorical_data)
X = home_data[features]
The categorical data was one hot encoded by:
X = pd.get_dummies(X, prefix='OHE', columns=categorical_data)
Columns with missing values were gathered by:
cols_with_missing = (col for col in X.columns if X[col].isnull().any())
for col in cols_with_missing:
X[col + '_was_missing'] = X[col].isnull()
The training and validation data were then split:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1, test_size=0.25)
train_X, val_X = train_X.align(val_X, join='left', axis=1)
The pipeline was then created to impute mean for NaN with the regressor
my_pipeline = Pipeline([('imputer', SimpleImputer()), ('xgbrg', XGBRegressor())])
param_grid = {
'xgbrg__n_estimators': [10, 50, 100, 500, 1000],
'xgbrg__learning_rate': [0.01, 0.04, 0.05, 0.1, 0.5, 1]
}
fit_params = {
'xgbrg__early_stopping_rounds': 10,
'xgbrg__verbose': False,
'xgbrg__eval_set': [(np.array(val_X), val_y)]
}
I then initialized the cross validator:
searchCV = GridSearchCV(my_pipeline, cv=5, param_grid=param_grid, return_train_score=True, scoring='neg_mean_absolute_error')
I then fitted my cross validator:
searchCV = GridSearchCV(my_pipeline, cv=5, param_grid=param_grid, return_train_score=True, scoring='neg_mean_absolute_error')
and fit the model (take note of this next line):
searchCV.fit(X=np.array(train_X), y=train_y, **fit_params)
I then did the same for the test data (one hot encoding, getting columns with NaN,)
# path to file you will use for predictions
test_data_path = '../input/test.csv'
# read test data file using pandas
test_data = pd.read_csv(test_data_path)
# create test_X which comes from test_data but includes only the columns you used for prediction.
original_test_X = test_data[features]
test_X = original_test_X.copy()
# to one hot encode the data
test_X = pd.get_dummies(test_X, prefix='OHE', columns=categorical_data)
for col in cols_with_missing:
test_X[col + '_was_missing'] = test_X[col].isnull()
# to align the training and test data and discard columns not in the training data
X, test_X = X.align(test_X, join='inner', axis=1)
I then tried to transform the test data with the average from the training data to impute the NaN values in the test data:
test_X = my_pipeline.named_steps['imputer'].transform(test_X)
I then get this error:
NotFittedError: This SimpleImputer instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
So i can't even use this line for prediction:
test_preds = searchCV.predict(test_X)
What might be wrong here?
How can I use my pipeline to transform another dataset after fitting?
If i try creating a new SimpleImputer() instance for the test data and imputing for NaN and performing a fit_transform:
test_pipeline = SimpleImputer()
test_X = test_pipeline.fit_transform(test_X)
and I add and run:
test_preds = searchCV.predict(test_X)
I get the following error:
ValueError: X has 72 features per sample, expected 74
What is wrong here?
I had the same "This SimpleImputer instance is not fitted yet" error when refining my model at the Missing Data stage. After a lot of trial and error, the following did the trick for me:
Prep your test data in the same loop where you are prepping the training data. Basically, the "for col in cols_with_missing" loop should be run for training and test data simultaneously. I am a newbie in this field as well (just started last week), but I am guessing this error might be occurring due to a mismatch in the columns if you run that col loop separately for training and testing data.
my code snippet which worked:
cols_with_missing = (col for col in X_train.columns
if X_train[col].isnull().any())
for col in cols_with_missing:
imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
imputed_final_test_plus[col + 'was_missing'] = imputed_final_test_plus[col].isnull()
This question already has answers here:
Keep same dummy variable in training and testing data
(5 answers)
Closed 4 years ago.
I am using pandas get_dummies to convert categorical variables into dummy/indicator variables, it introduce new features in the dataset. Then we fit/train this dataset into a model.
Since the dimension of X_train and X_test remains the same, when we do prediction for test data it works well with test data X_test.
Now lets say we have test data in another csv file (with unknown output). When we transform this set of test data using get_dummies, the resulting dataset may not have same number of features as we have trained our model with. Later when we use our model with this dataset its failing, because number of feature in testing set is not matching with the model's.
Any idea how we can handle this?
Code :
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load the dataset
in_file = 'train.csv'
full_data = pd.read_csv(in_file)
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)
features = pd.get_dummies(features_raw)
features = features.fillna(0.0)
X_train, X_test, y_train, y_test = train_test_split(features, outcomes,
test_size=0.2, random_state=42)
model =
DecisionTreeClassifier(max_depth=50,min_samples_leaf=6,min_samples_split=2)
model.fit(X_train,y_train)
y_train_pred = model.predict(X_train)
#print (X_train.shape)
y_test_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
# DOing again to test another set of data
test_data = 'test.csv'
test_data1 = pd.read_csv(test_data)
test_data2 = pd.get_dummies(test_data1)
test_data3 = test_data2.fillna(0.0)
print(test_data2.shape)
print (model.predict(test_data3))
Seems a similar question has been asked before but the most efficient/easiest way would be to follow approach by Thibault Clement described here
# Get missing columns in the training test
missing_cols = set( X_train.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X_test = X_test[X_train.columns]
It's also worth noting that your model can only use the features it was trained on so if there are additional columns in X_test vs X_train rather than less then these will have to be removed before predicting.
So, I have this doubt and have been looking for answers. So the question is when I use,
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)
After which I will train and test the model (A,B as features, C as Label) and get some accuracy score. Now my doubt is, what happens when I have to predict the label for new set of data. Say,
df = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
Because when I normalize the column the values of A and B will be changed according to the new data, not the data which the model will be trained on.
So, now my data after the data preparation step that is as below, will be.
data[['A','B']] = min_max_scaler.fit_transform(data[['A','B']])
Values of A and B will change with respect to the Max and Min value of df[['A','B']]. The data prep of df[['A','B']] is with respect to Min Max of df[['A','B']].
How can the data preparation be valid with respect to different numbers relate? I don't understand how the prediction will be correct here.
You should fit the MinMaxScaler using the training data and then apply the scaler on the testing data before the prediction.
In summary:
Step 1: fit the scaler on the TRAINING data
Step 2: use the scaler to transform the TRAINING data
Step 3: use the transformed training data to fit the predictive model
Step 4: use the scaler to transform the TEST data
Step 5: predict using the trained model (step 3) and the transformed TEST data (step 4).
Example using your data:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
#training data
df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
#fit and transform the training data and use them for the model training
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)
#fit the model
model.fit(df['A','B'])
#after the model training on the transformed training data define the testing data df_test
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
#before the prediction of the test data, ONLY APPLY the scaler on them
df_test[['A','B']] = min_max_scaler.transform(df_test[['A','B']])
#test the model
y_predicted_from_model = model.predict(df_test['A','B'])
Example using iris data:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
data = datasets.load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
model = SVC()
model.fit(X_train_scaled, y_train)
X_test_scaled = scaler.transform(X_test)
y_pred = model.predict(X_test_scaled)
Hope this helps.
See also by post here: https://towardsdatascience.com/everything-you-need-to-know-about-min-max-normalization-in-python-b79592732b79
Best way is train and save MinMaxScaler model and load the same when it's required.
Saving model:
df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
pickle.dump(min_max_scaler, open("scaler.pkl", 'wb'))
Loading saved model:
scalerObj = pickle.load(open("scaler.pkl", 'rb'))
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
df_test[['A','B']] = scalerObj.transform(df_test[['A','B']])