Why is the shape different for train,test and cv?

Why is the shape different for train,test and cv? - python

I have a dataset of 3321 rows and i have divided them into train test and cv sets.
After dividing the data-set i have applied response coding and onehot-encoding, but after onehotencoding the shapes of the column have also changed, due to which i am getting an error further while predicting
#response coding for the Gene feature
alpha = 1 #Used for laplace smoothing
train_gene_feature_responseCoding = np.array(get_gv_feature(alpha, "Gene", train_df)) #train gene feature
test_gene_feature_responseCoding = np.array(get_gv_feature(alpha, "Gene", test_df)) #test gene feature
cv_gene_feature_responseCoding = np.array(get_gv_feature(alpha, "Gene", cv_df)) #cv gene feature
#one-hot encoding of Gene Feature
gene_vectorizer = CountVectorizer()
train_gene_feature_onehotCoding = gene_vectorizer.fit_transform(train_df['Gene'])
test_gene_feature_onehotCoding = gene_vectorizer.fit_transform(test_df['Gene'])
cv_gene_feature_onehotCoding = gene_vectorizer.fit_transform(cv_df['Gene'])
train_gene_feature_responseCoding.shape -
(2124, 9)
test_gene_feature_responseCoding.shape -
(665, 9)
cv_gene_feature_responseCoding.shape -
(532, 9)
train_gene_feature_onehotCoding.shape -
(2124, 228)
test_gene_feature_onehotCoding.shape -
(665, 158)
cv_gene_feature_onehotCoding.shape -
(532, 144)

You need to use gene_vectorizer.transform() only on the test and cv dataframe.
gene_vectorizer.transform(test_df['Gene'])
gene_vectorizer.transform(cv_df['Gene'])
In scikit-learn estimator api,
fit() : used for generating learning model parameters from training data
transform() : parameters generated from fit() method,applied upon model to generate transformed data set.
fit_transform() : combination of fit() and transform() api on same data set
So on the test datasets, you just need to use transform() to convert test dataset to the shape which is acceptable by the model.
Reference: what is the difference between 'transform' and 'fit_transform' in sklearn

Related

using ColumnTransformer for predicting values

I am currently running a logistic regression model using keras.
I have 1 numeric variable and around 6 categorical variables.
I am currently using a column transformer for training and testing the model and it works perfect (code shown below):
numeric_variables = ["var1"]
cat_variables = ["var2","var3","var4","var5","var6","var7"]
pipeline = ColumnTransformer([('num',StandardScaler(), numeric_variables), ('cat',OneHotEncoder(handle_unknown = "ignore"), cat_variables)], remainder = "passthrough")
pipeline.fit(X_Train)
pipeline.fit_transform(X_Train)
This works perfectly when I run the train and test dataset.
However, when I deploy the model to get the probability of a customer renewing, I am sending the data as a dataframe with one row.
While the fit_transform for X_Train and X_Test gives out a nx17 array (because of the onehotencoding of the 7 factors), the transform of the predictions only gives nx7.
My theory here is that the pipeline is dropping one hot encoded fields. For instance, if var2 can take 3 values (say "M","F" and "O"), the X_Train gives out 3 columns for each (isM, isF and isO) while the transform for the predictions is only giving the output for "isM" if the value of Var2 is "M"
How do I address this issue?
I get this error when I run the model.predict on the single customer example:
Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 19), found shape=(None, 7)

After the discussion in the comments:
It appears that you are using pipeline.fit_transform(X_test). This means you are fitting your pipeline with X_test before transforming it. This is a problem in your case for two reasons:
You are re-fitting the StandardScaler, which means you will scale your features differently than what you did with the train set.
You are re-fitting the OneHotEncoder. Hence, you could miss some categories in cat_variables that were present only in the train set. Consequently, your output shape is smaller.
Simply use .transform(X_train) instead.

SVM testing - normalization of test data [duplicate]

This question already has an answer here:
what is the difference between fit() ,fit_transform() and transform() in scikit_learn? [duplicate]
(1 answer)
Closed 1 year ago.
I'm working with SVM model to classify 5 different classes. (N1, N2, N3, W, R)
Feature extractions -> Data normalization -> train SVM
when I tested the model (20%, 80% usual train-test-split), it shows high accuracy enter image description here
But when I tried testing with a completely new dataset, with the same method of
Feature extractions -> Data normalization -> test on trained SVM model
It came out really badly.
Let's say the original dataset used in training is A, and the new test dataset is B.
when I trained the model only with A and tested B, it came out really badly.
First I thought it was model overfitting so I included A and B to train the model and tested with B. It came out badly again...
I think the problem is the normaliztion process. It eventually worked when I tried new dataset C, but this time I brought the train A data, concatenated A+C to normalize, and then cut only C dataset out from it. And when I compared that with the data C normalized alone, it was different..
I used MinMaxScaler from sklearn.
I mean mathematically speaking of course it's different.. because every dataset has different minimum maximum value and normalized data will be different when mixed with other data.
My question is, when you test with new dataset, is it normal to bring the train dataset to normalize it together and then take out the test datapart only?? It's like mixing A(112x12), B(15x12) -> normalize (127x12) together -> take out (15x12)
Or should I start from fixing the code from feature extraction and training SVM?
(I attached the code, and each feature has 12x1 shape which means each stage has 12xN matrix.)
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
# Load training data
N1_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N1_features")
N2_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N2_features")
N3_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N3_features")
W_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_W_features")
R_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_R_features")
# Load test data
N1_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N1_features")
N2_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N2_features")
N3_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N3_features")
W_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_W_features")
R_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_R_features")
# normalize with original raw features and take only test out
N1_scaled_test = features.normalize_together(N1_test, N1_train, "N1")
N2_scaled_test = features.normalize_together(N2_test, N2_train, "N2")
N3_scaled_test = features.normalize_together(N3_test, N3_train, "N3")
W_scaled_test = features.normalize_together(W_test, W_train, "W")
R_scaled_test = features.normalize_together(R_test, R_train, "R")
def normalize_together(test, raw, stage_no):
together = pd.concat([test, raw], ignore_index=True)
scaled_test = pd.DataFrame(scaler.fit_transform(together.iloc[:, :-1]))
scaled_test['label'] = "{}".format(stage_no)
scaled_test = scaled_test.iloc[0:test.shape[0], :]
return scaled_test

Test data should remain unseen during training (includes preprocessing) - don't use both test + train data to compute a common normalisation factor. Normalise the training set. Separately, normalise the test set.
Why? It's vital to use an unseen test partition to evaluate your trained model. Otherwise you have not tested the ability for your model to generalise - imagine playing a game of cards where you have already have prior knowledge of the cards or order of the deck.

Different Datasets for Training and testing a machine learning model

I am currently working on the BNP Paribas Cardiff Claim Management dataset from kaggle & I have finished writing my code on python (jupyter notebook) for the train dataset where I have used 20% of it to test. This study requires me to test my model on a completely different dataset test.csv and append the predicted probabilities on the sample_sumbission.csv. How do I go about it. What changes would I have to make since I have made many tweaks to the training dataset using feature selection techniques

Let's define the following:
Xtrain - Training data with shape nxm where n is number of records and m is features
ytrain - Target for each row in Xtrain
model - The chosen model
ypred_train - model(Xtrain)
I assume you have had a dataset data of which you have done some cleaning/feature-engineering such that Xtrain = clean(data).
Since your model is trained over Xtrain which has been "transformed" using clean, you'll need to make sure that Xtest = clean(data_test)
You can do it in different ways, where the simplest is
Define a function clean e.g
def clean(X):
""" Cleans the data in X
input
------
X: pandas.DataFrame
"""
X["sum"] = X["feature1"] + X["feature2"]
X["lower"] = X["string_feature"].str.lower()
X.drop(columns=["string"], inplace=True)
return X #Cleaned data
and then you can simply do
Xtrain = clean(data)
Xtest = clean(data_test)
ytest = model(Xtest)
Depending on what you are doing in clean you can look at pipelines

sklearn: Found arrays with inconsistent numbers of samples when calling naive_bayes.MultinomialNB(

I have looked at similar questions as such as this one. But none of the mentioned solutions worked in my case.
I am trying to build a text classification prediction model.
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
if is_neural_net:
predictions = predictions.argmax(axis=-1)
return metrics.accuracy_score(predictions, train_label)
# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(),train_text,train_label,test_text)
print ("NB, WordLevel TF-IDF: ", accuracy)
However, Naive_bayes returns the below error:
ValueError: Found input variables with inconsistent numbers of samples: [500, 3100]
my training data
print(train_text.shape)
type(train_text)
returns
(3100, 3013)
scipy.sparse.csr.csr_matrix
my training labels
print(train_label.shape)
type(train_label)
returns
(3100,)
numpy.ndarray
my test dataset
print(test_text.shape)
type(test_text)
returns
(500, 3013)
scipy.sparse.csr.csr_matrix
I tried every possible type of transformation. Can any one recommend a solution? thanks

I guess the problem is in
predictions = classifier.predict(feature_vector_valid)
return metrics.accuracy_score(predictions, train_label)
What is the shape of predictions? Is train_label a global variable in train_model? Also, is prediction has the same shape as train_label?

Why does AdaBoost not work with DecisionTree?

I'm using sklearn 0.19.1 with DecisionTree and AdaBoost.
I have a DecisionTree classifier that works fine:
clf = tree.DecisionTreeClassifier()
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
clf.fit(train_pdf_x, train_pdf_y)
pred2 = clf.predict(test_pdf_x)
But when trying to add AdaBoost, it throws an error on the predict function:
treeclf = tree.DecisionTreeClassifier(max_depth=3)
adaclf = AdaBoostClassifier(base_estimator=treeclf, n_estimators=500, learning_rate=0.5)
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
adaclf.fit(train_pdf_x, train_pdf_y)
pred2 = adaclf.predict(test_pdf_x)
Specifically the error says:
ValueError: bad input shape (236821, 6)
The dataset that it seems to be pointing to is train_pdf_y because it has a shape of (236821, 6) and I don't understand why.
From even the description of the AdaBoostClassifier in the docs I can understand that the actual classifier that uses the data is the DecisionTree:
An AdaBoost 1 classifier is a meta-estimator that begins by fitting
a classifier on the original dataset and then fits additional copies
of the classifier on the same dataset but where the weights of
incorrectly classified instances are adjusted such that subsequent
classifiers focus more on difficult cases
But still I'm getting this error.
In the code examples I've found, even on sklearn's website with how to use AdaBoost and I can't understand what I'm doing wrong.
Any help is appreciated.

It looks like you are trying to perform a Multi-Output classification problem, given the shape of y, otherwise it does not make sense that you are feeding and n-dimensional y to adaclf.fit(train_pdf_x, train_pdf_y).
So assuming that is the case, the problem is that indeed Scikit-Learn's DecisionTreeClassifier does support Multi-output problems, this is, y inputs with shape [n_samples, n_outputs]. However that is not the case for the AdaBoostClassifier, given that, from the documentation, the labels must be:
y : array-like of shape = [n_samples]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is the shape different for train,test and cv? - python

Related

using ColumnTransformer for predicting values

SVM testing - normalization of test data [duplicate]

Different Datasets for Training and testing a machine learning model

sklearn: Found arrays with inconsistent numbers of samples when calling naive_bayes.MultinomialNB(

Why does AdaBoost not work with DecisionTree?

Categories

Resources