Different Datasets for Training and testing a machine learning model

Different Datasets for Training and testing a machine learning model - python

I am currently working on the BNP Paribas Cardiff Claim Management dataset from kaggle & I have finished writing my code on python (jupyter notebook) for the train dataset where I have used 20% of it to test. This study requires me to test my model on a completely different dataset test.csv and append the predicted probabilities on the sample_sumbission.csv. How do I go about it. What changes would I have to make since I have made many tweaks to the training dataset using feature selection techniques

Let's define the following:
Xtrain - Training data with shape nxm where n is number of records and m is features
ytrain - Target for each row in Xtrain
model - The chosen model
ypred_train - model(Xtrain)
I assume you have had a dataset data of which you have done some cleaning/feature-engineering such that Xtrain = clean(data).
Since your model is trained over Xtrain which has been "transformed" using clean, you'll need to make sure that Xtest = clean(data_test)
You can do it in different ways, where the simplest is
Define a function clean e.g
def clean(X):
""" Cleans the data in X
input
------
X: pandas.DataFrame
"""
X["sum"] = X["feature1"] + X["feature2"]
X["lower"] = X["string_feature"].str.lower()
X.drop(columns=["string"], inplace=True)
return X #Cleaned data
and then you can simply do
Xtrain = clean(data)
Xtest = clean(data_test)
ytest = model(Xtest)
Depending on what you are doing in clean you can look at pipelines

Related

SVM testing - normalization of test data [duplicate]

This question already has an answer here:
what is the difference between fit() ,fit_transform() and transform() in scikit_learn? [duplicate]
(1 answer)
Closed 1 year ago.
I'm working with SVM model to classify 5 different classes. (N1, N2, N3, W, R)
Feature extractions -> Data normalization -> train SVM
when I tested the model (20%, 80% usual train-test-split), it shows high accuracy enter image description here
But when I tried testing with a completely new dataset, with the same method of
Feature extractions -> Data normalization -> test on trained SVM model
It came out really badly.
Let's say the original dataset used in training is A, and the new test dataset is B.
when I trained the model only with A and tested B, it came out really badly.
First I thought it was model overfitting so I included A and B to train the model and tested with B. It came out badly again...
I think the problem is the normaliztion process. It eventually worked when I tried new dataset C, but this time I brought the train A data, concatenated A+C to normalize, and then cut only C dataset out from it. And when I compared that with the data C normalized alone, it was different..
I used MinMaxScaler from sklearn.
I mean mathematically speaking of course it's different.. because every dataset has different minimum maximum value and normalized data will be different when mixed with other data.
My question is, when you test with new dataset, is it normal to bring the train dataset to normalize it together and then take out the test datapart only?? It's like mixing A(112x12), B(15x12) -> normalize (127x12) together -> take out (15x12)
Or should I start from fixing the code from feature extraction and training SVM?
(I attached the code, and each feature has 12x1 shape which means each stage has 12xN matrix.)
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
# Load training data
N1_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N1_features")
N2_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N2_features")
N3_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N3_features")
W_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_W_features")
R_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_R_features")
# Load test data
N1_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N1_features")
N2_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N2_features")
N3_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N3_features")
W_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_W_features")
R_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_R_features")
# normalize with original raw features and take only test out
N1_scaled_test = features.normalize_together(N1_test, N1_train, "N1")
N2_scaled_test = features.normalize_together(N2_test, N2_train, "N2")
N3_scaled_test = features.normalize_together(N3_test, N3_train, "N3")
W_scaled_test = features.normalize_together(W_test, W_train, "W")
R_scaled_test = features.normalize_together(R_test, R_train, "R")
def normalize_together(test, raw, stage_no):
together = pd.concat([test, raw], ignore_index=True)
scaled_test = pd.DataFrame(scaler.fit_transform(together.iloc[:, :-1]))
scaled_test['label'] = "{}".format(stage_no)
scaled_test = scaled_test.iloc[0:test.shape[0], :]
return scaled_test

Test data should remain unseen during training (includes preprocessing) - don't use both test + train data to compute a common normalisation factor. Normalise the training set. Separately, normalise the test set.
Why? It's vital to use an unseen test partition to evaluate your trained model. Otherwise you have not tested the ability for your model to generalise - imagine playing a game of cards where you have already have prior knowledge of the cards or order of the deck.

Apply embedding layer for categorical variable with keras

I have a dataset with many categorical features and many features.I want to apply embedding layer to transfer the categorical data to numerical data for the using of the other models.But, I got some error during training.
Now, my training process is:
Perform label encoder to categorical features
Split training and testing data by train_test_split() function
Drop the numerical columns. Only send the categorical features and target y for model training.
And I got this error:
indices[13,0] = 10 is not in [0, 10)
[[node functional_1/embed_6/embedding_lookup (defined at <ipython-input-34-0b6b3ae455d0>:4) ]] [Op:__inference_train_function_3509]
Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embed_6/embedding_lookup:
functional_1/embed_6/embedding_lookup/2395 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function
After searching, someone says the problem is that the vocabulary_size parameter of embedding layer is wrong. Enlarge the vocabulary_size can solve this problem.
But in my case, I need to map the result back to original label.
For example, I have a categorical feature ['dog', 'cat', 'fish']. After label encode, it become[0,1,2]. An embedding layer for this feature with 3 unique variable should output something like
([-0.22748041], [-0.03832678], [-0.16490786]).
Then I can replace the ['dog'] variable in original data as -0.22748041, replace ['cat'] variable as -0.03832678, and so on.
So, I can't change the vocabulary_size or the output dimension will be wrong.
I guess the problem in my case is that not all of the categorical variable are go into the training process.
(E.x. Only ['dog', 'fish'] are in the training data. ['cat'] is only appear in testing data). If I set the vocabulary_size as 3, it will report an error like above. If I experimentally add ['cat'] to training data. It works fine.
My problem is, dose embedding layer have to look all of the unique value in training process to perform the application I want? If there are a lot of categorical data with a lot of unique value, how to ensure all the unique value appear in testing data when splitting data.
Thanks in advance!

Solution
You need to use out-of-vocabulary buckets when creating the the lookup table.
oov buckets allow to lookup of unknown category if found during testing.
What the solution does?
Setting it to a required number (like 1000) will allow you to get ids of those other category as well which were not present in test data categories.
words = tf.constant(vocabulary)
word_ids = tf.range(len(vocabulary), dtype=tf.int64)
# important
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets) # lokup table for ids->category
Then you can encode the training set (I am using TensorFlow Dataset IMDb rating dataset)
def encode_words(X_batch, y_batch):
"""
Encode the training set converting words to IDs
using the lookup table just created
"""
return table.lookup(X_batch), y_batch
train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)
when creating model:
vocab_size=10000 # whatever the length of variable vocabulary is of
embedding_size = 128 # tweakable | hyperparameter
model = keras.models.Sequential([
keras.layers.Embedding(vocab_size + num_oov_buckets, embedding_size,
input_shape=[None]),
# usual code follows
])
and fit the data
model.compile(loss="binary_crossentropy",
optimizer="adam",
metrics="accuracy")
history = model.fit(train_set, epochs=5)

Neural network in keras and tensorflow for multiplying features

The idea of this model is that it learns, through neural networks, to perform the multiplication of two feactures, so I created a training dataset with multiplications of random numbers from 0 to 100. As the idea is that it learns to multiply in any situation, I created training data a) with random numbers up to 100 and b) with random numbers from 1000 to 5000.
I created the neural network below for this, however it does not fit well with the test data “b”.
model = tf.keras.Sequenenter code heretial()
model.add(tf.keras.layers.Dense(units = 2,input_dim = 2))
model.add(tf.keras.layers.Dropout(0.1))
model.add(tf.keras.layers.Dense(units = 64,activation='relu'))
model.add(tf.keras.layers.Dropout(0.1))
model.add(tf.keras.layers.Dense(units = 32,activation='relu'))
model.add(tf.keras.layers.Dense(units = 1))
model.compile(optimizer='adam', loss = 'mean_squared_error')
Compared to the "a" test data, the prediction makes sense. But comparing with the test data "b", it presents a similar curve, but with very distant values.
Data test x Data predict "a"
Data test x Data predict "b"
Data predict "b"
If you want to see my complete code:
https://colab.research.google.com/drive/1rdAhZnHlxyXHHDF2D_grog05oDwYbXHa?usp=sharing
Could you help me with my model to generalize well to data much larger than training data?
Thanks!

Using the scaling provided by you in the comments of your notebook results in a different scaling for training and test data. For example if there is a 100 value in your training data its normalized value should be the same in your test data, which is not the case right now. The easiest way to normalize the data in your case is simply doing it from the beginning, e.g. here
df = pd.DataFrame(data=a, columns=['a'])
df['b']= b
df['mult'] = df['a']*df['b']
# Scale your data here
In any case, I am not sure if this would solve the problem.

TF2.0 Data API get n_i samples from each class label

I have to classify inputs of shape 32x32 into 3 classes using a TF2 Keras model. My training set has 7000 examples
>>> X_train.shape # (7000, 32, 32)
>>> Y_train.shape # (7000, 3)
The number of examples for each class varies (e.g. class_0 has ~2500 examples while class_1 has ~800, etc.)
I want to use the tf.data API to create a dataset object that returns batches of training data with no. of examples from each class specified by [n_0, n_1, n_2].
I would like to have these n_i samples from each class randomly drawn with replacement from X_train, Y_train
For example, if I call get_batch([100, 150, 125]) it should return 100 random samples from X_batch from class_0, 150 from class_1, and 125 from class_2.
How can I achieve this using the TF2.0 Data API so I could use it for training a Keras model?

One possible approach is to proceed as follows:
Load the data from X_train & Y_train into a single tf.data Dataset so that we ensure we keep each X matched with the correct Y
.shuffle() then split the dataset into each n_i using a filter()
Write our get_batch function to return the correct number of samples from each dataset, shuffle() the sample then split it back into X & Y
Something like this:
# 1: Load the data into a Dataset
raw_data = tf.data.Dataset.zip(
(
tf.data.Dataset.from_tensor_slices(X_train),
tf.data.Dataset.from_tensor_slices(Y_train)
)
).shuffle(7000)
# 2: Split for each category
def get_filter_fn(n):
def filter_fn(x, y):
return tf.equal(1.0, y[n])
return filter_fn
n_0s = raw_data.filter(get_filter_fn(0))
n_1s = raw_data.filter(get_filter_fn(1))
n_2s = raw_data.filter(get_filter_fn(2))
# 3:
def get_batch(n_0,n_1,n_2):
sample = n_0s.take(n_0).concatenate(n_1s.take(n_1)).concatenate(n_2s.take(n_2))
shuffled = sample.shuffle(n_0 + n_1 + n_2)
return shuffled.map(lambda x,y: x),shuffled.map(lambda x,y: y)
So now we can do:
x_batch, y_batch = get_batch(100, 150, 125)
Note that I've used some potentially wasteful operations here pursuing an approach I find intuitive and straightforward (specifically reading the raw_data dataset 3 times for the filter operations) so I make no claim that this is the most efficient way to accomplish what you need but for a dataset that fits in memory like the one you describe I'm sure such inefficiencies will be negligible

Keras' train_test_split actually has a parameter for that. While it doesn't let you pick exact number of samples, it selects them evenly from the classes.
X_train_stratified, X_test_stratified, y_train_strat, y_test_strat = train_test_split(X_train, y_train, test_size=0.2, stratify=y)
If you want to do cross validation you can also use stratified shuffle split
I hope I understood your question correctly

Using tensorflow or keras to build a NN model by feeding 'pairwise' samples

I'm trying to implement a NN model with pairwise samples. Details are shown in follows:
Original data:
X_org with shape of (100, 50) for example, namely 100 samples with 50 features.
Y_org with shape of (100, 1).
Processing these original data for real training:
Select 2 samples from X_org randomly (so we have 100*99/2 such combinations) to form a new 'pairwise' sample, and the prediction target, namely the new y label is the subtraction of the two corresponding y_org labels (Y_org_sample1 - Y_org_sample2). Now we have new X_train and Y_train.
I need a more a NN model (DNN, CNN, LSTM, whatever ...), with which I can pass the first sub_sample of one pairwise sample from X_train into the model and will get one result, same step for the second sub_sample. By calculating the subtraction of the two results, I can get the prediction of this pairwise sample. This prediction will be the one compared with the corresponding Y label from Y_train.
Overall, I need to train a model (update the weights) after feeding it a 'pairwise' sample (two successive sub samples). The reason why I don't choose a 'two-arm' model (e.g. merge two arms by xxx.sub()) is that I will only feed one sub sample during test process. I will just use the model to predict one sub-sample finally.
So I will use the data from X_train during train step, while use X_org-like data format during test step. It looks a bit complex.
Looks like Tensorflow would be more feasible for this task, if keras also works, please kindly share your idea.

You can first create a model that will take only one X_org-like element:
#create a model the way you like it, it can be Functional API or Sequential, no problem
xOrgModel = createAModelForXOrgData(...)
Now, lets create a second model, this time necessarily functional API that works with both inputs:
from keras.models import Model
from keras.layers import Input, Subtract
input1 = Input(shapeOfInput)
input2 = Input(shapeOfInput)
output1 = xOrgModel(input1)
output2 = xOrgModel(input2)
output = Subtract()([output1,output2])
pairWiseModel = Model([input1,input2],output)
Now you have two models: xOrgModel and pairWiseModel. You can use any of them depending on the task you are doing at the moment.
Both models are sharing their weights. This means that you can train any of them and the other will be updated as well.
Using the pairwise model
First, organize your data in two separate arrays. (Because our model uses two inputs)
L = len(X_org)
x1 = []
x2 = []
y = []
for i in range(L):
for j in range(i+1,L):
x1.append(X_org[i])
x2.append(X_org[j])
y.append(Y_org[i] - Y_org[j])
x1 = np.array(x1)
x2 = np.array(x2)
y = np.array(y)
Train and predict with a list of inputs:
pairWiseModel.fit([x1,x2],y,...)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.