predicitng new value through a model trained on one hot encoded data

predicitng new value through a model trained on one hot encoded data - python

This might look like a trivial problem. But I am getting stuck in predicting results from a model. My problem is like this:
I have a dataset of shape 1000 x 19 (except target feature) but after one hot encoding it becomes 1000 x 141.
Since I trained the model on the data which is of shape 1000 x 141, so I need data of shape 1 x 141 (at least) for prediction.
I also know in python, I can make future prediction using
model.predict(data)
But, since I am getting data from an end user through a web portal which is shape of 1 x 19. Now I am very confused how should I proceed further to make predictions based on the user data.
How can I convert data of shape 1 x 19 into 1 x 141 as I have to maintain the same order with respect to train/test data means the order of column should not differ?
Any help in this direction would be highly appreciated.

I am assuming that to create a one hot encoding, you are using sklearn onehotencoder. If you using that, then the problem should be solved easily. Since you are fitting the one hot encoder on your training data
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories = "auto", handle_unknown = "ignore")
X_train_encoded = encoder.fit_transform(X_train)
So now in the above code, your encoder is fitted on your training data so when you get the test data, you can transform it into the same encoded data using this fitted encoder.
test_data = encoder.transform(test_data)
Now your test data will also be of 1x141 shape. You can check shape using
(pd.DataFrame(test_data.toarray())).shape

Related

scaling and encoding with same technique at prediction

i used robustscaling and labelencoding for numerical and categorical column respectively at the time of training.
num_features = x_df[numeric_type]
scaler = RobustScaler()
x_df[numeric_type] = scaler.fit_transform(num_features.values)
label_encoder = LabelEncoder()
x_df[categorical_type[i]] = label_encoder.fit_transform(x_df[categorical_type[i]])
and it works very fine.
and i trained my model also but when i try to give new values for prediction so i have to give value in the encoded aur scaled form
eg:
at training :
apple -> 1
at prediction: if i want to insert apple as a value so i want to type apple but this says you have to give 1
so any solution to encode with same technique as used in training with the sequence of variable also. means variable will user defined so i dont have idea of sequence of variable what user decided for training and prediction.
thanks in advance

using ColumnTransformer for predicting values

I am currently running a logistic regression model using keras.
I have 1 numeric variable and around 6 categorical variables.
I am currently using a column transformer for training and testing the model and it works perfect (code shown below):
numeric_variables = ["var1"]
cat_variables = ["var2","var3","var4","var5","var6","var7"]
pipeline = ColumnTransformer([('num',StandardScaler(), numeric_variables), ('cat',OneHotEncoder(handle_unknown = "ignore"), cat_variables)], remainder = "passthrough")
pipeline.fit(X_Train)
pipeline.fit_transform(X_Train)
This works perfectly when I run the train and test dataset.
However, when I deploy the model to get the probability of a customer renewing, I am sending the data as a dataframe with one row.
While the fit_transform for X_Train and X_Test gives out a nx17 array (because of the onehotencoding of the 7 factors), the transform of the predictions only gives nx7.
My theory here is that the pipeline is dropping one hot encoded fields. For instance, if var2 can take 3 values (say "M","F" and "O"), the X_Train gives out 3 columns for each (isM, isF and isO) while the transform for the predictions is only giving the output for "isM" if the value of Var2 is "M"
How do I address this issue?
I get this error when I run the model.predict on the single customer example:
Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 19), found shape=(None, 7)

After the discussion in the comments:
It appears that you are using pipeline.fit_transform(X_test). This means you are fitting your pipeline with X_test before transforming it. This is a problem in your case for two reasons:
You are re-fitting the StandardScaler, which means you will scale your features differently than what you did with the train set.
You are re-fitting the OneHotEncoder. Hence, you could miss some categories in cat_variables that were present only in the train set. Consequently, your output shape is smaller.
Simply use .transform(X_train) instead.

fit and transform error on Cross validation and test data

I need help with the code here. i am trying to fit and transform the train data and then transform the cross validation and the test data. but when i do that i get the error that - ValueError: X has 24155 features, but Normalizer is expecting 49041 features as input.
Can someone please help me to solve this issue.
my code snippet-
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(1,-1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(1,-1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(1,-1))
print("After vectorizations")
print(X_train_price_norm.shape, y_train.shape)
print(X_cv_price_norm.shape, y_cv.shape)
print(X_test_price_norm.shape, y_test.shape)
print("="*100)

The transform function expects a 2D array as (samples, features)
The error indicates that second dimension of X_train['price'] and x_cv['price'] or x_test['price'] are not the same.
As the code reflects, you have 1 feature (price), and many samples. So, as the above explanation (samples, features), your input shape should be like (n_samples,1), since you have one feature. Now, consider to change the reshape to (-1,1) instead of (1,-1).
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(-1,1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(-1,1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(-1,1))

Apply embedding layer for categorical variable with keras

I have a dataset with many categorical features and many features.I want to apply embedding layer to transfer the categorical data to numerical data for the using of the other models.But, I got some error during training.
Now, my training process is:
Perform label encoder to categorical features
Split training and testing data by train_test_split() function
Drop the numerical columns. Only send the categorical features and target y for model training.
And I got this error:
indices[13,0] = 10 is not in [0, 10)
[[node functional_1/embed_6/embedding_lookup (defined at <ipython-input-34-0b6b3ae455d0>:4) ]] [Op:__inference_train_function_3509]
Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embed_6/embedding_lookup:
functional_1/embed_6/embedding_lookup/2395 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function
After searching, someone says the problem is that the vocabulary_size parameter of embedding layer is wrong. Enlarge the vocabulary_size can solve this problem.
But in my case, I need to map the result back to original label.
For example, I have a categorical feature ['dog', 'cat', 'fish']. After label encode, it become[0,1,2]. An embedding layer for this feature with 3 unique variable should output something like
([-0.22748041], [-0.03832678], [-0.16490786]).
Then I can replace the ['dog'] variable in original data as -0.22748041, replace ['cat'] variable as -0.03832678, and so on.
So, I can't change the vocabulary_size or the output dimension will be wrong.
I guess the problem in my case is that not all of the categorical variable are go into the training process.
(E.x. Only ['dog', 'fish'] are in the training data. ['cat'] is only appear in testing data). If I set the vocabulary_size as 3, it will report an error like above. If I experimentally add ['cat'] to training data. It works fine.
My problem is, dose embedding layer have to look all of the unique value in training process to perform the application I want? If there are a lot of categorical data with a lot of unique value, how to ensure all the unique value appear in testing data when splitting data.
Thanks in advance!

Solution
You need to use out-of-vocabulary buckets when creating the the lookup table.
oov buckets allow to lookup of unknown category if found during testing.
What the solution does?
Setting it to a required number (like 1000) will allow you to get ids of those other category as well which were not present in test data categories.
words = tf.constant(vocabulary)
word_ids = tf.range(len(vocabulary), dtype=tf.int64)
# important
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets) # lokup table for ids->category
Then you can encode the training set (I am using TensorFlow Dataset IMDb rating dataset)
def encode_words(X_batch, y_batch):
"""
Encode the training set converting words to IDs
using the lookup table just created
"""
return table.lookup(X_batch), y_batch
train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)
when creating model:
vocab_size=10000 # whatever the length of variable vocabulary is of
embedding_size = 128 # tweakable | hyperparameter
model = keras.models.Sequential([
keras.layers.Embedding(vocab_size + num_oov_buckets, embedding_size,
input_shape=[None]),
# usual code follows
])
and fit the data
model.compile(loss="binary_crossentropy",
optimizer="adam",
metrics="accuracy")
history = model.fit(train_set, epochs=5)

Understanding what exactly neural network is predicting in documentation example (MNIST)

I've taken a quick course in neural networks to better understand them and now I'm trying them out for myself in R. I'm following this documentation of Keras.
The way I understand what is happening:
We are inputting a series of images and transforming these images to numerical matrices based on the arrangement of the pixels and colors in those pixels. We then build a neural network model to learn the pattern of these arrangements, depending on the classification (0 to 9). We then use the model to predict which class an image belongs to. I'll be honest and admit I'm not entirely sure what y_train and x_train is. I simply see it as one training and one validation set so I'm not sure what the difference between x and y is.
My question:
I've followed the steps to the T and the model runs fine and the predictions look like they do in the documentation. Ultimately, the prediction looks like this:
I take this to mean that observation 1 in x_test is predicted to be a category 7.
However, looking at x_test it looks like this:
There is a 0 in every column and row, also if I scroll further down. This is where I get confused. I'm also not sure how I view the original images to view for myself how well they are predicting them. I would eventually like to draw a number myself in paint or so and then see if the model can predict it, but for that I need to first understand what is going on. I feel I am close but I just need a little nudge!

I think if you read more about the input and output layer's dimensions, that would help.
In your example:
Input layer:
A single training example of image has two dimensions 28*28, which is then converted to a single vector of dimension 784. This acts as the input layer for the neural network.
So for m training examples, your input layer will have dimensions (m, 784). Analogically speaking (to traditional ML systems), you can imagine that each pixel of an image is converted into a feature (or x1, x2, ... x784), and your training set is a dataframe with m rows and 784 columns, which is then fed into neural network to compute y_hat = f(x1,x2,x3,...x784).
Output layer:
As an output for our neural network, we want it to predict which number it is from 0 to 9. So for a single training example, the output layer has dimension 10, representing each number from 0 to 9 and for n testing examples the output layer would be a matrix with dimension n*10.
Our y is a vector of length n which would be something like [1,7,8,2,.....] containing true value for each testing example. But to match the dimension of output layer, the y vector's dimension are converted using one-hot encoding. Imagine a length 10 vector, representing number 7 by putting 1 at 7th place and rest of the positions zeros something like [0,0,0,0,0,0,1,0,0,0].
So in your question, if you wish to see the original image, you should be able to see it before reshaping the training examples with something like image(mnist$test$x[1, , ]
Hope this helps!!

y_train are the labels and x_train is the training data, so images in this example. You need to use some kind of plotting library to plot x'es. In this example you probably are not expected to input your own drawings and if you want you would need to preprocess them in the same way as in MNIST and pass them to the model.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.