I am currently running a logistic regression model using keras.
I have 1 numeric variable and around 6 categorical variables.
I am currently using a column transformer for training and testing the model and it works perfect (code shown below):
numeric_variables = ["var1"]
cat_variables = ["var2","var3","var4","var5","var6","var7"]
pipeline = ColumnTransformer([('num',StandardScaler(), numeric_variables), ('cat',OneHotEncoder(handle_unknown = "ignore"), cat_variables)], remainder = "passthrough")
pipeline.fit(X_Train)
pipeline.fit_transform(X_Train)
This works perfectly when I run the train and test dataset.
However, when I deploy the model to get the probability of a customer renewing, I am sending the data as a dataframe with one row.
While the fit_transform for X_Train and X_Test gives out a nx17 array (because of the onehotencoding of the 7 factors), the transform of the predictions only gives nx7.
My theory here is that the pipeline is dropping one hot encoded fields. For instance, if var2 can take 3 values (say "M","F" and "O"), the X_Train gives out 3 columns for each (isM, isF and isO) while the transform for the predictions is only giving the output for "isM" if the value of Var2 is "M"
How do I address this issue?
I get this error when I run the model.predict on the single customer example:
Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 19), found shape=(None, 7)
After the discussion in the comments:
It appears that you are using pipeline.fit_transform(X_test). This means you are fitting your pipeline with X_test before transforming it. This is a problem in your case for two reasons:
You are re-fitting the StandardScaler, which means you will scale your features differently than what you did with the train set.
You are re-fitting the OneHotEncoder. Hence, you could miss some categories in cat_variables that were present only in the train set. Consequently, your output shape is smaller.
Simply use .transform(X_train) instead.
I have a dataset with many categorical features and many features.I want to apply embedding layer to transfer the categorical data to numerical data for the using of the other models.But, I got some error during training.
Now, my training process is:
Perform label encoder to categorical features
Split training and testing data by train_test_split() function
Drop the numerical columns. Only send the categorical features and target y for model training.
And I got this error:
indices[13,0] = 10 is not in [0, 10)
[[node functional_1/embed_6/embedding_lookup (defined at <ipython-input-34-0b6b3ae455d0>:4) ]] [Op:__inference_train_function_3509]
Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embed_6/embedding_lookup:
functional_1/embed_6/embedding_lookup/2395 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function
After searching, someone says the problem is that the vocabulary_size parameter of embedding layer is wrong. Enlarge the vocabulary_size can solve this problem.
But in my case, I need to map the result back to original label.
For example, I have a categorical feature ['dog', 'cat', 'fish']. After label encode, it become[0,1,2]. An embedding layer for this feature with 3 unique variable should output something like
([-0.22748041], [-0.03832678], [-0.16490786]).
Then I can replace the ['dog'] variable in original data as -0.22748041, replace ['cat'] variable as -0.03832678, and so on.
So, I can't change the vocabulary_size or the output dimension will be wrong.
I guess the problem in my case is that not all of the categorical variable are go into the training process.
(E.x. Only ['dog', 'fish'] are in the training data. ['cat'] is only appear in testing data). If I set the vocabulary_size as 3, it will report an error like above. If I experimentally add ['cat'] to training data. It works fine.
My problem is, dose embedding layer have to look all of the unique value in training process to perform the application I want? If there are a lot of categorical data with a lot of unique value, how to ensure all the unique value appear in testing data when splitting data.
Thanks in advance!
Solution
You need to use out-of-vocabulary buckets when creating the the lookup table.
oov buckets allow to lookup of unknown category if found during testing.
What the solution does?
Setting it to a required number (like 1000) will allow you to get ids of those other category as well which were not present in test data categories.
words = tf.constant(vocabulary)
word_ids = tf.range(len(vocabulary), dtype=tf.int64)
# important
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets) # lokup table for ids->category
Then you can encode the training set (I am using TensorFlow Dataset IMDb rating dataset)
def encode_words(X_batch, y_batch):
"""
Encode the training set converting words to IDs
using the lookup table just created
"""
return table.lookup(X_batch), y_batch
train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)
when creating model:
vocab_size=10000 # whatever the length of variable vocabulary is of
embedding_size = 128 # tweakable | hyperparameter
model = keras.models.Sequential([
keras.layers.Embedding(vocab_size + num_oov_buckets, embedding_size,
input_shape=[None]),
# usual code follows
])
and fit the data
model.compile(loss="binary_crossentropy",
optimizer="adam",
metrics="accuracy")
history = model.fit(train_set, epochs=5)
I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps. I wonder if the time series data has been properly learned and predicted. How do I do this right to get the following(next) value? I want to get the next value using model.predict or similar.
I have x_test and x_test[-1] == t So, the meaning of the next value is t+1, t+2, .... t+n. In this example I want to get t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same. I'm wondering if it's the correct method to train and predicted value and also if the result was correct.
full source
The method I tried first did not work correctly.
Second
I realized something is wrong, I tried using another official data so I used the time series in the Tensorflow tutorial to practice training the model.
a = y_val[-look_back:]
for i in range(N-step prediction): #predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) #predicted value
a = a[1:] #remove first
a = np.append(a, tmp) #insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression abnormal that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious that How can I predict the following value of time series using Tensorflow predict method
I'm not wondering if this works or not theoretically. I'm just wondering how to get the following n steps using the predict method.
Thank you for reading the long question. I seek advice about your priceless opinion.
"lstm" is normaly used to predict 3D data => target
which input has same time frame number(n , t ,f)
"n" for data number "t" for frame number , "f" for feature number
what you want to predict is 1D data => target
which is (t ,f )
when you have f = 0 then you can only use F(t) => y
if you can estimate function F then u can get y . NN can't help here
This might look like a trivial problem. But I am getting stuck in predicting results from a model. My problem is like this:
I have a dataset of shape 1000 x 19 (except target feature) but after one hot encoding it becomes 1000 x 141.
Since I trained the model on the data which is of shape 1000 x 141, so I need data of shape 1 x 141 (at least) for prediction.
I also know in python, I can make future prediction using
model.predict(data)
But, since I am getting data from an end user through a web portal which is shape of 1 x 19. Now I am very confused how should I proceed further to make predictions based on the user data.
How can I convert data of shape 1 x 19 into 1 x 141 as I have to maintain the same order with respect to train/test data means the order of column should not differ?
Any help in this direction would be highly appreciated.
I am assuming that to create a one hot encoding, you are using sklearn onehotencoder. If you using that, then the problem should be solved easily. Since you are fitting the one hot encoder on your training data
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories = "auto", handle_unknown = "ignore")
X_train_encoded = encoder.fit_transform(X_train)
So now in the above code, your encoder is fitted on your training data so when you get the test data, you can transform it into the same encoded data using this fitted encoder.
test_data = encoder.transform(test_data)
Now your test data will also be of 1x141 shape. You can check shape using
(pd.DataFrame(test_data.toarray())).shape
I have scraped a lot of ebay titles like this one:
Apple iPhone 5 White 16GB Dual-Core
and I have manually tagged all of them in this way
B M C S NA
where B=Brand (Apple) M=Model (iPhone 5) C=Color (White) S=Size (Size) NA=Not Assigned (Dual Core)
Now I need to train a SVM classifier using the libsvm library in python to learn the sequence patterns that occur in the ebay titles.
I need to extract new value for that attributes (Brand, Model, Color, Size) by considering the problem as a classification one. In this way I can predict new models.
I want to considering this features:
* Position
- from the beginning of the title
- to the end of the listing
* Orthographic features
- current word contains a digit
- current word is capitalized
....
I can't understand how can I give all this info to the library. The official doc lacks a lot of information
My class are Brand, Model, Size, Color, NA
what does the input file of the SVM algo must contain?
how can I create it? could I have an example of that file considering the 4 features that I put as example in my question? Can I also have an example of the code that I must use to elaborate the input file ?
* UPDATE *
I want to represent these features... How can I must do?
Identity of the current word
I think that I can interpret it in this way
0 --> Brand
1 --> Model
2 --> Color
3 --> Size
4 --> NA
If I know that the word is a Brand I will set that variable to 1 (true).
It is ok to do it in the training test (because I have tagged all the words) but how can I do that for the test set? I don't know what is the category of a word (this is why I'm learning it :D).
N-gram substring features of current word (N=4,5,6)
No Idea, what does it means?
Identity of 2 words before the current word.
How can I model this feature?
Considering the legend that I create for the 1st feature I have 5^(5) combination)
00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44
How can I convert it to a format that the libsvm (or scikit-learn) can understand?
Membership to the 4 dictionaries of attributes
Again how can I do it?
Having 4 dictionaries (for color, size, model and brand) I thing that I must create a bool variable that I will set to true if and only if I have a match of the current word in one of the 4 dictionaries.
Exclusive membership to dictionary of brand names
I think that like in the 4. feature I must use a bool variable. Do you agree?
Here's a step-by-step guide for how to train an SVM using your data and then evaluate using the same dataset. It's also available at http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f. At the url you can also see the output of the intermediate data and the resulting accuracy (it's an iPython notebook)
Step 0: Install dependencies
You need to install the following libraries:
pandas
scikit-learn
From command line:
pip install pandas
pip install scikit-learn
Step 1: Load the data
We will use pandas to load our data.
pandas is a library for easily loading data. For illustration, we first save
sample data to a csv and then load it.
We will train the SVM with train.csv and get test labels with test.csv
import pandas as pd
train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""
with open('train.csv', 'w') as output:
output.write(train_data_contents)
train_dataframe = pd.read_csv('train.csv')
Step 2: Process the data
We will convert our dataframe into numpy arrays which is a format that scikit-
learn understands.
We need to convert the labels "B", "M", "C",... to numbers also because svm does
not understand strings.
Then we will train a linear svm with the data
import numpy as np
train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)
print "train labels: "
print train_labels
print
print "train features:"
print train_features
We see here that the length of train_labels (5) exactly matches how many rows
we have in trainfeatures. Each item in train_labels corresponds to a row.
Step 3: Train the SVM
from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)
Step 4: Evaluate the SVM on some testing data
test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""
with open('test.csv', 'w') as output:
output.write(test_data_contents)
test_dataframe = pd.read_csv('test.csv')
test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])
test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)
results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"
Links & Tips
Example code for how to load LinearSVC: http://scikitlearn.org/stable/modules/svm.html#svm
Long list of scikit-learn examples: http://scikitlearn.org/stable/auto_examples/index.html. I've found these mildly helpful but
often confusing myself.
If you find that the SVM is taking a long time to train, try LinearSVC
instead: http://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
Here's another tutorial on getting familiar with machine learning models: http://scikit-learn.org/stable/tutorial/basic/tutorial.html
You should be able to take this code and replace train.csv with your training data, test.csv with your testing data, and get predictions for your test data, along with accuracy results.
Note that since you're evaluating using the data you trained on the accuracy will be unusually high.
I echo the comment of #MarcoPashkov but will try to elaborate on the LibSVM file format. I find the documentation comprehensive yet hard to find, for the Python lib I recommend the README on GitHub.
An important piece to recognize is that there is a Sparse format where all features which are 0 get removed and a Dense format where features which are 0 are not removed. These two are equivalent examples of each taken from the README.
# Dense data
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
# Sparse data
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]
The y variable stores a list of all the categories for the data.
The x variable stores the feature vector.
assert len(y) == len(x), "Both lists should be the same length"
The format found in the Heart Scale Example is a Sparse format where the dictionary key is the feature index and the dictionary value is the feature value while the first value is the category.
The Sparse format is incredibly useful while using a Bag of Words Representation for your feature vector.
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
For an example using the feature vector you started with, I trained a basic LibSVM 3.20 model. This code isn't meant to be used but may help in showing how to create and test a model.
from collections import namedtuple
# Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine.
Category = namedtuple("Category", ["index", "name"])
Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"])
# Separate up the set of categories, libsvm requires a numerical index so we associate each with an index.
categories = dict()
for index, name in enumerate("B M C S NA".split(' ')):
# LibSVM expects index to start at 1, not 0.
categories[name] = Category(index + 1, name)
categories
Out[0]: {'B': Category(index=1, name='B'),
'C': Category(index=3, name='C'),
'M': Category(index=2, name='M'),
'NA': Category(index=5, name='NA'),
'S': Category(index=4, name='S')}
# Faked set of CSV input for example purposes.
csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
NA,12,0,0,1""".split("\n")
# We just ignore the header.
header = csv_input_lines[0]
# A list of Feature namedtuples, this will be trained as lists.
features = list()
for line in csv_input_lines[1:]:
split_values = line.split(',')
# Create a Feature with the values converted to integers.
features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:])))
features
Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0),
Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1),
Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1),
Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0),
Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)]
# Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes.
y = map(lambda f: f.category_index, features)
# X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0.
x = map(lambda f: list(f)[1:], features)
from svmutil import svm_parameter, svm_problem, svm_train, svm_predict
# Barebones defaults for SVM
param = svm_parameter()
# The (Y,X) parameters should be the train dataset.
prob = svm_problem(y, x)
model=svm_train(prob, param)
# For actual accuracy checking, the (Y,X) parameters should be the test dataset.
p_labels, p_acc, p_vals = svm_predict(y, x, model)
Out[3]: Accuracy = 100% (5/5) (classification)
I hope this example helps, it shouldn't be used for your training. It is meant as an example only because it is inefficient.