I am making a classifier based on a CNN model in Keras.
I will use it in an application, where the user can load the application and enter input text and the model will be loaded from the weights and make predictions.
The thing is I am using GloVe embeddings as well and the CNN model uses padded text sequences as well.
I used Keras tokenizer as following:
tokenizer = text.Tokenizer(num_words=max_features, lower=True, char_level=False)
tokenizer.fit_on_texts(list(train_x))
train_x = tokenizer.texts_to_sequences(train_x)
test_x = tokenizer.texts_to_sequences(test_x)
train_x = sequence.pad_sequences(train_x, maxlen=maxlen)
test_x = sequence.pad_sequences(test_x, maxlen=maxlen)
I trained the model and predicted on test data, but now I want to test the same with loaded model which I loaded and working.
But my problem here is If I provide a single review, it has to be passed through the tokeniser.text_to_sequences() which is returning 2D array, with a shape of (num_chars, maxlength) and hence followed by a num_chars predictions, but I need it in (1, max_length) shape.
I am using the following code for prediction:
review = 'well free phone cingular broke stuck not abl offer kind deal number year contract up realli want razr so went look cheapest one could find so went came euro charger small adpat made fit american outlet, gillett fusion power replac cartridg number count packagemay not greatest valu out have agillett fusion power razor'
xtest = tokenizer.texts_to_sequences(review)
xtest = sequence.pad_sequences(xtest, maxlen=maxlen)
model.predict(xtest)
Output is:
array([[0.29289 , 0.36136267, 0.6205081 ],
[0.362869 , 0.31441122, 0.539749 ],
[0.32059124, 0.3231736 , 0.5552745 ],
...,
[0.34428033, 0.3363668 , 0.57663095],
[0.43134686, 0.33979046, 0.48991954],
[0.22115968, 0.27314988, 0.6188136 ]], dtype=float32)
I need a single prediction here array([0.29289 , 0.36136267, 0.6205081 ]) as I have a single review.
The problem is that you need to pass a list of strings to texts_to_sequences method. So you need to put the single review in a list like this:
xtest = tokenizer.texts_to_sequences([review])
If you don't do that (i.e. pass a string, not a list of string(s)), considering the strings in Python are iterable, it would iterate over the characters of the given string and consider the characters, not words, as the tokens:
oov_token_index = self.word_index.get(self.oov_token)
for text in texts: # <-- it would iterate over the string instead
if self.char_level or isinstance(text, list):
That's why you would get an array of shape (num_chars, maxlength) as the return value of texts_to_sequences method.
Related
I have created an NLP classification model with keras with no problems with my model showing 83.5% accuracy upon evaluation. However, when I want to use my model to predict a new set of tokenized words, my model returns x number of arrays where x is the number of tokens in a tokenized sentence I have given to my model to predict
`
here is the code example
toPredict = np.array([1,2])
prediction = self.model.predict(toPredict)
print(prediction)
`
The values 1 and 2 are obviously just token values, but this will return an output of
'
[[0.24091144 0.20921658 0.3415633 0.20830865]
[0.20159791 0.46421158 0.19968869 0.13450184]]
'
I may be missing something, but i thought the output would be only 1 array to classify the whole tokenized sentence, not each word individually. Am I feeding in the model a badly formatted input? Please help!
to predict you should feed model in the same shape that training data fed into model; so the sequence must have been in 2-dim shape and even the same length as you set before when padded sequences. you could tf.expand_dims(toPredict, 0) and then feed it into model.
for instance here i will define a function for prediction;
#def prediction
def predict_text(#define input text and model
input_text, tokenizer, model,
#define tokenizer maximum length of sequence
maxlen_seq, padding = 'post', truncating = 'post'
):
#prediction
text = str(input_text)
sequence = tokenizer.texts_to_sequences([text])
sequence = keras.preprocessing.sequence.pad_sequences(sequence, maxlen = maxlen_seq,
padding = padding, truncating = truncating)
predict = model.predict(sequence)
return predict
I have a dataset with many categorical features and many features.I want to apply embedding layer to transfer the categorical data to numerical data for the using of the other models.But, I got some error during training.
Now, my training process is:
Perform label encoder to categorical features
Split training and testing data by train_test_split() function
Drop the numerical columns. Only send the categorical features and target y for model training.
And I got this error:
indices[13,0] = 10 is not in [0, 10)
[[node functional_1/embed_6/embedding_lookup (defined at <ipython-input-34-0b6b3ae455d0>:4) ]] [Op:__inference_train_function_3509]
Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embed_6/embedding_lookup:
functional_1/embed_6/embedding_lookup/2395 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function
After searching, someone says the problem is that the vocabulary_size parameter of embedding layer is wrong. Enlarge the vocabulary_size can solve this problem.
But in my case, I need to map the result back to original label.
For example, I have a categorical feature ['dog', 'cat', 'fish']. After label encode, it become[0,1,2]. An embedding layer for this feature with 3 unique variable should output something like
([-0.22748041], [-0.03832678], [-0.16490786]).
Then I can replace the ['dog'] variable in original data as -0.22748041, replace ['cat'] variable as -0.03832678, and so on.
So, I can't change the vocabulary_size or the output dimension will be wrong.
I guess the problem in my case is that not all of the categorical variable are go into the training process.
(E.x. Only ['dog', 'fish'] are in the training data. ['cat'] is only appear in testing data). If I set the vocabulary_size as 3, it will report an error like above. If I experimentally add ['cat'] to training data. It works fine.
My problem is, dose embedding layer have to look all of the unique value in training process to perform the application I want? If there are a lot of categorical data with a lot of unique value, how to ensure all the unique value appear in testing data when splitting data.
Thanks in advance!
Solution
You need to use out-of-vocabulary buckets when creating the the lookup table.
oov buckets allow to lookup of unknown category if found during testing.
What the solution does?
Setting it to a required number (like 1000) will allow you to get ids of those other category as well which were not present in test data categories.
words = tf.constant(vocabulary)
word_ids = tf.range(len(vocabulary), dtype=tf.int64)
# important
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets) # lokup table for ids->category
Then you can encode the training set (I am using TensorFlow Dataset IMDb rating dataset)
def encode_words(X_batch, y_batch):
"""
Encode the training set converting words to IDs
using the lookup table just created
"""
return table.lookup(X_batch), y_batch
train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)
when creating model:
vocab_size=10000 # whatever the length of variable vocabulary is of
embedding_size = 128 # tweakable | hyperparameter
model = keras.models.Sequential([
keras.layers.Embedding(vocab_size + num_oov_buckets, embedding_size,
input_shape=[None]),
# usual code follows
])
and fit the data
model.compile(loss="binary_crossentropy",
optimizer="adam",
metrics="accuracy")
history = model.fit(train_set, epochs=5)
Given the following setup of a dataset:
tweet, number of retweets, genre
I want to build a softmax classifier that predicts the tweet genre(s).
I am struggling to find a way to assign sample importance to keras WITHOUT repeating the data (tweets).
For example: tweet #1 is retweet 1000 times for genre 1 and 3. tweet #2 is retweeted 100 for genre 1 and 4. How to incorporate the importance of tweet #1 to genre 1 and 3 without repeating the tweet itself 1000 times in the training data ?
model = tf.keras.Sequential()
embedding_layer = tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)
model.add(embedding_layer)
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_SIZE)))
model.add(tf.keras.layers.Dense(len(GENRES_LIST)+1,activation=tf.keras.activations.softmax))
m = tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1)
opt = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
metrics=[m])
model.fit(train_data, epochs=50, validation_data=test_data,verbose=1)
If you use .fit(), I think duplicating your data is the only way to achieve what you want.
Alternatively, you can consider writing your own batcher with .train_on_batch() and .test_on_batch(). That way, you can control what you feed to the model.
A simple call to numpy.random.choice() with the parameter p should do what you want (inside your batcher):
Sources:
https://keras.io/api/models/model_training_apis/
https://docs.scipy.org/doc//numpy-1.10.4/reference/generated/numpy.random.choice.html
You can simply use sample_weight matrix and pass it to the .fit function. So lets assume you have 2 samples of tweets and each sample has its number of retweets (as you described). All you need is to create a 1-D array of weights for each of those samples (1:1 mapping). You can either enter the integer value or normalize the weights to values between 0 and 1.
Your data will look like this:
X_train = [tweet1, tweet2]
y_train = [[1,0,1,0], [1,0,0,1]] # one-hot encoding
my_sample_weight = [1000,100]
model.fit(train_data, epochs=50, validation_data=test_data, sample_weight=my_sample_weight, verbose=1)
I am using the the Python BERT models: https://github.com/google-research/bert
My goal is to build a binary classification model to predict if a news headline is relevant to a specific category. I have a training set of data which has news headline sentences as well as binary values to indicate if the headline is valid or invalid.
I tried to run the run_classifier.py script and the results I obtained do not seem to make sense. The test results file has two columns with the same two numbers being repeated on each row :
Also in the model parameters for task_name I have it set as: cola, after reading the academic paper for BERT https://arxiv.org/pdf/1810.04805.pdf I feel as if this is not an appropriate task name. The paper lists several other tasks on pages 14 and 15 but none of them seem to be appropriate for the binary categorization of sentences based on content.
How can I properly use BERT to classify sentences? I tried using this guide.
But it did not yield the results I had expected.
For Binary classification task (I assume you have used the cola processor), BERT's predictions on the test set goes to test_results.tsv file.
In order to interpret test_results.tsv, you must know its structure.
The file contains number of rows equalling to number of inputs in the test set. And the number of columns will be equal to number of test labels. (Since your task is a binary classification, there will be two columns, column for label 0 and column for label 1).
The value in each column is the softmax value (summing up the values of all the columns for a given row must be equal to 1) indicating the probability of the given class (or label).
If you observe in your case, 0.9999991 and 9.12E-6 (9.12*10^(-6)) are not the same. If you sum them, they equate to ~1. (This can also be interpreted that the test input belongs to the class indicated by label 0)
How can I properly use BERT to classify sentences?
Take a look at this complete working code for sentence classification, using IMDB Sentiment Analysis (Binary text classification on Google Colab using GPU)
Basically, you can use Tensorflow and keras-bert to do that. The steps involved are
Load and transform your custom data.
Load pre-trained models and define network for fine-tuning
Train/fine-tune the model using custom data.
Classify using the trained model.
Here is brief snippet to help.
model = load_trained_model_from_checkpoint(
config_path,
checkpoint_path,
training=True,
trainable=True,
seq_len=SEQ_LEN,
)
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=2, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)
model.compile(
RAdam(lr=LR),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
history = model.fit(
train_x,
train_y,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_split=0.20,
shuffle=True,
)
predicts = model.predict(test_x, verbose=True).argmax(axis=-1)
texts = [
"It's a must watch",
"Can't wait for it's next part!",
'It fell short of expectations.',
]
for text in texts:
ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
inpu = np.array(ids).reshape([1, SEQ_LEN])
predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0]
print ("%s: %s"% (id_to_labels[predicted_id], text))
Output:
positive: It's a must watch
positive: Can't wait for it's next part!
negative: It fell short of expectations.
Hope that helps.
I'm trying to train a basic text classification NN using Keras. I download 12,500 pos and 12,500 negative movie reviews from a website. I'm having trouble processing the data into something Keras can use however.
First, I open the 25000 text files and store each file into an array. I then run each array (one positive and one negative) through this function:
def process_for_model(textArray):
'''
Given a 2D array of the form:
[[fileLines1],[fileLines2]...[fileLinesN]]
converts the text into integers
'''
result = []
for file_ in textArray:
inner = []
for line in file_:
length = len(set(text_to_word_sequence(line)))
inner.append(hashing_trick(line,round(length*1.3),hash_function='md5'))
result.append(inner)
return result
With the purpose of converting the words into numbers to get them close to something a Keras model can use.
I then append the converted numbers into a single array, along with appending a 0 or 1 to another array as labels:
training_labels = []
train_batches = []
for i in range(len(positive_encoded)):
train_batches.append(positive_encoded[i])
training_labels.append([0])
for i in range(len(negative_encoded)):
train_batches.append(negative_encoded[i])
training_labels.append([1])
And finally I convert each array to a np array:
train_batches = array(train_batches)
training_labels = array(training_labels)
However, I'm not really sure where to go from here. Each review is, I believe, 168 words. I don't know how to create an appropriate model for this data or how to properly scale all the numbers to be between 0 and 1 using sklearn.
The things I am most confused on are: how many layers should I have, how many neurons each layer should have, and how many input dimensions should I have for the first layer.
Should I be taking another approach entirely?
Here is quite a good tutotial with Keras and this dataset: https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/
You can also use Keras official tutorial for text classification.
It basically downloads 50k reviews from the IMDB set, equally balanced (half positive, half negative). They split (randomly) half for training, half for testing, and take 10k (40%) of the training examples as a validation set.
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
The reviews are already in their word-dictionary representation (i.e. each review is an array of numbers). The total dictionary has about 80k+ words, but they only use the top 10k most frequent words (all the other words in a particular review are mapped to a special token - unknown ('<UNK>')).
(In the tutorial they create a reversed word dictionary - for the sake of showing you the original reviews. But it's not important.)
Each review is max 256 words, so they pre-process each review and pad it with 0 (<PAD> token) in case it's shorter. (Padding is done post, i.e. at the end)
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
Their NN architecture consists of 4 layers:
Input Embedding layer: takes a batch of reviews, each a 256 vector who's numbers are [0, 10,000) and tries to find a 16 dimensional vector (for each word) to represent them.
Global Average Pooling layer: average over all the words (16-D representation) in a review, and gives you a single 16 dimensional vector to represent the whole review.
Fully connected dense layer of 16 nodes - the 'vanilla' NN layer. They chose a ReLu activation function.
An output layer of 1 node: with a sigmoid activation function - gives a number from 0 to 1 which represents the confidence it's a positive/negative review.
Here is the code for it:
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
Then they fit the model and run it:
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val, y_val),
verbose=1)
In summary - they chose to simplify what could have been a 10k dimensional vector to only 16 dimensions, and then run one dense layer NN - with which they got a pretty good results (87%).