I am using the the Python BERT models: https://github.com/google-research/bert
My goal is to build a binary classification model to predict if a news headline is relevant to a specific category. I have a training set of data which has news headline sentences as well as binary values to indicate if the headline is valid or invalid.
I tried to run the run_classifier.py script and the results I obtained do not seem to make sense. The test results file has two columns with the same two numbers being repeated on each row :
Also in the model parameters for task_name I have it set as: cola, after reading the academic paper for BERT https://arxiv.org/pdf/1810.04805.pdf I feel as if this is not an appropriate task name. The paper lists several other tasks on pages 14 and 15 but none of them seem to be appropriate for the binary categorization of sentences based on content.
How can I properly use BERT to classify sentences? I tried using this guide.
But it did not yield the results I had expected.
For Binary classification task (I assume you have used the cola processor), BERT's predictions on the test set goes to test_results.tsv file.
In order to interpret test_results.tsv, you must know its structure.
The file contains number of rows equalling to number of inputs in the test set. And the number of columns will be equal to number of test labels. (Since your task is a binary classification, there will be two columns, column for label 0 and column for label 1).
The value in each column is the softmax value (summing up the values of all the columns for a given row must be equal to 1) indicating the probability of the given class (or label).
If you observe in your case, 0.9999991 and 9.12E-6 (9.12*10^(-6)) are not the same. If you sum them, they equate to ~1. (This can also be interpreted that the test input belongs to the class indicated by label 0)
How can I properly use BERT to classify sentences?
Take a look at this complete working code for sentence classification, using IMDB Sentiment Analysis (Binary text classification on Google Colab using GPU)
Basically, you can use Tensorflow and keras-bert to do that. The steps involved are
Load and transform your custom data.
Load pre-trained models and define network for fine-tuning
Train/fine-tune the model using custom data.
Classify using the trained model.
Here is brief snippet to help.
model = load_trained_model_from_checkpoint(
config_path,
checkpoint_path,
training=True,
trainable=True,
seq_len=SEQ_LEN,
)
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=2, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)
model.compile(
RAdam(lr=LR),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
history = model.fit(
train_x,
train_y,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_split=0.20,
shuffle=True,
)
predicts = model.predict(test_x, verbose=True).argmax(axis=-1)
texts = [
"It's a must watch",
"Can't wait for it's next part!",
'It fell short of expectations.',
]
for text in texts:
ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
inpu = np.array(ids).reshape([1, SEQ_LEN])
predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0]
print ("%s: %s"% (id_to_labels[predicted_id], text))
Output:
positive: It's a must watch
positive: Can't wait for it's next part!
negative: It fell short of expectations.
Hope that helps.
Related
I have built a machine learning model using Catboost classifier to predict the categoryname of my result as per below screenshot1. However, if I get an unknown as input or any input with which the model has not been trained with, then I need to return it as null.
My idea to approach this is was based on the Probability of confidence score as per below scrrenshot2 (Expected Output). For known input the model would have high probability score and for any unknown unseen input the model would have low confidence score.
How can I achieve this and add probability column to my predicted results as per below screenshot2 (Expected Output)?
Code I am working with
pred = pipe_model_.predict(df_unseen)
predict_proba = pipe_model_.predict_proba(df_unseen)
# Get predicted RawFormulaVal
preds_raw = pipe_model_.predict(df_unseen,
prediction_type='RawFormulaVal')
Output of above code on Predict_proba is below
Sample Input Trained Dataframe (Screenshot 1)
Expected Predicted Output is as below (Screenshot 2) and yellow highlighted is the one which the model has never seen before or trained with so the probability is low and I can write a if condition to omit that as per my requirement
To summarize your requirements:
Return the probability of the label predicted by the model
If the input (Name) was not part of the training set, null the probability
If this is correct, then for requirement 1, the only step you're missing for is the mapping from the .predict_proba() output to the classes. You can call .classes_ to recover the mapping. See related answer. With this mapping, you can store the prediction as well as the probabilities for each class, and present only the probability for the class that was predicted.
For requirement 2, you will need to keep a record of all the inputs (Names) you provided in training. You could keep it in a .txt file and load it into a list. Then, after predictions are made, you can exclude any row which had a new or unknown input.
2 is an odd requirement though. If you know the Label for each of the Names you have seen before, and you don't want to use the output of the model in the cases where you haven't seen the Name before, the use case may be better served with a hard-coded lookup from Name to Label. The purpose of a model is to predict the Label when you haven't seen Name before, after training it on patterns of Names (e.g., if you get the new Name "Transt," the model would hopefully predict "Logistics" after being trained on "Transit" > "Logistics" and "Transiting" > "Logistics").
I have a dataset with many categorical features and many features.I want to apply embedding layer to transfer the categorical data to numerical data for the using of the other models.But, I got some error during training.
Now, my training process is:
Perform label encoder to categorical features
Split training and testing data by train_test_split() function
Drop the numerical columns. Only send the categorical features and target y for model training.
And I got this error:
indices[13,0] = 10 is not in [0, 10)
[[node functional_1/embed_6/embedding_lookup (defined at <ipython-input-34-0b6b3ae455d0>:4) ]] [Op:__inference_train_function_3509]
Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embed_6/embedding_lookup:
functional_1/embed_6/embedding_lookup/2395 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function
After searching, someone says the problem is that the vocabulary_size parameter of embedding layer is wrong. Enlarge the vocabulary_size can solve this problem.
But in my case, I need to map the result back to original label.
For example, I have a categorical feature ['dog', 'cat', 'fish']. After label encode, it become[0,1,2]. An embedding layer for this feature with 3 unique variable should output something like
([-0.22748041], [-0.03832678], [-0.16490786]).
Then I can replace the ['dog'] variable in original data as -0.22748041, replace ['cat'] variable as -0.03832678, and so on.
So, I can't change the vocabulary_size or the output dimension will be wrong.
I guess the problem in my case is that not all of the categorical variable are go into the training process.
(E.x. Only ['dog', 'fish'] are in the training data. ['cat'] is only appear in testing data). If I set the vocabulary_size as 3, it will report an error like above. If I experimentally add ['cat'] to training data. It works fine.
My problem is, dose embedding layer have to look all of the unique value in training process to perform the application I want? If there are a lot of categorical data with a lot of unique value, how to ensure all the unique value appear in testing data when splitting data.
Thanks in advance!
Solution
You need to use out-of-vocabulary buckets when creating the the lookup table.
oov buckets allow to lookup of unknown category if found during testing.
What the solution does?
Setting it to a required number (like 1000) will allow you to get ids of those other category as well which were not present in test data categories.
words = tf.constant(vocabulary)
word_ids = tf.range(len(vocabulary), dtype=tf.int64)
# important
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets) # lokup table for ids->category
Then you can encode the training set (I am using TensorFlow Dataset IMDb rating dataset)
def encode_words(X_batch, y_batch):
"""
Encode the training set converting words to IDs
using the lookup table just created
"""
return table.lookup(X_batch), y_batch
train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)
when creating model:
vocab_size=10000 # whatever the length of variable vocabulary is of
embedding_size = 128 # tweakable | hyperparameter
model = keras.models.Sequential([
keras.layers.Embedding(vocab_size + num_oov_buckets, embedding_size,
input_shape=[None]),
# usual code follows
])
and fit the data
model.compile(loss="binary_crossentropy",
optimizer="adam",
metrics="accuracy")
history = model.fit(train_set, epochs=5)
i have made a cnn using keras.
Now i wanted to extract features of my train set from this model. I compiled the model and trained it on the train set first. Then i used the 'predict; to extract features of the train set. Following lines of code used.
train_feature = model.predict(X_TRAIN)
print(train_feature.shape) # (692,10)
692 are the total train images. Now what does 10 represent? I had 10 classes. What is 10 representing over here?
This isn't called "extracting features". So you shouldn't assign to this name:
train_feature = model.predict(X_TRAIN) # I suggest train_output or something
The number of columns, ie 10, is the number of categories you have, assuming you built your model properly. Each of the 10 categories will result in a probability when making a forward pass.
I have three columns in a dataset on which I'm doing sentiment analysis(classes 0,1,2):
text thing sentiment
But the problem is that I can train my data only on either text or thing and get predicted sentiment. Is there a way to train the data both on text & thing and then predict sentiment ?
Problem case(say):
|text thing sentiment
0 | t1 thing1 0
. |
. |
54| t1 thing2 2
This example tells us that sentiment shall depend on the thing as well. If I try to concatenate the two columns one below the other and then try but that would be incorrect as we wouldn't be giving any relationship between the two columns to the model.
Also my test set contains two columns test and thing for which I've to predict the sentiment according to the trained model on the two columns.
Right now I'm using the tokenizer and then the model below:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Any pointers on how to proceed or which model or coding manipulation to use ?
You may want to shift to the Keras functionnal API and train a multi-input model.
According to Keras's creator, François CHOLLET, in his book Deep Learning with Python [Manning, 2017] (chapter 7, section 1) :
Some tasks, require multimodal inputs: they merge data coming from different input sources, processing each type of data using different kinds of neural layers. Imagine a deep-learning model trying to predict the most likely market price of a second-hand piece of clothing, using the following inputs: user-provided metadata (such as the item’s brand, age, and so on), a user-provided text description, and a picture of the item. If you had only the metadata available, you could one-hot encode it and use a densely connected network to predict the price. If you had only the text description available, you could use an RNN or a 1D convnet. If you had only the picture, you could use a 2D convnet. But how can you use all three at the same time? A naive approach would be to train three separate models and then do a weighted average of their predictions. But this may be suboptimal, because the information extracted by the models may be redundant. A better way is to jointly learn a more accurate model of the data by using a model that can see all available input modalities simultaneously: a model with three input branches.
I think the Concatenate functionality is the way to get in such a case and the general idea should be as follows. Please tweak it according to your use case.
### whatever preprocessing you may want to do
text_input = Input(shape=(1, ))
thing_input = Input(shape=(1,))
### now bring them together
merged_inputs = Concatenate(axis = 1)([text_input, thing_input])
### sample output layer
output = Dense(3)(merged_inputs)
### pass your inputs and outputs to the model
model = Model(inputs = [text_input, thing_input], outputs = output)
You have to take multiple column as list and then merge to train after embedding and pre processing on the raw data.
Example:
train = pd.read_csv('COVID19 multifeature Emotion - 50 data.csv', nrows=49)
# This dataset has two text column field and different class level
X_train_doctor_opinion = train["doctor-opinion"].str.lower()
X_train_patient_opinion = train["patient-opinion"].str.lower()
X_train = list(X_train_doctor_opinion) + list(X_train_patient_opinion))
Then pre process and embed
I am making a classifier based on a CNN model in Keras.
I will use it in an application, where the user can load the application and enter input text and the model will be loaded from the weights and make predictions.
The thing is I am using GloVe embeddings as well and the CNN model uses padded text sequences as well.
I used Keras tokenizer as following:
tokenizer = text.Tokenizer(num_words=max_features, lower=True, char_level=False)
tokenizer.fit_on_texts(list(train_x))
train_x = tokenizer.texts_to_sequences(train_x)
test_x = tokenizer.texts_to_sequences(test_x)
train_x = sequence.pad_sequences(train_x, maxlen=maxlen)
test_x = sequence.pad_sequences(test_x, maxlen=maxlen)
I trained the model and predicted on test data, but now I want to test the same with loaded model which I loaded and working.
But my problem here is If I provide a single review, it has to be passed through the tokeniser.text_to_sequences() which is returning 2D array, with a shape of (num_chars, maxlength) and hence followed by a num_chars predictions, but I need it in (1, max_length) shape.
I am using the following code for prediction:
review = 'well free phone cingular broke stuck not abl offer kind deal number year contract up realli want razr so went look cheapest one could find so went came euro charger small adpat made fit american outlet, gillett fusion power replac cartridg number count packagemay not greatest valu out have agillett fusion power razor'
xtest = tokenizer.texts_to_sequences(review)
xtest = sequence.pad_sequences(xtest, maxlen=maxlen)
model.predict(xtest)
Output is:
array([[0.29289 , 0.36136267, 0.6205081 ],
[0.362869 , 0.31441122, 0.539749 ],
[0.32059124, 0.3231736 , 0.5552745 ],
...,
[0.34428033, 0.3363668 , 0.57663095],
[0.43134686, 0.33979046, 0.48991954],
[0.22115968, 0.27314988, 0.6188136 ]], dtype=float32)
I need a single prediction here array([0.29289 , 0.36136267, 0.6205081 ]) as I have a single review.
The problem is that you need to pass a list of strings to texts_to_sequences method. So you need to put the single review in a list like this:
xtest = tokenizer.texts_to_sequences([review])
If you don't do that (i.e. pass a string, not a list of string(s)), considering the strings in Python are iterable, it would iterate over the characters of the given string and consider the characters, not words, as the tokens:
oov_token_index = self.word_index.get(self.oov_token)
for text in texts: # <-- it would iterate over the string instead
if self.char_level or isinstance(text, list):
That's why you would get an array of shape (num_chars, maxlength) as the return value of texts_to_sequences method.