Bert Embedding element #308

Bert Embedding element #308 - python

I've been getting some practice with Bert Embedding. Specifically, I'm using BERT-Base Uncased model with the PyTorch libraries
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
I noticed that for pretty much every word I've looked at the element #308 in the bert embedding 768-long vector is a negative outlier with the values below -2, perhaps half of the time below -4.
It is really weird. I tried to google something about 'bert embedding 308' but couldn't find anything.
I wonder if there is any explanation for this 'phenomenon'.
Here is my routine to extract the embeddings:
def bert_embedding(text, bert_model, bert_tokenizer, layer_number = 0):
marked_text = "[CLS] " + text + " [SEP]"
#the default is the last layer: 0-1 = -1
layer_number -= 1
tokenized_text = bert_tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
encoded_layers, _ = bert_model(tokens_tensor, segments_tensors)
return encoded_layers[layer_number][0][:]

Related

Word-embedding does not provide expected relations between words

I am trying to train a word embedding to a list of repeated sentences where only the subject changes. I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding. However, the angle between the vectors of subjects is not always larger than the angle between subjects and a random word.
Man is going to write a very long novel that no one can read.
Woman is going to write a very long novel that no one can read.
Boy is going to write a very long novel that no one can read.
The code is based on pytorch tutorial:
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
class EmbedTrainer(nn.Module):
def __init__(self, d_vocab, d_embed, d_context):
super(EmbedTrainer, self).__init__()
self.embed = nn.Embedding(d_vocab, d_embed)
self.fc_1 = nn.Linear(d_embed * d_context, 128)
self.fc_2 = nn.Linear(128, d_vocab)
def forward(self, x):
x = self.embed(x).view((1, -1)) # flatten after embedding
x = self.fc_2(F.relu(self.fc_1(x)))
x = F.log_softmax(x, dim=1)
return x
text = " ".join(["{} is going to write a very long novel that no one can read.".format(x) for x in ["Man", "Woman", "Boy"]])
text_split = text.split()
trigrams = [([text_split[i], text_split[i+1]], text_split[i+2]) for i in range(len(text_split)-2)]
dic = list(set(text.split()))
tok_to_ids = {w:i for i, w in enumerate(dic)}
tokens_text = text.split(" ")
d_vocab, d_embed, d_context = len(dic), 10, 2
""" Train """
loss_func = nn.NLLLoss()
model = EmbedTrainer(d_vocab, d_embed, d_context)
print(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
losses = []
epochs = 10
for epoch in range(epochs):
total_loss = 0
for input, target in trigrams:
tok_ids = torch.tensor([tok_to_ids[tok] for tok in input], dtype=torch.long)
target_id = torch.tensor([tok_to_ids[target]], dtype=torch.long)
model.zero_grad()
log_prob = model(tok_ids)
#if total_loss == 0: print("train ", log_prob, target_id)
loss = loss_func(log_prob, target_id)
total_loss += loss.item()
loss.backward()
optimizer.step()
print(total_loss)
losses.append(total_loss)
embed_map = {}
for word in ["Man", "Woman", "Boy", "novel"]:
embed_map[word] = model.embed.weight[tok_to_ids[word]]
print(word, embed_map[word])
def angle(a, b):
from numpy.linalg import norm
a, b = a.detach().numpy(), b.detach().numpy()
return np.dot(a, b) / norm(a) / norm(b)
print("man.woman", angle(embed_map["Man"], embed_map["Woman"]))
print("man.novel", angle(embed_map["Man"], embed_map["novel"]))

I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding
I don't really think you'll achieve that kind of result with only 3 sentences and like 40 iterations in 10 epochs (plus most of the data in your 40 iterations is repeated).
maybe try downloading a couple of free datasets out there, or try your own data with a proven model like a genism model.
I'll give you the code for training a gensim model, so you can test your dataset on another model and see if the problem comes from your data or from your model.
I've tested similar gensim models on datasets with millions of sentences and it worked like a charm, for smaller datasets you might want to change the parameters.
from gensim.models import Word2Vec
from multiprocessing import cpu_count
corpus_path = 'eachLineASentence.txt'
vecSize = 300
winSize = 5
numWorkers = cpu_count()-1
epochs = 20
minCount = 5
skipGram = False
modelName = f'mymodel.model'
model = Word2Vec(corpus_file=corpus_path,
size=vecSize,
window=winSize,
min_count=minCount,
workers=numWorkers,
iter=epochs,
sg=skipGram)
model.save(modelName)
P.S. I don't think it's a good idea to use the keyword input as a variable in your code.

It's most probably the training size. Training a 128d embedding is definitely overkill. Rule of thumb from the the google developers blog:
Why is the embedding vector size 3 in our example? Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:
embedding_dimensions = number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:
3 = 81**0.25

Tensor("dense_2/Softmax:0", shape=(?, 10), dtype=float32) is not an element of this graph

I'm running this program with tensorflow==1.14.0 and Keras==2.3.0, my code works fine with tensorflow==2.2.0 and Keras==2.4.3. However due to some reason, I need to decrease my overall package size to be < 500MB (Heroku: deploying Deep Learning model), therefore I want to use an earlier version of tensorflow instead. However, with tensorflow==1.14.0 and Keras==2.3.0 my program gives ValueError: Tensor Tensor("dense_2/Softmax:0", shape=(?, 10), dtype=float32) is not an element of this graph. whenever I try to make a prediction (inside the predict_class function).
Does anyone know why is that? I've scan though many other problems but they don't seem to be resolving this issue or they're just leading me to another error.
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import numpy as np
import pickle
import json
import random
from django.http import JsonResponse
from tensorflow.keras.models import load_model
model = load_model('chatbot/model.h5')
intents = json.loads(open('chatbot/intents.json').read())
words = pickle.load(open('chatbot/words.pkl', 'rb'))
classes = pickle.load(open('chatbot/classes.pkl', 'rb'))
def clean_up_user_input(sentence):
sentence_words = nltk.word_tokenize(sentence)
sentence_words = [
lemmatizer.lemmatize(word.lower()) for word in sentence_words
]
return sentence_words
def get_bag_of_words(sentence, words):
sentence_words = clean_up_user_input(sentence)
bag = [0] * len(words)
for s in sentence_words:
for i, word in enumerate(words):
if word == s:
bag[i] = 1
return np.array(bag)
def predict_class(sentence):
# filter below threshold predictions
p = get_bag_of_words(sentence, words)
res = model.predict(np.array([p]))[0]
ERROR_THRESHOLD = 0.25
results = [[i, r] for i, r in enumerate(res) if r > ERROR_THRESHOLD]
# sorting strength probability
results.sort(key=lambda x: x[1], reverse=True)
return_list = []
for r in results:
return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
if return_list == []:
return_list.append({"intent": 'noanswer', "probability": str(1)})
return return_list
This is how I train the model
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import SGD
import random
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
words = []
classes = []
documents = []
ignore_letters = ['!', '?', ',', '.']
intents_file = open('intents.json').read()
intents = json.loads(intents_file)
for intent in intents['intents']:
for pattern in intent['patterns']:
word = nltk.word_tokenize(pattern)
words.extend(word)
#add documents in the corpus
documents.append((word, intent['tag']))
# add to our classes list
if intent['tag'] not in classes:
classes.append(intent['tag'])
print(documents)
# lemmaztize and lower each word and remove duplicates
words = [
lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_letters
]
words = sorted(list(set(words)))
# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print(len(documents), "documents")
# classes = intents
print(len(classes), "classes", classes)
# words = all words, vocabulary
print(len(words), "unique lemmatized words", words)
pickle.dump(words, open('words.pkl', 'wb'))
pickle.dump(classes, open('classes.pkl', 'wb'))
# create the training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
# initialize our bag of words
bag = []
# list of tokenized words for the pattern
pattern_words = doc[0]
# lemmatize each word - create base word, in attempt to represent related words
pattern_words = [
lemmatizer.lemmatize(word.lower()) for word in pattern_words
]
# create our bag of words array with 1, if word match found in current pattern
for word in words:
bag.append(1) if word in pattern_words else bag.append(0)
# output is a '0' for each tag and '1' for current tag (for each pattern)
output_row = list(output_empty)
output_row[classes.index(doc[1])] = 1
training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:, 0])
train_y = list(training[:, 1])
print("Training data created")
# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
# Sequential() allows you to create models layer-by-layer
model = Sequential()
# Dense layer is a regular layer of neurons in a neural network
model.add(Dense(128, input_shape=(len(train_x[0]), ), activation='relu'))
# Dropout is used for prevent overfitting.
# Dropout works by randomly setting the outgoing edges of hidden units (neurons that make up hidden layers) to 0 at each update of the training phase
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
# nesterov is an optimal method (in terms of oracle complexity) for smooth convex optimization
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
# loss is the loss function
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
# fitting and saving the model
# By setting verbose 0, 1 or 2 you just say how do you want to 'see' the training progress for each epoch.
# The batch size defines the number of samples that will be propagated through the network
hist = model.fit(np.array(train_x),
np.array(train_y),
epochs=200,
batch_size=5,
verbose=1)
model.save('model.h5', hist)

This problem is solved by adding a few lines of code, I have marked the end of line with '#'
for additional details, pls check out https://github.com/tensorflow/tensorflow/issues/28287
from tensorflow.keras.models import load_model #
from tensorflow.python.keras.backend import set_session #
import tensorflow as tf #
graph = tf.get_default_graph() #
sess = tf.Session() #
set_session(sess) #
model = load_model('chatbot/model.h5')
intents = json.loads(open('chatbot/intents.json').read())
words = pickle.load(open('chatbot/words.pkl', 'rb'))
classes = pickle.load(open('chatbot/classes.pkl', 'rb'))
def clean_up_user_input(sentence):
sentence_words = nltk.word_tokenize(sentence)
sentence_words = [
lemmatizer.lemmatize(word.lower()) for word in sentence_words
]
return sentence_words
def get_bag_of_words(sentence, words):
sentence_words = clean_up_user_input(sentence)
bag = [0] * len(words)
for s in sentence_words:
for i, word in enumerate(words):
if word == s:
bag[i] = 1
return np.array(bag)
def predict_class(sentence):
# filter below threshold predictions
p = get_bag_of_words(sentence, words)
with graph.as_default(): #
set_session(sess) #
res = model.predict(np.array([p]))[0]
ERROR_THRESHOLD = 0.25
results = [[i, r] for i, r in enumerate(res) if r > ERROR_THRESHOLD]
# sorting strength probability
results.sort(key=lambda x: x[1], reverse=True)
return_list = []
for r in results:
return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
if return_list == []:
return_list.append({"intent": 'noanswer', "probability": str(1)})
return return_list

How to add GloVe word embeddings to Keras POS tagger?

I've been trying to research how to use Keras to train a POS tagger; specifically I want it to use an LSTM architecture and to use word embeddings, namely, GloVe. I've taken inspiration from two blogs. One uses an LSTM w/o pretrained embeddings to perform POS; the other uses LSTM w/ word embeddings to classify text.
https://nlpforhackers.io/lstm-pos-tagger-keras/
https://nlpforhackers.io/keras-intro/
The below script "works" in the sense that no errors are triggered, however, overpredicts "padding" cells and underpredicts other tokens. (When following the POS blog verbatim, the accuracy is ~99%.) I don't understand why the addition of word embeddings has hurt performance so bad.
Data preprocessing:
import nltk
tagged_sentences = nltk.corpus.treebank.tagged_sents()
import numpy as np
sentences, sentence_tags =[], []
for tagged_sentence in tagged_sentences:
sentence, tags = zip(*tagged_sentence)
sentences.append(np.array(sentence))
sentence_tags.append(np.array(tags))
from sklearn.model_selection import train_test_split
(train_sentences, test_sentences,
train_tags, test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2)
def assemble_text(array):
return ' '.join([word for word in array])
train_sentences = [assemble_text(arr) for arr in train_sentences]
test_sentences = [assemble_text(arr) for arr in test_sentences]
tags = set([])
for ts in train_tags:
for t in ts:
tags.add(t)
tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0 # The special value used to padding
train_tags_y = []
for s in train_tags:
train_tags_y.append([tag2index[t] for t in s])
test_tags_y= []
for s in test_tags:
test_tags_y.append([tag2index[t] for t in s])
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True,
lowercase=True, min_df=3, max_df=0.9, max_features=5000)
X_train_onehot = vectorizer.fit_transform(train_sentences)
word2idx = {word: idx for idx, word in enumerate(vectorizer.get_feature_names())}
tokenize = vectorizer.build_tokenizer()
preprocess = vectorizer.build_preprocessor()
def to_sequence(tokenizer, preprocessor, index, text):
words = tokenizer(preprocessor(text))
indexes = [index[word] for word in words if word in index]
return indexes
X_train_sequences = [to_sequence(tokenize, preprocess, word2idx, x) for x in train_sentences]
X_test_sequences = [to_sequence(tokenize, preprocess, word2idx, x) for x in test_sentences]
# Compute the max lenght of a text
MAX_SEQ_LENGHT = len(max(X_train_sequences, key=len))
print("MAX_SEQ_LENGHT=", MAX_SEQ_LENGHT)
from tensorflow.keras.preprocessing.sequence import pad_sequences
N_FEATURES = len(vectorizer.get_feature_names())
from tensorflow.keras.preprocessing.sequence import pad_sequences
X_train_sequences = pad_sequences(X_train_sequences, maxlen=MAX_SEQ_LENGHT, padding='post')
X_test_sequences = pad_sequences(X_test_sequences, maxlen=MAX_SEQ_LENGHT, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_SEQ_LENGHT, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_SEQ_LENGHT, padding='post')
def to_categorical(sequences, categories):
cat_sequences = []
for s in sequences:
cats = []
for item in s:
cats.append(np.zeros(categories))
cats[-1][item] = 1.0
cat_sequences.append(cats)
return np.array(cat_sequences)
cat_train_tags_y = to_categorical(train_tags_y, len(tag2index))
cat_test_tags_y = to_categorical(test_tags_y, len(tag2index))
Importing word vectors
import numpy as np
GLOVE_PATH = '/Users/jdmoore7/Downloads/glove.6B/glove.6B.50d.txt'
GLOVE_VECTOR_LENGHT = 50
def read_glove_vectors(path, lenght):
embeddings = {}
with open(path) as glove_f:
for line in glove_f:
chunks = line.split()
assert len(chunks) == lenght + 1
embeddings[chunks[0]] = np.array(chunks[1:], dtype='float32')
return embeddings
GLOVE_INDEX = read_glove_vectors(GLOVE_PATH, GLOVE_VECTOR_LENGHT)
# Init the embeddings layer with GloVe embeddings
embeddings_index = np.zeros((len(vectorizer.get_feature_names()) + 1, GLOVE_VECTOR_LENGHT))
for word, idx in word2idx.items():
try:
embedding = GLOVE_INDEX[word]
embeddings_index[idx+1] = embedding
except:
pass
Model and accuracy metrics
from tensorflow.keras import backend as K
def ignore_class_accuracy(to_ignore=0):
def ignore_accuracy(y_true, y_pred):
y_true_class = K.argmax(y_true, axis=-1)
y_pred_class = K.argmax(y_pred, axis=-1)
ignore_mask = K.cast(K.not_equal(y_pred_class, to_ignore), 'int32')
matches = K.cast(K.equal(y_true_class, y_pred_class), 'int32') * ignore_mask
accuracy = K.sum(matches) / K.maximum(K.sum(ignore_mask), 1)
return accuracy
return ignore_accuracy
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation, Dropout
from tensorflow.keras.optimizers import Adam
model = Sequential()
model.add(InputLayer(input_shape=(MAX_SEQ_LENGHT, )))
model.add(Embedding(len(vectorizer.get_feature_names()) + 1,
GLOVE_VECTOR_LENGHT, # Embedding size
weights=[embeddings_index],
input_length=MAX_SEQ_LENGHT,
trainable=False))
model.add(Bidirectional(LSTM(256, activation='relu', return_sequences=True)))
model.add(TimeDistributed(Dense(len(tag2index))))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer=Adam(0.001),
metrics=['accuracy',ignore_class_accuracy(0)])
model.fit(X_train_sequences, cat_train_tags_y,
epochs=40, batch_size=128, verbose=1,
validation_data=(X_test_sequences, cat_test_tags_y))
def logits_to_tokens(sequences, index):
token_sequences = []
for categorical_sequence in sequences:
token_sequence = []
for categorical in categorical_sequence:
token_sequence.append(index[np.argmax(categorical)])
token_sequences.append(token_sequence)
return token_sequences
import string
def pipe(text):
words = ''.join([char.lower() for char in text if char not in string.punctuation]).split(' ')
arr = [to_sequence(tokenize, preprocess, word2idx, text) ]
arr = pad_sequences(arr, maxlen=MAX_SEQ_LENGHT, padding='post')
pred = model.predict(arr)
values = logits_to_tokens(pred,
{i: t for t, i in tag2index.items()})[0]
return [(w,t) for w,t in zip(words,values)]
pipe('the walk down the hill')
>>>
[('the', '-PAD-'),
('walk', '-PAD-'),
('down', '-PAD-'),
('the', '-PAD-'),
('hill', '-PAD-')]
The accuracy produced in model fitting came out to be 0.00%. So I can only conclude that I've used word embeddings wrong in some way; it's a matter of, is my model architecture flawed? Or is the way I handle the word embeddings, themselves, flawed? Or something else?

Not using recurrent_dropout in colab crashing model?

I'm trying to train a simple tensorflow model to detect the sentiment of tweets. The datatypes and sizes of arrays are consistent and the model trains just fine when the recurrent_dropout is set to some float value. However this disables cuDNN and I'd really like to speed this up (don't we all) but whenever I remove the recurrent dropout argument the model training will crash before the end of the first epoch.
Below is the relevant code, I've left out imports, and loading the csv files. After the relevant code are the final input dimensions and the error code. Additionally, I have figured out why colab seemed to be cutting the training data. Colab displays the number of sequences after it has been split into batches, so with the default batch size of 32 we were getting 859 sequences. The crashing issue when not using the recurrent dropout is still an issue. Side note, this code is a very rough draft with the data cleaning all being done within the same notebook, hence the lack of typical formatting.
def remove_case(X):
removed_case = []
X = X.copy()
for text in X:
text = str(text).lower()
removed_case.append(text)
X = removed_case
return X
def remove_hyperlinks(X):
removed_hyperlinks = []
X = X.copy()
for text in X:
text = str(text)
text = re.sub(r'http\S+', '', text)
text = re.sub(r'https\S+', '', text)
text = re.sub(r'www\S+', '', text)
removed_hyperlinks.append(text)
X = removed_hyperlinks
return X
def remove_punctuation(X):
removed_punc = []
X = X.copy()
for text in X:
text = str(text)
text = "".join([char for char in text if char not in punctuation])
removed_punc.append(text)
X = removed_punc
return X
def split_text(X):
split_tweets = []
X = X.copy()
for text in X:
text = str(text).split()
split_tweets.append(text)
X = split_tweets
return X
def map_sentiment(X, l, m, n):
keys = ['negative', 'neutral', 'positive']
values = [l, m, n]
dictionary = dict(zip(keys, values))
X = X.copy()
X = X.map(dictionary)
return X
# # def sentiment_to_onehot(X):
# sentiment_foofs = []
# X = X.copy()
# for integer in X:
# if integer == "negative": # Negative
# integer = [1, 0, 0]
# elif integer == "neutral": # Neutral
# integer = [0, 1, 0]
# elif integer == "positive": # Positive
# integer = [0, 0, 1]
# else:
# break
# sentiment_foofs.append(integer)
# X = sentiment_foofs
# return X
train_no_punc_lowercase = train.copy()
train_no_punc_lowercase['text'] = remove_case(train_no_punc_lowercase['text'])
train_no_punc_lowercase['text'] = remove_hyperlinks(train_no_punc_lowercase['text'])
train_no_punc_lowercase['text'] = remove_punctuation(train_no_punc_lowercase['text'])
train_no_punc_lowercase['sentiment'] = map_sentiment(train_no_punc_lowercase['sentiment'], 0, 1, 2)
train_no_punc_lowercase.head()
test_no_punc_lowercase = test.copy()
test_no_punc_lowercase['text'] = remove_case(test_no_punc_lowercase['text'])
test_no_punc_lowercase['text'] = remove_hyperlinks(test_no_punc_lowercase['text'])
test_no_punc_lowercase['text'] = remove_punctuation(test_no_punc_lowercase['text'])
test_no_punc_lowercase['sentiment'] = map_sentiment(test_no_punc_lowercase['sentiment'], 0, 1, 2)
features = train.columns.tolist()
features.remove('textID') # all unique, high cardinality feature
features.remove('selected_text') # target
target = 'selected_text'
X_train_no_punc_lowercase = train_no_punc_lowercase[features]
y_train_no_punc_lowercase = train_no_punc_lowercase[target]
X_test_no_punc_lowercase = test_no_punc_lowercase[features]
def stemming_column(df_column):
ps = PorterStemmer()
stemmed_word_list = []
for i, string in enumerate(df_column):
tokens = word_tokenize(string)
new_string = ""
for j, words in enumerate(tokens):
new_string = new_string + ps.stem(words) + " "
stemmed_word_list.append(new_string)
return stemmed_word_list
def create_lookup_table(list1, list2):
main_list = []
lookup_dict = {}
i = 1 # used to create a value in the dictionary
main_list.append(list1)
main_list.append(list2)
for list in main_list:
for string in list:
for word in string.split():
if word not in lookup_dict:
lookup_dict[word] = i
i += 1
return lookup_dict
def encode(input_list, input_dict):
encoded_list = []
for string in input_list:
sentence_list = []
for word in string.split():
sentence_list.append(input_dict[word]) # value lookup from dictionary.. int
encoded_list.append(sentence_list)
return encoded_list
def pad_data(list_of_lists):
padded_data = tf.keras.preprocessing.sequence.pad_sequences(list_of_lists, padding='post')
return padded_data
def create_array_sentiment_integers(list):
sent_int_list = []
for sentiment in list:
sent_int_list.append(sentiment)
return np.asarray(sent_int_list, dtype=np.int32)
X_train_stemmed_list = stemming_column(X_train_no_punc_lowercase['text'])
X_test_stemmed_list = stemming_column(X_test_no_punc_lowercase['text'])
lookup_table = create_lookup_table(X_train_stemmed_list, X_test_stemmed_list)
X_train_encoded_list = encode(X_train_stemmed_list, lookup_table)
X_train_padded_data = pad_data(X_train_encoded_list)
Y_train = create_array_sentiment_integers(train_no_punc_lowercase['sentiment'])
max_features = 3 # 3 choices 0, 1, 2
Y_train_final = np.zeros((Y_train.shape[0], max_features), dtype=np.float32)
Y_train_final[np.arange(Y_train.shape[0]), Y_train] = 1.0
input_dimension = len(lookup_table) + 1
output_dimension = 64
input_length = 33
model = Sequential()
model.add(tf.keras.layers.Embedding(input_dim=input_dimension,
output_dim=output_dimension,
input_length=input_length,
mask_zero=True))
model.add(tf.keras.layers.LSTM(512, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(tf.keras.layers.Dense(256, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train_padded_data, Y_train_final, validation_split=0.20, epochs=10)
model.save('Tweet_sentiment.model')
Additionally, here are the shapes of the datasets..
x train shape: (27481, 33, 1) x train type: <class 'numpy.ndarray'> y train shape: (27481, 3)
Error code
Epoch 1/3
363/859 [===========>..................] - ETA: 9s - loss: 0.5449 - accuracy: 0.5674
---------------------------------------------------------------------------
UnknownError Traceback (most recent call last)
<ipython-input-103-1d4af3962607> in <module>()
----> 1 model.fit(X_train_padded_data, Y_train_final, epochs=3,)
8 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
58 ctx.ensure_initialized()
59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
62 if name is not None:
UnknownError: [_Derived_] CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1496): 'cudnnSetRNNDataDescriptor( data_desc.get(), data_type, layout, max_seq_length, batch_size, data_size, seq_lengths_array, (void*)&padding_fill)'
[[{{node cond_38/then/_0/CudnnRNNV3}}]]
[[sequential_5/lstm_4/StatefulPartitionedCall]] [Op:__inference_train_function_36098]
Function call stack:
train_function -> train_function -> train_function

I see some problems in your code. They are mentioned below:
You are using input_dimension = len(lookup_table) + 1. len(lookup_table) is nothing but the Number of Time Steps. It's value will be very high, at least more than 30,000. It is recommended to use only subset of those Values. So, you can set input_dimension = 10000 or input_dimension = 15000 (you may experiment with this value) it should solve the problem. Having said that, it will not impact the Accuracy of the Model.
Why is setting Recurrent Dropout a Float Value working ==> When we set the Recurrent Dropout, it actually drops the Number of Time Steps, input_dimension in your case, and hence it is not crashing.
You should use return_sequences=True only if you have another LSTM Layer, after an LSTM Layer. Since you have only one LSTM Layer, return_sequences should be set to False
Since you have 3 Classes, you shouldn't use binary_crossentropy. You should use sparse_categorical_crossentropy if you are not One-Hot-Encoding your Target or categorical_crossentropy if you are One-Hot-Encoding your Target.
Are you sure you want to use Masking in Embedding Layer?
Also, I see that you are using Many Functions and Many Lines of Code for Data-Preprocessing like Removing Hyperlinks, Removing Punctuations, Tokenizing, etc..
So, I thought I will provide an End-To-End Tutorial for Text Classification which will help you as well as the Stack Overflow Community. Code for the same is shown below:
#!pip install tensorflow==2.1
#!pip install nltk
#!pip install tika
#!pip install textblob
#!pip3 install --upgrade numpy
#!pip install scikit-learn
# To handle Paths
import os
# To remove Hyperlinks and Dates
import re
# To remove Puncutations
import string
# This helps to remove the unnecessary words from our Text Data
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
# To Parse the Input Data Files
from tika import parser
from textblob import TextBlob
# In order to use the Libraries of Tensorflow
import tensorflow as tf
# For Preprocessing the Text => To Tokenize the Text
from tensorflow.keras.preprocessing.text import Tokenizer
# If the Two Articles are of different length, pad_sequences will make the length equal
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Package for performing Numerical Operations
import numpy as np
# MatplotLib for Plotting Graphs
import matplotlib.pyplot as plt
# To shuffle the Data
from random import shuffle
# To Partition the Data into Train Data and Test Data
from sklearn.model_selection import train_test_split
# To add Regularizer in order to reduce Overfitting
from tensorflow.keras.regularizers import l2
# Give the Path of our Data
Path_Of_Data = 'Data'
# Extract the Labels from the Folders inside the Path mentioned above
Unique_Labels_List = ['negative', 'neutral', 'positive']
def GetNumericLabel(EachLabel):
if EachLabel=='negative':
return 0
elif EachLabel=='neutral':
return 1
elif EachLabel=='positive':
return 2
def Pre_Process_Data_And_Create_BOW(folder_path):
#creating empty lists in order to Create Resume Text and the respective Label
Resumes_List = []
Labels_List = []
for EachLabel in Unique_Labels_List:
for root, dirs, files in os.walk(os.path.join(folder_path, EachLabel),topdown=False):
for file in files:
i = 0
if file.endswith('.pdf'):
#Access individual file
Full_Resume_Path = os.path.join(root, file)
# Parse the Data inside the file
file_data = parser.from_file(Full_Resume_Path)
# Extract the Content of the File
Resume_Text = file_data['content']
# Below Code removes the Hyperlinks in the Resume, like LinkedIn Profile, Certifications, etc..
HyperLink_Regex = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
Text_Without_HL = re.sub(HyperLink_Regex, ' ', Resume_Text, flags=re.MULTILINE)
# Below Code removes the Date from the Resume
Date_regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+'
CleanedText = re.sub(Date_regEx,' ',Text_Without_HL)
List_Of_All_Punctuations = list(string.punctuation)
Important_Punctuations = ['#', '.', '+' , '-'] #Add more, if any other Punctuation is observed as Important
NewLineChar = '\n'
# Below Set Comprises all the Punctuations, which can be Removed from the Text of Resume
Total_Punct = len(List_Of_All_Punctuations)
for EachImpPunct in Important_Punctuations:
for CountOfPunct in range(Total_Punct):
if CountOfPunct == Total_Punct:
break
elif EachImpPunct == List_Of_All_Punctuations[CountOfPunct]:
del List_Of_All_Punctuations[CountOfPunct]
Total_Punct = Total_Punct - 1
List_Of_All_Punctuations.append(NewLineChar)
for EachPunct in List_Of_All_Punctuations:
CleanedText = CleanedText.replace(EachPunct, " ")
# Below Code converts all the Words in the Resume to Lowercase ======> Check if it has to come after Tokenization if Splitting Code is delet instead of integed
#Final_Cleaned_Resume_Text = Text_Without_Punct.lower()
Final_Cleaned_Resume_Text = CleanedText.lower()
#Code to remove Stopwords from each Resume
for word in STOPWORDS:
#stop_token = ' ' + word + ' '
stop_token = word
Resume_Text = Final_Cleaned_Resume_Text.replace(stop_token, ' ')
#Resume_Text = Resume_Text.replace(' ', ' ')
Resumes_List.append(Resume_Text)
Numeric_Label = GetNumericLabel(EachLabel)
Labels_List.append(Numeric_Label)
#print('Successfully executed for the Folder, ', EachLabel)
#Return Final Lists
return Resumes_List, Labels_List
#calling the function and passing the path
Resumes_List, Labels_List = Pre_Process_Data_And_Create_BOW(Path_Of_Data)
vocab_size = 10000 # This is very important for you
# We want the Output of the Embedding Layer to be 64
embedding_dim = 64
max_length = 800
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
# Taking 80% of the Data as Training Data and remaining 20% will be for Test Data
training_portion = .8
# Size of Train Data is 80% of the Entire Dataset => 0.8 * 2225
Train_Resume_Size = int(len(Resumes_List) * training_portion)
Labels_List = np.asarray(Labels_List)
Train_Resume_Data, Validation_Resume_Data, Train_Labels, Validation_Labels = \
train_test_split(Resumes_List, Labels_List, train_size = training_portion,
shuffle = True
, stratify= Labels_List)
from statistics import mean
print('Average Number of Words in Each Training Resume is {}'.format(mean([len(i) for i in Train_Resume_Data])))
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(Train_Resume_Data)
word_index = tokenizer.word_index
# Convert the Word Tokens into Integer equivalents, before passing it to keras embedding layer
train_sequences = tokenizer.texts_to_sequences(Train_Resume_Data)
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
validation_sequences = tokenizer.texts_to_sequences(Validation_Resume_Data)
validation_padded = pad_sequences(validation_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(len(validation_sequences))
print(validation_padded.shape)
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
# Check your Data
def decode_article(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_article(train_padded[10]))
print('-------------------------------------------------------------------------')
print(Train_Resume_Data[10])
Regularizer = l2(0.001)
model = tf.keras.Sequential([
# Add an Embedding layer expecting input vocab of size 5000, and output embedding dimension of size 64 we set at the top
tf.keras.layers.Embedding(vocab_size, embedding_dim,
embeddings_regularizer = Regularizer),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
# use ReLU in place of tanh function since they are very good alternatives of each other.
tf.keras.layers.Dense(embedding_dim, activation='relu'),
# Add a Dense layer with 3 units and softmax activation.
# When we have multiple outputs, softmax convert outputs layers into a probability distribution.
tf.keras.layers.Dense(3, activation='softmax')
])
model.summary()
#Using Early Stopping in order to handle Overfitting
ES_Callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(), optimizer='adam', metrics=['accuracy'])
num_epochs = 100
history = model.fit(x = train_padded, y = Train_Labels, epochs=num_epochs,
callbacks=[ES_Callback],
validation_data=(validation_padded, Validation_Labels),
batch_size = 32, shuffle=True, verbose=1)
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")
version = 1
MODEL_DIR = 'Resume_Classification_Model'
export_path = os.path.join(MODEL_DIR, str(version))
tf.keras.models.save_model(model = model, filepath = export_path)
!ls -l {export_path}
!saved_model_cli show --dir {export_path} --all
For more information, please refer this Beautiful Article.
Hope this solves your issue. Happy Learning!

Keras Accuracy and Loss not changing over a large period of epochs

I am trying to create a Convolutional Neural Network to classify what language a certain "word" is from. There are two files ("english_words.txt" and "spanish_words.txt") which each contain about 60,000 words each. I have converted each word into a 29-dimensional vector where each element is a number between 0 and 1. I am training the model for 500 epochs with the optimizer "adam". However, when I train the model, the loss tends to hover around 0.7 and the accuracy around 0.5, and no matter how long I train it for, these metrics will not improve. Here is the code:
import keras
import numpy as np
from keras.layers import Dense
from keras.models import Sequential
import re
train_labels = []
train_data = []
with open("english_words.txt") as words:
full_words = words.read()
full_words = full_words.split("\n")
# all of the labels are just 1.
# we now need to encode them into 29 dimensional vectors.
vector = []
i = 0
for word in full_words:
train_labels.append([1,0])
for letter in word:
vector.append((ord(letter) - 96) * (1.0 / 26.0))
i += 1
if (i < 29):
for x in range(0, 29 - i):
vector.append(0)
train_data.append(vector)
vector = []
i = 0
with open("spanish_words.txt") as words:
full_words = words.read()
full_words = full_words.replace(' ', '')
full_words = full_words.replace('\n', ',')
full_words = full_words.split(",")
vector = []
for word in full_words:
train_labels.append([0,1])
for letter in word:
vector.append((ord(letter) - 96) * (1.0 / 26.0))
i += 1
if (i < 29):
for x in range(0, 29 - i):
vector.append(0)
train_data.append(vector)
vector = []
i = 0
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = np.empty(a.shape, dtype=a.dtype)
shuffled_b = np.empty(b.shape, dtype=b.dtype)
permutation = np.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
train_data = np.asarray(train_data, dtype=np.float32)
train_labels = np.asarray(train_labels, dtype=np.float32)
train_data, train_labels = shuffle_in_unison(train_data, train_labels)
print(train_data.shape, train_labels.shape)
model = Sequential()
model.add(Dense(29, input_shape=(29,)))
model.add(Dense(60))
model.add(Dense(40))
model.add(Dense(25))
model.add(Dense(2))
model.compile(optimizer="adam",
loss="categorical_crossentropy",
metrics=["accuracy"])
model.summary()
model.fit(train_data, train_labels, epochs=500, batch_size=128)
model.save("language_predictor.model")
For some extra info, I am running python 3.x with tensorflow 1.15 and keras 1.15 on windows x64.

I can see several potential problems with your code.
You added several Dense layers one after another, but you really need to also include a non-linear activation function with the parameter activation= .... In the absence of any non-linear activation functions, all those fully-connected Dense layers will mathematically collapse into one single linear Dense layer incapable of learning a non-linear decision boundary.
In general, if you see your loss and accuracy not making any improvement or even getting worse, then the first thing to try is to reduce your learning rate.
You don't need to necessarily implement your own shuffling function. The Keras fit() function can do it if you use the shuffle=True parameter.

In addition to the points mentioned by stackoverflowuser2010:
I find this a very good read and highly suggest checking the mentioned points: 37 Reasons why your Neural Network is not working
Center your input data: Compute a component-wise mean vector and subtract it from every input.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bert Embedding element #308 - python

Related

Word-embedding does not provide expected relations between words

Tensor("dense_2/Softmax:0", shape=(?, 10), dtype=float32) is not an element of this graph

How to add GloVe word embeddings to Keras POS tagger?

Not using recurrent_dropout in colab crashing model?

Keras Accuracy and Loss not changing over a large period of epochs

Categories

Resources