Weighted average of embedding layer using Universal sentence encoder

Weighted average of embedding layer using Universal sentence encoder - python

In my dataframe, I have two columns viz text and score.Text is list of strings eg. [table,chair] and similarly score is list of numbers eg. [0.4,0.2].I am trying to use universal sentence encoder to take weighted average inside the keras model.No of text in the list might be different for different rows of the dataframe.
(0.4* UniversalEncoder('table') + 0.2*UniversalEncoder('chair'))/(0.4+0.2) -like this.
text_input = layers.Input(shape=(1,),name='text')
weight = layers.Input(shape=(1,),name='w')
embedding = layers.Lambda(my_lambda_func)([text_input,weight])
embedding[0].set_shape((None,512))
mul = 0.23
num_neuron = int(512*mul)
text_output=layers.Dense(num_neuron,input_shape=(512,))(embedding[0])
all_inputs.append(text_input)
all_inputs.append(weight)
preds = layers.Dense(1, activation="sigmoid")(text_output)
model = Model(inputs=all_inputs, outputs=preds)
lr = 0.0001
model.compile(
loss="binary_crossentropy",
optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
metrics=["AUC"],
)
model.fit([text_train,score_train],df_train["y"],epoch = 60)
text_train = np.array(list(x for x in df_train.text))
score_train = np.array(list(x for x in df_train.score))
def UniversalEncoder(x,weight):
text = x
weight = weight
emb_vec = embed(text)
vec = np.average(emb_vec,axis=0,weights = weight).flatten()
vec = vec/np.linalg.norm(vec)
return np.array(vec)
def my_lambda_func(x):
result = tf.py_function(UniversalEncoder, [x[0],x[1]], [tf.float32])
return result
Embed() is universal sentence encoder which gives us 512-dimensional vector for given string.
I am getting "ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray)" during fitting the model.Please help me with this issue.Thanks in advance.

Related

How to use custom function with tensorflow dataset API?

I am new to TensorFlow's tf.data.Dataset and I am trying to use it on my data that I loaded with pandas dataframe as follows:
Load the input date (df_input):
id messages Label
0 11 I am not driving home 0
1 11 Please pick me up 1
2 103 The car already park 1
3 103 No need for ticket 0
4 104 I will buy a car 1
5 104 I will buy truck 1
And I do preprocess and apply text Vectorization as follows:
text_vectorizer = layers.TextVectorization(max_tokens=20, output_mode="int", output_sequence_length=6)
text_vectorizer.adapt(df_input.message.values.tolist())
def encode(texts):
encoded_texts = text_vectorizer(texts)
return encoded_texts.numpy()
train_data = encode(df_input.message.values) ## This the training data
train_label = tf.keras.utils.to_categorical(df_input.label.values, 2) ## This labels
Then I am using the preprocess data in the training model by using the TensorFlow tf.data.Dataset function as follows:
train_dataset_df = (
tf.data.Dataset.from_tensor_slices((train_data, train_label))
.shuffle(1000)
.batch(2)
)
My question is how I can transform the data in every training epoch by applying my custom function to the training data. I saw a usage example of performing the transformation via .map function from here to this post:
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
My goal is to apply my custom function as follows (which reorders the words in text data):
def order_augment_sent(Sentence):
words = Sentence.split(" ")
words.sort()
newSentence = " ".join(words)
return newSentence
train_dataset_ds = (
tf.data.Dataset.from_tensor_slices((train_data, train_label))
.shuffle(1000)
.batch(2)
.map(lambda x, y: (order_augment_sent(x), y))
)
But I am getting error as:
AttributeError: 'Tensor' object has no attribute 'split'
Or if I apply my other cutom function, I am getting as:
TypeError: To be compatible with tf.function, Python functions must return zero or more Tensors or ExtensionTypes or None values; in compilation of <function _tf_if_stmt.<locals>.aug_body at 0124f565>, found return value of type WarningException, which is not a Tensor or ExtensionType.
I am not sure how I can do this and I will appreciate it if you have any idea or solution to help me.

The parameters you get in your lambda function are token from the vectors so they are int. If you want to reorder the text data, you need to do it before the text_vectorizer.
So you should add the TextVectorization layer to your model so your map function will have the string and you can reorder the sentance before calling the TextVectorization.
Here is an almost working exemple, you just need to edit the order_augment_sent function with the code you need, I didn't know what kind of sorting you want to do, probably you will have to write a custom sort with numpy https://www.tensorflow.org/api_docs/python/tf/py_function
import tensorflow as tf
import numpy as np
train_data = ["I am not driving home", "Please pick me up", "The car already park", " No need for ticket", "I will buy a car", "I will buy truck"]
train_label = [0,1,1,0,1,1]
text_dataset = tf.data.Dataset.from_tensor_slices(train_data)
max_features = 5000 # Maximum vocab size.
max_len = 4 # Sequence length to pad the outputs to.
# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=max_features,
output_mode='int',
output_sequence_length=max_len)
# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
vectorize_layer.adapt(train_data)
# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()
# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
model.add(vectorize_layer)
def apply_order_augment_sent(s):
Sentence = s.decode('utf-8')
words = Sentence.split(" ")
words.sort()
newSentence = " ".join(words)
return(newSentence)
def order_augment_sent(x: np.ndarray, y:np.ndarray):
new_x = []
for i in range(len(x)):
new_x.append(np.array([apply_order_augment_sent(x[i])]))
print('new', new_x, y)
return(new_x, y)
train_dataset_ds = tf.data.Dataset.from_tensor_slices((train_data, train_label))
train_dataset_ds = train_dataset_ds.shuffle(1000).batch(32)
train_dataset_ds = train_dataset_ds.map(lambda item1, item2: tf.numpy_function(
order_augment_sent, [item1, item2], [tf.string, tf.int32]))
list(train_dataset_ds.as_numpy_iterator())
model.predict(train_dataset_ds)

Word-embedding does not provide expected relations between words

I am trying to train a word embedding to a list of repeated sentences where only the subject changes. I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding. However, the angle between the vectors of subjects is not always larger than the angle between subjects and a random word.
Man is going to write a very long novel that no one can read.
Woman is going to write a very long novel that no one can read.
Boy is going to write a very long novel that no one can read.
The code is based on pytorch tutorial:
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
class EmbedTrainer(nn.Module):
def __init__(self, d_vocab, d_embed, d_context):
super(EmbedTrainer, self).__init__()
self.embed = nn.Embedding(d_vocab, d_embed)
self.fc_1 = nn.Linear(d_embed * d_context, 128)
self.fc_2 = nn.Linear(128, d_vocab)
def forward(self, x):
x = self.embed(x).view((1, -1)) # flatten after embedding
x = self.fc_2(F.relu(self.fc_1(x)))
x = F.log_softmax(x, dim=1)
return x
text = " ".join(["{} is going to write a very long novel that no one can read.".format(x) for x in ["Man", "Woman", "Boy"]])
text_split = text.split()
trigrams = [([text_split[i], text_split[i+1]], text_split[i+2]) for i in range(len(text_split)-2)]
dic = list(set(text.split()))
tok_to_ids = {w:i for i, w in enumerate(dic)}
tokens_text = text.split(" ")
d_vocab, d_embed, d_context = len(dic), 10, 2
""" Train """
loss_func = nn.NLLLoss()
model = EmbedTrainer(d_vocab, d_embed, d_context)
print(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
losses = []
epochs = 10
for epoch in range(epochs):
total_loss = 0
for input, target in trigrams:
tok_ids = torch.tensor([tok_to_ids[tok] for tok in input], dtype=torch.long)
target_id = torch.tensor([tok_to_ids[target]], dtype=torch.long)
model.zero_grad()
log_prob = model(tok_ids)
#if total_loss == 0: print("train ", log_prob, target_id)
loss = loss_func(log_prob, target_id)
total_loss += loss.item()
loss.backward()
optimizer.step()
print(total_loss)
losses.append(total_loss)
embed_map = {}
for word in ["Man", "Woman", "Boy", "novel"]:
embed_map[word] = model.embed.weight[tok_to_ids[word]]
print(word, embed_map[word])
def angle(a, b):
from numpy.linalg import norm
a, b = a.detach().numpy(), b.detach().numpy()
return np.dot(a, b) / norm(a) / norm(b)
print("man.woman", angle(embed_map["Man"], embed_map["Woman"]))
print("man.novel", angle(embed_map["Man"], embed_map["novel"]))

I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding
I don't really think you'll achieve that kind of result with only 3 sentences and like 40 iterations in 10 epochs (plus most of the data in your 40 iterations is repeated).
maybe try downloading a couple of free datasets out there, or try your own data with a proven model like a genism model.
I'll give you the code for training a gensim model, so you can test your dataset on another model and see if the problem comes from your data or from your model.
I've tested similar gensim models on datasets with millions of sentences and it worked like a charm, for smaller datasets you might want to change the parameters.
from gensim.models import Word2Vec
from multiprocessing import cpu_count
corpus_path = 'eachLineASentence.txt'
vecSize = 300
winSize = 5
numWorkers = cpu_count()-1
epochs = 20
minCount = 5
skipGram = False
modelName = f'mymodel.model'
model = Word2Vec(corpus_file=corpus_path,
size=vecSize,
window=winSize,
min_count=minCount,
workers=numWorkers,
iter=epochs,
sg=skipGram)
model.save(modelName)
P.S. I don't think it's a good idea to use the keyword input as a variable in your code.

It's most probably the training size. Training a 128d embedding is definitely overkill. Rule of thumb from the the google developers blog:
Why is the embedding vector size 3 in our example? Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:
embedding_dimensions = number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:
3 = 81**0.25

How to debug Keras ValueError: No gradients provided for any variable?

How can I approach this error and what part of my code might be causing this? I tried looking up existing issues and SO threads but they mostly point towards a failing custom loss-calculating layer but I am using Keras' built-in loss.
My code:
dimension = 300
n_neighbors = 10
n_unique_candidate_pos = 500 # Defines number of unique positions a candidate can take in a given invoice. Hyper-parameter
running_first_time = True
tf.keras.backend.set_floatx('float64')
tf.keras.backend.clear_session()
for index, row in data.iterrows():
print('Epoch {} started.'.format(index))
image_path = row.pngFileLoc # getting path to image
image_json = eval(row.json_file) # getting json
image = Image.open(PATH_TO_DATA_FOLDER + image_path)
# Getting actual class label
actual_invoice_date = row.InvDate
# 1. Getting candidates
candidate_gen = CandidateGenerator(image, row)
candidates = candidate_gen.generate_date()
for c in candidates:
'''
Feed each candidate through the model by icrementally training it.
'''
# 2. Getting neighbors
neighbors_gen = Neighbors(image, row, n_neighbors = n_neighbors)
neighbors = neighbors_gen.get_neighbors(c)
# 3. vectorizing the neighbors
vect = Vectorize(dimension=300)
neighbors_embedded = vect.vectorize(neighbors)
# 4. getting absolute candidate position
#absolute_cand_pos = get_absolute_cand_pos(c, image_json)
candidate_normalized_vertices = get_normalized_cand_vertices(c, image_json, image)
absolute_candidate_pos = neighbors_gen.get_centroid(candidate_normalized_vertices)
# 5. is already done
# 6. Maxpooling the neighbors
neighborhood_encoding = NeighborhoodEncoding()
neighbors_encoded = neighborhood_encoding.encode_neighbors(np.array(neighbors_embedded))
# Trainable layers start here
# 7. Embedding candidate absolute position
input_candidate_pos_layer = Input(shape=(2,), name='input_layer')
embedding_candidate_pos_layer = Embedding(input_dim = n_unique_candidate_pos, output_dim = dimension // 2, name='candidate_position_embedding_layer')(input_candidate_pos_layer)
#We will get [2, dimension/2] output above as we are using 2d co-ordinates so flatten it out into [dimension]
flatten_cand_pos_layer = Flatten(name = 'flatten_candidate_position_embedding_layer')(embedding_candidate_pos_layer)
# 8. Concatenate neighborhood encoding and candidate position embedding
sliced_flatten_cand_pos_layer = SliceLayer(index=0)(flatten_cand_pos_layer)
concat_neighbor_candidate = Concatenate(name='concat_neighbors_candpos_layer')([tf.convert_to_tensor(neighbors_encoded, dtype='float64'), sliced_flatten_cand_pos_layer]) #I honestly have no idea why it requires me to slice the tensor
#reshaping this
reshape_concat_neighbor = Reshape((1, ), input_shape=concat_neighbor_candidate.shape)(concat_neighbor_candidate)
transposed_reshape_concat_neighbor = TransposeLayer()(reshape_concat_neighbor)
# 9. Reduce dimensionality of candidate encoding
dense_dim_reduce_layer = Dense(units = dimension, activation = 'relu', name='dense_dim_reduc_layer')(transposed_reshape_concat_neighbor)
flatten_dense_dim_reduce_layer = Flatten(name='flatten_dense_dim_reduc_layer')(dense_dim_reduce_layer)
# 10. Compute cosine similarity between field_embedding and candidate_encoding and 11. do sigmoid
sliced_flatten_dense_dim_layer = SliceLayer(index=0)(flatten_dense_dim_reduce_layer)
cosine_sim_layer = CosineSimilarityLayer(name='cosine_sim_layer')(sliced_flatten_dense_dim_layer, field_embedded[0])
# 12. Compute loss
#y_pred = Output(name='output_layer')(tf.convert_to_tensor(cosine_sim_layer, dtype='float64'))
y_pred = Output(name='output_layer')(tf.convert_to_tensor([cosine_sim_layer], dtype='float64'))
y_actual = int(actual_invoice_date == c)
if running_first_time:
model = Model(inputs = input_candidate_pos_layer, outputs = y_pred)
model.compile(loss='binary_crossentropy')
running_first_time = False
print('model initialized successfully.')
print(model.summary())
model.fit([np.asarray([absolute_candidate_pos]).astype('float64'), np.asarray([y_actual]).astype('float64')])
What it does:
I am iterating through each image and for each image I generate a list of 'candidates' that might belong to the actual-class. Each image is an invoice and each candidate for a class 'Invoice Date' could be all of the dates that appear in the image.
Actual class-label is calculated dynamically being seeing if the candidate matches the human-annotated data.
Between each layer, I am applying some functions on the previous' layers output (as per this research paper). I made sure to not use any layers / tensorflow operations that might break backprop and this is what the summary of model after one iteration looks like this:
This is the raised error when i call model.fit():
ValueError: No gradients provided for any variable: ['candidate_position_embedding_layer/embeddings:0', 'dense_dim_reduc_layer/kernel:0', 'dense_dim_reduc_layer/bias:0'].

Multiple Input Keras Model

I want to train a Keras Model where the input is a vector of size (20, 300).
But the problem is that I need also to feed the model with a fixed list of vectors that should be used on each training step.
the list of vectors is fixed for all training examples.. so Here's what I've tried.
def create_model(num_filters=64, embedding_dim=300, seq_len=20):
# input1 Shape (?,20,300)
input1 = Input(shape=(seq_len,embedding_dim,), dtype='float32') # Input1 taken from the model input
# input2 Shape (5,20,300)
input2=get_input2() # Input2: taken from outside the model
# CNN Encoding of Input 1
convs = []
filter_sizes = [1,2,3]
for fsz in filter_sizes:
x = Conv1D(num_filters, fsz, activation='relu',padding='same')(input1)
x = MaxPooling1D()(x)
convs.append(x)
output1 = Concatenate(axis=-1)(convs)
output1 = Flatten()(output1)
# CNN Encoding of Input 2
convs1 = []
filter_sizes = [1,2,3]
for fsz in filter_sizes:
x1 = Conv1D(num_filters, fsz, activation='relu',padding='same')(input2)
x1 = MaxPooling1D()(x1)
convs1.append(x1)
output2 = Concatenate(axis=-1)(convs1)
output2 = Flatten()(output2)
However this implementation throws a value error.
"ValueError: Layer conv1d_60 was called with an input that isn't a
symbolic tensor. Received type: ."
How this can be done in Keras?

one hidden layer sufficient for auto-encoder to have output same as input

I am doing some work with Theano based auto-encoder, giving input as samples from mixture of Gaussians, one hidden layer. I expected output to be same as input, but I am not achieving it. I have been inspired by this tutorial for implemenation. Is autoencoder with only one hidden layer is also sufficient to recover exact replica of output ?
My code looks like below :
` def train(self, n_epochs=100, mini_batch_size=1, learning_rate=0.01):
index = T.lscalar()
x=T.matrix('x')
params = [self.W, self.b1, self.b2]
hidden = self.activation_function(T.dot(x, self.W)+self.b1)
output = T.dot(hidden,T.transpose(self.W))+self.b2
output = self.output_function(output)
# Use mean square error
L = T.sum((x - output) ** 2)
cost = L.mean()
updates=[]
#Return gradient with respect to W, b1, b2.
gparams = T.grad(cost,params)
#Create a list of 2 tuples for updates.
for param, gparam in zip(params, gparams):
updates.append((param, param-learning_rate*gparam))
#Train given a mini-batch of the data.
train = th.function(inputs=[index], outputs=cost, updates=updates,
givens={x:self.X[index:index+mini_batch_size,:]})
import time
start_time = time.clock()
acc_cost = []
for epoch in xrange(n_epochs):
#print "Epoch:", epoch
for row in xrange(0,self.m, mini_batch_size):
cost = train(row)
acc_cost.append(cost)
plt.plot(range(n_epochs), acc_cost)
plt.ylabel("cost")
plt.xlabel("epochs")
plt.show()
# Format input data for plotable format
norm_data = self.X.get_value()
plot_var1 = []
plot_var1.append(norm_data[:,0])
plot_var2 = []
plot_var2.append(norm_data[:,1])
plt.plot(plot_var1, plot_var2, 'ro')
# Hidden output
x=T.dmatrix('x')
hidden = self.activation_function(T.dot(x,self.W)+self.b1)
transformed_data = th.function(inputs=[x], outputs=[hidden])
hidden_data = transformed_data(self.X.get_value())
#print "hidden_output ", hidden_data[0]
# final output
y=T.dmatrix('y')
W = T.transpose(self.W)
output = self.activation_function(T.dot(y,W) + self.b2)
transformed_data = th.function(inputs=[y], outputs=[output])
output_data = transformed_data(hidden_data[0])[0]
print "decoded_output ", output_data
# Format output data for plotable format
plot_var1 = []
plot_var1.append(output_data[:,0])
plot_var2 = []
plot_var2.append(output_data[:,1])
plt.plot(plot_var1, plot_var2, 'bo')
plt.show()
'

In your code:
params = [self.W, self.b1, self.b2]
hidden = self.activation_function(T.dot(x, self.W)+self.b1)
output = T.dot(hidden,T.transpose(self.W))+self.b2
You are using same weight for both input and output. What about:
params = [self.W1, self.W2, self.b1, self.b2]
hidden = self.activation_function(T.dot(x, self.W1)+self.b1)
output = T.dot(hidden,self.W2)+self.b2
Autoencoder isn't PCA. If you want to use same weight, it may be a good idea to constrain weight to be orthogonal.
Otherwise, making deeper AE may help. Since only one independent weight matrix, the proposed model can hardly behave as a universal function approximator as a 3 layer MLP.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Weighted average of embedding layer using Universal sentence encoder - python

Related

How to use custom function with tensorflow dataset API?

Word-embedding does not provide expected relations between words

How to debug Keras ValueError: No gradients provided for any variable?

Multiple Input Keras Model

one hidden layer sufficient for auto-encoder to have output same as input

Categories

Resources