Pytorch working with different tensors length - python

I am working with word embeddings, and each phrase has different length.
My dataset contains a list of vectors, one vector of size 300 for each word in the phrase.
Maybe one phrase has 20 words, maybe it has 2, etc.
For instance:
X.iloc[0:10, ]
0 [[-0.51389, -0.55286, -0.28648, -0.18608, -0.0...
1 [[0.33837, -0.13933, -0.096114, 0.40103, 0.041...
2 [[-0.078564, 0.18702, -0.35168, 0.067557, 0.11...
3 [[0.047356, -0.10216, -0.15738, -0.04521, 0.26...
4 [[0.16781, -0.31793, -0.21877, 0.28025, 0.3364...
5 [[-0.4509, 0.077681, -0.058347, 0.2859, -0.369...
6 [[0.018223, -0.012323, 0.035569, 0.24232, -0.1...
7 [[-0.19265, 0.45863, -0.33841, -0.16293, -0.26...
8 [[0.10751, 0.15958, 0.13332, 0.16642, -0.03273...
9 [[0.35259, 0.60833, 0.051335, -0.079285, -0.35...
Name: embedding, dtype: object
len(X.iloc[0])
313
len(X.iloc[1])
2
The targets are just a numeric integer, from 0 to 5.
How can I pad this sequence using pytorch to feed a neural network of fixed size?
I saw something with collate_fn, however I think it only works for batches, and not for the whole dataset.

It seems correct that collate_fn can help in this case. Indeed, it is applied on batch level, but iterating through all batches will process entire dataset.
For padding you can use some of the following functions:
torch.nn.functional.pad
torch.nn.utils.rnn.pad_sequence
Doing it manually in a custom way is also an option.
Here is toy example with torch.nn.functional.pad and random dataset:
import torch
import random
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
MAX_PHRASE_LENGTH = 20
class PhraseDataset(Dataset):
def __len__(self):
return 5
def __getitem__(self, item):
phrase = torch.rand((random.randint(2, MAX_PHRASE_LENGTH), 300), dtype=torch.float)
target = torch.tensor(random.randint(0, 5), dtype=torch.float)
return phrase, target
def collate_fn(batch):
targets = []
phrases = []
for (phrase, target) in batch:
targets.append(target)
padding = MAX_PHRASE_LENGTH - phrase.size()[0]
if padding > 0:
phrases.append(F.pad(phrase, (0, 0, padding, 0), "constant", 0))
else:
phrases.append(phrase)
return torch.stack(phrases), torch.tensor(targets, dtype=torch.float)
dataset = PhraseDataset()
data_loader = DataLoader(dataset, batch_size=5, collate_fn=collate_fn)
for phrases, targets in data_loader:
print('Phrases size:', phrases.size())
Output:
Phrases size: torch.Size([5, 20, 300])
Although padding could actually be applied in a dataset, before collate_fn is called, so collate_fn won't be needed in this case at all.

Related

How to use custom function with tensorflow dataset API?

I am new to TensorFlow's tf.data.Dataset and I am trying to use it on my data that I loaded with pandas dataframe as follows:
Load the input date (df_input):
id messages Label
0 11 I am not driving home 0
1 11 Please pick me up 1
2 103 The car already park 1
3 103 No need for ticket 0
4 104 I will buy a car 1
5 104 I will buy truck 1
And I do preprocess and apply text Vectorization as follows:
text_vectorizer = layers.TextVectorization(max_tokens=20, output_mode="int", output_sequence_length=6)
text_vectorizer.adapt(df_input.message.values.tolist())
def encode(texts):
encoded_texts = text_vectorizer(texts)
return encoded_texts.numpy()
train_data = encode(df_input.message.values) ## This the training data
train_label = tf.keras.utils.to_categorical(df_input.label.values, 2) ## This labels
Then I am using the preprocess data in the training model by using the TensorFlow tf.data.Dataset function as follows:
train_dataset_df = (
tf.data.Dataset.from_tensor_slices((train_data, train_label))
.shuffle(1000)
.batch(2)
)
My question is how I can transform the data in every training epoch by applying my custom function to the training data. I saw a usage example of performing the transformation via .map function from here to this post:
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
My goal is to apply my custom function as follows (which reorders the words in text data):
def order_augment_sent(Sentence):
words = Sentence.split(" ")
words.sort()
newSentence = " ".join(words)
return newSentence
train_dataset_ds = (
tf.data.Dataset.from_tensor_slices((train_data, train_label))
.shuffle(1000)
.batch(2)
.map(lambda x, y: (order_augment_sent(x), y))
)
But I am getting error as:
AttributeError: 'Tensor' object has no attribute 'split'
Or if I apply my other cutom function, I am getting as:
TypeError: To be compatible with tf.function, Python functions must return zero or more Tensors or ExtensionTypes or None values; in compilation of <function _tf_if_stmt.<locals>.aug_body at 0124f565>, found return value of type WarningException, which is not a Tensor or ExtensionType.
I am not sure how I can do this and I will appreciate it if you have any idea or solution to help me.
The parameters you get in your lambda function are token from the vectors so they are int. If you want to reorder the text data, you need to do it before the text_vectorizer.
So you should add the TextVectorization layer to your model so your map function will have the string and you can reorder the sentance before calling the TextVectorization.
Here is an almost working exemple, you just need to edit the order_augment_sent function with the code you need, I didn't know what kind of sorting you want to do, probably you will have to write a custom sort with numpy https://www.tensorflow.org/api_docs/python/tf/py_function
import tensorflow as tf
import numpy as np
train_data = ["I am not driving home", "Please pick me up", "The car already park", " No need for ticket", "I will buy a car", "I will buy truck"]
train_label = [0,1,1,0,1,1]
text_dataset = tf.data.Dataset.from_tensor_slices(train_data)
max_features = 5000 # Maximum vocab size.
max_len = 4 # Sequence length to pad the outputs to.
# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=max_features,
output_mode='int',
output_sequence_length=max_len)
# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
vectorize_layer.adapt(train_data)
# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()
# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
model.add(vectorize_layer)
def apply_order_augment_sent(s):
Sentence = s.decode('utf-8')
words = Sentence.split(" ")
words.sort()
newSentence = " ".join(words)
return(newSentence)
def order_augment_sent(x: np.ndarray, y:np.ndarray):
new_x = []
for i in range(len(x)):
new_x.append(np.array([apply_order_augment_sent(x[i])]))
print('new', new_x, y)
return(new_x, y)
train_dataset_ds = tf.data.Dataset.from_tensor_slices((train_data, train_label))
train_dataset_ds = train_dataset_ds.shuffle(1000).batch(32)
train_dataset_ds = train_dataset_ds.map(lambda item1, item2: tf.numpy_function(
order_augment_sent, [item1, item2], [tf.string, tf.int32]))
list(train_dataset_ds.as_numpy_iterator())
model.predict(train_dataset_ds)

How can I solve the wrong shape in DataLoader?

I have a text dataset that I want to use for a GAN and it should turn to onehotencode and this is how I Creating a Custom Dataset for my files
class Dataset2(torch.utils.data.Dataset):
def __init__(self, list_, labels):
'Initialization'
self.labels = labels
self.list_IDs = list_
def __len__(self):
'Denotes the total number of samples'
return len(self.list_IDs)
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
mylist = self.list_IDs[index]
# Load data and get label
X = F.one_hot(mylist, num_classes=len(alphabet))
y = self.labels[index]
return X, y
It is working well and every time I call it, it works just fine but the problem is when I use DataLoader and try to use it, its shape is not the same as it just came out of the dataset, this is the shape that came out of the dataset
x , _ = dataset[1]
x.shape
torch.Size([1274, 22])
and this is the shape that came out dataloader
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
one = []
for epoch in range(epochs):
for i, (real_data, _) in enumerate(dataloader):
one.append(real_data)
one[3].shape
torch.Size([4, 1274, 22])
this 4 is number of samples in my data but it should not be there, how can I fix this problem?
You confirmed you only had four elements in your dataset. You have wrapped your dataset with a data loader with batch_size=64 which is greater than 4. This means the dataloader will only output a single batch containing 4 elements.
In turn, this means you only append a single element per epoch, and one[3].shape is a batch (the only batch of the data loader), shaped (4, 1274, 22).

Word-embedding does not provide expected relations between words

I am trying to train a word embedding to a list of repeated sentences where only the subject changes. I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding. However, the angle between the vectors of subjects is not always larger than the angle between subjects and a random word.
Man is going to write a very long novel that no one can read.
Woman is going to write a very long novel that no one can read.
Boy is going to write a very long novel that no one can read.
The code is based on pytorch tutorial:
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
class EmbedTrainer(nn.Module):
def __init__(self, d_vocab, d_embed, d_context):
super(EmbedTrainer, self).__init__()
self.embed = nn.Embedding(d_vocab, d_embed)
self.fc_1 = nn.Linear(d_embed * d_context, 128)
self.fc_2 = nn.Linear(128, d_vocab)
def forward(self, x):
x = self.embed(x).view((1, -1)) # flatten after embedding
x = self.fc_2(F.relu(self.fc_1(x)))
x = F.log_softmax(x, dim=1)
return x
text = " ".join(["{} is going to write a very long novel that no one can read.".format(x) for x in ["Man", "Woman", "Boy"]])
text_split = text.split()
trigrams = [([text_split[i], text_split[i+1]], text_split[i+2]) for i in range(len(text_split)-2)]
dic = list(set(text.split()))
tok_to_ids = {w:i for i, w in enumerate(dic)}
tokens_text = text.split(" ")
d_vocab, d_embed, d_context = len(dic), 10, 2
""" Train """
loss_func = nn.NLLLoss()
model = EmbedTrainer(d_vocab, d_embed, d_context)
print(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
losses = []
epochs = 10
for epoch in range(epochs):
total_loss = 0
for input, target in trigrams:
tok_ids = torch.tensor([tok_to_ids[tok] for tok in input], dtype=torch.long)
target_id = torch.tensor([tok_to_ids[target]], dtype=torch.long)
model.zero_grad()
log_prob = model(tok_ids)
#if total_loss == 0: print("train ", log_prob, target_id)
loss = loss_func(log_prob, target_id)
total_loss += loss.item()
loss.backward()
optimizer.step()
print(total_loss)
losses.append(total_loss)
embed_map = {}
for word in ["Man", "Woman", "Boy", "novel"]:
embed_map[word] = model.embed.weight[tok_to_ids[word]]
print(word, embed_map[word])
def angle(a, b):
from numpy.linalg import norm
a, b = a.detach().numpy(), b.detach().numpy()
return np.dot(a, b) / norm(a) / norm(b)
print("man.woman", angle(embed_map["Man"], embed_map["Woman"]))
print("man.novel", angle(embed_map["Man"], embed_map["novel"]))
I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding
I don't really think you'll achieve that kind of result with only 3 sentences and like 40 iterations in 10 epochs (plus most of the data in your 40 iterations is repeated).
maybe try downloading a couple of free datasets out there, or try your own data with a proven model like a genism model.
I'll give you the code for training a gensim model, so you can test your dataset on another model and see if the problem comes from your data or from your model.
I've tested similar gensim models on datasets with millions of sentences and it worked like a charm, for smaller datasets you might want to change the parameters.
from gensim.models import Word2Vec
from multiprocessing import cpu_count
corpus_path = 'eachLineASentence.txt'
vecSize = 300
winSize = 5
numWorkers = cpu_count()-1
epochs = 20
minCount = 5
skipGram = False
modelName = f'mymodel.model'
model = Word2Vec(corpus_file=corpus_path,
size=vecSize,
window=winSize,
min_count=minCount,
workers=numWorkers,
iter=epochs,
sg=skipGram)
model.save(modelName)
P.S. I don't think it's a good idea to use the keyword input as a variable in your code.
It's most probably the training size. Training a 128d embedding is definitely overkill. Rule of thumb from the the google developers blog:
Why is the embedding vector size 3 in our example? Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:
embedding_dimensions = number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:
3 = 81**0.25

Keras Accuracy and Loss not changing over a large period of epochs

I am trying to create a Convolutional Neural Network to classify what language a certain "word" is from. There are two files ("english_words.txt" and "spanish_words.txt") which each contain about 60,000 words each. I have converted each word into a 29-dimensional vector where each element is a number between 0 and 1. I am training the model for 500 epochs with the optimizer "adam". However, when I train the model, the loss tends to hover around 0.7 and the accuracy around 0.5, and no matter how long I train it for, these metrics will not improve. Here is the code:
import keras
import numpy as np
from keras.layers import Dense
from keras.models import Sequential
import re
train_labels = []
train_data = []
with open("english_words.txt") as words:
full_words = words.read()
full_words = full_words.split("\n")
# all of the labels are just 1.
# we now need to encode them into 29 dimensional vectors.
vector = []
i = 0
for word in full_words:
train_labels.append([1,0])
for letter in word:
vector.append((ord(letter) - 96) * (1.0 / 26.0))
i += 1
if (i < 29):
for x in range(0, 29 - i):
vector.append(0)
train_data.append(vector)
vector = []
i = 0
with open("spanish_words.txt") as words:
full_words = words.read()
full_words = full_words.replace(' ', '')
full_words = full_words.replace('\n', ',')
full_words = full_words.split(",")
vector = []
for word in full_words:
train_labels.append([0,1])
for letter in word:
vector.append((ord(letter) - 96) * (1.0 / 26.0))
i += 1
if (i < 29):
for x in range(0, 29 - i):
vector.append(0)
train_data.append(vector)
vector = []
i = 0
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = np.empty(a.shape, dtype=a.dtype)
shuffled_b = np.empty(b.shape, dtype=b.dtype)
permutation = np.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
train_data = np.asarray(train_data, dtype=np.float32)
train_labels = np.asarray(train_labels, dtype=np.float32)
train_data, train_labels = shuffle_in_unison(train_data, train_labels)
print(train_data.shape, train_labels.shape)
model = Sequential()
model.add(Dense(29, input_shape=(29,)))
model.add(Dense(60))
model.add(Dense(40))
model.add(Dense(25))
model.add(Dense(2))
model.compile(optimizer="adam",
loss="categorical_crossentropy",
metrics=["accuracy"])
model.summary()
model.fit(train_data, train_labels, epochs=500, batch_size=128)
model.save("language_predictor.model")
For some extra info, I am running python 3.x with tensorflow 1.15 and keras 1.15 on windows x64.
I can see several potential problems with your code.
You added several Dense layers one after another, but you really need to also include a non-linear activation function with the parameter activation= .... In the absence of any non-linear activation functions, all those fully-connected Dense layers will mathematically collapse into one single linear Dense layer incapable of learning a non-linear decision boundary.
In general, if you see your loss and accuracy not making any improvement or even getting worse, then the first thing to try is to reduce your learning rate.
You don't need to necessarily implement your own shuffling function. The Keras fit() function can do it if you use the shuffle=True parameter.
In addition to the points mentioned by stackoverflowuser2010:
I find this a very good read and highly suggest checking the mentioned points: 37 Reasons why your Neural Network is not working
Center your input data: Compute a component-wise mean vector and subtract it from every input.

Program hangs on Estimator.evaluate in Tensorflow 1.6

As a learning tool, I am trying to do something simple.
I have two training CSV files:
One file with 36 columns (3500 records) with 0s and 1s. I am envisioning this file as a flattened 6x6 matrix.
I have another CSV file with 1 columnn of ground truth 0 or 1 (3500 records) which indicates if at least 4 of the 6 of elements in the 6x6 matrix's diagonal are 1's.
I also have two test CSV files which are the same structure as the training files except there are 500 records in each.
When I step through the program using the debugger, it appears that the...
estimator.train(
input_fn=lambda: get_inputs(x_paths=[x_train_file], y_paths=[y_train_file], batch_size=32), steps=100)
... runs OK. I see files in the checkpoint directory and see a loss function graph in Tensorboard.
But when the program gets to...
eval_result = estimator.evaluate(
input_fn=lambda: get_inputs(x_paths=[x_test_file], y_paths=[y_test_file], batch_size=32))
... it just hangs.
I have checked the test files and I also tried running the estimator.evaluate using the training files. Still hangs
I am using TensorFlow 1.6, Python 3.6
The following is all of the code:
import tensorflow as tf
import os
import numpy as np
x_train_file = os.path.join('D:', 'Diag', '6x6_train.csv')
y_train_file = os.path.join('D:', 'Diag', 'HasDiag_train.csv')
x_test_file = os.path.join('D:', 'Diag', '6x6_test.csv')
y_test_file = os.path.join('D:', 'Diag', 'HasDiag_test.csv')
model_chkpt = os.path.join('D:', 'Diag', "checkpoints")
def get_inputs(
count=None, shuffle=True, buffer_size=1000, batch_size=32,
num_parallel_calls=8, x_paths=[x_train_file], y_paths=[y_train_file]):
"""
Get x, y inputs.
Args:
count: number of epochs. None indicates infinite epochs.
shuffle: whether or not to shuffle the dataset
buffer_size: used in shuffle
batch_size: size of batch. See outputs below
num_parallel_calls: used in map. Note if > 1, intra-batch ordering
will be shuffled
x_paths: list of paths to x-value files.
y_paths: list of paths to y-value files.
Returns:
x: (batch_size, 6, 6) tensor
y: (batch_size, 2) tensor of 1-hot labels
"""
def x_map(line):
n_dims = 6
columns = [str(i1) for i1 in range(n_dims**2)]
# Decode the line into its fields
fields = tf.decode_csv(line, record_defaults=[[0]] * (n_dims ** 2))
# Pack the result into a dictionary
features = dict(zip(columns, fields))
return features
def y_map(line):
y_row = tf.string_to_number(line, out_type=tf.int32)
return y_row
def xy_map(x, y):
return x_map(x), y_map(y)
x_ds = tf.data.TextLineDataset(x_train_file)
y_ds = tf.data.TextLineDataset(y_train_file)
combined = tf.data.Dataset.zip((x_ds, y_ds))
combined = combined.repeat(count=count)
if shuffle:
combined = combined.shuffle(buffer_size)
combined = combined.map(xy_map, num_parallel_calls=num_parallel_calls)
combined = combined.batch(batch_size)
x, y = combined.make_one_shot_iterator().get_next()
return x, y
columns = [str(i1) for i1 in range(6 ** 2)]
feature_columns = [
tf.feature_column.numeric_column(name)
for name in columns]
estimator = tf.estimator.DNNClassifier(feature_columns=feature_columns,
hidden_units=[18, 9],
activation_fn=tf.nn.relu,
n_classes=2,
model_dir=model_chkpt)
estimator.train(
input_fn=lambda: get_inputs(x_paths=[x_train_file], y_paths=[y_train_file], batch_size=32), steps=100)
eval_result = estimator.evaluate(
input_fn=lambda: get_inputs(x_paths=[x_test_file], y_paths=[y_test_file], batch_size=32))
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
There are two parameters that are causing this:
tf.data.Dataset.repeat has a count parameter:
count: (Optional.) A tf.int64 scalar tf.Tensor, representing the
number of times the dataset should be repeated. The default behavior
(if count is None or -1) is for the dataset be repeated indefinitely.
In your case, count is always None, so the dataset is repeated indefinitely.
tf.estimator.Estimator.evaluate has the steps parameter:
steps: Number of steps for which to evaluate model. If None, evaluates until input_fn raises an end-of-input exception.
Steps are set for the training, but not for the evaluation, as a result the estimator is running until input_fn raises an end-of-input exception, which, as described above, never happens.
You should set either of those, I think count=1 is the most reasonable for evaluation.

Categories

Resources