I am testing a two-step architecture that is composed of a conventional first section that can be implemented with any standard deep learning architecture and a second section that must be coded manually outside the declaration of the Pytorch graph (while still utilizing numpy-like torch functions).
My problem can be simplified to coding a feed-forward neural network with two hidden layers, where the first is implemented within the Pytorch graph and the second is implemented manually outside the Pytorch graph.
Architecture:
Input
-> Linear(28 * 28, 120) w/ Pytorch
-> ReLU w/ Pytorch
-> Linear(120, 84) w/ Pytorch
-> ReLU w/ Pytorch
-> Linear(84, 10) w/o Pytorch
-> Output
Problem: My implementation below achieves a very low ~74%, while a standard fully Pytorch implementation achieves ~95%. What is causing this disparity?
I believe my problem lies in manually passing back the deltas, although the math looks right, so I am stuck in finding a solution to this.
Implementation of architecture and training on MNIST:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(28 * 28, 120)
self.fc2 = nn.Linear(120, 84)
def forward(self, x):
x = x.view(-1, 28 * 28)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return x
net = Net()
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)
# Initialize weights just as Pytorch does by default:
m = torch.distributions.uniform.Uniform(torch.tensor([-np.sqrt(1.0/84)]),
torch.tensor([np.sqrt(1.0/84)]))
W = m.sample((84, 10)).reshape((84, 10))
# based on https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
for epoch in range(2): # loop over the dataset multiple times
for i, data in enumerate(trainloader, 0):
# get the inputs
inputs, labels = data
# make one-hot encoding of labels
targets = oneHot(labels)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
pytorch_outputs = net(inputs)
pytorch_outputs = torch.autograd.Variable(pytorch_outputs,
requires_grad=True)
manual_outputs = torch.mm(pytorch_outputs, W)
delta_out = manual_outputs - targets.view(-1,10) # = error_out
dEdW3 = torch.mm(torch.t(pytorch_outputs), delta_out)
W -= 0.01 * dEdW3 # gradient descent
delta_h = torch.autograd.Variable(
torch.t(torch.mm(W, torch.t(delta_out))))
loss = criterion(pytorch_outputs, delta_h)
loss.backward()
optimizer.step()
Full code:
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.0, 0.0, 0.0), (1.0, 1.0, 1.0))])
trainset = torchvision.datasets.MNIST(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=1,
shuffle=True, num_workers=1)
testset = torchvision.datasets.MNIST(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=1,
shuffle=False, num_workers=1)
classes = ('0', '1', '2', '3',
'4', '5', '6', '7', '8', '9')
def oneHot(a):
b = torch.zeros(10)
b[a] = 1
return b
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(28 * 28, 120)
self.fc2 = nn.Linear(120, 84)
def forward(self, x):
x = x.view(-1, 28 * 28)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return x
net = Net()
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)
# Initialize weights just as Pytorch does by default:
m = torch.distributions.uniform.Uniform(torch.tensor([-np.sqrt(1.0/84)]),
torch.tensor([np.sqrt(1.0/84)]))
W = m.sample((84, 10)).reshape((84, 10))
# based on https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
for epoch in range(2): # loop over the dataset multiple times
for i, data in enumerate(trainloader, 0):
# get the inputs
inputs, labels = data
# make one-hot encoding of labels
targets = oneHot(labels)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
pytorch_outputs = net(inputs)
pytorch_outputs = torch.autograd.Variable(pytorch_outputs,
requires_grad=True)
manual_outputs = torch.mm(pytorch_outputs, W)
delta_out = manual_outputs - targets.view(-1,10) # = error_out
dEdW3 = torch.mm(torch.t(pytorch_outputs), delta_out)
W -= 0.001*dEdW3 # gradient descent
delta_h = torch.autograd.Variable(
torch.t(torch.mm(W, torch.t(delta_out))))
loss = criterion(pytorch_outputs, delta_h)
loss.backward()
optimizer.step()
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
inputs, labels = data
pytorch_outputs = torch.autograd.Variable(net(inputs),
requires_grad=True)
manual_outputs = torch.mm(pytorch_outputs, W)
_, predicted = torch.max(manual_outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))
Related
I'm completely new to neural nets, so I tried to roughly follow some tutorials to create a neural net that can just distinguish if a given binary picture contains a white circle or if it is all black. So, I generated 1000 arrays of size 10000 representing a 100x100 picture with half of them containing a white circle somewhere. The generation of my dataset looks like this:
for i in range(1000):
image = [0] * (IMAGE_SIZE * IMAGE_SIZE)
if random() < 0.5:
dataset.append([image, [[0]]])
else:
#inserts circle in image
#...
dataset.append([image, [[1]]])
np.random.shuffle(dataset)
np.save("testdataset.npy", dataset)
The double list around the classifications is because the net seemed to give that format as an output, so I matched that.
Now since I don't really have any precise idea of how pytorch works, I don't really now which parts of the code are relevant for solving my problem and which aren't. Therefore, I gave the code for the net and the training down below and really hope that someone can explain to me where I went wrong. I'm sorry if it's too much code. The code runs without errors, but if I print the parameters before and after training they didn't change in any way and the net will always just return a 0 for every image/array.
IMAGE_SIZE = 100
EPOCHS = 3
BATCH_SIZE = 50
VAL_PCT = 0.1
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(IMAGE_SIZE * IMAGE_SIZE, 64)
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, 64)
self.fc4 = nn.Linear(64, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = self.fc4(x)
return F.log_softmax(x, dim = 1)
net = Net()
optimizer = optim.Adam(net.parameters(), lr = 0.01)
loss_function = nn.MSELoss()
dataset = np.load("testdataset.npy", allow_pickle = True)
X = torch.Tensor([i[0] for i in dataset]).view(-1, 10000)
y = torch.Tensor([i[1] for i in dataset])
val_size = int(len(X) * VAL_PCT)
train_X = X[:-val_size]
train_y = y[:-val_size]
test_X = X[-val_size:]
test_y = y[-val_size:]
for epoch in range(EPOCHS):
for i in range(0, len(train_X), BATCH_SIZE):
batch_X = train_X[i:i + BATCH_SIZE].view(-1, 1, 10000)
batch_y = train_y[i:i + BATCH_SIZE]
net.zero_grad()
outputs = net(batch_X)
loss = loss_function(outputs, batch_y)
loss.backward()
optimizer.step()
Instead of net.zero_grad() I would recommend using optimizer.zero_grad() as it's more common and de facto standard. Your training loop should be:
for epoch in range(EPOCHS):
for i in range(0, len(train_X), BATCH_SIZE):
batch_X = train_X[i:i + BATCH_SIZE].view(-1, 1, 10000)
batch_y = train_y[i:i + BATCH_SIZE]
optimizer.zero_grad()
outputs = net(batch_X)
loss = loss_function(outputs, batch_y)
loss.backward()
optimizer.step()
I would recommend you reading a bit about different loss functions. It seems you have a classification problem, for that you should use the logits (binary classification) or cross entropy (multi class) loss. I would make the following changes to the network and loss function:
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(IMAGE_SIZE * IMAGE_SIZE, 64)
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, 64)
self.fc4 = nn.Linear(64, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = self.fc4(x)
return x
loss_function = nn.BCEWithLogitsLoss()
Check the documentation before using it: https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss
Good luck!
First, It is not ideal to use Neural networks for address this kind of problems. Neural Networks train on highly non-linear data. For this example, You can use average intensities of image to find out a white pixel is present or not
However, A classic logistic regression problem outputs a value from 0 to 1 or probabilities
Softmax function is used when you have multiple classes and convert all the sum of classes equal to 1
log_softmax implementation: log( exp(x_i) / exp(x).sum() ). Here, your output layer consists of only 1 neuron. outputs = net(batch_X) is always 1.
This code attempts to utilize a custom implementation of dropout :
%reset -f
import torch
import torch.nn as nn
# import torchvision
# import torchvision.transforms as transforms
import torch
import torch.nn as nn
import torch.utils.data as data_utils
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F
num_epochs = 1000
number_samples = 10
from sklearn.datasets import make_moons
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_moons(n_samples=number_samples, noise=0.1)
# scatter plot, dots colored by class value
x_data = [a for a in enumerate(X)]
x_data_train = x_data[:int(len(x_data) * .5)]
x_data_train = [i[1] for i in x_data_train]
x_data_train
y_data = [y[i[0]] for i in x_data]
y_data_train = y_data[:int(len(y_data) * .5)]
y_data_train
x_test = [a[1] for a in x_data[::-1][:int(len(x_data) * .5)]]
y_test = [a for a in y_data[::-1][:int(len(y_data) * .5)]]
x = torch.tensor(x_data_train).float() # <2>
print(x)
y = torch.tensor(y_data_train).long()
print(y)
x_test = torch.tensor(x_test).float()
print(x_test)
y_test = torch.tensor(y_test).long()
print(y_test)
class Dropout(nn.Module):
def __init__(self, p=0.5, inplace=False):
# print(p)
super(Dropout, self).__init__()
if p < 0 or p > 1:
raise ValueError("dropout probability has to be between 0 and 1, "
"but got {}".format(p))
self.p = p
self.inplace = inplace
def forward(self, input):
print(list(input.shape))
return np.random.binomial([np.ones((len(input),np.array(list(input.shape))))],1-dropout_percent)[0] * (1.0/(1-self.p))
def __repr__(self):
inplace_str = ', inplace' if self.inplace else ''
return self.__class__.__name__ + '(' \
+ 'p=' + str(self.p) \
+ inplace_str + ')'
class MyLinear(nn.Linear):
def __init__(self, in_feats, out_feats, drop_p, bias=True):
super(MyLinear, self).__init__(in_feats, out_feats, bias=bias)
self.custom_dropout = Dropout(p=drop_p)
def forward(self, input):
dropout_value = self.custom_dropout(self.weight)
return F.linear(input, dropout_value, self.bias)
my_train = data_utils.TensorDataset(x, y)
train_loader = data_utils.DataLoader(my_train, batch_size=2, shuffle=True)
my_test = data_utils.TensorDataset(x_test, y_test)
test_loader = data_utils.DataLoader(my_train, batch_size=2, shuffle=True)
# Device configuration
device = 'cpu'
print(device)
# Hyper-parameters
input_size = 2
hidden_size = 100
num_classes = 2
learning_rate = 0.0001
pred = []
# Fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
def __init__(self, input_size, hidden_size, num_classes, p):
super(NeuralNet, self).__init__()
# self.drop_layer = nn.Dropout(p=p)
# self.drop_layer = MyLinear()
# self.fc1 = MyLinear(input_size, hidden_size, p)
self.fc1 = MyLinear(input_size, hidden_size , p)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# out = self.drop_layer(x)
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
return out
model = NeuralNet(input_size, hidden_size, num_classes, p=0.9).to(device)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
# Move tensors to the configured device
images = images.reshape(-1, 2).to(device)
labels = labels.to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch) % 100 == 0:
print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, i+1, total_step, loss.item()))
Custom dropout is implemented as :
class Dropout(nn.Module):
def __init__(self, p=0.5, inplace=False):
# print(p)
super(Dropout, self).__init__()
if p < 0 or p > 1:
raise ValueError("dropout probability has to be between 0 and 1, "
"but got {}".format(p))
self.p = p
self.inplace = inplace
def forward(self, input):
print(list(input.shape))
return np.random.binomial([np.ones((len(input),np.array(list(input.shape))))],1-dropout_percent)[0] * (1.0/(1-self.p))
def __repr__(self):
inplace_str = ', inplace' if self.inplace else ''
return self.__class__.__name__ + '(' \
+ 'p=' + str(self.p) \
+ inplace_str + ')'
class MyLinear(nn.Linear):
def __init__(self, in_feats, out_feats, drop_p, bias=True):
super(MyLinear, self).__init__(in_feats, out_feats, bias=bias)
self.custom_dropout = Dropout(p=drop_p)
def forward(self, input):
dropout_value = self.custom_dropout(self.weight)
return F.linear(input, dropout_value, self.bias)
It seems I've implemented the dropout function incorrectly ? :
np.random.binomial([np.ones((len(input),np.array(list(input.shape))))],1-dropout_percent)[0] * (1.0/(1-self.p))
How to modify in order to correctly utilize dropout ?
These posts were useful in getting to this point :
Hinton's Dropout in 3 Lines of Python :
https://iamtrask.github.io/2015/07/28/dropout/
Making a Custom Dropout Function : https://discuss.pytorch.org/t/making-a-custom-dropout-function/14053/2
It seems I've implemented the dropout function incorrectly?
np.random.binomial([np.ones((len(input),np.array(list(input.shape))))],1 dropout_percent)[0] * (1.0/(1-self.p))
In fact, the above implementation is known as Inverted Dropout. Inverted Dropout is how Dropout is implemented in practice in the various deep learning frameworks.
What is inverted dropout?
Before jump into the inverted dropout, it can be helpful to see how Dropout works for a single neuron:
Since during train phase a neuron is kept on with probability q (=1-p), during the testing phase we have to emulate the behavior of the ensemble of networks used in the training phase. To this end, the authors suggest scaling the activation function by a factor of q during the test phase in order to use the expected output produced in the training phase as the single output required in the test phase (Section 10, Multiplicative Gaussian Noise). Thus:
Inverted dropout is a bit different. This approach consists in the scaling of the activations during the training phase, leaving the test phase untouched. The scale factor is the inverse of the keep probability 1/1-p = 1/q, thus:
Inverted dropout helps to define the model once and just change a parameter (the keep/drop probability) to run train and test on the same model. Direct Dropout, instead, force you to modify the network during the test phase because if you don’t multiply by q the output the neuron will produce values that are higher respect to the one expected by the successive neurons (thus the following neurons can saturate or explode): that’s why Inverted Dropout is the more common implementation.
References:
Dropout Regularization, coursera by Andrew NG
What is inverted dropout?
Dropout: scaling the activation versus inverting the dropout
Analysis of Dropout
How implement inverted dropout Pytorch?
class MyDropout(nn.Module):
def __init__(self, p: float = 0.5):
super(MyDropout, self).__init__()
if p < 0 or p > 1:
raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p))
self.p = p
def forward(self, X):
if self.training:
binomial = torch.distributions.binomial.Binomial(probs=1-self.p)
return X * binomial.sample(X.size()) * (1.0/(1-self.p))
return X
How to implement in Numpy?
import numpy as np
pKeep = 0.8
weights = np.ones([1, 5])
binary_value = np.random.rand(weights.shape[0], weights.shape[1]) < pKeep
res = np.multiply(weights, binary_value)
res /= pKeep # this line is called inverted dropout technique
print(res)
How to implement in Tensorflow?
import tensorflow as tf
tf.enable_eager_execution()
weights = tf.ones(shape=[1, 5])
keep_prob = 0.8
random_tensor = keep_prob
random_tensor += tf.random_uniform(weights.shape)
# 0. if [keep_prob, 1.0) and 1. if [1.0, 1.0 + keep_prob)
binary_tensor = tf.floor(random_tensor)
ret = tf.div(weights, keep_prob) * binary_tensor
print(ret)
Implementation with Torch and bernoulli..
def forward(self, x):
output = x # self.W.t() + self.bias
if self.training:
sample = torch.distributions.bernoulli.Bernoulli(self.keep_prob).sample(output.size())
print(sample)
return output * sample
return output
I am trying to solve the very simple non-linear problem. It is XOR gate.
I my school knowledge. XOR can be solve by using 2 input nodes, 2 hidden layer nodes. And 1 output. It is binary classification problem.
I generate the 1000 of random integer number it is 0 or 1 and then do backpropagation. But for some unknown reason my network has not learned anything. The training accuracy is constant at 50.
# coding: utf-8
import matplotlib
import torch
import torch.nn as nn
from torch.autograd import Variable
matplotlib.use('TkAgg') # My buggy OSX 10.13.6 requires this
import matplotlib.pyplot as plt
from torch.utils.data import Dataset
from tqdm import tqdm
import random
N = 1000
batch_size = 10
epochs = 40
hidden_size = 2
output_size = 1
lr = 0.1
def return_xor(N):
tmp_x = []
tmp_y = []
for i in range(N):
a = (random.randint(0, 1) == 1)
b = (random.randint(0, 1) == 1)
if (a and not b) or (not a and b):
q = True
else:
q = False
input_features = (a, b)
output_class = q
tmp_x.append(input_features)
tmp_y.append(output_class)
return tmp_x, tmp_y
# In[495]:
# Training set
x, y = return_xor(N)
x = torch.tensor(x, dtype=torch.float, requires_grad=True)
y = torch.tensor(y, dtype=torch.float, requires_grad=True)
# Test dataset
x_test, y_test = return_xor(100)
x_test = torch.tensor(x_test)
y_test = torch.tensor(y_test)
class MyDataset(Dataset):
"""Define my own `Dataset` in order to use `Variable` with `autograd`"""
def __init__(self, x, y):
self.x = x
self.y = y
def __getitem__(self, index):
return self.x[index], self.y[index]
def __len__(self):
return len(self.x)
dataset = MyDataset(x, y)
test_dataset = MyDataset(x_test, y_test)
print(dataset.x.shape)
print(dataset.y.shape)
# Make data iterable by loading to a loader. Shuffle, batch_size kwargs put them here in order to remind I myself
train_loader = torch.utils.data.DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)
print(f"They are {len(train_loader)} batches in the dataset")
shown = 0
for (x, y) in train_loader:
if shown == 1:
break
print(f"{x.shape} {x.dtype}")
print(f"{y.shape} {y.dtype}")
shown += 1
class MyModel(nn.Module):
"""
Binary classification
2 input nodes
2 hidden nodes
1 output node
"""
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.fc1 = torch.nn.Linear(input_size, hidden_size)
self.fc2 = torch.nn.Linear(hidden_size, output_size)
self.sigmoid = torch.nn.Sigmoid()
def forward(self, out):
out = self.fc1(out)
out = self.fc2(out)
out = self.sigmoid(out)
return out
# Create my network
net = MyModel(dataset.x.shape[1], hidden_size, output_size)
CUDA = torch.cuda.is_available()
if CUDA:
net = net.cuda()
criterion = torch.nn.BCELoss(reduction='elementwise_mean')
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
# Train the network
correct_train = 0
total_train = 0
for epoch in range(epochs):
for i, (batches, labels) in enumerate(train_loader):
batcesh = Variable(batches.float())
labels = Variable(labels.float())
output = net(batches) # Forward pass
optimizer.zero_grad()
loss = criterion(output, labels.view(10, 1))
loss.backward()
optimizer.step()
total_train += labels.size(0)
correct_train += (predicted == labels.long()).sum()
if (i + 1) % 10 == 0:
print(f"""
Epoch {epoch+1}/{epochs},
Iteration {i+1}/{len(dataset)//batch_size},
Training Loss: {loss.item()},
Training Accuracy: {100*correct_train/total_train}
""")
Solution:
I did initialized weight, Adaptive learning rate
https://github.com/elcolie/nnbootcamp/blob/master/Study-XOR.ipynb
I am not sure what results you are getting, as the code you have posted in the question doesn't work (It gives errors with pytorch 0.4.1 like predicted not defined etc). But syntax issues apart, there are other problems.
Your model is not actually two layer as it does not use non-linearity after the first output. Effectively this is one layer network and to fix that you can modify your model's forward as follows:
def forward(self, out):
out = torch.nn.functional.relu(self.fc1(out))
out = self.fc2(out)
out = self.sigmoid(out)
return out
You can try sigmoid or tanh non-linearity as well... but the non-linearity is a must. This should fix the problem.
I also see that you are using only 2 hidden units. This might be restrictive and you might want to increase that to something like 5 or 10.
Using below code I create 10 instances of each training data of which each has 100 dimensions.
Each of the 100 dimension contains 3 dimensions. Therefore it's shape is : (3, 100, 10). This emulates 10 instances of 100 pixels each with 3 channels to emulate an RGB value
I've set this model to just classify between 1 and 0.
When applying the softmax layer I receive the error :
RuntimeError: Expected tensor for argument #1 'input' to have the same
dimension as tensor for 'result'; but 4 does not equal 3 (while
checking arguments for cudnn_convolution)
I'm using 0.4.0 (checked using print(torch.__version__) )
How to correctly set the dimension for the softmax layer? As I think my dimensions are correct?
%reset -f
import os
import torch
from skimage import io, transform
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
import torch.utils.data as data_utils
import torchvision
import numpy as np
from sklearn.preprocessing import scale
import torch.nn.functional as F
import torch.nn as nn
import torch.nn.functional as F
from random import randint
batch_size_value = 10
train_dataset = []
mu, sigma = 0, 0.1 # mean and standard deviation
num_instances = 10
# Create 3000 instance and reshape to (3 , 100, 10) , this emulates 10 instances of 100 pixels
# each with 3 channels to emulate an RGB value
for i in range(num_instances) :
image = []
image_x = np.random.normal(mu, sigma, 3000).reshape((3 , 100, 10))
train_dataset.append(image_x)
mu, sigma = 100, 0.80 # mean and standard deviation
for i in range(num_instances) :
image = []
image_x = np.random.normal(mu, sigma, 3000).reshape((3 , 100, 10))
train_dataset.append(image_x)
labels_1 = [1 for i in range(num_instances)]
labels_0 = [0 for i in range(num_instances)]
labels = labels_1 + labels_0
print(labels)
x2 = torch.tensor(train_dataset).float()
y2 = torch.tensor(labels).long()
my_train2 = data_utils.TensorDataset(x2, y2)
train_loader2 = data_utils.DataLoader(my_train2, batch_size=batch_size_value, shuffle=False)
# print(x2)
# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print('device' , device)
# device = 'cpu'
# Hyper parameters
num_epochs = 10
num_classes = 2
batch_size = 5
learning_rate = 0.001
# Convolutional neural network (two convolutional layers)
class ConvNet(nn.Module):
def __init__(self):
super(ConvNet, self).__init__()
self.conv1 = nn.Conv2d(3, 6, kernel_size=5)
self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
self.fc1 = nn.Linear(864, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, num_classes)
def forward(self, x):
out = F.relu(self.conv1(x))
out = F.max_pool2d(out, 2)
out = F.relu(self.conv2(out))
out = F.max_pool2d(out, 2)
out = out.view(out.size(0), -1)
out = F.relu(self.fc1(out))
out = F.relu(self.fc2(out))
out = self.fc3(out)
model = ConvNet().to(device)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Train the model
total_step = len(train_loader2)
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader2):
images = images.to(device)
labels = labels.to(device)
# Forward pass
outputs = model(images)
# print(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i % 10) == 0:
print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
.format(epoch+1, num_epochs, i+1, total_step, loss.item()))
Update :
Removing these lines :
out = F.relu(self.conv2(out))
out = F.max_pool2d(out, 2)
fixes issue of dimensions being smaller than kernel.
There is one structured problem and one bug in your code. Here is the solution.
class ConvNet(nn.Module):
def __init__(self):
super(ConvNet, self).__init__()
self.conv1 = nn.Conv2d(3, 6, kernel_size=5)
self.conv2 = nn.Conv2d(6, 16, kernel_size=1) # kernel_size 5----> 1
self.fc1 = nn.Linear(384, 120) # FC NodeCount864--> 384
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, num_classes)
def forward(self, x):
out = F.relu(self.conv1(x))
out = F.max_pool2d(out, 2)
out = F.relu(self.conv2(out))
out = F.max_pool2d(out, 2)
out = out.view(out.size(0), -1)
out = F.relu(self.fc1(out))
out = F.relu(self.fc2(out))
out = self.fc3(out)
return out # Don't forget to return the output
I am doing a sequence to label learning model in PyTorch. I have two sentences and I am classifying whether they are entailed or not (SNLI dataset). I concatenate two 50 word sentences together (sometimes padded) into a vector of length 100. I then send in minibatches into word embeddings -> LSTM -> Linear layer. I am doing cross entropy loss but I need a vector of [mini_batch, C] to go into the CrossEntropyLoss function. Instead I still have the 100 words in my vector as [mini_batch, 100, C]
Here is my model:
class myLSTM(nn.Module):
def __init__(self, h_size=128, v_size=10, embed_d=300, mlp_d=256):
super(myLSTM, self).__init__()
self.embedding = nn.Embedding(v_size, embed_d)
self.lstm = nn.LSTM(embed_d, h_size, num_layers=1, bidirectional=True, batch_first=True)
self.mlp = nn.Linear(mlp_d, 1024)
# Set static embedding vectors
self.embedding.weight.requires_grad = False
#self.sm = nn.CrossEntropyLoss()
def display(self):
for param in self.parameters():
print(param.data.size())
def filter_params(self):
# Might not be compatible with python 3
#self.parameters = filter(lambda p: p.requires_grad, self.parameters())
pass
def init_hidden(self):
# Need to init hidden weights in LSTM
pass
def forward(self, sentence):
print(sentence.size())
embeds = self.embedding(sentence)
print(embeds.size())
out, _ = self.lstm(embeds)
print(out.size())
out = self.mlp(out)
return out
My training sequences with output:
batch_size = 3
SGD_optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=0.01, weight_decay=1e-4)
ADM_optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.01)
criterion = nn.CrossEntropyLoss()
num_epochs = 50
from torch.autograd import Variable
from torch import optim
for epoch in range(num_epochs):
print("Epoch {0}/{1}: {2}%".format(epoch, num_epochs, float(epoch)/num_epochs))
for start, end in tqdm(batch_index_gen(batch_size, len(n_data))):
# Convert minibatch to numpy
s1, s2, y = convert_to_numpy(n_data[start:end])
# Convert numpy to Tensor
res = np.concatenate((s1,s2), axis=1) # Attach two sentences into 1 input vector
input_tensor = torch.from_numpy(res).type(torch.LongTensor)
target_tensor = torch.from_numpy(y).type(torch.FloatTensor)
data, target = Variable(input_tensor), Variable(target_tensor)
# Zero gradients
SGD_optimizer.zero_grad()
# Forward Pass
output = model.forward(data)
print("Output size: ")
print(output.size())
print("Target size: ")
print(target.size())
# Calculate loss with respect to training labels
loss = criterion(output, target)
# Backprogogate and update optimizer
loss.backward()
SGD_optimizer.step()
#ADAM_optimizer.step()
output:
Epoch 0/50: 0.0%
torch.Size([3, 100])
torch.Size([3, 100, 300])
torch.Size([3, 100, 256])
Output size:
torch.Size([3, 100, 1024])
Target size:
torch.Size([3])
error:
ValueError: Expected 2 or 4 dimensions (got 3)
EDITED -------------------------------------------------------------------
I have now got my model training but I am getting low accuracy. Is there an issue with my LSTM outputs being concatenated and then condensed to a smaller tensor to go through my linear layer?
New Model:
class myLSTM(nn.Module):
def __init__(self, h_size=128, v_size=10, embed_d=300, mlp_d=256, num_classes=3, lstm_layers=1):
super(myLSTM, self).__init__()
self.num_layers = lstm_layers
self.hidden_size = h_size
self.embedding = nn.Embedding(v_size, embed_d)
self.lstm = nn.LSTM(embed_d, h_size, num_layers=lstm_layers, bidirectional=True, batch_first=True)
self.mlp = nn.Linear(2 * h_size * 2, num_classes)
# Set static embedding vectors
self.embedding.weight.requires_grad = False
def forward(self, s1, s2):
# Set initial states
#h0 = Variable(torch.zeros(self.num_layers*2, s1.size(0), self.hidden_size)).cuda() # 2 for bidirection
#c0 = Variable(torch.zeros(self.num_layers*2, s1.size(0), self.hidden_size)).cuda()
batch_size = s1.size()[0]
embeds_1 = self.embedding(s1)
embeds_2 = self.embedding(s2)
_, (h_1_last, _) = self.lstm(embeds_1)#, (h0, c0)) #note the change here. Last hidden state is taken
_, (h_2_last, _) = self.lstm(embeds_2)#, (h0, c0))
concat = torch.cat( (h_1_last, h_2_last), dim=2) #double check the dimension
concat = concat.view(batch_size, -1)
scores = self.mlp(concat)
return scores
New Training
batch_size = 64
SGD_optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=0.001, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
num_epochs = 10
model.train()
if cuda:
model = model.cuda()
criterion = criterion.cuda()
from torch.autograd import Variable
from torch import optim
epoch_losses = []
for epoch in range(num_epochs):
print("Epoch {0}/{1}: {2}%".format(epoch, num_epochs, 100*float(epoch)/num_epochs))
# Batch loss aggregator
losses = []
for start, end in tqdm(batch_index_gen(batch_size, len(n_data))):
# Convert minibatch to numpy
s1, s2, y = convert_to_numpy(n_data[start:end])
# Convert numpy to Tensor
s1_tensor = torch.from_numpy(s1).type(torch.LongTensor)
s2_tensor = torch.from_numpy(s2).type(torch.LongTensor)
target_tensor = torch.from_numpy(y).type(torch.LongTensor)
s1 = Variable(s1_tensor)
s2 = Variable(s2_tensor)
target = Variable(target_tensor)
if cuda:
s1 = s1.cuda()
s2 = s2.cuda()
target = target.cuda()
# Zero gradients
SGD_optimizer.zero_grad()
# Forward Pass
output = model.forward(s1,s2)
# Calculate loss with respect to training labels
loss = criterion(output, target)
losses.append(loss.data[0])
# Backprogogate and update optimizer
loss.backward()
SGD_optimizer.step()
# concat losses to epoch losses
epoch_losses += losses
training with tensor sizes printed:
Epoch 0/10: 0.0%
Batch size: 64
Sentences
torch.Size([64, 50])
torch.Size([64, 50])
torch.Size([64, 50, 300])
torch.Size([64, 50, 300])
Hidden states
torch.Size([2, 64, 128])
torch.Size([2, 64, 128])
Concatenated hidden states
torch.Size([2, 64, 256])
Reshaped tensors for linear layer
torch.Size([64, 512])
Linear propogation
torch.Size([64, 3])
Evaluation
def eval_model(model, mode='dev'):
file_name = 'snli_1.0/snli_1.0_dev.jsonl' if mode == 'dev' else 'snli_1.0/snli_1.0_test.jsonl'
dev_data, _ = obtain_data(file_name)
dev_n_data = vocab.process_data(dev_data)
print("Length of data: {}".format(len(dev_n_data)))
eval_batch_size = 1024
model.eval()
total = len(dev_n_data)
hit = 0
correct = 0
# Batch dev eval
for start, end in batch_index_gen(eval_batch_size, len(dev_n_data)):
s1, s2, y = convert_to_numpy(dev_n_data[start:end])
s1_tensor = torch.from_numpy(s1).type(torch.LongTensor)
s2_tensor = torch.from_numpy(s2).type(torch.LongTensor)
target_tensor = torch.from_numpy(y).type(torch.LongTensor)
s1 = Variable(s1_tensor, volatile=True)
s2 = Variable(s2_tensor, volatile=True)
target = Variable(target_tensor, volatile=True)
if cuda:
s1 = s1.cuda()
s2 = s2.cuda()
target = target.cuda()
output = model.forward(s1,s2)
loss = criterion(output, target)
#print("output size: {}".format(output.size()))
#print("target size: {}".format(target.size()))
pred = output.data.max(1)[1] # get the index of the max log-probability
#print(pred[:5])
#print(output[:])
correct += pred.eq(target.data).cpu().sum()
return correct / float(total)
eval_model(model)
I think there is an issue in a way you are trying to solve an entailment problem.
Maybe you can do it this way:
design your module to accept two sentences as input
embed both of them with your embeddings
encode them using the LSTM module.
now you have two fixed length vector representations of two sentences. Simpliest thing to do is to just concatenate them
together.
Add liner layer on top to evaluate scores for each entailment class (3 I suppose)
apply softmax to get a proper probability distribution
So your model can look like this (double check the dimensions):
class myLSTM(nn.Module):
def __init__(self, h_size=128, v_size=10, embed_d=300, num_classes = 3):
super(myLSTM, self).__init__()
self.embedding = nn.Embedding(v_size, embed_d)
self.lstm = nn.LSTM(embed_d, h_size, num_layers=1, bidirectional=True, batch_first=True)
self.mlp = nn.Linear(2*h_size*2, num_classes) #<- change here
def forward(self, sentence1, sentence2):
embeds_1 = self.embedding(sentence1)
embeds_2 = self.embedding(sentence2)
_, (h_1_last, _) = self.lstm(embeds_1) #note the change here. Last hidden state is taken
_, (h_2_last, _) = self.lstm(embeds_2)
concat = torch.concat([h_1_last, h_2_last], dim=1) #double check the dimension
scores = self.mlp(concat)
probas = F.softmax(scores) #from torch.functional ...
Then you can play around with adding more hidden layers or thinking how combining two sentences can be done in more intelligent way (attention, etc).
Double check what CrossEntropyLoss accepts as input and target and adjust (is it unnormalized class scores or probability distribution). Check http://pytorch.org/docs/master/nn.html#lstm for LSTM module documentation to clarify what LSTM returns (do you need hidden states for every word or just the representation after the last one).