I dont have access to any GPU's, but I want to speed-up the training of my model created with PyTorch, which would be using more than 1 CPU. I will use the most basic model for example here.
All I want is this code to run on multiple CPU instead of just 1 (Dataset and Network class in Appendix).
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
df = pd.read_pickle('path/to/data')
X = df.drop(columns=['target'])
y = df[['target']]
train_data = CustomDataset(X, y)
train_loader = DataLoader(
dataset=train_data,
batch_size=64
)
model = Network(X.shape[-1])
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
epochs = 10
for e in range(1, epochs + 1):
train_loss = .0
model.train()
for batch_id, (data, labels) in enumerate(train_loader):
optimizer.zero_grad()
target = model(data)
loss = criterion(target, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
print('Epoch {}:\tTrain Loss: {:.4f}'.format(
e,
train_loss / len(train_loader)
))
Right now, this code is using only 1 CPU for 100% during training. I want the code to use 4 CPU's in the training process for 100%. There is so much different information out there on how to do this in the best way, that none of it is really working. I have tried different approaches using model = torch.nn.parallel.DistributedDataParallel(model) for example, but none of the worked.
Does someone now how to get the most performance for this code using 4 CPU? Thanks in Advance!
APPENDIX
class CustomDataset(Dataset):
def __init__(
self,
X,
y,
):
self.X = torch.Tensor(X.values)
self.y = torch.Tensor(y.values)
def __getitem__(
self,
index
):
return self.X[index], self.y[index]
def __len__(self):
return len(X)
class Network(nn.Module):
def __init__(
self,
input_size
):
super(Network, self).__init__()
self.input_size = input_size
self.linear_1 = nn.Linear(self.input_size, 32)
self.linear_2 = nn.Linear(32, 1)
def forward(self, data):
output = self.linear_1(data)
output = self.linear_2(output)
return output
Torchrun (included with Pytorch) makes this surprisingly easy.
How you want the CPUs to work together is not clear from your question, but I am assuming (because you refer to DistributedDataParallel that you would like to distribute the data across multiple cores which all do backward passes and broadcast their losses to the main process.
First, see if torch.distributed is available: torch.distributed.is_available().
Torchrun requires your script to have a few tweaks.
To initialize a process group, include
import torch.distributed as dist
dist.init_process_group(backend="gloo")
The backend must be gloo for CPUS.
Torchrun sets the environment variables MASTER_PORT, MASTER_ADDR, WORLD_SIZE, and RANK, which are required for torch.distributed.
Then, to be able to broadcast the losses across cores, use
local_rank = int(os.environ["LOCAL_RANK"])
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[local_rank],
output_device=local_rank,
)
Every core still gets to see the same data.
This can be more efficient using DistributedSampler:
train_sampler = DistributedSampler(train_data)
train_loader = DataLoader(
...
train_data,
shuffle=False, # train_sampler will shuffle for you.
sampler=train_sampler,
)
for e in range(1, epochs + 1):
train_sampler.set_epoch(e)
train(train_loader)
where train(train_loader) does the training steps. Note the train_sampler.set_epoch(e).
This makes sure that every epoch the data is distributed to processes differently.
To start the training, run
torchrun
--standalone
--nnodes=1
--nproc_per_node=$NUM_TRAINERS
YOUR_TRAINING_SCRIPT.py
where $NUM_TRAINERS is the number of processes that work on improving your model.
It is recommended, not necessary, to include
from torch.distributed.elastic.multiprocessing.errors import record
#record
def main():
# do train
pass
if __name__ == "__main__":
main()
to see which process throws what error.
If I misunderstood your question and you only want more workers to load the data, then pass num_workers=N to DataLoader.
This ensures more workers (other than the main process) push the data into RAM while the current batch is being processed.
Related
I have this code:
#!/usr/bin/env python
# coding: utf-8
import torch
from torch_geometric.datasets import TUDataset
from torch_geometric.data import Data, Dataset,DataLoader
dataset = TUDataset(root='data/TUDataset', name='MUTAG')
print()
print(f'Dataset: {dataset}:')
print('====================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')
data = dataset[0] # Get the first graph object.
print()
print(data)
print('=============================================================')
# Gather some statistics about the first graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
#print(f'Has isolated nodes: {data.has_isolated_nodes()}')
#print(f'Has self-loops: {data.has_self_loops()}')
#print(f'Is undirected: {data.is_undirected()}')
torch.manual_seed(12345)
dataset = dataset.shuffle()
train_dataset = dataset[:150]
test_dataset = dataset[150:]
print(f'Number of training graphs: {len(train_dataset)}')
print(f'Number of test graphs: {len(test_dataset)}')
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
for step, data in enumerate(train_loader):
print(f'Step {step + 1}:')
print('=======')
print(f'Number of graphs in the current batch: {data.num_graphs}')
print(data)
print()
from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.nn import global_mean_pool
class GCN(torch.nn.Module):
def __init__(self, hidden_channels):
super(GCN, self).__init__()
torch.manual_seed(12345)
self.conv1 = GCNConv(dataset.num_node_features, hidden_channels)
self.conv2 = GCNConv(hidden_channels, hidden_channels)
self.conv3 = GCNConv(hidden_channels, hidden_channels)
self.lin = Linear(hidden_channels, dataset.num_classes)
def forward(self, x, edge_index, batch):
# 1. Obtain node embeddings
x = self.conv1(x, edge_index)
x = x.relu()
x = self.conv2(x, edge_index)
x = x.relu()
x = self.conv3(x, edge_index)
# 2. Readout layer
x = global_mean_pool(x, batch) # [batch_size, hidden_channels]
# 3. Apply a final classifier
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin(x)
return x
model = GCN(hidden_channels=64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()
def train():
model.train()
for data in train_loader: # Iterate in batches over the training dataset.
out = model(data.x, data.edge_index, data.batch) # Perform a single forward pass.
loss = criterion(out, data.y) # Compute the loss.
loss.backward() # Derive gradients.
optimizer.step() # Update parameters based on gradients.
optimizer.zero_grad() # Clear gradients.
def test(loader):
model.eval()
correct = 0
for data in loader: # Iterate in batches over the training/test dataset.
out = model(data.x, data.edge_index, data.batch)
pred = out.argmax(dim=1) # Use the class with highest probability.
correct += int((pred == data.y).sum()) # Check against ground-truth labels.
return correct / len(loader.dataset) # Derive ratio of correct predictions.
for epoch in range(1, 10000):
train()
train_acc = test(train_loader)
test_acc = test(test_loader)
print(f'Epoch: {epoch:03d}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}')
It runs in a docker container on a linux server without error.
I want to be able to run this code in the background, so that I can be disconnected from the linux server that docker is running on (e.g. the internet connection to the server goes down), and when I log back in, the script is still running.
To achieve this on a command line without docker, I would usually log in to the server, and run:
python script.py & , and then log out and back in, and the script will be still running.
How do I achieve this with docker? When I run the above script with python script.py &, the printing progress to screen means I cannot get a command line in docker (because the printing is happeneing too quickly), to be able to log out and log back in.
I don't want to just 'not print anything to screen' because I want to see the progress printing at the start.
So is there another way to set a script running in docker, so I can be disconnected from docker and the underlying server, and log back in and the docker container and script is still running?
I would like to train a neural network model (built by tf2.8 and keras) on EC2 instance (cpu only, e.g. m4) from jupyter.
In order to improve training runtime, I want to use "multiprocessing" for keras.model.fit().
My code:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.utils import Sequence
class DataGenerator(Sequence):
def __init__(self, train_file_paths, batch_size, total_data_size, epoch,
num_parallel_calls=tf.data.AUTOTUNE):
def process_samples(input_ds):
features = tf.io.parse_single_example(input_ds, feaure_names)
labels = tf.io.parse_single_example(input_ds, label_name)['label']
return (features, labels)
train_dataset = tf.data.TFRecordDataset(train_file_paths, compression_type='GZIP')
train_dataset = train_dataset.map(lambda x: process_samples(x), num_parallel_calls=num_parallel_calls)
train_dataset = train_dataset.shuffle(300)
train_dataset = train_dataset.repeat(epoch) #2588
train_dataset = train_dataset.batch(batch_size, drop_remainder=True)
self.train_dataset = train_dataset.prefetch(batch_size)
self.total_data_size = total_data_size
self.batch_size = batch_size
def __len__(self):
return np.math.ceil(self.total_data_size/self.batch_size)
def __getitem__(self, idx):
features = next(iter(self.train_dataset))
return features
my_train_data_gen = DataGenerator(train_file_paths, batch_size=4, total_data_size=10000, epoch=10)
my_test_data_gen = DataGenerator(test_file_paths, batch_size=4, total_data_size=500, epoch=10)
model.fit(my_train_data_gen,
epochs=10,
steps_per_epoch = np.math.floor(train_data_size/4),
validation_data=my_test_data_gen,
validation_steps=np.math.ceil(test_data_size/4),
workers=8,
use_multiprocessing=True)
But, the code is stuck in the beginning of the first epoch and does not move forward. No errors popped !
DataGenerator, train_dataset type is <class 'tuple'> size is 2
Epoch 1/2
No any progress, all cpu cores on the EC2 instance are idle.
But, the jupyter notebook is still running.
I have checked https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence,
In order to speed up training process, what could be better choice for keras.utils.Sequence or tf.data ?
Could anybody help me with this?
I am trying to compete in kaggle's cornell birdcall detection challenge and there is in total 23 gb of data which mainly composed as mp3 sound files. As you may know 23 gb of data is impossible to fit into the RAM kaggle or google colab. Therefore, I tried to write a datagenerator to fetch mp3 files while training my model and convert them in order to prevent out of memory issue. However, I am still getting out of memory issue after first few epochs. Below you can find my generator and training code where I use del command to specifically de-allocate objects from memory but apparently I did something wrong. Is there any resource you can suggest for that or any suggestion to improve my code to prevent memory leak? Calling garbage collector makes no difference too.
Thx
My datagenerator code
from tensorflow import keras
import random
import glob
import gc
class My_Custom_Generator(keras.utils.Sequence):
def __init__(self, batch_size):
files = glob.glob("../input/birdsong-recognition/train_audio/*/*.mp3")
random.shuffle(files)
self.files = files
self.batch_size = batch_size
def __len__(self) :
return (np.ceil(len(self.files) / float(self.batch_size))).astype(np.int)
def __getitem__(self, idx) :
gc.collect(2)
batch_x = self.files[idx * self.batch_size : (idx+1) * self.batch_size]
#batch_y = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]
train_image = []
train_label = []
for i in range(0, len(batch_x)):
image, label = get_data(batch_x[i])
image = tf.convert_to_tensor(image)
label_matrix = get_cat_label(label)
train_image.append(image)
train_label.append(label_matrix)
self.train_image = np.array(train_image)
self.train_label = np.array(train_label)
del train_image
del train_label
return self.train_image, self.train_label
My training loop which I got from tensorflow tutorial and edited
## Note: Rerunning this cell uses the same model variables
# Keep results for plotting
train_loss_results = []
train_accuracy_results = []
num_epochs = int(len(glob.glob("../input/birdsong-recognition/train_audio/*/*.mp3")) // 8)
for epoch in range(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.CategoricalAccuracy()
imgs, labels = my_training_batch_generator.__getitem__(epoch)
# Training loop - using batches of 32
for i in range(1):
# Optimize the model
loss_value, grads = grad(xceptionModel, imgs, labels)
optimizer.apply_gradients(zip(grads, xceptionModel.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Add current batch loss
# Compare predicted label to actual label
# training=True is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
epoch_accuracy.update_state(labels, xceptionModel(imgs, training=True))
del imgs
del labels
# End epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
if epoch % 2 == 0:
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
epoch_loss_avg.result(),
epoch_accuracy.result()))
I am new to Pytorch. I was trying to model a binary classifier on the Kepler dataset. The following was my dataset class.
class KeplerDataset(Dataset):
def __init__(self, test=False):
self.dataframe_orig = pd.read_csv(koi_cumm_path)
if (test == False):
self.data = df_numeric[( df_numeric.koi_disposition == 1 ) | ( df_numeric.koi_disposition == 0 )].values
else:
self.data = df_numeric[~(( df_numeric.koi_disposition == 1 ) | ( df_numeric.koi_disposition == 0 ))].values
self.X_data = torch.FloatTensor(self.data[:, 1:])
self.y_data = torch.FloatTensor(self.data[:, 0])
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.X_data[index], self.y_data[index]
Here, I created a custom classifier class with one hidden layer and a single output unit that produces sigmoidal probability of being in class 1 (planet).
class KOIClassifier(nn.Module):
def __init__(self, input_dim, out_dim):
super(KOIClassifier, self).__init__()
self.linear1 = nn.Linear(input_dim, 32)
self.linear2 = nn.Linear(32, 32)
self.linear3 = nn.Linear(32, out_dim)
def forward(self, xb):
out = self.linear1(xb)
out = F.relu(out)
out = self.linear2(out)
out = F.relu(out)
out = self.linear3(out)
out = torch.sigmoid(out)
return out
I then created a train_model function to optimize the loss using SGD.
def train_model(X, y):
criterion = nn.BCELoss()
optim = torch.optim.SGD(model.parameters(), lr=0.001)
n_epochs = 100
losses = []
for epoch in range(n_epochs):
y_pred = model.forward(X)
loss = criterion(y_pred, y)
losses.append(loss.item())
optim.zero_grad()
loss.backward()
optim.step()
losses = []
for X, y in train_loader:
losses.append(train_model(X, y))
But after performing the optimization over the train_loader, When I try predicting on the trainn_loader itself, the prediction values are so much worse.
for features, y in train_loader:
y_pred = model.predict(features)
break
y_pred
> tensor([[4.5436e-02],
[1.5024e-02],
[2.2579e-01],
[4.2279e-01],
[6.0811e-02],
.....
Why is my model not working properly? Is it the problem with the dataset or am I doing something wrong with implementing the Neural net? I will link my Kaggle notebook because more context might be helpful. Please help.
You are optimizing many times (100 steps) on the first batch (first samples), then moving to the next samples. It means that your model will overfit your few samples before going to the next batch. Then, your training will be very non smooth, diverge and go far from your global optimum.
Usually, in a training loop you should:
go over all samples (this is one epoch)
shuffle your dataset in order to visit your samples in a different order (set your pytorch training loader accordingly)
go back to 1. until you reach the max number of epochs
Also you should not define your optimizer each time (nor your criterion).
Your training loop should look like this:
criterion = nn.BCELoss()
optim = torch.optim.SGD(model.parameters(), lr=0.001)
n_epochs = 100
def train_model():
for X, y in train_loader:
optim.zero_grad()
y_pred = model.forward(X)
loss = criterion(y_pred, y)
loss.backward()
optim.step()
for epoch in range(n_epochs):
train_model()
I want to use torch.save() to save a trained model for inference. However, with either torch.load_state_dict() or torch.load(), I can't get the saved model. The loss computed by the loaded model is just different from the loss computed by the saved model.
The relevant Libraries:
import numpy as np
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F
The model:
class nn_block(nn.Module):
def __init__(self, feats_dim):
super(nn_block, self).__init__()
self.linear = nn.Linear(feats_dim, feats_dim)
self.bn = nn.BatchNorm1d(feats_dim)
self.softplus1 = nn.Softplus()
self.softplus2 = nn.Softplus()
def forward(self, rep_mat):
transformed_mat = self.linear(rep_mat)
transformed_mat = self.bn(transformed_mat)
transformed_mat = self.softplus1(transformed_mat)
transformed_mat = self.softplus2(transformed_mat + rep_mat)
return transformed_mat
class test_nn(nn.Module):
def __init__(self, in_feats, feats_dim, num_conv, num_classes):
super(test_nn, self).__init__()
self.linear1 = nn.Linear(in_feats, feats_dim)
self.convs = [nn_block(feats_dim) for _ in range(num_conv)]
self.linear2 = nn.Linear(feats_dim, num_classes)
self.softmax = nn.Softmax()
def forward(self, rep_mat):
h = self.linear1(rep_mat)
for conv_func in self.convs:
h = conv_func(h)
h = self.linear2(h)
h = self.softmax(h)
return h
Train, save, and reload a model:
# fake a classification task
num_classes = 2; input_dim = 8
one = np.random.multivariate_normal(np.zeros(input_dim),np.eye(input_dim),20)
two = np.random.multivariate_normal(np.ones(input_dim),np.eye(input_dim),20)
inputs = np.concatenate([one, two], axis=0)
labels = np.concatenate([np.zeros(20), np.ones(20)])
inputs = Variable(torch.Tensor(inputs))
labels = torch.LongTensor(labels)
# build a model
net = test_nn(input_dim, 5, 2, num_classes)
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
net.train()
losses = []
best_score = 1e10
for epoch in range(25):
preds = net(inputs)
loss = F.cross_entropy(preds, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
state_dict = {'state_dict': net.state_dict()}
if loss.item()-best_score<-1e-4:
# save only parameters
torch.save(state_dict, 'model_params.torch')
# save the whole model
torch.save(net, 'whole_model.torch')
best_score = np.min([best_score, loss.item()])
losses.append(loss.item())
net_params = test_nn(input_dim, 5, 2, num_classes)
net_params.load_state_dict(torch.load('model_params.torch')['state_dict'])
net_params.eval()
preds_params = net_params(inputs)
loss_params = F.cross_entropy(preds_params, labels)
print('reloaded params %.4f %.4f' % (loss_params.item(), np.min(losses)))
net_whole = torch.load('whole_model.torch')
net_whole.eval()
preds_whole = net_whole(inputs)
loss_whole = F.cross_entropy(preds_whole, labels)
print('reloaded whole %.4f %.4f' % (loss_whole.item(), np.min(losses)))
As you can see by running the code, the losses computed by the two loaded models are different, while the two loaded models are exactly the same. Not just the two losses are different, they are also different from the loss computed by the best model that was saved in the first place.
Why this can happen?
The state dict contains every parameter (nn.Parameter
) and buffer (similar to parameter, but which should not be trained/optimised) that has been registered on the module and all of its submodules. Everything else will not be included in that state dict.
Your test_nn module uses a list for convs, therefore it is not included in the state dict:
self.convs = [nn_block(feats_dim) for _ in range(num_conv)]
Not only are they not contained in the state dict, they are also not visible to net.parameters(), which means they are not trained/optimised at all.
To register the modules from the list you can wrap it in nn.ModuleList, which is a module that acts like a list, while correctly registering the modules it contains:
self.convs = nn.ModuleList([nn_block(feats_dim) for _ in range(num_conv)])
With that change both models produce the same result.
Since you are calling the convs modules sequentially in the for-loop (output of one module is the input of the next), you may consider using nn.Sequential, which you can call directly instead of having to use the for-loop. Sequencing is used a lot and it just makes it a little simpler, for example if you want to replace the sequence of modules with a single module, you don't need to change anything in the forward method.
Not just the two losses are different, they are also different from the loss computed by the best model that was saved in the first place.
When you are training, you calculate the loss for the current input (batch) and then you optimise the parameters based on that input. This means your parameters differ from the ones used to calculate the loss. Because you are saving the model after that, it will also have a different loss (the one that would occur in the next iteration).
preds = net(inputs)
# Calculating the loss of the current model
loss = F.cross_entropy(preds, labels)
optimizer.zero_grad()
loss.backward()
# Updating the model's parameters based on the loss
optimizer.step()
# State of the model after it has been updated
state_dict = {'state_dict': net.state_dict()}
# Comparing the loss from BEFORE the update
# But saving the model from AFTER the update
if loss.item()-best_score<-1e-4:
# save only parameters
torch.save(state_dict, 'model_params.torch')
# save the whole model
torch.save(net, 'whole_model.torch')
It's important to evaluate the model after the updates have been made. For this reason a validation set should be used, which is run after each epoch to assess the model's accuracy.