How to feed large NumPy arrays to tf.fit()

How to feed large NumPy arrays to tf.fit() - python

I have two NumPy arrays saved in .npy file extension. One contains x_train data and other contains y_train data.
The x_train.npy file is 5.7GB of size. I can't feed it to the training by loading the whole array to the memory.
Every time I try to load it to RAM and train the model, Colab crashes before starting the training.
Is there a way to feed large Numpy files to tf.fit()
files I have:
"x_train.npy" 5.7GB
"y_train.npy"

Depending on how much RAM your device has, it may not be possible from a hardware point of view.

Related

How to read large sample of images effectively without overloading ram memory?

While training the classification model I'm passing input image samples as NumPy array but when I try to train large dataset samples I run into memory error. Currently, I've 120 GB size of memory even with this size I run into memory error. I've enclosed code snippet below
x_train = np.array([np.array(ndimage.imread(image)) for image in image_list])
x_train = x_train.astype(np.float32)
Error traceback:
x_train = x_train.astype(np.float32) numpy.core._exceptions.MemoryError: Unable to
allocate 134. GiB for an array with shape (2512019,82,175,1) and data type float32
How can I fix this issue without increasing ram size? is there a better way to read the data like using cache or using protobuf?

I would load the first half of the dataset and then train the model on the first half of the dataset and then I would load the 2nd half and train the model on the 2nd part of the dataset. This does not influence the result.
The easiest way to split your dataset is to simply make a 2nd folder with the same structure with 50% of the dataset.
The pseudo-code for that method of training would look like this.:
load dataset 1
train the model with dataset 1
load dataset 2 but into the same variable as the first one to reuse the memory instead of creating a 2nd variable with the first one still in memory
train the model with dataset 2
A second option to decrease the memory size of your array you could use np.float16 instead of np.float32 but this would result in a more inaccurate model. The difference is data-dependent. So it could be 1-2% or even 5-10% so the only option without losing any accuracy is the option is described above.
EDIT
I am going to add the actual code.
import cv2 #pip install opencv-python
import os
part1_of_dataset = os.listdir("Path_to_your_first_dataset")
part2_of_dataset = os.listdir("Path_to_your_second_dataset")
x_train = np.array([cv2.imread(image) for image in part1_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m
model.fit(x_train,y_train) #not the full training code just an example
x_train = np.array([cv2.imread(image) for image in part2_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m
model.fit(x_train,y_train) #not the full training code just an example

This question comes up just as I put the first 2 32GB RAM sticks into my pc today for pretty much the same reason.
At this point it becomes necsessary to handle the data different.
I am not sure what you are using to do the learning. But if its tensorflow you can customize your input pipeline.
Anyways. It comes down to correctly analyze what you want to do with the data and the capabilities of your environment. If the data is ready to train and you just load it from disk it should not be a problem to only load then train on only a portion of it, then go to the next portion and so on.
You can split this data into multiple files or partially load the data (there are datatypes/fileformats to help with that). You can even optimize this so far that you can read from disk during training and have the next batch ready to go when you need it.

More memory and velocity efficient way to read in and save images?

I am training a neural network. Therefore, I read in 182335 images (png-files) with the code below.
folders = glob.glob(r'path\to\images\*')
imagenames_list = []
for folder in folders:
for f in glob.glob(folder+'/*.png'):
imagenames_list.append(f)
read_images = []
for image in imagenames_list:
read_images.append(cv2.imread(image))
After some preprocessing of the data I created a pandas dataframe and saved it as a pickle-file:
df.to_pickle(r'data\data_as_pddataframe.pkl')
df.head()
Because of the huge number of images I have a relatively big pickle file (3GB). Because of this it lasts some time to read in this file and it also needs a lot of memory. Furthermore, when I am going to train the network in Google Colab, it happens that Colab crashes because of the huge amount of data.
Therefore, is there a more efficient way 1. to read in the data and 2. to store the dataframe?
Thanks!

I would do something like this:
Make sure that the batch size of your model is small enough that the input data and model parameters fit in memory.
Save the images as images on disk. Save the non-image data as a Parquet, CSV, or whatever (don't use Pickle for this). Put the image filenames in the table.
Keep data on disk, don't load it all into memory.
Load your non-image data as a regular data frame. Only load images from disk when it's required for your batch in SGD.

Pytorch tensor.save() produces huge files for small tensors from MNIST

I'm working with MNIST dataset from Kaggle challange and have troubles preprocessing with data. Furthermore, I don't know what are the best practices and was wondering if you could advise me on that.
Disclaimer: I can't just use torchvision.datasets.mnist because I need to use Kaggle's data for training and submission.
In this tutorial, it was advised to create a Dataset object loading .pt tensors from files, to fully utilize GPU. In order to achieve that, I needed to load the csv data provided by Kaggle and save it as .pt files:
import pandas as pd
import torch
import numpy as np
# import data
digits_train = pd.read_csv('data/train.csv')
train_tensor = torch.tensor(digits_train.drop(label, axis=1).to_numpy(), dtype=torch.int)
labels_tensor = torch.tensor(digits_train[label].to_numpy())
for i in range(train_tensor.shape[0]):
torch.save(train_tensor[i], "data/train-" + str(i) + ".pt")
Each train_tensor[i].shape is torch.Size([1, 784])
However, each such .pt file has size of about 130MB.
A tensor of the same size, with randomly generated integers, has size of 6.6kB.
Why are these tensors so huge, and how can I reduce their size?
Dataset is 42 000 samples. Should I even bother with batching this data? Should I bother with saving tensors to separate files, rather than loading them all into RAM and then slicing into batches? What is the most optimal approach here?

As explained in this discussion, torch.save() saves the whole tensor, not just the slice. You need to explicitly copy the data using clone().
Don't worry, at runtime the data is only allocated once unless you explicitly create copies.
As a general advice: If the data easily fits into your memory, just load it at once. For MNIST with 130 MB that's certainly the case.
However, I would still batch the data because it converges faster. Look up the advantages of SGD for more details.

Train Keras Model on large number of .mat files

I want to train a model using neural network on Python using Keras. The data I have are a bunch of .mat files that contain data associated with ECG signals.
All .mat files have same structure but different values. However there is some differences in some array sizes in the files. For example, one file contain an array called "P_Signal" having a size of 9000, another file have it with a size of 18000.
I want to train a neural network model using Keras. As far as I know I should prepare a CSV file containing all the necessary attributes + label on each column so Keras module can understand it.
Unfortunately, this is impossible due to large arrays and fields inside the .mat files (CSV file can not load more than 16,384 columns and I have some arrays of 18000 each). So I should find another way to load those data in Keras.
Appreciate your help a lot.

Optimizing shuffle buffer size in tensorflow dataset api

I'm trying to use the dataset api to load data and find that I'm spending a majority of the time loading data into the shuffle buffer. How might I optimize this pipeline in order to minimize the amount of time spent populating the shuffle buffer.
(tf.data.Dataset.list_files(path)
.shuffle(num_files) # number of tfrecord files
.apply(tf.contrib.data.parallel_interleave(lambda f: tf.data.TFRecordDataset(f), cycle_length=num_files))
.shuffle(num_items) # number of images in the dataset
.map(parse_func, num_parallel_calls=8)
.map(get_patches, num_parallel_calls=8)
.apply(tf.contrib.data.unbatch())
# Patch buffer is currently the number of patches extracted per image
.apply(tf.contrib.data.shuffle_and_repeat(patch_buffer))
.batch(64)
.prefetch(1)
.make_one_shot_iterator())

Since I have at most thousands of images, my solution to this problem was to have a separate tfrecord file per image. That way individual images could be shuffled without having to load them into memory first. This drastically reduced the buffering that needed to occur.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to feed large NumPy arrays to tf.fit() - python

Depending on how much RAM your device has, it may not be possible from a hardware point of view.

Related

How to read large sample of images effectively without overloading ram memory?

More memory and velocity efficient way to read in and save images?

Pytorch tensor.save() produces huge files for small tensors from MNIST

Train Keras Model on large number of .mat files

Optimizing shuffle buffer size in tensorflow dataset api

Categories

Resources