cPickle very large amount of data

cPickle very large amount of data - python

I have about 0.8 million images of 256x256 in RGB, which amount to over 7GB.
I want to use them as training data in a Convolutional Neural Network, and want to put them in a cPickle file, along with their labels.
Now, this is taking a lot of memory, to the extent that it needs to swap with my hard drive memory, and almost consume it all.
Is this is a bad idea?
What would be the smarter/more practical way to load into CNN or pickle them without causing too much memory issue?
This is what the code looks like
import numpy as np
import cPickle
from PIL import Image
import sys,os
pixels = []
labels = []
traindata = []
data=[]
for subdir, dirs, files in os.walk('images'):
curdir=''
for file in files:
if file.endswith(".jpg"):
floc=str(subdir)+'/'+str(file)
im= Image.open(floc)
pix=np.array(im.getdata())
pixels.append(pix)
labels.append(1)
pixels=np.array(pixels)
labels=np.array(labels)
traindata.append(pixels)
traindata.append(labels)
traindata=np.array(traindata)
.....# do the same for validation and test data
.....# put all data and labels into 'data' array
cPickle.dump(data,open('data.pkl','wb'))

Is this is a bad idea?
Yes, indeed.
You are trying to load 7GB of compressed image data into memory all at once (about 195 GB for 800k 256*256 RGB files). This will not work. You have to find a way to update your CNN image-by-image, saving the state as you go along.
Also consider how large your CCN parameter set will be. Pickle is not designed for large amounts of data. If you need to store GB worth of neural net data, you're much better off using a database. If the neural net parameter set is only a few MB, pickle will be fine, though.
You might also want to take a look at the documentation for pickle.HIGHEST_PROTOCOL, so you are not stuck with an old and unoptimized pickle file format.

Related

Fastest way of loading a large number (600) of large images (42M px) in Python

I am processing images with OpenCV, and saving these in a Numpy array with numpy.save. I find myself with this kind of file: a 600(number of images) x 5248(height) x 7936(width) x 3(channels) .npy file, that weights around 70GB.
I then need in another program to load this file and show the images at a fast pace. If I'm not mistaken, loading a 70GB file regardless of the RAM size is a not viable.
Hence my question, should I save the images in multiple smaller arrays ? And if so, how can I define the right amount of images per array ? As for the loading, should I use multiprocessing or multithreading ?
Also, is there perhaps a better type of file in which to store the images ?

Handling large numpy array in tensorflow with regression output(51 outputs)

I have a very large dataset which is a single npy file that contains around 1.5m elements each a 150x150x3 image. The output has 51 columns (51 outputs). Since the dataset can't fit into memory, How do I load it and use it to fit the model? An efficient way is using TFRecords and tf.data but I couldn't understand how to do this. I would appreciate the help. Thank you.

One way is to load your NPY file fragment by fragment ( to feed your neural network with) and not to load it into the memory at once. You can use numpy.load as normal and specify the mmap_mode keyword so that the array is kept on disk, and only necessary bits are loaded into memory upon access (more details here)
numpy.load(file, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.
If you want to know how to create a tfrecords from a numpy array, and then read the tfrecords using the Dataset API, this link provides a good answer.

how to load a very large mat file in chunks?

okay so the code is like this
X1 is the loaded hyperspectral image with dimensions (512x512x91)
what i am trying to do is basically crop 64x64x91 sized matrices with a changing stride of 2. this gives me a total of 49952 images each of 64x64x91 size however when i run the for loop i get the memory error.
my system has 8 GB ram.
data_images_0=np.zeros((49952,256,256,91))
k=0
for i in range(0,512-64,2):
r=64
print(k)
for j in range (0,512-64,2):
#print(k)
data_images_0[k,:,:,:]=X1[i:i+r,j:j+r,:]
k=k+1
I have a hyperspectral image loaded as a Mat file and the dimensions are (512x512x91). I want to use chunks of this image as the input to my CNN for example using crops of 64x64x91. The problem is once i create the crops out of the original image i have trouble loading the data as loading all the crops at once gives me memory error.
Is there something i can do to load my cropped data in batches so that i dont receive such a memory error.
Should i convert my data into some other format or proceed the problem in some other way?

You are looking for the matfile function. It allows you to access the array on your harddisk and then only load parts of it.
Say your picture is named pic, then you can do something like
data = matfile("filename.mat");
part = data.pic(1:64,1:64,:);
%Do something
then only the (1:64,1:64,:) part of the variable pic will be loaded into part.
As always it should be noted, that working on the harddisk is not exactly fast and should be avoided. On the other hand if your variable is too large to fit in the memory, there is no other way around it (apart from buying more memory).

I think you might want to use the matfile function, which basically opens a .mat file without pulling its entire content into the RAM. You basically read a header from your .mat file that contains information about the stored elements, like size, data type and so on. Imagine your .mat file hyperspectralimg.mat containing the matrix myImage. You'd have to proceed like this:
filename = 'hyperspectralimg.mat';
img = matfile(filename);
A = doStuff2MyImg(img.myImage(1:64,1:64,:)); % Do stuff to your imageparts
img.myImage(1:64,1:64,:) = A; %Return changes to your file
This is a brief example how you can use it in case you haven't used matfile before. If you have already used it and it doesn't work, let us know and as a general recommendation upload code snippets and data samples regarding your issues, it helps.
A quick comment regarding the tags: If your concern regards matlab, then don't tag python and similar things.

You can use numpy memory map. This is equivalent to matfile of MatLAB.
https://numpy.org/doc/stable/reference/generated/numpy.memmap.html

How to understand python pickle files

I've a problem that I am not able to solve. I'm studying a Convolutional Neural Network for traffic sign recognition but I don't understand how python .p files are organized, what these files contain and how to create a .p file to insert my images and labels associated with such images. Can anyone help me? I posted the link of the screenshot about the first lines of code that load the data in the dataset. Thanks a lot.
import pickle
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import cv2
import scipy.misc
from PIL import Image
training_file = 'traffic-signs-data/train.p'
validation_file = 'traffic-signs-data/valid.p'
testing_file = 'traffic-signs-data/test.p'
with open(training_file, mode='rb') as f:
train = pickle.load(f)
with open(validation_file, mode='rb') as f:
valid = pickle.load(f)
with open(testing_file, mode='rb') as f:
test = pickle.load(f)
X_train, y_train = train['features'], train['labels']
X_valid, y_valid = valid['features'], valid['labels']
X_test, y_test = test['features'], test['labels']

pickle isn't just one thing, so there's no single answer to your question, but the whole point of pickle is that it shouldn't matter. In general, any python object can be pickled as-is, and unpickled, without any special knowledge of what's being pickled or how. It's simply a way to freeze an in-memory Python object on disk. It's up to you as the developer to know what data and types went into a pickle file and what you should expect back out, or to handle errors appropriately.
There are rare issues with certain types that don't make sense to be pickled (such as an HTTP connection), and also if you attempt to unpickle an old Python object after changing the underlying Python or library versions (e.g., trying to unpickle a Python 3 object in Python 2), but in general it doesn't matter what you're pickling. If you need greater resilience to change, you should use some serialization system that isn't Python- and library-specific.

This may be of interest to you: https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled
The thing is, if you do not have the classes that were used to pickle, you won't be able to unpickle.
So, your data in your .p files may be totally useless.
However, if you're the owner of the full flow (and you'll have to create the .p files), then, know that pickle is just a way to serialize/deserialize data.
So, you can have a piece of your software that focuses on populating your .p files (try loading your images with Image (and use pillow and not PIL) and then pickle your list of Images).
You should then be able to unpickle them on the part of your software that you're showing above.
This is just a way to do your preprocessing beforehand and avoid to redo them everytime.
Another way to do that (for example) would be to dump it as json and your images can be base64 encoded/decoded. See here for a quick example of the latter part: https://stackoverflow.com/a/31826470/8933502
Good luck!

How to increase read from disk speed in Python

I use Python for Image analysis. The first step in my code is to load the images from disk to a big 20GB uint8 array. This step is taking a very long time, loading about 10MB/s, and the cpu is idling during the task.
This seems extremely slow. Am I making an obvious mistake? How can I improve performance? Is it a problem with the numpy array type?
# find all image files in working folder
FileNames = [] # FileNames is a list of image names
workingFolder = 'C:/folder'
for (dirpath, dirnames, filenames) in os.walk(workingFolder):
FileNames.extend(filenames)
FileNames.sort() # Sorted by image number
imNumber = len(FileNames) # Number of Images
# AllImages initialize
img = Image.open(workingFolder+'/'+FileNames[0])
AllImages = np.zeros((img.size[0],img.size[1], imNumber),dtype=np.uint8)
for ii in range(imNumber):
img = Image.open(workingFolder+'/'+FileNames[ii])
AllImages[:,:,ii] = img
Thanks a lot for your help.

Since the CPU is idling it sounds that it's the disk that is the bottle neck. 10 Mb/s is somewhat slow, but not that slow that it reminds me of stone age hard disks. If it were numpy I'd expect the CPU to be busy running numpy code rather than being idle.
Note that there maybe two ways the CPU will be waiting for the disk. First of course you will need to read the data from disk, but also since the data is 20GB the data may be big enough to require it to be swapped to disk. The normal solution to this type of situation is to memory map the file (which will avoid moving data from disk to swap).
Try to check if you can read the files faster by other means. For example on linux you could use dd if=/path/to/image of=/tmp/output bs=8k count=10k; rm -f /tmp/output to check the speed of read to ram. See this question for more information on checking disk performance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

cPickle very large amount of data - python

Related

Fastest way of loading a large number (600) of large images (42M px) in Python

Handling large numpy array in tensorflow with regression output(51 outputs)

how to load a very large mat file in chunks?

How to understand python pickle files

How to increase read from disk speed in Python

Categories

Resources