Converting image folder to numpy array is consuming the entire RAM - python

I am trying to convert the celebA dataset(https://www.kaggle.com/jessicali9530/celeba-dataset) images folder into a numpy array for later to be converted into a .pkl file(for using the data as simply as mnist or cifar).
I am willing to find a better way of converting since this method is absolutely consuming the whole RAM.
from PIL import Image
import pickle
from glob import glob
import numpy as np
TARGET_IMAGES = "img_align_celeba/*.jpg"
def generate_dataset(glob_files):
dataset = []
for _, file_name in enumerate(sorted(glob(glob_files))):
img = Image.open(file_name)
pixels = list(img.getdata())
dataset.append(pixels)
return np.array(dataset)
celebAdata = generate_dataset(TARGET_IMAGES)
I am rather curious on how the mnist authors did this themselves but any approach that works is welcome.

You can transform any kind of data on the fly in Keras and load in memory one batch at the time during training.
See documentation, search for 'Example of using .flow_from_directory(directory)'.

Related

High performance (python) library for reading tiff files?

I am using code to read a .tiff file in order to calculate a fractal dimension. My code looks like this:
import matplotlib.pyplot as plt
raster = plt.imread('xyz.tif')
for i in range(x1, x2):
for j in range(y1, y2):
pixel = raster[i][j]
This works, but I have to read a lot of pixels so I would like this to be fast, and ideally minimize electricity usage given current events. Is there a better library than matplotlib for this purpose? For example, could using a library specialized for matrix operations such as pandas help? Additionally, would another language such as C have better performance than python?
Edit: #cgohlke in the comments and others have found that cv2 is slower than tifffile for large and/or compressed images. It is best you test the different options on realistic data for your application.
I have found cv2 to be the fastest library for this. Using 5000 128x128 uint16 tif images gives the following result:
import time
import matplotlib.pyplot as plt
t0 = time.time()
for file in files:
raster = plt.imread(file)
print(f'{time.time()-t0:.2f} s')
1.52 s
import time
from PIL import Image
t0 = time.time()
for file in files:
im = np.array(Image.open(file))
print(f'{time.time()-t0:.2f} s')
1.42 s
import time
import tifffile
t0 = time.time()
for file in files:
im = tifffile.imread(file)
print(f'{time.time()-t0:.2f} s')
1.25 s
import time
import cv2
t0 = time.time()
for file in files:
im = cv2.imread(file, cv2.IMREAD_UNCHANGED)
print(f'{time.time()-t0:.2f} s')
0.20 s
cv2 is a computer vision library written in c++, which as the other commenter mentioned is much faster than pure python. Note the cv2.IMREAD_UNCHANGED flag, otherwise cv2 will convert monochrome images to 8-bit rgb.
I am not sure which library is the fastest but I have very good experience with Pillow:
from PIL import Image
raster = Image.open('xyz.tif')
then you could convert it to a numpy array:
import numpy
pixels = numpy.array(raster)
I would need to see the rest of the code to be able to recommend any other libraries. As for the language C++ or C would have better performance as they are low level languages. So depends on how complex your operations are and how much data you need to process, C++ scripts were shown to be 10-200x faster(increasing with the complexity of calculations). Hope this helps if you have any further questions just ask.

How to load images from memory to numpy using file system

I want to store my image directory in memory, then load the images into a numpy array.
The normative way to load images that are not in memory is as follows:
import PIL.Image
import numpy as np
image = PIL.Image.open("./image_dir/my_image_1.jpg")
image = np.array(image)
However, I am not sure how to do this when the images are in memory. So far, I have been able to setup the following starter code:
import fs
import fs.memoryfs
import fs.osfs
image_dir = "./image_dir"
mem_fs = fs.memoryfs.MemoryFS()
drv_fs = fs.osfs.OSFS(image_path)
fs.copy.copy_fs(drv_fs, mem_fs)
print(mem_fs.listdir('.'))
Returns:
['my_image_1.jpg', 'my_image_2.jpg']
How do I load images that are in memory into numpy?
I am also open to alternatives to the fs package.
As per the documentation, Pillow's Image.open accepts a file object instead of a file name, so as long as your in-memory file package provides Python file objects (which it most likely does), you can just use them. If it doesn't, you could even just wrap them in a class that provides the required methods. Assuming you are using PyFilesystem, according to its documentation you should be fine.
So, you want something like:
import numpy as np
import PIL.Image
import fs.memoryfs
import fs.osfs
import fs.copy
mem_fs = fs.memoryfs.MemoryFS()
drv_fs = fs.osfs.OSFS("./image_dir")
fs.copy.copy_file(drv_fs, './my_image_1.jpg', mem_fs, 'test.jpg')
with mem_fs.openbin('test.jpg') as f:
image = PIL.Image.open(f)
image = np.array(image)
(note I just used copy_file because I tested with a single file, you can use copy_fs if you need to copy the entire tree - it's the same principle)

How to effectively store a very large list in python

Question:I have a big 3D image collection that i would like to store into one file. How should I effectively do it?
Background: The dataset has about 1,000 3D MRI images with a size of 256 by 256 by 156. To avoid frequent files open and close, I was trying to store all of them into one big list and export it.
So far I tried reading each MRI in as 3D numpy array and append it to a list. When i tried to save it using numpy.save, it consumed all my memory and exited with "Memory Error".
Here is the code i tried:
import numpy as np
import nibabel as nib
import os
file_list = os.listdir('path/to/files')
for file in file_list:
mri = nib.load(os.path.join('path/to/files',file))
mri_array = np.array(mri.dataobj)
data.append(mri_array)
np.save('imported.npy',data)
Expected Outcome:
Is there a better way to store such dataset without consuming too much memory?
Using HDF5 file format or Numpy's memmap are the two options that I would go to first if you want to jam all your data into one file. These options do not load all the data into memory.
Python has the h5py package to handle HDF5 files. These have a lot of features, and I would generally lean toward this option. It would look something like this:
import h5py
with h5py.File('data.h5') as h5file:
for n, image in enumerate(mri_images):
h5file[f'image{n}'] = image
memmap works with binary files, so not really feature rich at all. This would look something like:
import numpy as np
bin_file = np.memmap('data.bin', mode='w+', dtype=int, shape=(1000, 256, 256, 156))
for n, image in enumerate(mri_images):
bin_file[n] = image
del bin_file # dumps data to file

How to load multiple images in a numpy array ?

How to load pixels of multiple images in a directory in a numpy array . I have loaded a single image in a numpy array . But can not figure out how to load multiple images from a directory . Here what i have done so far
image = Image.open('bn4.bmp')
nparray=np.array(image)
This loads a 32*32 matrices . I want to load 100 of the images in a numpy array . I want to make 100*32*32 size numpy array . How can i do that ? I know that the structure would look something like this
for filename in listdir("BengaliBMPConvert"):
if filename.endswith(".bmp"):
-----------------
else:
continue
But can not find out how to load the images in numpy array
Getting a list of BMP files
To get a list of BMP files from the directory BengaliBMPConvert, use:
import glob
filelist = glob.glob('BengaliBMPConvert/*.bmp')
On the other hand, if you know the file names already, just put them in a sequence:
filelist = 'file1.bmp', 'file2.bmp', 'file3.bmp'
Combining all the images into one numpy array
To combine all the images into one array:
x = np.array([np.array(Image.open(fname)) for fname in filelist])
Pickling a numpy array
To save a numpy array to file using pickle:
import pickle
pickle.dump( x, filehandle, protocol=2 )
where x is the numpy array to be save, filehandle is the handle for the pickle file, such as open('filename.p', 'wb'), and protocol=2 tells pickle to use its current format rather than some ancient out-of-date format.
Alternatively, numpy arrays can be pickled using methods supplied by numpy (hat tip: tegan). To dump array x in file file.npy, use:
x.dump('file.npy')
To load array x back in from file:
x = np.load('file.npy')
For more information, see the numpy docs for dump and load.
Use OpenCV's imread() function together with os.listdir(), like
import numpy as np
import cv2
import os
instances = []
# Load in the images
for filepath in os.listdir('images/'):
instances.append(cv2.imread('images/{0}'.format(filepath),0))
print(type(instances[0]))
class 'numpy.ndarray'
This returns you a list (==instances) in which all the greyscale values of the images are stored. For colour images simply set .format(filepath),1.
I just would like to share two sites where one can split a dataset into train, test and validation sets: split_folder
and create numpy arrays out of images residing in respective folders code snippet from medium by muskulpesent

support vector machines for classifying images

I am trying to use SVMs to classify a set if images I have on my computer into 3 categories :
I am just facing a problem of how to load the data as in the following example , he uses a data set that is already saved.
http://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
Me I have all the images in png format saved in a folder on my pc
You can load data as numpy arrays using Pillow, in this way:
from PIL import Image
import numpy as np
data = np.array(Image.open('yourimg.png')) # .astype(float) if necessary
couple it with os.listdir to read multiple files, e.g.
import os
for file in os.listdir('your_dir/'):
img = Image.open(os.path.join('your_dir/', file))
data = np.array(img)
your_model.train(data)

Categories

Resources