I'm a Python novice but have a decent amount of experience in other languages. I'm using this loop to load in a directory of images for some machine learning, which is why I convert them to numpy arrays. It's very slow, so I must be doing something wrong!
My current code:
def load_images(src):
files = [] # accept multiple extensions
for ext in ('*.gif', '*.png', '*.PNG', '*.jpg', '*.jpeg', '*.JPG', '*.JPEG'):
files.extend(glob.glob(os.path.join(src, ext)))
images = []
for each in files:
print(each)
img = PIL.Image.open(each)
img_array = np.asarray(img)
images.append(img_array)
return images
# need to convert from list to numpy array
train_images = np.asarray(load_images(READY_IMAGES))
from multiprocessing import Process
#this is the function to be parallelised
def image_load_here(image_path):
pass
if __name__ == '__main__':
#Start the multiprocesses and provide your dataset.
p = Process(target=image_load_here,['img1', 'img2', 'img3', 'img4'])
p.start()
p.join()
Original Code link : fastest way to load images in python for processing
Extra Refer
Fastest image reader? Four ways to open a Satellite image in Python
Efficient image loading
Also, if you're using machine learning with keras/tensorflow you can use a generator to really speed up your loading process and it will consume memory on the go, thereby conserving your RAM for other uses.
Here is an excellent article on the function. You can visit the official documentationtoo.
Related
I have a problem and don't know how to solve:
I'm learning how to analyze DICOM files with Python and, so,
I got a patient exam, on single patient and one single exam, which is 200 DICOM files all of the size 512x512 each archive representing a different layer of him and I want to turn them into a single archive .npy so I can use in another tutorial that I found online.
Many tutorials try to convert them to jpg or png using opencv first, but I don't want this since I'm not interested in a friendly image to see right now, I need the array. Also, this step screw all the quality of images.
I already know that using:
medical_image = pydicom.read_file(file_path)
image = medical_image.pixel_array
I can grab the path, turn 1 slice in a pixel array and them use it, but the thing is, it doesn't work in a for loop.
The for loop I tried was basically this:
image = [] # to create an empty list
for f in glob.iglob('file_path'):
img = pydicom.dcmread(f)
image.append(img)
It results in a list with all the files. Until here it goes well, but it seems it's not the right way, because I can use the list and can't find the supposed next steps anywhere, not even answers to the errors that I get in this part, (so I concluded it was wrong)
The following code snippet allows to read DICOM files from a folder dir_path and to store them into a list. Actually, the list does not consist of the raw DICOM files, but is filled with NumPy arrays of Hounsfield units (by using the apply_modality_lut function).
import os
from pathlib import Path
import pydicom
from pydicom.pixel_data_handlers import apply_modality_lut
dir_path = r"path\to\dicom\files"
dicom_set = []
for root, _, filenames in os.walk(dir_path):
for filename in filenames:
dcm_path = Path(root, filename)
if dcm_path.suffix == ".dcm":
try:
dicom = pydicom.dcmread(dcm_path, force=True)
except IOError as e:
print(f"Can't import {dcm_path.stem}")
else:
hu = apply_modality_lut(dicom.pixel_array, dicom)
dicom_set.append(hu)
You were well on your way. You just have to build up a volume from the individual slices that you read in. This code snippet will create a pixelVolume of dimension 512x512x200 if your data is as advertised.
import dicom
import numpy
images = [] # to create an empty list
# Read all of the DICOM images from file_path into list "images"
for f in glob.iglob('file_path'):
image = pydicom.dcmread(f)
images.append(image)
# Use the first image to determine the number of rows and columns
repImage = images[0]
rows=int(repImage.Rows)
cols=int(repImage.Columns)
slices=len(images)
# This tuple represents the dimensions of the pixel volume
volumeDims = (rows, cols, slices)
# allocate storage for the pixel volume
pixelVolume = numpy.zeros(volumeDims, dtype=repImage.pixel_array.dtype)
# fill in the pixel volume one slice at a time
for image in images:
pixelVolume[:,:,i] = image.pixel_array
#Use pixelVolume to do something interesting
I don't know if you are a DICOM expert or a DICOM novice, but I am just accepting your claim that your 200 images make sense when interpreted as a volume. There are many ways that this may fail. The slices may not be in expected order. There may be multiple series in your study. But I am guessing you have a "nice" DICOM dataset, maybe used for tutorials, and that this code will help you take a step forward.
I have a bunch of images in a directory and want to make a single ndarray of those images like CIFAR-10. I have written a brute-force way to do so. However it will get really slow if the number of images are large.
def dataLoader(data_path):
data = np.empty([1,1200,900,3]) # size of an image is 1200*900 with RGB
i = 0
for filename in os.listdir(data_path):
if filename.endswith(".jpg") or filename.endswith(".png"):
target = cv2.imread(os.path.join(data_path, filename))
data = np.concatenate([data, target.reshape([1,1200,900,3])], axis=0)
print(i, end='\r')
i += 1
return data
I am checking the progression by printing loop count. As I see that, it is fairly quick at first 50 iterations but it gets slower and slower as it iterates. I suppose it is due to the numpy concatenation. Is there better way to do so?
I also really appreciate if there is a method to save the ndarray so that I don't need to build the ndarray every time. Currently, I'm using numpy.save.
I need to perform a function on images in less than 1 second. I have a problem on a 1000x1000 image that, just to load it as a matrix in the program, takes 1 second.
The function I use to load it is as follows:
import png
def load(fname):
with open(fname, mode='rb') as f:
reader = png.Reader(file=f)
w, h, png_img, _ = reader.asRGB8()
img = []
for line in png_img:
l = []
for i in range(0, len(line), 3):
l+=[(line[i], line[i+1], line[i+2])]
img+=[l]
return img
How can I modify it in such a way that, when opening the image, it takes a little more than a few milliseconds?
IMPORTANT NOTE: I cannot import other functions outside of this (this is a university exercise and therefore there are rules -.-). So I have to get one myself
you can use PIL to do this for you, it's highly optimized and fast
from PIL import Image
def load(path):
return Image.open(path)
Appending to a list is inherently slow - read about Shlemiel the painter’s algorithm. You can replace it with a generator expression and slicing.
for line in png_img:
img += list(zip(line[0::3], line[1::3], line[2::3])
I'm not sure it is remotely possible to run a python script that opens a file, etc. in just a few ms. On my computer, the simplest program takes several 10ms
Without knowing more about the specifics of your problem and the reasons for your constraint, it is hard to answer. You should consider what you are trying to do, in the context of the way your program really works, and then formulate a strategy to achieve your goal.
The total context here is, you're asking the computer to:
run python, load your code and interpret it
load any modules you want to use
find your image file and read it from disk
give those bytes some meaning as an image abstraction - parse, etc these bytes
do some kind of transform or "work" on the image
export your result in some way
You need to figure out which of those steps is it that really needs to be lightning fast. After that, maybe someone can make a suggestion.
I am trying to convert the celebA dataset(https://www.kaggle.com/jessicali9530/celeba-dataset) images folder into a numpy array for later to be converted into a .pkl file(for using the data as simply as mnist or cifar).
I am willing to find a better way of converting since this method is absolutely consuming the whole RAM.
from PIL import Image
import pickle
from glob import glob
import numpy as np
TARGET_IMAGES = "img_align_celeba/*.jpg"
def generate_dataset(glob_files):
dataset = []
for _, file_name in enumerate(sorted(glob(glob_files))):
img = Image.open(file_name)
pixels = list(img.getdata())
dataset.append(pixels)
return np.array(dataset)
celebAdata = generate_dataset(TARGET_IMAGES)
I am rather curious on how the mnist authors did this themselves but any approach that works is welcome.
You can transform any kind of data on the fly in Keras and load in memory one batch at the time during training.
See documentation, search for 'Example of using .flow_from_directory(directory)'.
I want to make a generator that generates batches of images from urls to train a keras model. I have another generator that feeds me images urls.
What I currently do is download the image to disk and then load the image from the disk.
def loadImage(URL):
with urllib.request.urlopen(URL) as url:
with open('temp.jpg', 'wb') as f:
f.write(url.read())
img_path = 'temp.jpg'
img = image.load_img(img_path, target_size=(125, 125))
os.remove(img_path)
x = image.img_to_array(img)
return x
def imageGenerator(batch_size):
i = 0
batch = []
for URL in imageUrlGenerator():
if i>batch_size:
yield batch
batch = []
i=0
batch.append(loadImage(URL))
i+=1
This works but I wonder if there isn't a faster way to load images from the web without having to write and read in/from disk.
Assuming you are actually using keras and that this image.load_img is the method you are calling, it would call a function which should ultimately be PIL.Image.open. In the documentation for PIL.image.open, the first argument fp could be a string filename (which is what you are currently passing), or a stream-like object that implements read, seek, and tell. While the object returned by urllib.request.urlopen does provide all three methods, it does not implement seek at all, so it cannot be used directly. However, the entire buffer can be read into an BytesIO object which does implement seek, so it should be usable. Putting this together, your loadImage function may be reduced to something like the following:
from io import BytesIO
def loadImage(URL):
with urllib.request.urlopen(URL) as url:
img = image.load_img(BytesIO(url.read()), target_size=(125, 125))
return image.img_to_array(img)
This keeps the images downloaded fully in memory.
This is the simplest solution I've seen.
from PIL import Image
from urllib import request
from io import BytesIO
url = "https://github.com/ironmanciti/MachineLearningBasic/blob/master/datasets/TransferLearningData/watch.jpg?raw=true"
res = request.urlopen(url).read()
Sample_Image = Image.open(BytesIO(res)).resize((150,150))
plt.imshow(Sample_Image)
Got this from Github issues
from io import BytesIO
from PIL import Image
import requests
def loadImage(url):
response = requests.get(url)
img_bytes = BytesIO(response.content)
img = Image.open(img_bytes)
img = img.convert('RGB')
img = img.resize((250,250), Image.NEAREST)
img = img_to_array(img)
return img
2 Quick fixes:
Consider moving the os.remove(img_path) line. I assume this is removing the file from your drive, but I'm thinking you can save this for the end. Your model just needs all the information possible. Once it has the information, you can either asynchronously begin deleting the files, or just wait until the model is trained and cleanup. Doing it one by one as you have, may be slowing you down.
Use fast storage devices and configurations, SSD, USB 3.x, USB C, etc.
Other fixes:
Is it possible to hold things in a cache?
Could you hold things in an array? I don't think so, but it may be possible.
Do you need the whole image? Could you decrease the quality of the image?
How nested is the image? Parsing for the image likely isn't an issue, but it can't hurt to check.