How to create X_train and y_train from very large datasets?

How to create X_train and y_train from very large datasets? - python

I'm trying to work the RSNA Screening Mammography Breast Cancer Detection competition on Kaggle. There is a lot of image data. The data is setup where there are 11,000 folders, each with 4 images in them. Below is the loop I'm using to create X_train and y_train:
def read_xray(file_path, img_size=None):
dicom = pydicom.read_file(file_path)
img = dicom.pixel_array
if dicom.PhotometricInterpretation == "MONOCHROME1":
img = np.max(img) - img
if img_size:
img = cv2.resize(img, img_size)
# Add channel dim at First
img = img[np.newaxis]
# Converting img to float32
img = img / np.max(img)
img = img.astype("float32")
return img
X_train = []
y_train = []
def create_train_datasets(img_path, df):
for folder in tqdm(os.listdir(img_path)):
folder_path = os.path.join(img_path, folder)
for img in os.listdir(folder_path):
im_path = os.path.join(folder_path, img)
image = read_xray(im_path, img_size=(512, 512))
X_train.append(image)
label = df.loc[df['patient_id'] == 'folder', 'cancer']
y_train.append(label)
But this is very slow and would take hours to iterate through the dataset. Eventually I will convert X_train and y_train into Pytorch Tensors and create a Dataloader. Is there any way to make this faster? Or is there a better way of creating the training dataset I will use?
I have looked through some of the other notebooks for this competition on Kaggle, but I'm looking for a more simple approach to creating the training dataset.

Related

how to create train, test & validation split of tf.data.Dataset in tf 2.1.0

the following code is copied from :
https://www.tensorflow.org/tutorials/load_data/images
the code aims to create dataset of images downloaded from the web and stored into folders depending upon their classes, please do refer to the link above for the whole context!
list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'))
for f in list_ds.take(5):
print(f.numpy())
def get_label(file_path):
# convert the path to a list of path components
parts = tf.strings.split(file_path, os.path.sep)
# The second to last is the class-directory
return parts[-2] == CLASS_NAMES
def decode_img(img):
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_jpeg(img, channels=3)
# Use `convert_image_dtype` to convert to floats in the [0,1] range.
img = tf.image.convert_image_dtype(img, tf.float32)
# resize the image to the desired size.
return tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])
def process_path(file_path):
label = get_label(file_path)
# load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label
# Set `num_parallel_calls` so multiple images are loaded/processed in parallel.
labeled_ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)
for image, label in labeled_ds.take(1):
print("Image shape: ", image.numpy().shape)
print("Label: ", label.numpy())
def prepare_for_training(ds, cache=True, shuffle_buffer_size=1000):
# This is a small dataset, only load it once, and keep it in memory.
# use `.cache(filename)` to cache preprocessing work for datasets that don't
# fit in memory.
if cache:
if isinstance(cache, str):
ds = ds.cache(cache)
else:
ds = ds.cache()
ds = ds.shuffle(buffer_size=shuffle_buffer_size)
# Repeat forever
ds = ds.repeat()
ds = ds.batch(BATCH_SIZE)
# `prefetch` lets the dataset fetch batches in the background while the model
# is training.
ds = ds.prefetch(buffer_size=AUTOTUNE)
return ds
train_ds = prepare_for_training(labeled_ds)
we are finally left with train_ds that is a PreffetchDataset object and contains the entire dataset of images, labels!
How to split train_ds into train, test & validation sets to feed it into a model?

After the ds.repeat() call the dataset is infinite and splitting an infinte dataset doesn't work very well. Therefore you should split it before the prepare_training() call. Like this:
labeled_ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)
labeled_ds = labeled_ds.shuffle(10000).batch(BATCH_SIZE)
# Size of dataset
n = sum(1 for _ in labeled_ds)
n_train = int(n * 0.8)
n_valid = int(n * 0.1)
n_test = n - n_train - n_valid
train_ds = labeled_ds.take(n_train)
valid_ds = labeled_ds.skip(n_train).take(n_valid)
test_ds = labeled_ds.skip(n_train + n_valid).take(n_test)
The line n = sum(1 for _ in labeled_ds) iterates through the dataset once to get its size, then it is 3-way split into 80%/10%/10%.

How can I properly get my Dataset to create?

I have the following code:
imagepaths = tf.convert_to_tensor(imagepaths, dtype=tf.string)
labels = tf.convert_to_tensor(labels, dtype=tf.int32)
# Build a TF Queue, shuffle data
image, label = tf.data.Dataset.from_tensor_slices((imagepaths, labels))
and am getting the following error:
image, label = tf.data.Dataset.from_tensor_slices((imagepaths, labels))
ValueError: too many values to unpack (expected 2)
Shouldn't Dataset.from_tensor_slices see this as the length of the tensor, not the number of inputs? How can I fix this issue or combine the data tensors into the same variable more effectively?
Just for reference:
There are 1800 imagepaths and 1800 labels corresponding to each other. And to be clear, the imagepaths are paths to the files where the jpgs images are located. My goal after this is to shuffle the data set and build the neural network model.
That code is right here:
# Read images from disk
image = tf.read_file(image)
image = tf.image.decode_jpeg(image, channels=CHANNELS)
# Resize images to a common size
image = tf.image.resize_images(image, [IMG_HEIGHT, IMG_WIDTH])
# Normalize
image = image * 1.0/127.5 - 1.0
# Create batches
X, Y = tf.train.batch([image, label], batch_size=batch_size,
capacity=batch_size * 8,
num_threads=4)

try to do this:
def transform(entry):
img = entry[0]
lbl = entry[1]
return img, lbl
raw_data = list(zip(imagepaths, labels))
dataset = tf.data.Dataset.from_tensor_slices(raw_data)
dataset = dataset.map(transform)
and if you want to have a look at your dataset you can do it like this:
for e in dataset.take(1):
print(e)
you can add multiple map functions and you can after that use shuffle and batch on your dataset to prepare it for training ;)

My Input pipeline with tf.data is converting images to negetives. How can I stop it?

I am making a triplet producing tf.data pipeline and while testing each dataset, It is producing negative of images.
I did already tried to normalize the image, but I don't know what I am doing wrong.
All code is provided in this gist.
But here is the gist of what I am doing:
def _read_image_and_resize(self, same_img_1, same_img_2, diff_img_1):
target_size = [self.img_height, self.img_width]
# read images from disk
img1_file = tf.io.read_file(same_img_1)
img2_file = tf.io.read_file(same_img_2)
img3_file = tf.io.read_file(diff_img_1)
img1 = tf.image.decode_jpeg(img1_file, channels=3)
img2 = tf.image.decode_jpeg(img2_file, channels=3)
img3 = tf.image.decode_jpeg(img3_file, channels=3)
img1_resized = tf.image.resize_images(img1, target_size)
img2_resized = tf.image.resize_images(img2, target_size)
img3_resized = tf.image.resize_images(img3, target_size)
return img1, img2, img3
And here is the negative image I get as well as the positive result I want.

Loading images and eval() in tensorflow are super slow

X = []
filelist = gfile.ListDirectory(path_imgs)
for filename in filelist:
path_filename = path_imgs + filename
image_file = file_io.FileIO(path_filename,'rb')
image_raw = image_file.read()
img = tf.image.decode_image(image_raw)
img = tf.image.convert_image_dtype(img, tf.float32)
img = tf.image.resize_image_with_pad(img, img_size, img_size, method=1).eval(session=tf.Session())
X.append(img)
imgs = np.array(X)
Tried some things with session, but didn't work. Probably it should be handled differently, but I don't know how to do it. Any ideas?
EDIT:
Yes, I want to train ANN to segment objects in images.
There are folders with images and their masks. Size is 1000s, and could be 10s of 1000s.
I need single numpy array of images, which will be saved, and used later as dataset for model training.

def getImageData(fileNameList):
imageData=[]
for fn in fileNameList:
testImage = Image.open(fn)
testImage.show()
imageData.append(np.array(testImage))
return np.array(imageData,dtype=np.float32)
imageFn=("dog.png",)
imageData=getImageData(imageFn)
you must import something:
import tensorflow as tf
from PIL import Image
import numpy as np

Query a kNN with out of the original cluster pictures

I'm trying to use the code from a public repository to train an kNN model with a set of images. It originally works processing the similarity of all the images between the cluster. But I'd like to use a new image (not included in the model) and get the most similar images from the original cluster.
This is the code to train the original kNN
for f in os.listdir(path):
# Process filename
filename = os.path.splitext(f) # filename in directory
filename_full = os.path.join(path,f) # full path filename
head, ext = filename[0], filename[1]
if ext.lower() not in [".jpg", ".jpeg"]:
continue
# Read image file
img = image.load_img(filename_full, target_size=(224, 224)) #
load
imgs.append(np.array(img)) # image
filename_heads.append(head) # filename head
# Pre-process for model input
img = process_image(img)
features = model.predict(img).flatten() # features
eX.append(features) # append feature extractor
filename_heads.append(head)
X = np.array(eX) # feature vectors
imgs = np.array(imgs) # images
n_neighbours = 5 + 1
knn = kNN() # kNN model
knn.compile(n_neighbors=n_neighbours, algorithm="brute", metric="cosine")
knn.fit(X)
This is my code to query a new image and find similar ones in the original cluster
#previously I read the image from an url and put it in img variable
img = image.load_img('db/temp.jpg', target_size=(224, 224)) # load
img = image.img_to_array(img) # convert to array
img = np.expand_dims(img, axis=0)
img = preprocess_input(img)
img_features = model.predict(img).flatten() # features
distances, indices = knn.predict(img_features)
The problem is that I get a "IndexError: tuple index out of range
" error when I run knn.predict(new_img_features). I've already looked at the shape and type of the img_features and they're all the same, so I don't really know why this error appears. Maybe the error is because the kNN used here is not a classifier, but I don't know how to adapt it in order to work.
Full code link just in case you want to check it out.

The problem was that I had to pass the matrix this way:
distances, indices = knn.predict(np.array([img_features]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create X_train and y_train from very large datasets? - python

Related

how to create train, test & validation split of tf.data.Dataset in tf 2.1.0

How can I properly get my Dataset to create?

My Input pipeline with tf.data is converting images to negetives. How can I stop it?

Loading images and eval() in tensorflow are super slow

Query a kNN with out of the original cluster pictures

Categories

Resources