Query a kNN with out of the original cluster pictures - python

I'm trying to use the code from a public repository to train an kNN model with a set of images. It originally works processing the similarity of all the images between the cluster. But I'd like to use a new image (not included in the model) and get the most similar images from the original cluster.
This is the code to train the original kNN
for f in os.listdir(path):
# Process filename
filename = os.path.splitext(f) # filename in directory
filename_full = os.path.join(path,f) # full path filename
head, ext = filename[0], filename[1]
if ext.lower() not in [".jpg", ".jpeg"]:
continue
# Read image file
img = image.load_img(filename_full, target_size=(224, 224)) #
load
imgs.append(np.array(img)) # image
filename_heads.append(head) # filename head
# Pre-process for model input
img = process_image(img)
features = model.predict(img).flatten() # features
eX.append(features) # append feature extractor
filename_heads.append(head)
X = np.array(eX) # feature vectors
imgs = np.array(imgs) # images
n_neighbours = 5 + 1
knn = kNN() # kNN model
knn.compile(n_neighbors=n_neighbours, algorithm="brute", metric="cosine")
knn.fit(X)
This is my code to query a new image and find similar ones in the original cluster
#previously I read the image from an url and put it in img variable
img = image.load_img('db/temp.jpg', target_size=(224, 224)) # load
img = image.img_to_array(img) # convert to array
img = np.expand_dims(img, axis=0)
img = preprocess_input(img)
img_features = model.predict(img).flatten() # features
distances, indices = knn.predict(img_features)
The problem is that I get a "IndexError: tuple index out of range
" error when I run knn.predict(new_img_features). I've already looked at the shape and type of the img_features and they're all the same, so I don't really know why this error appears. Maybe the error is because the kNN used here is not a classifier, but I don't know how to adapt it in order to work.
Full code link just in case you want to check it out.

The problem was that I had to pass the matrix this way:
distances, indices = knn.predict(np.array([img_features]))

Related

How to create X_train and y_train from very large datasets?

I'm trying to work the RSNA Screening Mammography Breast Cancer Detection competition on Kaggle. There is a lot of image data. The data is setup where there are 11,000 folders, each with 4 images in them. Below is the loop I'm using to create X_train and y_train:
def read_xray(file_path, img_size=None):
dicom = pydicom.read_file(file_path)
img = dicom.pixel_array
if dicom.PhotometricInterpretation == "MONOCHROME1":
img = np.max(img) - img
if img_size:
img = cv2.resize(img, img_size)
# Add channel dim at First
img = img[np.newaxis]
# Converting img to float32
img = img / np.max(img)
img = img.astype("float32")
return img
X_train = []
y_train = []
def create_train_datasets(img_path, df):
for folder in tqdm(os.listdir(img_path)):
folder_path = os.path.join(img_path, folder)
for img in os.listdir(folder_path):
im_path = os.path.join(folder_path, img)
image = read_xray(im_path, img_size=(512, 512))
X_train.append(image)
label = df.loc[df['patient_id'] == 'folder', 'cancer']
y_train.append(label)
But this is very slow and would take hours to iterate through the dataset. Eventually I will convert X_train and y_train into Pytorch Tensors and create a Dataloader. Is there any way to make this faster? Or is there a better way of creating the training dataset I will use?
I have looked through some of the other notebooks for this competition on Kaggle, but I'm looking for a more simple approach to creating the training dataset.

Problem reading and augmenting images in tf.data API using CSV / pandas DataFrames

I'm trying to (pre)process and augment my data and target variables when reading in the data each epoch/batch using the tf.data API. My unprocessed data is a CSV/pandas DataFrame with the format
index, img_id, c1, ..., c5 where img_id contains the path to an image while c1,...,c5 are run length encodings of different defects in the image, both are strings. To increase the amount of data I want to augment (e.g. flip) the images (and therefore the masks of defects aswell) with a certain probability for each image when reading it each batch/epoch. I want to read each image from my drive to save memory and because this seems to still yield good performance within the API (due to prefetching etc).
I'm familiar doing this using pytorchs DataLoader API (using version 1.8.1+cu111), but as this is for a course where I have to use tensorflow (using version 2.4.1), I read up on the tf.data API and came to the conclusion that I should do this augmentation and reading of the image using the map function. However, even reading the images throws different errors. The following is a mix of the code I've tried to use, most lines for reading the images are commented out with an extra comment in the line above with the error message it will produce.
import tensorflow as tf
test = tf.data.experimental.make_csv_dataset("data/mini_formatted.csv", batch_size=4)
def map_fn(df_):
img_path = df_["img_id"]
masks = restore_masks(df_) # get maps from RLE with same shape as images
imgs = []
# has to be declared before loop with correct shape, used for reading imgs later
img = np.empty(shape=(256,1600,1), dtype=np.float32)
# produces TypeError: Can't convert object of type 'Tensor' to 'str' for 'filename'
img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
for i in img_path:
# produces TypeError: Can't convert object of type 'Tensor' to 'str' for 'filename'
#img = cv2.imread(i, cv2.IMREAD_GRAYSCALE)
# produces AttributeError: 'NoneType' object has no attribute 'shape'
#img = cv2.imread(str(i), cv2.IMREAD_GRAYSCALE)
# produces ValueError: 'img' has shape (256, 1600, 1) before the loop, but shape <unknown> after one iteration. Use tf.autograph.experimental.set_loop_options to set shape invariants.
#img_file = tf.io.read_file(i)
#img = tf.io.decode_image(img_file, dtype=tf.float32, channels=1)
#imgs.append(img)
pass
# since img_path is a list, this doesn't work either
# ValueError: Shape must be rank 0 but is rank 1 for '{{node ReadFile}} = ReadFile[](args_6)' with input shapes: [4].
img_file = tf.io.read_file(img_path)
img = tf.io.decode_image(img_file, dtype=tf.float32)
##########################################
#
# DO AUGMENTING PER BATCH HERE
#
##########################################
# return augmented images and masks
return imgs, class_masks
proc_ds = test.map(map_fn)
As you can see, reading the image throws different errors I do not quite unterstand, especially because reading the image as follows (i.e. with the exact same commands after getting the first batch from the dataset without applying the map function) works without problems.
it = test.as_numpy_iterator()
x_proc = it.next()
img_files = [tf.io.read_file(i) for i in x_proc["img_id"]]
imgs = [img = tf.io.decode_image(img_file, dtype=tf.float32, channels=1) for img_file in img_files]
From my understanding, using the map function on a dataset should execute the code on each example once per epoch, but from the example given, it seems the function is executed once per batch, what I tried to work around. This doesn't explain to me, why the same code doesn't work inside the map function, while working fine outside it.
To help understand what I want to do, I've written a short Dataset/DataLoader in torch as an example of what my desired outputs are.
import torch
import pandas as pd
class MyDataset(torch.utils.data.Dataset):
def __init__(self, df, mode="train", shuffle=True, augment=False, union=False,
greyscale=False, normalize=True):
self.df = df
self.length = len(df)
self.mode = mode
self.shuffle = shuffle
self.augment = augment
self.union = union
self.greyscale = greyscale
self.normalize = normalize
def __len__(self):
return self.length
def __getitem__(self, idx_):
# gets called for a single item when added to batch -> one line of the dataframe
# in the tf example, these are grouped in an OrderedDict with arrays of length (BATCH_SIZE) as values
df_ = self.df.loc[idx_]
img = self._load_img(df_["img_id"])
if self.union:
masks = build_masks(df_["c1":"c_all"], union_only=True)
else:
masks = build_masks(df_["c1":"c_all"])
# could also add augmentation here instead of in collate_ds
if self.mode == "train":
return {"img": img, "masks": masks}
return {"img": img, "masks": None}
def _load_img(self, img_path):
if self.greyscale:
img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
else:
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
if self.normalize:
img = img.astype(np.float32) / 255.
else:
img = img.astype(np.float32)
return img
def collate_ds(self, batch):
# gets called with BATCH_SIZE examples that were processed using __getitem__
imgs = [d["img"] for d in batch]
masks = [d["masks"] for d in batch]
if self.augment:
# augmentation steps for each image
pass
imgs = torch.tensor(imgs, dtype=torch.float32)
masks = torch.tensor(masks, dtype=torch.float32)
res = (imgs, masks)
return res
mini_df = pd.read_csv("data/mini_formatted.csv", index_col=0)
torch_ds = MyDataset(mini_df, mode="train", shuffle=True, augment=False, union=False,
greyscale=False, normalize=True)
dataloader = torch.utils.data.DataLoader(torch_ds, batch_size=8, shuffle=True,
collate_fn=torch_ds.collate_ds)
batch = next(iter(dataloader))
print(batch[0].shape, batch[1].shape)
# output: (torch.Size([8, 256, 1600, 3]), torch.Size([8, 256, 1600, 5]))
I still don't understand, why even reading the images inside the map function doesn't work (e.g. using cv2 -> neither using imread(img_path) #TypeError: Can't convert object of type 'Tensor' to 'str' for 'filename' nor imread(str(i) #AttributeError: 'NoneType' object has no attribute 'shape' -> image wasn't found works, while the tf.io.* functions work outside the function, but throw errors when the exact same code is executed inside it.
I would be very thankful for any help on what I'm misunderstanding/doing wrong using the map function with the tf.data API and how I could achieve the same results as the provided torch dataloader using the tf.data API.

How can I properly get my Dataset to create?

I have the following code:
imagepaths = tf.convert_to_tensor(imagepaths, dtype=tf.string)
labels = tf.convert_to_tensor(labels, dtype=tf.int32)
# Build a TF Queue, shuffle data
image, label = tf.data.Dataset.from_tensor_slices((imagepaths, labels))
and am getting the following error:
image, label = tf.data.Dataset.from_tensor_slices((imagepaths, labels))
ValueError: too many values to unpack (expected 2)
Shouldn't Dataset.from_tensor_slices see this as the length of the tensor, not the number of inputs? How can I fix this issue or combine the data tensors into the same variable more effectively?
Just for reference:
There are 1800 imagepaths and 1800 labels corresponding to each other. And to be clear, the imagepaths are paths to the files where the jpgs images are located. My goal after this is to shuffle the data set and build the neural network model.
That code is right here:
# Read images from disk
image = tf.read_file(image)
image = tf.image.decode_jpeg(image, channels=CHANNELS)
# Resize images to a common size
image = tf.image.resize_images(image, [IMG_HEIGHT, IMG_WIDTH])
# Normalize
image = image * 1.0/127.5 - 1.0
# Create batches
X, Y = tf.train.batch([image, label], batch_size=batch_size,
capacity=batch_size * 8,
num_threads=4)
try to do this:
def transform(entry):
img = entry[0]
lbl = entry[1]
return img, lbl
raw_data = list(zip(imagepaths, labels))
dataset = tf.data.Dataset.from_tensor_slices(raw_data)
dataset = dataset.map(transform)
and if you want to have a look at your dataset you can do it like this:
for e in dataset.take(1):
print(e)
you can add multiple map functions and you can after that use shuffle and batch on your dataset to prepare it for training ;)

Tensorflow Pipeline for images and numpy files

I am working with tensorflow 2.0.0 and am trying to setup an efficient pipeline for feeding in ~90,000 png images of size (256, 256, 3) and their labels which are numpy arrays of size (256,256) for an image segmentation problem. These images and labels won't load fully into memory.
The data are stored in a directory like this:
'C:/Users/user/Documents/data/ims/' #png images
'C:/Users/user/Documents/data/masks/' #img labels/masks
The file names are the same save the extension so for example "test1.png" and "test1.npy" are an image/label pair.
The data are not split into training, validation, and test subsets yet.
I need to get to a point in which I have both the images and labels split into train, validation, and testing subsets, and also have a means to feed the data into a model for training.
I was following this guide here but could not figure out how to deal with the numpy files within the get_label function.
I thought I could write a function that splits the data into subsets via file names alone and then on the fly load the batches via the file names provided, but I can't figure out how to do this efficiently.
I'm currently doing this which either doesn't work because the files are too big or too slow because there are some many files to load into memory, either of which isn't a viable solution.
import tensorflow as tf
import numpy as np
import glob2 as glob
from imageio import imread
base = '/mnt/projects/CNN_Data/clean_data/'
image_path = sorted(glob.glob(base + 'ims/*.png'))
label_path = sorted(glob.glob(base + 'masks/*.npy'))
images = [imread(img).astype(np.float32)/255.0 for img in image_path]
labels = [np.load(path) for path in label_path]
Edit to add:
Here was my attempt following the tensorflow example that I linked above. It runs, but I can't get get_label to what I want.
import tensorflow as tf
import numpy as np
import os
AUTOTUNE = tf.data.experimental.AUTOTUNE
base = '/mnt/projects/CNN_Data/clean_data/'
list_ds = tf.data.Dataset.list_files(base + 'ims/*')
def get_label(file_path):
parts = tf.strings.split(file_path, os.path.sep)
parts[-2] == 'masks'
fname = tf.strings.split(parts[-1], '.')[0]
fname = tf.strings.join([fname, '.npy'])
parts[-1] == fname
return parts
def decode_img(img):
img = tf.image.decode_png(img, channels = 3)
img = tf.image.convert_image_dtype(img, tf.float32)
return img
def process_path(file_path):
label = get_label(file_path)
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label
labeled_ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)

tensorflow classify multiple images

I am using the Tensorflow image classification example (https://www.tensorflow.org/versions/r0.9/tutorials/image_recognition/index.html).
How could I classify multiple images at a time?
EDIT: Ideally, I would just pass in one image and a number (nb) as arguments, and then make the input-to-be-classified nb iterations of that image
The file is classify_image.py, and the important portion is:
def run_inference_on_image(image):
"""Runs inference on an image.
Args:
image: Image file name.
Returns:
Nothing
"""
if not tf.gfile.Exists(image):
tf.logging.fatal('File does not exist %s', image)
image_data = tf.gfile.FastGFile(image, 'rb').read()
# Creates graph from saved GraphDef.
create_graph()
with tf.Session() as sess:
# Some useful tensors:
# 'softmax:0': A tensor containing the normalized prediction across
# 1000 labels.
# 'pool_3:0': A tensor containing the next-to-last layer containing 2048
# float description of the image.
# 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
# encoding of the image.
# Runs the softmax tensor by feeding the image_data as input to the graph.
softmax_tensor = sess.graph.get_tensor_by_name('softmax:0')
predictions = sess.run(softmax_tensor,
{'DecodeJpeg/contents:0': image_data})
predictions = np.squeeze(predictions)
# Creates node ID --> English string lookup.
node_lookup = NodeLookup()
top_k = predictions.argsort()[-FLAGS.num_top_predictions:][::-1]
for node_id in top_k:
human_string = node_lookup.id_to_string(node_id)
score = predictions[node_id]
print('%s (score = %.5f)' % (human_string, score))
def main(_):
maybe_download_and_extract()
image = (FLAGS.image_file if FLAGS.image_file else
os.path.join(FLAGS.model_dir, 'cropped_panda.jpg'))
run_inference_on_image(image)
The code relevant to you would be this section:
def main(_):
maybe_download_and_extract()
image = (FLAGS.image_file if FLAGS.image_file else
os.path.join(FLAGS.model_dir, 'cropped_panda.jpg'))
run_inference_on_image(image)
In order to have predictions for all the png, jpeg or jpg files in a "images" folder, you could do this:
def main(_):
maybe_download_and_extract()
# search for files in 'images' dir
files_dir = os.getcwd() + '/images'
files = os.listdir(files_dir)
# loop over files, print prediction if it is an image
for f in files:
if f.lower().endswith(('.png', '.jpg', '.jpeg')):
image_path = files_dir + '/' + f
print run_inference_on_image(image_path)
This should print out the predictions for all your images in that folder

Categories

Resources