I'm sorry if the question is too basic, but I am just getting started with PyTorch (and Python).
I was trying to follow step by step the instructions here:
https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
However,I am working with have some DICOM files, that I kept into two directories (CANCER/NOCANCER). I split them with split-folders, to have it structured to be used with the ImageFolder dataset (as done in the tutorial).
I am aware that I only need to load the pixel_arrays extracted from the DICOM files, and I wrote some helper functions to:
read all paths of the .dcm files;
read them and extract the pixel_array;
do a little preprocessing.
Here are the outlines of the helper functions:
import os
import pydicom
import cv2
import numpy as np
def createListFiles(dirName):
print("Fetching all the files in the data directory...")
lstFilesDCM =[]
for root, dir, fileList in os.walk(dirName):
for filename in fileList:
if ".dcm" in filename.lower():
lstFilesDCM.append(os.path.join( root , filename))
return lstFilesDCM
def castHeight(list):
lstHeight = []
min_height = 0
for filenameDCM in list:
readfile = pydicom.read_file(filenameDCM)
lstHeight.append(readfile.pixel_array.shape[0])
min_height = np.min(lstHeight)
return min_height
def castWidth(list):
lstWidth = []
min_Width = 0
for filenameDCM in list:
readfile = pydicom.read_file(filenameDCM)
lstWidth.append(readfile.pixel_array.shape[1])
min_Width = np.min(lstWidth)
return min_Width
def Preproc1(listDCM):
new_height, new_width = castHeight(listDCM), castWidth(listDCM)
ConstPixelDims = (len(listDCM), int(new_height), int(new_width))
ArrayDCM = np.zeros(ConstPixelDims, dtype=np.float32)
## loop through all the DICOM files
for filenameDCM in listDCM:
## read the file
ds = pydicom.read_file(filenameDCM)
mx0 = ds.pixel_array
## Standardisation
imgb = mx0.astype('float32')
imgb_stand = (imgb - imgb.mean(axis=(0, 1), keepdims=True)) / imgb.std(axis=(0, 1), keepdims=True)
## Normalisation
imgb_norm = cv2.normalize(imgb_stand, None, 0, 1, cv2.NORM_MINMAX)
## we make sure that data is saved as a data_array as a numpy array
data = np.array(imgb_norm)
## we save it into ArrayDicom and resize it based 'ConstPixelDims'
ArrayDCM[listDCM.index(filenameDCM), :, :] = cv2.resize(data, (int(new_width), int(new_height)), interpolation = cv2.INTER_CUBIC)
return ArrayDCM
So, now, how do I tell the dataloader to load the data, considering the structure it's in for labelling purposes, but only after doing this extraction and preprocessing on it?
I'm referencing the "Loading data" part of the tutorial in the documentation, that goes:
# Create training and validation datasets
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'val']}
# Create training and validation dataloaders
dataloaders_dict = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=batch_size, shuffle=True, num_workers=4) for x in ['train', 'val']}
If it makes any sense, is it possible to do something on the lines of
image_datasets = {x: datasets.ImageFolder(Preproc1(os.path.join(data_dir, x)), data_transforms[x]) for x in ['train', 'val']}
?
Also, another question I have is: is it worth to do a normalisation step in my preprocessing when the tutorial suggests to do a transforms.Normalize ?
I'm really sorry this sounds so vague, I've been trying to solve this for weeks now, but I can't manage.
It sounds like you will be better off implementing your own custom Dataset.
And indeed, I think it would be better deferring normalization and other stuff to the transformations applied just before reading the images for the model.
Related
i have a df which is constructed like this, i have 5 columns and i have to perform a multi task or multi output classification
Daytime Environment Filename Weather
day wet 2018-10.png light_fog [Example of one row]
my problem is when i have done the flow from dataframe, i don't know how to use tf.data.dataset in order to build the dataset. Someone suggest me to use TFRecords but i have never used it. How can i do without using it?
train_data_gen = ImageDataGenerator(rescale=1. / 255)
train_gen = train_data_gen.flow_from_dataframe(train_df,
directory=dataset_dir,
x_col="Filename",
y_col=["Daytime", "Weather", "Environment"],
class_mode="multi_output",
target_size=(img_size, img_size),
batch_size=batch_size,
shuffle=True,
seed=SEED)
According to the tutorial (https://www.tensorflow.org/tutorials/load_data/pandas_dataframe#with_tfdata), you should be able to do something similar:
filenames = df.pop('Filename')
feature_names = ["Daytime", "Weather", "Environment"]
features = df[numeric_feature_names]
shuffle_buffer_size = len(df.index) # Or whatever shuffle size you want to use
ds = tf.data.Dataset.from_tensor_slices((filenames, features))
ds = ds.shuffle(buffer_size=shuffle_buffer_size, reshuffle_each_iteration=True)
# Performs the function for each pair of a filename and its corresponding features
ds = ds.map(prepare_image)
dataset = ds.batch(batch_size)
# Prefetch is just for optimization
dataset = ds.prefetch(buffer_size=1)
def prepare_image(filepath, features):
# You can do whatever transformations in here.
file_content = tf.io.read_file(image_file_path)
image = tf.io.decode_png(file_content)
return image, features
Note that this should just give you the idea. I did not test it, so it may throw an error or two.
I'm making a speech recognition model with an input shape of (56088,22050,1) which as a whole can be loaded from a .npy file(~5GB in size) into the memory but I wanted to figure out a better way. I came across the keras fit_generator() method but most examples were based on mnist and used the ImageDataGenerator() function. I realised that I had to make a custom generator function but I wasn't really sure how. As per this thread, I referenced his generator function to make something like this but I still have to load the entire data to memory which takes a lot of time. Plus I'm uncertain if this program would run at all because it doesn't output anything at all for the first 20 minutes that I ran it for
Any other way out?
import librosa
import glob
import tensorflow as tf
import os
import numpy as np
class_list, X_train, Y_train = [],[],[]
filename = "D:\\SpeechRecognitionData\\train\\audio\\"
class_names = os.listdir(filename)
print(class_names)
for classes in class_names:
if classes == '_background_noise_':
continue
else:
class_list.append(''.join(filename+classes))
print(class_list,"\n",len(class_list))
def create_X(address):
wave,sr = librosa.load(address)
wave.reshape(-1,1)
yield wave
def getLabel(filename):
base_name = os.path.basename(filename)
return base_name
def onehot(Y_train):
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
Y_train = Y_train.reshape(-1,1)
enc.fit(Y_train)
Y_train = enc.transform(Y_train).toarray()
return Y_train
def execute(X_train, Y_train):
loop = 0
for i in class_list:
c=0
loop+=1
for file in glob.glob("".join(i+"\\*.wav")): # iterating through each .wav audio file in the directory to create training data
if np.array(list(create_X(file))).shape[0] == 22050:
c+=1
Y_train.append(class_names.index(getLabel(i)))
X_train.append(create_X(file))
if c%100==0:
print("{} files processed in loop {}".format(c,loop))
while 1:
for i in range(1558): # 36*1558 = 56088
if i%125==0:
print("i= "+str(i))
yield np.array(X_train[i*36:(i+1)*36]).reshape(X_train.shape[0],X_train.shape[1],1), onehot(np.array(Y_train[i*36:(i+1)*36]))
input_shape = (22050,1)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv1D(16,activation='relu',input_shape=input_shape,kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv1D(32,activation='relu',kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv1D(16,activation='relu',kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128,activation='relu'))
model.add(tf.keras.layers.Dense(64,activation='relu'))
model.add(tf.keras.layers.Dense(30,activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
generator = execute(X_train,Y_train)
model.fit_generator(generator,steps_per_epoch=56088//36,shuffle=True)
model.save("model.h5")
So I figured it out by looking at this example here- https://github.com/tjh48/keras_generators/blob/master/keras_generator_example.ipynb
If someone comes across this then they can refer to my notebook
https://github.com/DarshanDeshpande/Speech-Recognition/blob/master/SpeechRecognitionWithGenerators.ipynb
Thanks!
I would like to know how I can use the data loader in PyTorch for the custom file structure of mine. I have gone through PyTorch documentation, but all those are with separate folders with class.
My folder structure consists of 2 folders(called training and validation), each with 2 subfolders(called images and json_annotations). Each image in the "images" folder has multiple objects(like cars, cycles, man etc) and each is annotated and have separate JSON files. Standard coco annotation is followed. My intention is to make a neural network which can do real-time classification from videos.
Edit 1:
I have done the coding as suggested by Fábio Perez.
class lDataSet(data.Dataset):
def __init__(self, path_to_imgs, path_to_json):
self.path_to_imgs = path_to_imgs
self.path_to_json = path_to_json
self.img_ids = os.listdir(path_to_imgs)
def __getitem__(self, idx):
img_id = self.img_ids[idx]
img_id = os.path.splitext(img_id)[0]
img = cv2.imread(os.path.join(self.path_to_imgs, img_id + ".jpg"))
load_json = json.load(open(os.path.join(self.path_to_json, img_id + ".json")))
#n = len(load_json)
#bboxes = load_json['annotation'][n]['segmentation']
return img, load_json
def __len__(self):
return len(self.image_ids)
When I try this
l_data = lDataSet(path_to_imgs = '/home/training/images', path_to_json = '/home/training/json_annotations')
I'm getting l_data with l_data[][0] - images and l_data with json. Now I'm confused. How will I use it with finetuning example availalbe in PyTorch? In that example, dataset and dataloader is done as shown below.
https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
# Create training and validation datasets
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'val']}
# Create training and validation dataloaders
dataloaders_dict = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=batch_size, shuffle=True, num_workers=4) for x in ['train', 'val']}
You should be able to implement your own dataset with data.Dataset. You just need to implement __len__ and __getitem__ methods.
In your case, you can iterate through all images in the image folder (then you can store the image ids in a list in your Dataset). Then, you use the index passed to __getitem__ to get the corresponding image id. With this image id, you can read the corresponding JSON file and return the target data that you need.
Something like this:
class YourDataLoader(data.Dataset):
def __init__(self, path_to_imgs, path_to_json):
self.path_to_imags = path_to_imgs
self.path_to_json = path_to_json
self.image_ids = iterate_through_images(path_to_images)
def __getitem__(self, idx):
img_id = self.image_ids[idx]
img = load_image(os.path.join(self.path_to_images, img_id)
bboxes = load_bboxes(os.path.join(self.path_to_json, img_id)
return img, bboxes
def __len__(self):
return len(self.image_ids)
In iterate_through_images you get all the ids (e.g. filenames) of images in a directory.
In load_bboxes you read the JSON and get the information you need.
I have a JSON loader implementation here if you want a reference.
I have a multi label classification problem. I wrote this custom generator. It reads images and output labels from the disk, and returns them in batches of size 32.
def get_input(img_name):
path = os.path.join("images", img_name)
img = image.load_img(path, target_size=(224, 224))
return img
def get_output(img_name, file_path):
data = pd.read_csv(file_path, delim_whitespace=True, header=None)
img_id = img_name.split(".")[0]
img_id = img_id.lstrip("0")
img_id = int(img_id)
labels = data.loc[img_id - 1].values
labels = labels[1:]
labels = list(labels)
label_arrays = []
for i in range(20):
val = np.zeros((1))
val[0] = labels[i]
label_arrays.append(val)
return label_arrays
def preprocess_input(img_name):
img = get_input(img_name)
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
return x
def train_generator(batch_size):
file_path = "train.txt"
data = pd.read_csv(file_path, delim_whitespace=True, header=None)
while True:
for i in range(math.floor(8000/batch_size)):
x_batch = np.zeros(shape=(32, 224, 224, 3))
y_batch = np.zeros(shape=(32, 20))
for j in range(batch_size):
img_name = data.loc[i * batch_size + j].values
img_name = img_name[0]
x = preprocess_input(img_name)
y = get_output(img_name, file_path)
x_batch[j, :, :, :] = x
y_batch[j] = y
ys = []
for i in range(20):
ys.append(y_batch[:,i])
yield(x_batch, ys)
Had a little problem with labels returned to the model, and got it solved in this question:
training a multi-output keras model
I tested this generator on a single output problem. This custom generator is very slow. The ETA for a single epoch by using this custom generator is around 27 hours, while the builtin generator(using flow_from_directory) takes 25 minutes for a single epoch. What am I doing wrong?
The training process for both tests is identical, except for the generator used. Validation generator is similar to training generator. I know I will not reach the efficiency of Keras' built in generator, but this difference in speed is too much.
EDIT
Some guides I read for creating custom generators.
Writing Custom Keras Generators
custom generator for fit_generator() that yields multiple inputs with different shapes
Maybe the built in generator processes the data on your gpu while your custom generator runs on the cpu, making is significantly slower.
Another guess is because Keras is using Dataset in the background. Your implementation probably uses feed-dict which is the slowest possible way to pass information to TensorFlow. The best way to feed data into the models is to use an input pipeline to ensure that the GPU never has to wait for new stuff to come in.
I have built a CNN to predict lymph node positivity (has cancer or not). Right now to load the data I have a self defined function that loads a batch of data and feeds it to the model for training.
Instead of loading batches I would love to use the flow_from_directory method. The problem I have is that my data are saved as arrays [#, rows, width, height, PET or CT] not images (that would later be converted to arrays). Example [0,:,:,:,0] is volume sized 48x48x32 from a ct image.
If I try to use flow_from_directory I get 0 images with 3 classes which I expected since '.mat' is not a recognized file (https://github.com/keras-team/keras-preprocessing/blob/362fe9f8daf556151328eb5d02bd5ae638c653b8/keras_preprocessing/image.py#L1868). Interestingly it doesnt raise any errors but I am indefinitely stuck on 1/150 epochs. I am going to see if I can write my own flow_from_directory. Not sure if someone has run across this problem and could give my pointers.
Illustrating how data is combined
for fname in fnames:
data = scipy.io.loadmat(os.path.join(dir_in_train, fname))['roi_patch']
data_PET = scipy.io.loadmat(os.path.join(dir_in_train_PET, fname))['roi_patch']
train_combo[0,:,:,:,0]=data/4.0950
train_combo[0,:,:,:,1]=data_PET/32.1959
train_combo[0,:,:,:,:].shape
train_combo = np.zeros((1, 48, 48, 32, 2))
scipy.io.savemat(fname, {fname: train_combo})
This will create a file ex '1.mat' that has CT data and PET data in one area
Then I have code changing it into npy files.
Example of data generator I already have
# load training data
def load_train_data_batch_generator(self, batch_size=32, rows_in=48, cols_in=48, zs_in=32,
channels_in=2, num_classes=3,
dir_in_train=None, dir_out_train=None):
# dir_in_train = main_dir + '/test_CT_PET_combo'
fnames = ['{}.mat'.format(i) for i in range(1,len(os.listdir(dir_in_train))+1)]
y_train = np.zeros((batch_size, num_classes))
x_train = np.zeros((batch_size, rows_in, cols_in, zs_in, channels_in))
while True:
count = 0
for fname in np.random.choice(fnames, batch_size, replace=False):
data_label = scipy.io.loadmat(os.path.join(dir_out_train, fname))['output']
# changing one hot encoding to integer
integer_label = np.argmax(data_label[0], axis=0)
y_train[count,:] = data_label
# Loading train ct w/ c and pet/ct combo that will be saved into new directory
train_combo = scipy.io.loadmat(os.path.join(dir_in_train, fname))[fname]
x_train[count,:,:,:,:] = train_combo
count += 1
yield(x_train, y_train)