I am looking to build a classification model using my own dataset. But I'm having trouble formatting the dataset to be used. They are currently in subfolders, with each name being the class. I want to create my dataset like the format of the MNIST dataset, but I am unable to do so. For example, for MNIST, we can split the dataset:
(train_images, train_labels), (
test_images,
test_labels) = tf.keras.datasets.mnist.load_data()
And then for example, I could flatten the data:
train_images = train_images.reshape((train_images.shape[0], -1))
test_images = test_images.reshape((test_images.shape[0], -1))
How would I replace tf.keras.datasets.mnist.load_data() with my own dataset but in the same format as the MNIST dataset? I am also doing multi-class classification.
Edit: Added Notes:
To be clear, my main task is to replace the MNIST dataset with my own dataset: My subdirectories are like this for example:
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
...class_c/
......c_image_1.jpg
......c_image_2.jpg
...class_d/
......d_image_1.jpg
......d_image_2.jpg
I tried following this link to make my dataset into a format that tf.keras could use to load the dataset, similar to the way the MNIST dataset is loaded. I have tried generating a tf.data.Dataset.
data_dir = "/datas"
data_dir = pathlib.Path(data_dir)
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)
cats = list(data_dir.glob('cats/*'))
PIL.Image.open(str(cats[0]))
batch_size = 32
img_height = 150
img_width = 150
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
But I am still unable to put it into the right format, to fix the code as described above. Something I saw was about conversion to a numpy array but I'm not sure how to do it.
Some packages have helper functions to make data access easy for the users. As you can see in the documentation for the load_data() function, it returns a tuple of Numpy arrays:
(X_train, y_train), (X_test, y_test)
You can get the same structure by arranging your features (X) and target (y) as numpy arrays, and pass them through the train_test_split() scikit-learn function.
Related
I imported a folder that contains 3 folders of images insidethe , each belonging to one class, suppose cats and dogs. I created TensorFlow dataset by (shuffle=True) :
ds = tf.keras.preprocessing.image_dataset_from_directory('/content/data',
labels='inferred',
label_mode='int',
batch_size=32,
image_size=(256, 256),
shuffle=True)
Suppose len(ds) = 10
is it a correct way of splitting ds to train, Val, and test datasets? I found it on some websites and youtube videos including towardsdatascience.com:
train_ds = ds.take(8)
val_ds = ds.skip(8).take(1)
test_ds = ds.skip(8).skip(1)
my question is: when the shuffle=True in image_dataset_from_directory function, each time I call ds.take(n) (n could be any integer less than 10), it returns different outputs, so I conclude that: it is highly probable that some data from train_ds exists in val_ds and test_ds. I mean there is data leakage...
Am I right?
Read the documentation below
Doc
It suffle only shuffle your data if true , other rake it inaccending order, and tosplit data you can use your code , many people do this and this does not show data leakage
I've been looking forever on the internet trying to create, train and test my own tensorflow model. But I have unsuccessfully done that. From investigating my code, I think it's how I create my dataset of images. Most online tutorials would just import a prepared dataset, but my dataset is specifically for use-case diagrams and holds each element within those diagrams. My aim is to train a tensorflow model to predict each element in a diagram and hopefully the errors too. Here's the code:
def createDataSet(labelList, label, filePath, width, height):
dataList = []
for img in os.listdir(filePath):
filename = str(img)
if filename[len(filename) - 3:len(filename)] != "npy":
pic = cv.imread(os.path.join(filePath, img))
pic = cv.cvtColor(pic, cv.COLOR_BGR2RGB)
pic = cv.resize(pic, (width, height))
dataList.append(pic)
labelList.append(label)
return dataList, labelList
# appending the pics to the training data list
training_dataset, train_labels = createDataSet(train_labels, train_label, path, width, height)
test_dataset, test_labels = createDataSet(test_labels, test_label, path2, width, height)
#converting the list to numpy array and saving it to a file using #numpy.save
np.save(os.path.join(path,train_label),np.array(training_dataset))
np.save(os.path.join(path2,test_label),np.array(test_dataset))
#loading the saved file once again
train_images = np.array(training_dataset)
test_images = np.array(test_dataset)
As of now, the function creates a list which will be saved as a numpy array and that numpy array will be used for my model. But it causes errors like UNIMPLEMENTED: Cast string to float is not supported.
I'm sure that I'm creating the train, test data and the labels for both incorrectly
You can use tf.keras.utils.image_dataset_from_directory function read images form directory and split the data into training and validation. Find the below is sample code.
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
import tensorflow as tf
batch_size = 32
img_height = 224
img_width = 224
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
class_names = train_ds.class_names
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
Having a bit of a clueless moment, I'm looking to apply transfer learning to a problem using ResNet50 pre-trained on ImageNet.
I've got the transfer learning process all ready to go, but need my data set in the right form which tf.keras.applications.resnet50.preprocess_input handily does. Except it works on a numpy.array or tf.Tensor and I'm using image_dataset_from_directory to load the data which gives me a tf.data.Dataset.
Is there a simple way to use the provided preprocess_input function to preprocess my data in this form?
Alternatively, the function specifies:
The images are converted from RGB to BGR, then each color channel is zero-centered with respect to the ImageNet dataset, without scaling.
So any other way to achieve this in the data pipeline or as part of the model would also be acceptable.
You could use the map function of tf.data.Dataset to apply the preprocess_input function to every batch of images:
import tensorflow as tf
import pathlib
import matplotlib.pyplot as plt
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
batch_size = 32
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(180, 180),
batch_size=batch_size)
def display(ds):
images, _ = next(iter(ds.take(1)))
image = images[0].numpy()
image /= 255.0
plt.imshow(image)
def preprocess(images, labels):
return tf.keras.applications.resnet50.preprocess_input(images), labels
train_ds = train_ds.map(preprocess)
display(train_ds)
I try to use CNN for images classification in Python. Now I have a question about how to load the dataset.
I have a dataset with two directories: one directory contains 50,000 jpg files for training images (with ID between 0-49,999) and another directory contains 10,000 jpg files for testing (ID between 0-9,999). There is also a training label CSV file. This is a CSV file with two columns. The first column indicates the samples id and the second column indicates the label of the samples. The labels are between 0 and 9. I know the mapping between the pictures and labels.
How to import the dataset in Python ?
I try to used the code for CIFAR-10 dataset to load my dataset, I have code as follows:
# example of loading the cifar10 dataset
from matplotlib import pyplot
from keras.datasets import cifar10
# load dataset
(trainX, trainy), (testX, testy) = cifar10.load_data()
# summarize loaded dataset
print('Train: X=%s, y=%s' % (trainX.shape, trainy.shape))
print('Test: X=%s, y=%s' % (testX.shape, testy.shape))
# plot first few images
for i in range(9):
# define subplot
pyplot.subplot(330 + 1 + i)
# plot raw pixel data
pyplot.imshow(trainX[i])
# show the figure
pyplot.show()
For CIFAR-10 dataset, the shape of training dataset and test dataset are
Train: X=(50000, 32, 32, 3), y=(50000, 1)
Test: X=(10000, 32, 32, 3), y=(10000, 1)
Can I get the similar train dataset and test dataset for my directories?
Use tf.keras.utils.image_dataset_from_directory to load images from directory.
Sample code below
data_dir ='directory/path'
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
I am trying to create a multilabel classification model with keras. As such I have all my images in one folder. Furthermore, I have a CSV file containing a path to each image followed by multiple possible labels
Example of my CSV:
path, x1, x2, x3
img/img_00000001.jpg,1,0,1
img/img_00000002.jpg,0,0,1
...
I am trying to read in my images using flow_from_directory and provide the respective labels via the CSV. My so far looks like this:
image_path= "C:/user/Images"
data_generator = ImageDataGenerator(rescale=1./255,
validation_split=0.20)
train_generator = data_generator.flow_from_directory(image_path, target_size=(IMAGE_HEIGHT, IMAGE_SIZE), shuffle=True, seed=13,
class_mode='binary', batch_size=BATCH_SIZE, subset="training")
validation_generator = data_generator.flow_from_directory(image_path, target_size=(IMAGE_HEIGHT, IMAGE_SIZE), shuffle=False, seed=13,
class_mode='binary', batch_size=BATCH_SIZE, subset="validation")
A solution to a similar problem is suggested here: How to manually specify class labels in keras flow_from_directory? providing this code:
def multiclass_flow_from_directory(flow_from_directory_gen, multiclasses_getter):
for x, y in flow_from_directory_gen:
yield x, multiclasses_getter(x, y)
However, I cant figure out how to implement the multiclasses_getter() such that it works.
Try to use flow_from_dataframe instead flow_from_directory