image loading from local directory - python

tydef prepare_data(batch_size):
(X_train, y_train)=load_data(TRAIN_DIR)
(X_test, y_test) = load_data(TEST_DIR)
X_all = np.concatenate([X_train, X_test])
y_all = np.concatenate([y_train, y_test])
X_all = X_all.astype(np.float32) / 255
X_all = X_all.reshape(-1, 28, 28, 1) * 2. - 1.
y_all = keras.utils.to_categorical(y_all, 10)
dataset = tf.data.Dataset.from_tensor_slices((X_all, y_all))
dataset = dataset.shuffle(1024)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True).prefetch(1)pe here
return dataset
this is the script to load the directory files using TRAIN_DIR valuable
but when I call the function "dataset = prepare_data(BATCH_SIZE)" it says "too many values to unpack (expected 2)"
can you share your experiences

Based on the comments, you have a function load_data like this:
def load_data(dir_path, img_size=(100,100)):
""" Load resized images as np.arrays to workspace """
X = []
y = []
i = 0
label = dict()
X = np.array(X)
y = np.array(y)
print(f'{len(X)} images loaded from {dir_path} directory.')
return X, y, label
which return two numpy arrays and one dictionary.
So I would change the beginning of the function prepare_data like so:
def prepare_data(batch_size):
X_train, y_train, label_train = load_data(TRAIN_DIR)
X_test, y_test, label_test = load_data(TEST_DIR)
to match load_data signature.

Related

Too much RAM is required for loading dataset

I’m working in a neural network and my dataset has 42000 images and I have to load it all. I’m using google colab for that, but every time I load the dataset the RAM is insufficient.
I am putting everything in a numpy array, cause I tried to use the ImageGenerator method and it didn’t work. I’m using the following code to load the data:
class = glob.glob(r"/content/drive/MyDrive/DATASET/class/*.*")
data = []
labels = []
for i in class:
image=tf.keras.preprocessing.image.load_img(i, color_mode='rgb',
target_size= (336, 336))
image=np.array(image)
data.append(image)
labels.append(0)
data = np.array(data)
labels = np.array(labels)
As ImageDataGenerator is deprecated, you can use a custom Keras Sequence class to load images when needed.
The strategy here is to create a Pandas DataFrame with all the path and class of your images then transform the class to numeric label with pd.factorize. Once, you have X (paths) and y (labels), you can use train_test_split to extract 3 subsets: train, test and validation. The last step is to convert these collections to datasets compatible with Tensorflow.
Each time, Tensorflow process a batch, the Sequence will load a batch of images in memory and so on.
Step 0: Imports and constants
import tensorflow as tf
import pandas as pd
import numpy as np
import pathlib
from sklearn.model_selection import train_test_split
INPUT_SHAPE = (336, 336, 3)
BATCH_SIZE = 32
DATA_DIR = pathlib.Path('/content/drive/MyDrive/DATASET/')
Step 1: Load all image paths to a Pandas DataFrame:
# Find images of dataset
data = []
for file in DATA_DIR.glob('**/*.jpg'):
d = {'class': file.parent.name,
'path': file}
data.append(d)
# Create dataframe and select columns
df = pd.DataFrame(data)
df['label'] = pd.factorize(df['class'])[0]
X = df['path']
y = df['label']
# Split into 3 balanced datasets
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2, random_state=2023)
X_train, X_valid, y_train, y_valid = \
train_test_split(X_train, y_train, test_size=0.2, random_state=2023)
Step 2: Create a custom data Sequence
class ImgDataSequence(tf.keras.utils.Sequence):
"""
Check documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence
"""
def __init__(self, image_set, label_set, batch_size=32, image_size=(256, 256)):
self.image_set = np.array(image_set)
self.label_set = np.array(label_set)
self.batch_size = batch_size
self.image_size = image_size
def __get_image(self, image):
image = tf.keras.preprocessing.image.load_img(image, color_mode='rgb', target_size=self.image_size)
image_arr = tf.keras.preprocessing.image.img_to_array(image)
return image_arr
def __get_data(self, images, labels):
image_batch = np.asarray([self.__get_image(img) for img in images])
label_batch = np.asarray(labels)
return image_batch, label_batch
def __getitem__(self, index):
images = self.image_set[index * self.batch_size:(index + 1) * self.batch_size]
labels = self.label_set[index * self.batch_size:(index + 1) * self.batch_size]
images, labels = self.__get_data(images, labels)
return images, labels
def __len__(self):
return len(self.image_set) // self.batch_size + (len(self.image_set) % self.batch_size > 0)
Step 3: Create datasets
train_ds = ImgDataSequence(X_train, y_train, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
valid_ds = ImgDataSequence(X_valid, y_valid, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
test_ds = ImgDataSequence(X_test, y_test, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
Test the new datasets:
# Take the first batch of our train dataset
>>> imgs, labels = train_ds[0]
# Check then length (BATCH_SIZE)
>>> len(labels)
32
# Check the dimension of one image
>>> imgs[0].shape
(336, 336, 3)
How to use it with Tensorflow?
# train_ds & valid_ds to fit
history = model.fit(train_ds, epochs=10, validation_data=valid_ds)
# test_ds to evaluate
loss, *metrics = model.evaluate(test_ds)

I'm trying to create a Chi-Squared Feature Selection however there is an error in load the dataset

I'm trying to create a Chi-Squared Feature Selection however there is an error in load the dataset. I load the dataset using Panda library. I'm trying to use the train_test_split() function form scikit-learn and use 67% of the data for training and 33% for testing. The dataset used has the header on row 1. How to solve this problem?
This are the coding used.
# example of chi squared feature selection for categorical data
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from matplotlib import pyplot
# load the dataset
def load_dataset():
# load the dataset as a pandas DataFrame
data = read_csv('GDS-and-MMSE-balanced', header=None)
# retrieve numpy array
dataset = data.values
# split into input (X) and output (y) variables
X = dataset[:, :-1]
y = dataset[:,-1]
# format all fields as string
X = X.astype(str)
return X, y
# prepare input data
def prepare_inputs(X_train, X_test):
oe = OrdinalEncoder()
oe.fit(X_train)
X_train_enc = oe.transform(X_train)
X_test_enc = oe.transform(X_test)
return X_train_enc, X_test_enc
# prepare target
def prepare_targets(y_train, y_test):
le = LabelEncoder()
le.fit(y_train)
y_train_enc = le.transform(y_train)
y_test_enc = le.transform(y_test)
return y_train_enc, y_test_enc
# feature selection
def select_features(X_train, y_train, X_test):
fs = SelectKBest(score_func=chi2, k='all')
fs.fit(X_train, y_train)
X_train_fs = fs.transform(X_train)
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs
# load the dataset
X, y = load_dataset('GDS-and-MMSE-balanced.csv')
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)
# what are scores for the features
for i in range(len(fs.scores_)):
print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()
Below is the error after it was execute
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-669e661525aa> in <module>()
46
47 # load the dataset
---> 48 X, y = load_dataset('GDS-and-MMSE-balanced.csv')
49 # split into train and test sets
50 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
TypeError: load_dataset() takes 0 positional arguments but 1 was given
You can change your def load_dataset() such as:
def load_dataset(filepath):
# load the dataset as a pandas DataFrame
data = pd.read_csv(filepath, header=None)
# retrieve numpy array
dataset = data.values
# split into input (X) and output (y) variables
X = dataset[:, :-1]
y = dataset[:,-1]
# format all fields as string
X = X.astype(str)
return X, y
then, load your datasets, as following:
# load the dataset
filepath = 'GDS-and-MMSE-balanced.csv'
X, y = load_dataset(filepath)

List wrong predictions in the test data! - Python

I'm trying to list all the wrong predictions in a test set, but quite unsure how to do it. I tried Stackoverflow, but might have searched for the wrong "problem". So I have these text files from a folder, containing emails. The problems is that my predictions isn't doing to well, and I want to inspect the emails that is predicted wrong. Currently a snippet of my code looks something like this:
no_head_train_path_0 = 'folder_name'
no_head_train_path_1 = 'folder_name'
def get_data(path):
text_list = list()
files = os.listdir(path)
for text_file in files:
file_path = os.path.join(path, text_file)
read_file = open(file_path,'r+')
read_text = read_file.read()
read_file.close()
cleaned_text = clean_text(read_text)
text_list.append(cleaned_text)
return text_list, files
no_head_train_0, temp = get_data(no_head_train_path_0)
no_head_train_1, temp1 = get_data(no_head_train_path_1)
no_head_train = no_head_train_0 + no_head_train_1
no_head_labels_train = ([0] * len(no_head_train_0)) + ([1] * len(no_head_train_1))
def vocabularymat(TEXTFILES,VOC,PLAY,METHOD):
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
if (METHOD == "TDM"):
voc = CountVectorizer()
voc.fit(VOC)
if (PLAY == "TRAIN"):
TrainMat = voc.transform(TEXTFILES)
return TrainMat
if (PLAY =="TEST"):
TestMat = voc.transform(TEXTFILES)
return TestMat
TrainMat = vocabularymat(no_head_train,no_head_train,PLAY= "TRAIN",METHOD="TDM")
X_train = Featurelearning(Traindata, Method="NMF")
y_train = datalabel
X_train, X_test, y_train, y_test = train_test_split(data, datalabel, test_size=0.33,
random_state=42
model = LogisticRegression()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
proba = model.predict_proba(X_test)
accuracy = accuracy_score(expected, predicted)
recall = recall_score(expected, predicted, average="binary")
precision = precision_score(expected, predicted , average="binary")
f1 = f1_score(expected, predicted , average="binary")
Is it possible to find the emails/filename that are predicted wrong, so I can manually inspect them? (Sorry for the long code)
You can use NumPy to create a Boolean vector indicating which predictions are wrong, and then use that vector to index your array of file names. For example:
import numpy as np
# mock data
files = np.array(['mail1.txt', 'mail2.txt', 'mail3.txt', 'mail4.txt'])
y_test = np.array([0, 0, 1, 1])
predicted = np.array([0, 1, 0, 1])
# create a Boolean index for the wrong classifications
classification_is_wrong = y_test != predicted
# print the file names of the wrongly classified mails
print(files[classification_is_wrong])
Output:
['mail2.txt' 'mail3.txt']
# find the wrong prediction
prediction = model.predict(x_test)
# save the wrong predicted values
wrong_predict = []
for order, value in enumerate(y_test):
if y_test[order] != prediction[order].argmax():
wrong_predict.append(order)
print(wrong_predict)

Randomize the splitting of data for training and testing for this function

I wrote a function to split numpy ndarrays x_data and y_data into training and test data based on a percentage of the total size.
Here is the function:
def split_data_into_training_testing(x_data, y_data, percentage_split):
number_of_samples = x_data.shape[0]
p = int(number_of_samples * percentage_split)
x_train = x_data[0:p]
y_train = y_data[0:p]
x_test = x_data[p:]
y_test = y_data[p:]
return x_train, y_train, x_test, y_test
In this function, the top part of the data goes to the training dataset and the bottom part of the data samples go to the testing dataset based on percentage_split. How can this data split be made more randomized before being fed to the machine learning model?
Assuming there's a reason you're implementing this yourself instead of using sklearn.train_test_split, you can shuffle an array of indices (this leaves the training data untouched) and index on that.
def split_data_into_training_testing(x_data, y_data, split, shuffle=True):
idx = np.arange(len(x_data))
if shuffle:
np.random.shuffle(idx)
p = int(len(x_data) * split)
x_train = x_data[idx[:p]]
x_test = x_data[idx[p:]]
... # Similarly for y_train and y_test.
return x_train, x_test, y_train, y_test
You can create a mask with p randomly selected true elements and index the arrays that way. I would create the mask by shuffling an array of the available indices:
ind = np.arange(number_of_samples)
np.random.shuffle(ind)
ind_train = np.sort(ind[:p])
ind_test = np.sort(ind[p:])
x_train = x_data[ind_train]
y_train = y_data[ind_train]
x_test = x_data[ind_test]
y_test = y_data[ind_test]
Sorting the indices is only necessary if your original data is monotonically increasing or decreasing in x and you'd like to keep it that way. Otherwise, ind_train = ind[:p] is just fine.

Split into training and testing set in R?

How can I write the following written code in python into R ?
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
Spliting into training and testing set 80/20 ratio.
Probably the simpler way to do so
#read in iris dataset
data(iris)
library(caret) #this package has the createDataPartition function
set.seed(123) #randomization`
#creating indices
trainIndex <- createDataPartition(iris$Species,p=0.75,list=FALSE)
#splitting data into training/testing data using the trainIndex object
IRIS_TRAIN <- iris[trainIndex,] #training data (75% of data)
IRIS_TEST <- iris[-trainIndex,] #testing data (25% of data)
Using base R you can do the following:
set.seed(12345)
#getting training data set sizes of .20 (in this case 20 out of 100)
train.x<-sample(1:100, 20)
train.y<-sample(1:100, 20)
#simulating random data
x<-rnorm(100)
y<-rnorm(100)
#sub-setting the x data
training.x.data<-x[train]
testing.x.data<-x[-train]
#sub-setting the y data
training.y.data<-y[train]
testing.y.data<-y[-train]
You can do this using caret's createDataPartition function:
library(caret)
# Make example data
X = data.frame(matrix(rnorm(200), nrow = 100))
y = rnorm(100)
#Extract random sample of indices for test data
set.seed(42) #equivalent to python's random_state arg
test_inds = createDataPartition(y = 1:length(y), p = 0.2, list = F)
# Split data into test/train using indices
X_test = X[test_inds, ]; y_test = y[test_inds]
X_train = X[-test_inds, ]; y_train = y[-test_inds]
You could also create test_inds 'from scratch' using test_inds = sample(1:length(y), ceiling(length(y) * 0.2))
Let's take the iris dataset:
# in case you want to use a seed
set.seed(5)
## 70% of the sample size
train_size <- floor(0.75 * nrow(iris))
in_rows <- sample(c(1:nrow(iris)), size = train_size, replace = FALSE)
train <- iris[in_rows, ]
test <- iris[-in_rows, ]

Categories

Resources