issues with machine learning scikit learn in python

issues with machine learning scikit learn in python - python

I'm trying to reproduce a tutorial seen
here.
Everything work perfectly until I add the .fit methods with my training set.
Here is a sample of my code :
# TRAINING PART
train_dir = 'pdf/learning_set'
dictionary = make_dic(train_dir)
train_labels = np.zeros(20)
train_labels[17:20] = 1
train_matrix = extract_features(train_dir)
model1 = MultinomialNB()
model1.fit(train_matrix, train_labels)
# TESTING PART
test_dir = 'pdf/testing_set'
test_matrix = extract_features(test_dir)
test_labels = np.zeros(8)
test_labels[4:7] = 1
result1 = model1.predict(test_matrix)
print(confusion_matrix(test_labels, result1))
Here is my Traceback:
Traceback (most recent call last):
File "ML.py", line 65, in <module>
model1.fit(train_matrix, train_labels)
File "/usr/local/lib/python3.6/site-packages/sklearn/naive_bayes.py",
line 579, in fit
X, y = check_X_y(X, y, 'csr')
File "/usr/local/lib/python3.6/site-
packages/sklearn/utils/validation.py", line 552, in check_X_y
check_consistent_length(X, y)
File "/usr/local/lib/python3.6/site-
packages/sklearn/utils/validation.py", line 173, in
check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of
samples: [23, 20]
I would like to know how can I solve this issue ?
I'm working on Ubuntu 16.04, with python 3.6.

ValueError: Found input variables with inconsistent numbers of
samples: [23, 20]
That means you have 23 training Vectors (train_matrix has 23 rows)
but only 20 training labels (train_labels is an array of 20 values)
change train_labels = np.zeros(20)
to train_labels = np.zeros(23)
and it should work.

Related

not able to accurately input a batch of images into a model.fit

My model is designed to train dual images. Since the dataset is very huge I used tf.data.Dataset method to get them as batches as suggested here. However I had a difficulty at properly inputting a batch of images for training. I looked up some possible solutions to no avail. Still, after these modifications:
ds_train = tf.data.Dataset.zip((tr_inputs, tr_labels)).batch(64)
iterator = ds_train.make_one_shot_iterator()
next_batch = iterator.get_next()
result = list()
with tf.Session() as sess:
try:
while True:
result.append(sess.run(next_batch))
except tf.errors.OutOfRangeError:
pass
train_examples = np.array(list(zip(*result))[0]) # tr_examples[0][0].shape (64, 224, 224, 3)
val_examples = np.array(list(zip(*val_result))[0]) # val_examples[0][0].shape (64, 224, 224, 3)
The training code snippet is as follows:
hist = base_model.fit((tr_examples[0][0], tr_examples[0][1]), epochs=epochs, verbose=1,
validation_data=(val_examples[0][0], val_examples[0][1]), shuffle=True)
And the error trace:
Traceback (most recent call last):
File "/home/user/00_files/project/DOUBLE_INPUT/dual_input.py", line 177, in <module>
validation_data=(val_examples[0][0], val_examples[0][1]), shuffle=True)
File "/home/user/.local/lib/python3.5/site-packages/keras/engine/training.py", line 955, in fit
batch_size=batch_size)
File "/home/user/.local/lib/python3.5/site-packages/keras/engine/training.py", line 754, in _standardize_user_data
exception_prefix='input')
File "/home/user/.local/lib/python3.5/site-packages/keras/engine/training_utils.py", line 90, in standardize_input_data
data = [standardize_single_array(x) for x in data]
File "/home/user/.local/lib/python3.5/site-packages/keras/engine/training_utils.py", line 90, in <listcomp>
data = [standardize_single_array(x) for x in data]
File "/home/user/.local/lib/python3.5/site-packages/keras/engine/training_utils.py", line 25, in standardize_single_array
elif x.ndim == 1:
AttributeError: 'tuple' object has no attribute 'ndim'
Looking at the shapes of inputs (in the code snippets' comments), it should work. I guess there is only one step left, but I am not sure what is missing.
I am using python 3.5, keras 2.2.0, tensorflow-gpu 1.9.0 on Ubuntu 16.04.
Help is much appreciated.
EDIT: after correcting the parantheses, it threw this error:
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[[[0.9607844 , 0.9607844 , 0.9607844 ],
[0.9987745 , 0.9987745 , 0.9987745 ],
[0.9960785 , 0.9960785 , 0.9960785 ],
...,
[0.9609069 , 0.9609069 , 0.96017164...
Process finished with exit code 1

hist = base_model.fit((tr_examples[0][0], tr_examples[0][1]), epochs=epochs, verbose=1,
validation_data=(val_examples[0][0], val_examples[0][1]), shuffle=True)
should be:
hist = base_model.fit(tr_examples[0][0], tr_examples[0][1], epochs=epochs, verbose=1,
validation_data=(val_examples[0][0], val_examples[0][1]), shuffle=True)
Note that while the validation_data parameter expects a tuple, the training input/label pair should not be a tuple (i.e., remove the parenthesis).

TensorFlow Addition Error when adding results from binary_crossentropy() together

I was in the middle of training my gan when a very unexpected error came up. I have no idea how to fix it. The error doesn't come right away it happens about 2-3 minutes into my training. Here is the Error
Traceback (most recent call last):
File "gan.py", line 103, in <module>
train(X_train_dataset,200)
File "gan.py", line 80, in train
train_step(images) # takes images and improves both the generator and the discriminator
File "gan.py", line 91, in train_step
discriminator_loss = get_discriminator_loss(real_output,fake_output)
File "gan.py", line 48, in get_discriminator_loss
return fake_loss+real_loss
File "/home/jake/.local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1125, in binary_op_wrapper
return func(x, y, name=name)
File "/home/jake/.local/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/home/jake/.local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1447, in _add_dispatch
return gen_math_ops.add_v2(x, y, name=name)
File "/home/jake/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 486, in add_v2
_ops.raise_from_not_ok_status(e, name)
File "/home/jake/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [100] vs. [13] [Op:AddV2]
So from I can tell from this call back my error occures during my get_discriminator_loss() so here is that code.
def get_discriminator_loss(real_predictions,fake_predictions):
real_predictions = tf.sigmoid(real_predictions)
fake_predictions = tf.sigmoid(fake_predictions)
real_loss=tf.losses.binary_crossentropy(tf.ones_like(real_predictions),real_predictions)
fake_loss=tf.losses.binary_crossentropy(tf.zeros_like(fake_predictions),fake_predictions)
return fake_loss+real_loss
Does anyone have any ideas? And remember this is after running successfully for about 2-3 minutes. The error doesn't occur in the first many passes.
I've found the source of my error but I don't know why it's occuring.
My real loss at one of the passes has only 13 values instead of the normal 100
How can this be?
Here is my full code.
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import time
import pickle
pickle_in_X = open("X.pickle","rb")
pickle_in_y = open("y.pickle","rb")
X=pickle.load(pickle_in_X)
y = pickle.load(pickle_in_y)
y = np.array(y)
X_train = X[ int(len(X)*.3): ]
y_train = y[ int(len(y)*.3 ): ]
X_test = X[ :int(len(X)*.3) ]
y_test = X[ :int(len(y)*.3) ]
X_train = (X_train-127.5)/127.5
BATCH_SIZE = 100
X_train_dataset = tf.data.Dataset.from_tensor_slices(X_train).batch(BATCH_SIZE)
#creates a discriminator model.
#discriminator will ouput 0-1 which represents the probability that the image is real
def make_discriminator():
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(7,(3,3),padding="same",input_shape=(40,40,1)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.LeakyReLU())
model.add(tf.keras.layers.Dense(50,activation="relu"))
model.add(tf.keras.layers.Dense(1))
return model
model_discriminator = make_discriminator()
discriminator_optimizer = tf.optimizers.Adam(1e-3)
#real_loss is the amount of error when trying to guess that the real images are in fact real. i.e loss will be if our discriminator guesses that there is a 100% chance that this real image is real
#fake_loss is the amount of error when trying to guess that the fake images are in fact fake. i.e loss will be zero if our discriminator guesses there is a 0% chance that this fake image is fake
#returns the total of our loss
def get_discriminator_loss(real_predictions,fake_predictions):
real_predictions = tf.sigmoid(real_predictions)
fake_predictions = tf.sigmoid(fake_predictions)
real_loss=tf.losses.binary_crossentropy(tf.ones_like(real_predictions),real_predictions)
fake_loss=tf.losses.binary_crossentropy(tf.zeros_like(fake_predictions),fake_predictions)
return fake_loss+real_loss
#take an input of a random string of numbers. and output either a dog or a cat
def make_generator():
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(10*10*256,input_shape = (100,)))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Reshape((10,10,256)))
model.add(tf.keras.layers.Conv2DTranspose(128,(3,3),padding="same"))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Conv2DTranspose(64,(3,3),strides=(2,2),padding="same"))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Conv2DTranspose(1,(3,3),strides=(2,2),padding="same"))
return model
model_generator = make_generator()
#generator gets rewarded when it fools the discriminator
def get_generator_loss(fake_predictions):
fake_predictions = tf.sigmoid(fake_predictions)
fake_loss=tf.losses.binary_crossentropy(tf.ones_like(fake_predictions),fake_predictions)
return fake_loss
generator_optimizer = tf.optimizers.Adam(1e-3)
#training
def train(X_train_dataset,epochs):
for _ in range(epochs):
for images in X_train_dataset:
images = tf.cast(images,tf.dtypes.float32)
train_step(images) # takes images and improves both the generator and the discriminator
def train_step(images):
fake_image_noise = np.random.randn(BATCH_SIZE,100)#produces 100 random numbers that wll be converted to images
with tf.GradientTape() as generator_gradient, tf.GradientTape() as discriminator_gradient:
generated_images = model_generator(fake_image_noise)
real_output = model_discriminator(images)
fake_output = model_discriminator(generated_images)
generator_loss = get_generator_loss(fake_output)
discriminator_loss = get_discriminator_loss(real_output,fake_output)
gradients_of_generator = generator_gradient.gradient(generator_loss,model_generator.trainable_variables)#gradient of gen loss with respect to trainable variables
gradients_of_discriminator = discriminator_gradient.gradient(discriminator_loss,model_discriminator.trainable_variables)
discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator,model_discriminator.trainable_variables))
generator_optimizer.apply_gradients(zip(gradients_of_generator,model_generator.trainable_variables))
print("generator loss: ", np.mean(generator_loss))
print("discriminator loss: ",np.mean(discriminator_loss))
train(X_train_dataset,200)
model_generator.save('genModel')
model_discriminator.save('discModel')

If the size of your dataset is not a multiple of your batch size, then your last batch will have a smaller number of samples than the other batches. To avoid this, you can force a tf.data.Dataset to drop the last batch if it is smaller than the batch size. See the documentation for more information.
tf.data.Dataset.from_tensor_slices(X_train).batch(BATCH_SIZE, drop_remainder=True)

Custom keras generator __get_item() is called more than len__()

I've built a custom keras generator.function.
It yields an img and associated gt.
It works well for the training phase with predict_generator() function.
To evaluate my model, I use it on a test set containing 592 images. I call it with the predict_generator() function.
So I get the right number of prediction (592). Every time get_item() function is called, I add the GT to the self.gt list.
Then, after running predict_generator(), I compare the predictions with the stored GT.
My problem :
I want to store ground truth array in a list, everytime the generator is called. But at the end, I have more GT_arrays than the 592 predictions.
So I can't build my confusion matrix...
Here is the code of the generator:
class DataGenerator(Sequence):
def __init__(self, data_folders_txt, gen_data_type, batchsize, shuffle=True, classes=None, selected_class=None):
'''
- data_fodlers_txt : txt_file containing all the paths to different folders of data
- gen_type : string : can be either "train", "val" or "test" (correspond to a specific folder)
- shuffle : Shuffle the dataset at each epoch
- classes : dict of classes with associated nb (class nb must match the class position on the class axis of the ground truth one-hot-encoded array)
- selected_class : name of the selected class (128x128x1) in the 128x128x3 ground truth one-hot-encoded array
'''
self.gt = []
self.shuffle = shuffle
self.gen_data_type = gen_data_type
self.batchsize = batchsize
self.data_folders = open(data_folders_txt, "r").readlines()
self.list_IDs = self.tiles_list_creation(self.data_folders)
self.samples = len(self.list_IDs)
self.classes = classes
self.selected_class = selected_class
self.index = 0
self.on_epoch_end()
def tiles_list_creation(self, list_folders):
list_IDs = []
for folder in list_folders:
samples = glob.glob(folder.rstrip() + self.gen_data_type + '3/tile/*')
list_IDs += samples
random.shuffle(list_IDs)
return list_IDs
def __len__(self):
if len(self.list_IDs) % self.batchsize == 0:
return len(self.list_IDs)//self.batchsize
else:
return len(self.list_IDs) // self.batchsize + 1
def __getitem__(self, index):
self.index = index
X = []
y = []
# min(...,...) is for taking all the data without being out of range
for i in range(index*self.batchsize, min(self.samples, (index+1)*self.batchsize)):
tile = np.load(self.list_IDs[i])
#If specific class is specified, just take the right channel of the GT_array corresponding to the wanted class
if self.classes:
gt = np.load(self.list_IDs[i].replace("tile", "gt"))[:, :, self.classes[self.selected_class]]
gt = np.expand_dims(gt, axis=-1)
else:
gt = np.load(self.list_IDs[i].replace("tile", "gt"))
#store ground truth to compare the values between gt and predictions after running predict_generator()
self.gt.append(gt)
X.append(tile)
y.append(gt)
return np.array(X), np.array(y)
def on_epoch_end(self):
if self.shuffle:
random.shuffle(self.list_IDs)
And here is where I call it:
batchsize = 10
model = load_model(model_path, custom_objects={'jaccard_distance': jaccard_distance, 'auc': auc})
test_gen = DataGenerator("/path/to/data/path/written/in/file.txt",
gen_data_type='test',
batchsize=batchsize,
classes=None,
selected_class=None)
y_pred = model.predict_generator(test_gen, steps=None, verbose=1)
y_true = np.array(test_gen.gt)
plot_confusion_matrix(y_true, y_pred, ["Hedgerows", "No Hedgerows"])
Here is the error:
60/60 [==============================] - 4s 71ms/step
Traceback (most recent call last):
File "/work/stages/mathurin/sentinel_segmentation/unet/confusion_matrix.py", line 95, in <module>
plot_confusion_matrix(y_true, y_pred, ["Hedgrows", "No Hedgerows"], normalize=normalization, title=model_path.split('/')[-1].split('.')[0])
File "/work/stages/mathurin/sentinel_segmentation/unet/confusion_matrix.py", line 35, in plot_confusion_matrix
cm = confusion_matrix(y_true, y_pred)
File "/work/tools/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 253, in confusion_matrix
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/work/tools/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 71, in _check_targets
check_consistent_length(y_true, y_pred)
File "/work/tools/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 235, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [702, 592]
when I look at the index number of the get_item() function, it is not the expected number... It should be the number returned by the len() function but it is always smaller.
In this example, after making the predictions, the self.index parameter value is 8.
Like if it was exceeding then restarting at 0, 1, 2, etc...
EDIT: more strange !
I just re-run and I get a different number of stored_gt arrays ...
60/60 [==============================] - 6s 100ms/step
Traceback (most recent call last):
File "/work/tools/pycharm-community-2019.1.1/helpers/pydev/pydevd.py", line 1741, in <module>
main()
File "/work/tools/pycharm-community-2019.1.1/helpers/pydev/pydevd.py", line 1735, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/work/tools/pycharm-community-2019.1.1/helpers/pydev/pydevd.py", line 1135, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/work/tools/pycharm-community-2019.1.1/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/work/stages/mathurin/sentinel_segmentation/unet/confusion_matrix.py", line 95, in <module>
plot_confusion_matrix(y_true, y_pred, ["Hedgrows", "No Hedgerows"], normalize=normalization, title=model_path.split('/')[-1].split('.')[0])
File "/work/stages/mathurin/sentinel_segmentation/unet/confusion_matrix.py", line 35, in plot_confusion_matrix
cm = confusion_matrix(y_true, y_pred)
File "/work/tools/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 253, in confusion_matrix
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/work/tools/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 71, in _check_targets
check_consistent_length(y_true, y_pred)
File "/work/tools/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 235, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [682, 592]

There is nothing strange in this, generators are run by keras using multiple processes/threads to improve performance, specially for training, that's why fit_generator and predict_generator have keyword arguments like workers, use_multiprocessing, max_queue_size. So the solution is not to store any kind of ground truth or state in the generator instance.
For your specific case, you can use another kind of prediction loop, by calling the generator manually:
labels = []
preds = []
for step in range(len(generator)):
data, label = generator.__getitem__(step)
pred = model.predict(data)
preds.append(pred)
labels.append(label)
Then using preds and labels to make a confusion matrix.

Multiple labels with tensorflow regression

I'm trying to get a multilabel model going in tensorflow. I saw a related question here: Multiple labels with tensorflow, but couldn't get the solution working.
The code is from a tensorflow tutorial. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/input_fn/boston.py
FEATURES = ["crim", "zn", "indus", "nox", "rm",
"dis", "tax", "ptratio"]
LABELS = ["medv", "age"]
def get_input_fn(data_set, num_epochs=None, shuffle=True):
return tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
# y=pd.Series(data_set[LABEL].values),
y=list(map(lambda label: data_set[label].values, LABELS)),
num_epochs=num_epochs,
shuffle=shuffle)
In my regression I set the label dimension to 2.
regressor = tf.estimator.DNNRegressor(feature_columns=feature_cols,
label_dimension=2,
hidden_units=[10, 10],
model_dir="/tmp/boston_model")
With my try I get:
Traceback (most recent call last):
File "./boston.py", line 85, in <module>
tf.app.run()
File "/home/jillian/.eb/software/machine-learning/1.00/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "./boston.py", line 67, in main
regressor.train(input_fn=get_input_fn(training_set), steps=5000)
File "./boston.py", line 43, in get_input_fn
shuffle=shuffle)
File "/home/jillian/.eb/software/machine-learning/1.00/lib/python3.6/site-packages/tensorflow/python/estimator/inputs/pandas_io.py", line 87, in pand
as_input_fn
'Index for y: %s\n' % (x.index, y.index))
ValueError: Index for x and y are mismatched.
Index for x: RangeIndex(start=0, stop=400, step=1)
Index for y: <built-in method index of list object at 0x7f6f64a5bb48>
I also tried setting y to a numpy array instead of a list.

Tensorflow error using my own data for text classification

I've been playing with the Tensorflow library doing the tutorials.
I'm using this example. And I changed the parameters in the example from this: n_classes = 15
to this: n_classes = 2 as I have only two classes to classify.
I read data like:
train = pandas.read_csv('tensorflow_feed/test/train_with_abs.csv', header=None)
X_train, y_train = train[1], train[0]
test = pandas.read_csv('tensorflow_feed/test/test_with_abs.csv', header=None)
X_test, y_test = test[1], test[0]
But it gives following error:
Total words: 35
Traceback (most recent call last):
File "/home/sumit/PycharmProjects/experiments/text_classification_save_restore.py", line 94, in <module>
classifier.fit(X_train, y_train)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/base.py", line 160, in fit
monitors=monitors)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 449, in _train_model
train_op, loss_op = self._get_train_ops(features, targets)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 673, in _get_train_ops
_, loss, train_op = self._call_model_fn(features, targets, ModeKeys.TRAIN)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 656, in _call_model_fn
features, targets, mode=mode)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/base.py", line 369, in _model_fn
predictions, loss = model_fn(features, targets)
File "/home/sumit/PycharmProjects/experiments/text_classification_save_restore.py", line 73, in rnn_model
word_list = tf.unpack(word_vectors, axis=1)
TypeError: unpack() got an unexpected keyword argument 'axis'
Process finished with exit code 1

The "axis" parameter was just added to tf.unpack on June 23, and the example you're looking at was changed to use it:
https://github.com/tensorflow/tensorflow/commit/eff93149a6dc8e6826898fd9f9c28c81e21c9836
So I suggest either:
use an older version of the example from before that commit, e.g.:
https://github.com/tensorflow/tensorflow/blob/892ca4ddc12852a7b4633fd08f163941356cb4e6/tensorflow/examples/skflow/text_classification_save_restore.py
build a newer Tensorflow from github HEAD.
I hope that helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

issues with machine learning scikit learn in python - python

Related

not able to accurately input a batch of images into a model.fit

TensorFlow Addition Error when adding results from binary_crossentropy() together

Custom keras generator __get_item() is called more than len__()

Multiple labels with tensorflow regression

Tensorflow error using my own data for text classification

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

issues with machine learning scikit learn in python - python

Related

not able to accurately input a batch of images into a model.fit

TensorFlow Addition Error when adding results from binary_crossentropy() together

Custom keras generator __get_item__() is called more than __len__()

Multiple labels with tensorflow regression

Tensorflow error using my own data for text classification

Categories

Resources

Custom keras generator __get_item() is called more than len__()