SVM-handwriting-recognition-master - python

Good day everyone, I am currently working on a project to classify handwriting using SVM classifier, after downloading and resizing the dataset from NIST special database 19. on trying to train my model... I keep getting this error:
File "train_model.py", line 65, in <module>
x_pt = preprocessing.scale(x_pt)
File "C:\Users\Judson_Morgan\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py", line 142, in scale
force_all_finite='allow-nan')
File "C:\Users\Judson_Morgan\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 586, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 144)) while a minimum of 1 is required by the scale function.
How do I go about resolving this issue?

Related

How to implement a custom dataset to pytorch project

I’d like to train a NN with a given dataset (all including some kind of object, for example: a dog), after the training the NN should help me classifying my images (downloaded from instagram) as “image includes a dog (with probability:0.XX)” or “image doesn’t include a dog (with probability: 0.XX)”.
Obviously images from instagram-images do not always have the same size (but they all have the same format (.jpg) due to filtering), and the images from my dataset do not have the same size as well.
While testing, I'm getting this Error:
Traceback (most recent call last):
File "/venv/nn.py", line 129, in <module>
train(model=globalModel, hardware=hw, train_loader=loader, optimizer=optimizer, epoch=1)
File "/venv/nn.py", line 74, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "\venv\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
data = self._next_data()
File "\venv\lib\site-packages\torch\utils\data\dataloader.py", line 384, in _next_data
index = self._next_index() # may raise StopIteration
File "\venv\lib\site-packages\torch\utils\data\dataloader.py", line 339, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "\venv\lib\site-packages\torch\utils\data\sampler.py", line 200, in __iter__
for idx in self.sampler:
File "\venv\lib\site-packages\torch\utils\data\sampler.py", line 62, in __iter__
return iter(range(len(self.data_source)))
TypeError: object of type 'type' has no len()
with this code: https://pastebin.com/DcvbeMcq
Does anyone know how to implement a custom dataset right?
Looks like the problem is that you're passing not the instance of customDataset, but the class type itself.
Try changing your loader creation code to
loader = torch.utils.data.DataLoader(customDataset(), batch_size=4)
I fixed all previous bugs and errors. At the moment I'm trying to Label my Input data manually via PyTorch:
train_data = torchvision.datasets.ImageFolder(root=TRAIN_DATA_PATH, transform=TRANSFORM)
In the Folder TRAIN_DATA_PATH are many pictures of, for example, dogs.
How can I manually label them all as "dogs" ?
I tried to implement the train data as traindataloader to label them, but it doesn't work until now.
train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
Since I just want to get my evaluation data predicted as "dog (1)" or "not dog (0)", I have to label all of my dog images as "1" or "dog".
But how can I do that?
Thanks to every reader!
Updated code (testing level): https://hastebin.com/axuvupihed.py

Python sklearn ValueError: array is too big

I made simple script on Python (ver.3.7) that classifies satellite image, but It can classify only clip of the satellite image. When I'm trying to classify the whole satellite image, it returns this:
Traceback (most recent call last):
File "v0-3.py", line 219, in classification_tool
File "sklearn\cluster\k_means_.py", line 972, in fit
File "sklearn\cluster\k_means_.py", line 312, in k_means
File "sklearn\utils\validation.py", line 496, in check_array
File "numpy\core\_asarray.py", line 85, in asarray
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
I tried using MiniBatchKMeans instead of KMeans (from Sklearn.KMeans : how to avoid Memory or Value Error?), but It still doesn't work. How I can avoid or solve this error? Maybe there are some mistakes in my code?
Oh I'm idiot because I used x32 version of Python instead of x64.
Maybe reinstalling Python to x64 version will solve your problem, user

TypeError: Input must be a SparseTensor when using Tensorflow's canned RNNEstimator

I am using Tensorflow's Dataset API to create a basic dataset and use it as input to the canned RNNEstimator as follows (Note some lines have been omitted for brevity:
sequence_feature_colums = [tf.contrib.feature_column.sequence_numeric_column("price")]
estimator = tf.contrib.estimator.RNNEstimator(
head=tf.contrib.estimator.regression_head(),
sequence_feature_columns=sequence_feature_colums)
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
estimator.train(input_fn=lambda:return dataset)
But I am seeing the follow error when calling train:
...
File "/Users/luke/virtualenvs/smp-rnn/lib/python3.6/site-packages/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py", line 497, in _get_sequence_dense_tensor
sp_tensor, default_value=self.default_value)
File "/Users/luke/virtualenvs/smp-rnn/lib/python3.6/site-packages/tensorflow/python/ops/sparse_ops.py", line 1449, in sparse_tensor_to_dense
sp_input = _convert_to_sparse_tensor(sp_input)
File "/Users/luke/virtualenvs/smp-rnn/lib/python3.6/site-packages/tensorflow/python/ops/sparse_ops.py", line 68, in _convert_to_sparse_tensor
raise TypeError("Input must be a SparseTensor.")
TypeError: Input must be a SparseTensor.
What am I doing wrong here? This is very simple example following the standard steps for using Estimators so I'm not sure where in my code I have decided not to use a SparseTensor.

Tensorflow Dataset API - .from_tensor_slices() / .from_tensor() - cannot create a tensor proto whose content is larger than 2gb

So I want to use Dataset API for batching my large dataset (~8GB) as I am suffering from large idle times when using my GPU as I am passing data from python to Tensorflow using feed_dict.
When I follow the tutorial as mentioned here:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/5_DataManagement/tensorflow_dataset_api.py
When running my simple code:
one_hot_dataset = np.load("one_hot_dataset.npy")
dataset = tf.data.Dataset.from_tensor_slices(one_hot_dataset)
I am getting the error message with TensorFlow 1.8 and Python 3.5:
Traceback (most recent call last):
File "<ipython-input-17-412a606c772f>", line 1, in <module>
dataset = tf.data.Dataset.from_tensor_slices((one_hot_dataset))
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 235, in from_tensor_slices
return TensorSliceDataset(tensors)
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1030, in __init__
for i, t in enumerate(nest.flatten(tensors))
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1030, in <listcomp>
for i, t in enumerate(nest.flatten(tensors))
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1014, in convert_to_tensor
as_ref=False)
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1104, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 235, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 214, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/anaconda2/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 496, in make_tensor_proto
"Cannot create a tensor proto whose content is larger than 2GB.")
ValueError: Cannot create a tensor proto whose content is larger than 2GB.
How can I solve this? I think the cause is obvious but what did the tf developers think by limiting the input data to 2GB ?!? I really cannot understand this rational and what is the workaround when dealing with larger datasets?
I googled quite a lot but I could not find any similar error message. When I use a FITFH of the numpy dataset, the steps above work without any issues.
I somehow need to tell TensorFlow that I actually will be loading the data batch by batch and probably want to prefetch a few batches in order to keep my GPU busy. But it seems as if it is trying to load the whole numpy dataset at once. So what is the benefit of using the Dataset API, as I am able to reproduce this error by simply trying to load my numpy dataset as a tf.constant into the TensorFlow graph, which is obviously does not fit and I get OOM errors.
Tips and troubleshooting hints appreciated!
This issue is addressed in the tf.data user guide (https://www.tensorflow.org/guide/datasets) in "Consuming NumPy arrays" section.
Basically, create a dataset.make_initializable_iterator() iterator and feed your data at runtime.
If this does not work for some reason, you can write your data to files or create a dataset from Python generator (https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator), where you can put arbitrary Python code including slicing your numpy array and yielding the slice.

CountVectorizer() in scikit-learn Python gives Memory error when feeding big Dataset. Same code with Smaller dataset works fine, what am I missing?

I am Working on Two Class Machine Learning Problem. Training Set contains 2-Millions Rows of URL(Strings) and Label 0 and 1. Classifier LogisticRegression() should predict any of two labels when testing datasets are passed. I am getting 95% accuracy results when i use smaller dataset i.e 78,000 URL and 0 and 1 as labels.
The Problem I am having is When I feed in big dataset (2 million row of URL strings) I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "C:/Users/Slim/.xy/startups/start/chi2-94.85 - Copy.py", line 48, in <module>
bi_counts = bi.fit_transform(url_list)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 717, in _count_vocab
j_indices.append(vocabulary[feature])
MemoryError
My code which is working for small datasets with fair enough accuracy is
bi = CountVectorizer(ngram_range=(3, 3),binary = True, max_features=9000, analyzer='char_wb')
bi_counts = bi.fit_transform(url_list)
tf = TfidfTransformer(norm='l2')
X_train_tf =tf.fit_transform(use_idf=True, bi_counts)
clf = LogisticRegression(penalty='l1',intercept_scaling=0.5,random_state=True)
clf.fit(train_x2,y)
I tried to keep 'max_features' as minimum as possible say max_features=100, but still same result.
Please Note:
I am Using core i5 with 4GB ram
I tried the same code on 8GB ram but
no luck
I am using Pyhon 2.7.6 with sklearn, NumPy 1.8.1, SciPy 0.14.0, Matplotlib 1.3.1
UPDATE:
#Andreas Mueller suggested to used HashingVectorizer(), i used it with small and large datasets, 78,000 dataset compiled successfully but 2-million dataset gave me same memory error as shown above. I tried it on 8GB ram and in-use memory space = 30% when compiling big dataset.
IIRC the max_features is only applied after the whole dictionary is computed.
The easiest way out is to use the HashingVectorizer that does not compute a dictionary.
You will lose the ability to get the corresponding token for a feature, but you shouldn't run into memory issues any more.

Categories

Resources