How to implement a custom dataset to pytorch project - python

I’d like to train a NN with a given dataset (all including some kind of object, for example: a dog), after the training the NN should help me classifying my images (downloaded from instagram) as “image includes a dog (with probability:0.XX)” or “image doesn’t include a dog (with probability: 0.XX)”.
Obviously images from instagram-images do not always have the same size (but they all have the same format (.jpg) due to filtering), and the images from my dataset do not have the same size as well.
While testing, I'm getting this Error:
Traceback (most recent call last):
File "/venv/nn.py", line 129, in <module>
train(model=globalModel, hardware=hw, train_loader=loader, optimizer=optimizer, epoch=1)
File "/venv/nn.py", line 74, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "\venv\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
data = self._next_data()
File "\venv\lib\site-packages\torch\utils\data\dataloader.py", line 384, in _next_data
index = self._next_index() # may raise StopIteration
File "\venv\lib\site-packages\torch\utils\data\dataloader.py", line 339, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "\venv\lib\site-packages\torch\utils\data\sampler.py", line 200, in __iter__
for idx in self.sampler:
File "\venv\lib\site-packages\torch\utils\data\sampler.py", line 62, in __iter__
return iter(range(len(self.data_source)))
TypeError: object of type 'type' has no len()
with this code: https://pastebin.com/DcvbeMcq
Does anyone know how to implement a custom dataset right?

Looks like the problem is that you're passing not the instance of customDataset, but the class type itself.
Try changing your loader creation code to
loader = torch.utils.data.DataLoader(customDataset(), batch_size=4)

I fixed all previous bugs and errors. At the moment I'm trying to Label my Input data manually via PyTorch:
train_data = torchvision.datasets.ImageFolder(root=TRAIN_DATA_PATH, transform=TRANSFORM)
In the Folder TRAIN_DATA_PATH are many pictures of, for example, dogs.
How can I manually label them all as "dogs" ?
I tried to implement the train data as traindataloader to label them, but it doesn't work until now.
train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
Since I just want to get my evaluation data predicted as "dog (1)" or "not dog (0)", I have to label all of my dog images as "1" or "dog".
But how can I do that?
Thanks to every reader!
Updated code (testing level): https://hastebin.com/axuvupihed.py

Related

IndexError: index 89 is out of bounds for axis 0 with size 89

I am getting this error but I dont understand how to solve it, Can anyone please help. I am trying to run a pre-implemented project on my own text dataset. I am using CUB 200-2011 bird dataset which had initially 11788 birds images and author model has been trained and tested with 11788 images . but for some reason many images were missing when I downloaded the dataset so total I have only 2200 images and 1629 images for training. (you can see in the error as well Model is reading 11788 images filenames from somewhere but found only 1629)
After running the training file:
python3 bird_01_pretrain.py
Namespace(audio_model='Davenet', batch_size=128, cfg_file='cfg/Pretrain/bird_train.yml', data_path='data/birds', exp_dir='', gpu_id=0, image_model='VGG16', img_size=256, lr=0.001, lr_decay=50, manualSeed=200, margin=1.0, momentum=0.9, n_epochs=120, n_print_steps=2, optim='adam', pretrained_image_model=False, resume=True, rnn_type='GRU', save_root='outputs/pre_train/birds', simtype='MISA', tasks='extraction', weight_decay=0.001)
Total filenames: 11788 001.Black_footed_Albatross/Black_Footed_Albatross_0046_18.jpg
Load filenames from: data/birds/train/filenames.pickle (1629)
Traceback (most recent call last):
File "bird_01_pretrain.py", line 145, in <module>
dataset = SpeechDataset(cfg.DATA_DIR, 'train',
File "/opt/app/dataset/datasets_pre.py", line 358, in __init__
seq_labels[unique_id[i]-1]=i
IndexError: index 89 is out of bounds for axis 0 with size 89
Code:
# cacluate the sequence label for the whole dataset
if cfg.DATASET_NAME == 'birds'
if self.split =='train':
unique_id = np.unique(self.class_id)
seq_labels = np.zeros(cfg.DATASET_ALL_CLSS_NUM)
for i in range(cfg.DATASET_TRAIN_CLSS_NUM):
seq_labels[unique_id[i]-1]=i
self.labels = seq_labels[np.array(self.class_id)-1]
I am getting an error at this line: seq_labels[unique_id[i]-1]=i
datasets_pre.py file has around 400 lines of codes and calling many other modules as well. I would be happy to share if anyone wants to see whole code and other files as well but for now I am trying to give exact piece of code which is causing the error.
Please help me :)
I can also provide the author open github repo link if anyone wants to look into deeper.

'Magic number mismatch' error when loading mnist dataset

I'm trying to load the mnist digit dataset and am routinely getting this error. I'm unable to find any solutions online
This code:
from mnist import MNIST
m = MNIST(path)
x_train, y_train = m.load_training()
Yields this error:
File "<stdin>", line 1, in <module>
File "C:\Python38\lib\site-packages\mnist\loader.py", line 125, in load_training
ims, labels = self.load(os.path.join(self.path, self.train_img_fname),
File "C:\Python38\lib\site-packages\mnist\loader.py", line 250, in load
raise ValueError('Magic number mismatch, expected 2049,'
ValueError: Magic number mismatch, expected 2049,got 529205256
I'm running python-mnist 0.7.
I've just had the exact same error and I fixed it by renaming the files:
First you download the data as gzip files. The should look like this: "train-labels-idx1-ubytes.gz"
then you need to extract these files
then you probably get files similar to this one: "train-labels.idx1-ubyte". The error with this is that the file should be named like "train-labels-idx1-ubyte" (hyphen instead of dot).
If you rename the files like that it should work, at least that's what worked for me.

TypeError: Input must be a SparseTensor when using Tensorflow's canned RNNEstimator

I am using Tensorflow's Dataset API to create a basic dataset and use it as input to the canned RNNEstimator as follows (Note some lines have been omitted for brevity:
sequence_feature_colums = [tf.contrib.feature_column.sequence_numeric_column("price")]
estimator = tf.contrib.estimator.RNNEstimator(
head=tf.contrib.estimator.regression_head(),
sequence_feature_columns=sequence_feature_colums)
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
estimator.train(input_fn=lambda:return dataset)
But I am seeing the follow error when calling train:
...
File "/Users/luke/virtualenvs/smp-rnn/lib/python3.6/site-packages/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py", line 497, in _get_sequence_dense_tensor
sp_tensor, default_value=self.default_value)
File "/Users/luke/virtualenvs/smp-rnn/lib/python3.6/site-packages/tensorflow/python/ops/sparse_ops.py", line 1449, in sparse_tensor_to_dense
sp_input = _convert_to_sparse_tensor(sp_input)
File "/Users/luke/virtualenvs/smp-rnn/lib/python3.6/site-packages/tensorflow/python/ops/sparse_ops.py", line 68, in _convert_to_sparse_tensor
raise TypeError("Input must be a SparseTensor.")
TypeError: Input must be a SparseTensor.
What am I doing wrong here? This is very simple example following the standard steps for using Estimators so I'm not sure where in my code I have decided not to use a SparseTensor.

How to fix 'Error(s) in loading state_dict for AWD_LSTM' when using fast-ai

I am using fast-ai library in order to train a sample of the IMDB reviews dataset. My goal is to achieve sentiment analysis and I just wanted to start with a small dataset (this one contains 1000 IMDB reviews). I have trained the model in a VM by using this tutorial.
I saved the data_lm and data_clas model, then the encoder ft_enc and after that I saved the classifier learner sentiment_model. I, then, got those 4 files from the VM and put them in my machine and wanted to use those pretrained models in order to classify sentiment.
This is what I did:
# Use the IMDB_SAMPLE file
path = untar_data(URLs.IMDB_SAMPLE)
# Language model data
data_lm = TextLMDataBunch.from_csv(path, 'texts.csv')
# Sentiment classifier model data
data_clas = TextClasDataBunch.from_csv(path, 'texts.csv',
vocab=data_lm.train_ds.vocab, bs=32)
# Build a classifier using the tuned encoder (tuned in the VM)
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('ft_enc')
# Load the trained model
learn.load('sentiment_model')
After that, I wanted to use that model in order to predict the sentiment of a sentence. When executing this code, I ran into the following error:
RuntimeError: Error(s) in loading state_dict for AWD_LSTM:
size mismatch for encoder.weight: copying a param with shape torch.Size([8731, 400]) from checkpoint, the shape in current model is torch.Size([8888, 400]).
size mismatch for encoder_dp.emb.weight: copying a param with shape torch.Size([8731, 400]) from checkpoint, the shape in current model is torch.Size([8888, 400]).
And the Traceback is:
Traceback (most recent call last):
File "C:/Users/user/PycharmProjects/SentAn/mainApp.py", line 51, in <module>
learn = load_models()
File "C:/Users/user/PycharmProjects/SentAn/mainApp.py", line 32, in load_models
learn.load_encoder('ft_enc')
File "C:\Users\user\Desktop\py_code\env\lib\site-packages\fastai\text\learner.py", line 68, in load_encoder
encoder.load_state_dict(torch.load(self.path/self.model_dir/f'{name}.pth'))
File "C:\Users\user\Desktop\py_code\env\lib\site-packages\torch\nn\modules\module.py", line 769, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
So the error occurs when loading the encoder. But, I also tried to remove the load_encoder line but the same error occurred at the next line learn.load('sentiment_model').
I searched through the fast-ai forum and noticed that others also had this issue but found no solution. In this post the user says that this might have to do with different preprocessing, though I couldn't understand why this would happen.
Does anyone have an idea about what I am doing wrong?
It seems vocabulary size of data_clas and data_lm are different. I guess the problem is caused by different preprocessing used in data_clas and data_lm. To check my guess I simply used
data_clas.vocab.itos = data_lm.vocab.itos
Before the following line
learn_c = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.3)
This has fixed the error.

CountVectorizer() in scikit-learn Python gives Memory error when feeding big Dataset. Same code with Smaller dataset works fine, what am I missing?

I am Working on Two Class Machine Learning Problem. Training Set contains 2-Millions Rows of URL(Strings) and Label 0 and 1. Classifier LogisticRegression() should predict any of two labels when testing datasets are passed. I am getting 95% accuracy results when i use smaller dataset i.e 78,000 URL and 0 and 1 as labels.
The Problem I am having is When I feed in big dataset (2 million row of URL strings) I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "C:/Users/Slim/.xy/startups/start/chi2-94.85 - Copy.py", line 48, in <module>
bi_counts = bi.fit_transform(url_list)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 717, in _count_vocab
j_indices.append(vocabulary[feature])
MemoryError
My code which is working for small datasets with fair enough accuracy is
bi = CountVectorizer(ngram_range=(3, 3),binary = True, max_features=9000, analyzer='char_wb')
bi_counts = bi.fit_transform(url_list)
tf = TfidfTransformer(norm='l2')
X_train_tf =tf.fit_transform(use_idf=True, bi_counts)
clf = LogisticRegression(penalty='l1',intercept_scaling=0.5,random_state=True)
clf.fit(train_x2,y)
I tried to keep 'max_features' as minimum as possible say max_features=100, but still same result.
Please Note:
I am Using core i5 with 4GB ram
I tried the same code on 8GB ram but
no luck
I am using Pyhon 2.7.6 with sklearn, NumPy 1.8.1, SciPy 0.14.0, Matplotlib 1.3.1
UPDATE:
#Andreas Mueller suggested to used HashingVectorizer(), i used it with small and large datasets, 78,000 dataset compiled successfully but 2-million dataset gave me same memory error as shown above. I tried it on 8GB ram and in-use memory space = 30% when compiling big dataset.
IIRC the max_features is only applied after the whole dictionary is computed.
The easiest way out is to use the HashingVectorizer that does not compute a dictionary.
You will lose the ability to get the corresponding token for a feature, but you shouldn't run into memory issues any more.

Categories

Resources