I'm having a problem with a pytorch-ignite classification model. The code is quite long, so I'd like to first ask if anyone can explain this behavior in theory.
I am doing many classifications in a row. In each iteration, I select a subset of my data randomly and perform classification. My results were quite poor (accuracy ~ 0.6). I realized that in each iteration my training dataset is not balanced. I have a lot more class 0 data than class 1; so in a random selection, there tends to be more data from class 0.
So, I modified the selection procedure: I randomly select a N data points from class 1, then select N data points from class 0, then concatenate these two together (so the label order is like [1111111100000000] ). Finally, I shuffle this list to mix the labels before feeding it to the network.
The problem is, with this new data selection, my gpu runs out of memory within seconds. This was odd since with the first data selectin policy the code ran for tens of hours.
I retraced my steps: Turns out, if I do not shuffle my data in the end, meaning, if I keep the [1111111100000000] order, all is well. If I do shuffle the data, I need to reduce my batch_size by a factor of 5 or more so the code doesn't crash due to running out of gpu memory.
Any idea what is happening here? Is this to be expected?
I found the solution to my problem. But I don't really understand the details of why it works:
When trying to choose a batch_size at first, I chose 90. 64 was slow, I was worried 128 was going to be too large, and a quick googling let me to believe keeping to powers of 2 shouldn't matter much.
Turns out, it does matter! At least, when your classification training data is balanced. As soon as I changed my batch_size to a power of 2, there was no memory overflow. In fact, I ran the whole thing on a batch_size of 128 and there was no problem :)
Related
I was training an LSTM network in tensorflow. My model has the following configuration:
time_steps = 1700
Cell size: 120
Number of input features x = 512.
Batch size: 34
Optimizer: AdamOptimizer with learning rate = 0.01
Number of epochs = 20
I have GTX 1080 Ti. And my tensorflow version is 1.8.
Additionally, I have set the random seed through tf.set_random_seed(mseed), and I have set the random seed for every trainable variable's initializer so that I can reproduce the same results after multiple runs.
After training the model multiple times, every time for 20 epochs, I found that I was achieving the same exact loss for the first several epochs (7, 8 or 9) "during each run", and then the loss start to differ. I was wondering why this is occuring; and if possible how can someone totally reproduce the results of any model.
Additionally, In my case I feeding the whole data during every iteration. That is, I have doing back propagation through time (BPTT) and not truncated BPTT. In other words, I have 2 iterations in total which is equal to the number of epochs as well.
The following figure demonstrate my problem. Please note that every row correspond to one epoch.
Please note that each column correspond to a different run. (I only included 2 columns/runs) to demonstrate my point.
Finally, replacing the input features with a new features of dimensions 100, I get better results as shown in the following image:
Therefore, I am not sure if this is a hardware issue or not?
Any help is much appreciated!!
The likely issue, assuming everything you've done is correct, is Adam is not reproducible so that might be an issue.
But there are other potential sources of errors: finalizing your graph and setting its seed, and operation level seeds
Hope this helps! It's hard to be sure everything you've done is correct without code but who knows how long your code might be
To the best of my knowledge, as you might have tried, tf.set_random_seed(seed=1) or seed equals any other integer number could be a possible solution.
I am working with Gensim library to train some data files using doc2vec, while trying to test the similarity of one of the files using the method model.docvecs.most_similar("file") , I always get all the results above 91% with almost no difference between them (which is not logic), because the files do not have similarities between them. so the results are inaccurate.
Here is the code for training the model
model = gensim.models.Doc2Vec(vector_size=300, min_count=0, alpha=0.025, min_alpha=0.00025,dm=1)
model.build_vocab(it)
for epoch in range(100):
model.train(it,epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= 0.0002
model.min_alpha = model.alpha
model.save('doc2vecs.model')
model_d2v = gensim.models.doc2vec.Doc2Vec.load('doc2vecs.model')
sim = model_d2v.docvecs.most_similar('file1.txt')
print sim
**this is the output result**
[('file2.txt', 0.9279470443725586), ('file6.txt', 0.9258157014846802), ('file3.txt', 0.92499840259552), ('file5.txt', 0.9209873676300049), ('file4.txt', 0.9180108308792114), ('file7.txt', 0.9141069650650024)]
what am I doing wrong ? how could I improve the accuracy of results ?
What is your it data, and how is it prepared? (For example, what does print(iter(it).next()) do, especially if you call it twice in a row?)
By calling train() 100 times, and also retaining the default model.iter of 5, you're actually making 500 passes over the data. And the first 5 passes will use train()s internal, effective alpha-management to lower the learning rate gradually to your declared min_alpha value. Then your next 495 passes will be at your own clumsily-managed alpha rates, first back up near 0.025 and then lower each batch-of-5 until you reach 0.005.
None of that is a good idea. You can just call train() once, passing it your desired number of epochs. A typical number of epochs in published work is 10-20. (A bit more might help with a small dataset, but if you think you need hundreds, something else is probably wrong with the data or setup.)
If it's a small amount of data, you won't get very interesting Word2Vec/Doc2Vec results, as these algorithms depend on lots of varied examples. Published results tend to use training sets with tens-of-thousands to millions of documents, and each document at least dozens, but preferably hundreds, of words long. With tinier datasets, sometimes you can squeeze out adequate results by using more training passes, and smaller vectors. Also using the simpler PV-DBOW mode (dm=0) may help with smaller corpuses/documents.
The values reported by most_similar() are not similarity "percentages". They're cosine-similarity values, from -1.0 to 1.0, and their absolute values are less important than the relative ranks of different results. So it shouldn't matter if there are a lot of results with >0.9 similarities – as long as those documents are more like the query document than those lower in the rankings.
Looking at the individual documents suggested as most-similar is thus the real test. If they seem like nonsense, it's likely there are problems with your data or its preparation, or training parameters.
For datasets with sufficient, real natural-language text, it's typical for higher min_count values to give better results. Real text tends to have lots of low-frequency words that don't imply strong things without many more examples, and thus keeping them during training serves as noise making the model less strong.
Without knowing the contents of the documents, here are two hints that might help you.
Firstly, 100 epochs will probably be too small for the model to learn the differences.
also, check the contents of the documents vs the corpus you are using. Make sure that the vocab is relevant for your files?
I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!
Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.
Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.
I extracted 145,185,965 sentences (14GB) out of the english wikipedia dump and I want to train a Doc2Vec model based on these sentences. Unfortunately I have 'only' 32GB of RAM and get a MemoryError when trying to train. Even if I set the min_count to 50, gensim tells me that it would need over 150GB of RAM. I don't think that further increasing the min_count would be a good idea, because the resulting model would be not very good (just a guess). But anyways, I will try it with 500 to see if memory is sufficient then.
Are there any possibilities to train such a large model with limited RAM?
Here is my current code:
corpus = TaggedLineDocument(preprocessed_text_file)
model = Doc2Vec(vector_size=300,
window=15,
min_count=50, #1
workers=16,
dm=0,
alpha=0.75,
min_alpha=0.001,
sample=0.00001,
negative=5)
model.build_vocab(corpus)
model.train(corpus,
epochs=400,
total_examples=model.corpus_count,
start_alpha=0.025,
end_alpha=0.0001)
Are there maybe some obvious mistakes I am doing? Using it completely wrong?
I could also try reducing the vector size, but I think this will result in much worse results as most papers use 300D vectors.
The required model size in addressable memory is largely a function of the number of weights required, by the number of unique words and unique doc-tags.
With 145,000,000 unique doc-tags, no matter how many words you limit yourself to, just the raw doc-vectors in-training alone will require:
145,000,000 * 300 dimensions * 4 bytes/dimension = 174GB
You could try a smaller data set. You could reduce the vector size. You could get more memory.
I would try one or more of those first, just to verify you're able to get things working and some initial results.
There is one trick, best considered experimental, that may work to allow training larger sets of doc-vectors, at some cost of extra complexity and lower performance: the docvecs_mapfile parameter of Doc2Vec.
Normally, you don't want a Word2Vec/Doc2Vec-style training session to use any virtual memory, because any recourse to slower disk IO makes training extremely slow. However, for a large doc-set, which is only ever iterated over in one order, the performance hit may be survivable after making the doc-vectors array to be backed by a memory-mapped file. Essentially, each training pass sweeps through the file from font-to-back, reading each section in once and paging it out once.
If you supply a docvecs_mapfile argument, Doc2Vec will allocate the doc-vectors array to be backed by that on-disk file. So you'll have a hundreds-of-GB file on disk (ideally SSD) whose ranges are paged in/out of RAM as necessary.
If you try this, be sure to experiment with this option on small runs first, to familiarize yourself with its operation, especially around saving/loading models.
Note also that if you then ever do a default most_similar() on doc-vectors, another 174GB array of unit-normalized vectors must be created from the raw array. (You can force that to be done in-place, clobbering the existing raw values, by explicitly calling the init_sims(replace=True) call before any other method requiring the unit-normed vectors is called.)
The TensorFlow tutorial for using CNN for the cifar10 data set has the following advice:
EXERCISE: When experimenting, it is sometimes annoying that the first training step can take so long. Try decreasing the number of images that initially fill up the queue. Search for NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN in cifar10.py.
In order to play around with it, I tried decreasing this number by a lot but it doesn't seem to change the training time. Is there anything I can do? I tried even changing it to something as low as 5 and the training session still continued very slowly.
Any help would be appreciated!
Note that this exercise only speeds up the first step time by skipping the prefetching of a larger from of the data. This exercise does not speed up the overall training
That said, the tutorial text needs to be updated. It should read
Search for min_fraction_of_examples_in_queue in cifar10_input.py.
If you lower this number, the first step should be much quicker because the model will not attempt to prefetch the input.