I wish to use TensorFlow to perform classification on medical datasets.
To do this, I am using exactly the same code as that proposed in https://www.tensorflow.org/get_started/estimator, the only difference being that I face medical datasets rather than the Iris database. So, it is not custom code.
I do need help about the following problem: if I run the same code on the same data with the same parameters for the network configuration (number of layers, number of neurons in each layer, and so on) , the results are different. I mean, I run the same code ten times, and I obtain ten different values for the accuracy. These values are even largely different, as they range from 73% up to 83%.
This means that the subjects considered suffering from a given disease vary from a run to another. Differently said, once set a network structure, there are several subjects who are considered either healthy or sick depending on the run only.
As you can imagine, this lack of repeatability makes that code useless from both scientific and medical viewpoints: another user, running the same configuration over the same data set, would find a different model and different results, so would cure different subjects.
I have noticed that for the Iris database the problem seems not to take place, and accuracy is always 0.9666. This depends on the problem being very easy (linearly separable for all but one items, and very small data set).
I have carried out a search on the internet, and I have found several other people who have noted the same problem. As for the possible solutions, I have read several as well, I have implemented them all, with no result.
Here I add a short list of some suggested remedies that failed in my case:
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(0)
tf.set_random_seed(0)
rn.seed(0)
tf.reset_default_graph()
session_conf = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1
)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
Is there any chance to fix this problem? It is a pity that such an excellent tool, as TensorFlow is, cannot guarantee repeatibility.
I am using the following:
custom code: no, it is the one in https://www.tensorflow.org/get_started/estimator
system: Apple
OS: Mac OsX 10.13
TensorFlow version: 1.3.0
Python version: 3.6.3
GPU model: AMD FirePro D700 (actually, two such GPUs)
Thank you very much!
Best regards
Ivanoe De Falco
Related
System information
OpenVINO => 2022.1
Operating System / Platform =>Intel(R) Core(TM) i5-9400F CPU # 2.90GHz/ Windows 10 64 Bit
I trained the following YoloV5 model:
Model Size: Large
Labels: ['mango', 'apple', 'milk', 'orange', 'grapes'].
batch-size: 4
Img-Size: 512
When I perform the inference on the trained YoloV5 Model the detections are descent and it is able to detect all 5 labels. The detection confidence is also good averaging around 90%.
I then optimized the model using OpenVino:
Quantization: FP16, FP32
But the converted model only detects mango, apple, and grapes and completely ignores the remaining labels.
Things I have tried:
Retraining the Yolov5 model with different batch-size.
Tried different quantization while converting to OpenVino.
Tried different (previous) versions of OpenVino like 2020.4.
I have previously faced similar issues while training other models but could never figure out the solution or even the cause of the same. Has anyone else faced similar issues?
It would be ideal if someone can guide me in a direction to help solve it. Other answers that also explain potential causes of the issue are also welcome!
Converting the model into a smaller precision has its pros and cons.
The inferencing time is faster but the trade-off is accuracy.
If your use case involves something like clinical results that require to be accurate, it is not recommended to use smaller precision as you need to bear with less accuracy. Meanwhile, if your use case needs to be fast without being precise, then smaller precision (FP16/INT8) is suitable.
You should carefully choose the right precision depending on your use case and also hardware.
This might help you to further understand.
Hi everybody I have this basic question that isn't related to any particular piece of code
I'm following some papers about the use of CNN to assess the quality of images and, in all of them, the networks that are being used are the ones offered by keras-applications (VGG16-19, ResNet..) with some variations depanding on the paper
Often, in these works, the variations to the basic networks aren't bound to them but valid for the whole paper, meaning the only differences should be the images input size and the specific image pre-processing function, the latters also provided within keras-application."basic network name"
My question is if what I'm doing is sufficient because, in fact, when I try to replicate the results of a paper, there are some networks that underperform (the results aren't that bad, meaning the network is still training in the right direction) compared to the expected results but others don't. In particular, I had problems with ResNet50 and VGG16 but not with MobileNet v1/v2 and InceptionV3: no matter how much learning rate (or even the dropouts) I use, the validation loss obtained at the end of the trainings are almost 10% worse to the published results when using ResNet or VGG
I'm sure the code is correct, as I said the only difference when changing the loaded basic network is which image pre-processing function is selected. At this point I have two possible ideas:
I'm using a different setup compared to the papers (tensorflow 1.15) and the Keras-applications has some kind of bug. In fact the last published version of that module has some bugs regarding the pre-processing in that the code threw an exception when I tried it for the first time. By digging on their Git page I found that that problem was endeed a bug that had already been fixed within the last commit (not yet published inside a proper release). Sadly, that doesn't mean its behaviour isn't bugged but I can't verify that by myself
Some networks, such as VGG16 and ResNet50, have more requirements to work properly and I'm missing them, what can you tell me from your experience? Have you found yourself in a situation like mine? Note that I'm not talking about parameters such as the dropouts, learning rates etc.. because they are provided by the papers, I'm wondering if for example the function, and relative interpolation, used to load the images could matter in any kind of way
I am trying to apply the word2vec model implemented in the library gensim 3.6 in python 3.7, Windows 10 machine. I have a list of sentences (each sentences is a list of words) as an input to the model after performing preprocessing.
I have computed the results (obtaining 10 most similar words of a given input word using model.wv.most_similar) in Anaconda's Spyder followed by Sublime Text editor.
But, I am getting different results for the same source code executed in two editors.
Which result should I need to choose and Why?
I am specifying the screenshot of the results obtained by running the same code in both spyder and sublime text. The input word for which I need to obtain 10 most similar word is #universe#
I am really confused how to choose the results, on what basis? Also, I have started learning Word2Vec recently.
Any suggestion is appreciated.
Results Obtained in Spyder:
Results Obtained using Sublime Text:
The Word2Vec algorithm makes use of randomization internally. Further, when (as is usual for efficiency) training is spread over multiple threads, some additional order-of-presentation randomization is introduced. These mean that two runs, even in the exact same environment, can have different results.
If the training is effective – sufficient data, appropriate parameters, enough training passes – all such models should be of similar quality when doing things like word-similarity, even though the actual words will be in different places. There'll be some jitter in the relative rankings of words, but the results should be broadly similar.
That your results are vaguely related to 'universe' but not impressively so, and that they vary so much from one run to another, suggest there may be problems with your data, parameters, or quantity of training. (We'd expect the results to vary a little, but not that much.)
How much data do you have? (Word2Vec benefits from lots of varied word-usage examples.)
Are you retaining rare words, by making min_count lower than its default of 5? (Such words tend not to get good vectors, and also wind up interfering with the improvement of nearby words' vectors.)
Are you trying to make very-large vectors? (Smaller datasets and smaller vocabularies can only support smaller vectors. Too-large vectors allow 'overfitting', where idiosyncracies of the data are memorized rather than generalized patterns learned. Or, they allow the model to continue improving in many different non-competitive directions, so model end-task/similarity results can be very different from run-to-run, even though each model is doing about-as-well as the other on its internal word-prediction tasks.)
Have you stuck with the default epochs=5 even with a small dataset? (A large, varied dataset requires fewer training passes - because all words appear many times, all throughout the dataset, anyway. If you're trying to squeeze results from thinner data, more epochs may help a little – but not as much as more varied data would.)
While tuning the hyperparameters to get my model to perform better, I noticed that the score I get (and hence the model that is created) is different every time I run the code despite fixing all the seeds for random operations. This problem does not happen if I run on CPU.
I googled and found out that this is a common issue when using a GPU to train. Here is a very good/detailed example with short code snippets to verify the existence of that problem.
They pinpointed the non-determinism to "tf.reduce_sum" function. However, that is not the case for me. it could be because I'm using different hardware (1080 TI) or a different version of CUDA libraries or Tensorflow. It seems like there are many different parts of the CUDA libraries that are non-deterministic and it doesn't seem easy to figure out exactly which part and how to get rid of it. Also, this must have been by design, so it's likely that there is a sufficient efficiency increase in exchange for non-determinism.
So, my question is:
Since GPUs are popular for training NNs, people in this field must have a way to deal with non-determinism, because I can't see how else you'd be able to reliably tune the hyperparameters. What is the standard way to handle non-determinism when using a GPU?
TL;DR
Non-determinism for a priori deterministic operations come from concurrent (multi-threaded) implementations.
Despite constant progress on that front, TensorFlow does not currently guarantee determinism for all of its operations. After a quick search on the internet, it seems that the situation is similar to the other major toolkits.
During training, unless you are debugging an issue, it is OK to have fluctuations between runs. Uncertainty is in the nature of training, and it is wise to measure it and take it into account when comparing results – even when toolkits eventually reach perfect determinism in training.
That, but much longer
When you see neural network operations as mathematical operations, you would expect everything to be deterministic. Convolutions, activations, cross-entropy – everything here are mathematical equations and should be deterministic. Even pseudo-random operations such as shuffling, drop-out, noise and the likes, are entirely determined by a seed.
When you see those operations from their computational implementation, on the other hand, you see them as massively parallelized computations, which can be source of randomness unless you are very careful.
The heart of the problem is that, when you run operations on several parallel threads, you typically do not know which thread will end first. It is not important when threads operate on their own data, so for example, applying an activation function to a tensor should be deterministic. But when those threads need to synchronize, such as when you compute a sum, then the result may depend on the order of the summation, and in turn, on the order in which thread ended first.
From there, you have broadly speaking two options:
Keep non-determinism associated with simpler implementations.
Take extra care in the design of your parallel algorithm to reduce or remove non-determinism in your computation. The added constraint usually results in slower algorithms
Which route takes CuDNN? Well, mostly the deterministic one. In recent releases, deterministic operations are the norm rather than the exception. But it used to offer many non-deterministic operations, and more importantly, it used to not offer some operations such as reduction, that people needed to implement themselves in CUDA with a variable degree of consideration to determinism.
Some libraries such as theano were more ahead of this topic, by exposing early on a deterministic flag that the user could turn on or off – but as you can see from its description, it is far from offering any guarantee.
If more, sometimes we will select some implementations that are more deterministic, but slower. In particular, on the GPU, we will avoid using AtomicAdd. Sometimes we will still use non-deterministic implementation, e.g. when we do not have a GPU implementation that is deterministic. Also, see the dnn.conv.algo* flags to cover more cases.
In TensorFlow, the realization of the need for determinism has been rather late, but it's slowly getting there – helped by the advance of CuDNN on that front also. For a long time, reductions have been non-deterministic, but now they seem to be deterministic. The fact that CuDNN introduced deterministic reductions in version 6.0 may have helped of course.
It seems that currently, the main obstacle for TensorFlow towards determinism is the backward pass of the convolution. It is indeed one of the few operations for which CuDNN proposes a non-deterministic algorithm, labeled CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0. This algorithm is still in the list of possible choices for the backward filter in TensorFlow. And since the choice of the filter seems to be based on performance, it could indeed be picked if it is more efficient. (I am not so familiar with TensorFlow's C++ code so take this with a grain of salt.)
Is this important?
If you are debugging an issue, determinism is not merely important: it is mandatory. You need to reproduce the steps that led to a problem. This is currently a real issue with toolkits like TensorFlow. To mitigate this problem, your only option is to debug live, adding checks and breakpoints at the correct locations – not great.
Deployment is another aspect of things, where it is often desirable to have a deterministic behavior, in part for human acceptance. While nobody would reasonably expect a medical diagnosis algorithm to never fail, it would be awkward that a computer could give the same patient a different diagnosis depending on the run. (Although doctors themselves are not immune to this kind of variability.)
Those reasons are rightful motivations to fix non-determinism in neural networks.
For all other aspects, I would say that we need to accept, if not embrace, the non-deterministic nature of neural net training. For all purposes, training is stochastic. We use stochastic gradient descent, shuffle data, use random initialization and dropout – and more importantly, training data is itself but a random sample of data. From that standpoint, the fact that computers can only generate pseudo-random numbers with a seed is an artifact. When you train, your loss is a value that also comes with a confidence interval due to this stochastic nature. Comparing those values to optimize hyper-parameters while ignoring those confidence intervals does not make much sense – therefore it is vain, in my opinion, to spend too much effort fixing non-determinism in that, and many other, cases.
Starting from TF 2.9 (TF >= 2.9), if you want your TF models to run deterministically, the following lines need to be added at the beginning of the program.
import tensorflow as tf
tf.keras.utils.set_random_seed(1)
tf.config.experimental.enable_op_determinism()
Important note: The first line sets the random seed for the following : Python, NumPy and TensorFlow. The second line makes each TensorFlow operation deterministic.
To get a MNIST network (https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py) to train deterministically on my GPU (1050Ti):
Set PYTHONHASHSEED='SOMESEED'. I do it before starting the python kernel.
Set seeds for random generators (not sure all are needed for MNIST)
python_random.seed(42)
np.random.seed(42)
tf.set_random_seed(42)
Make TF select deterministic GPU algorithms
Either:
import tensorflow as tf
from tfdeterminism import patch
patch()
Or:
os.environ['TF_CUDNN_DETERMINISTIC']='1'
import tensorflow as tf
Note that the resulting loss is repeatable with either method for selecting deterministic algorithms from TF, but the two methods result in different losses. Also, the solution above doesn't make a more complicated model I'm using repeatable.
Check out https://github.com/NVIDIA/framework-determinism for a more current answer.
A side note:
For cuda cuDNN 8.0.1, non deterministic algorithms exist for:
(from https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html)
cudnnConvolutionBackwardFilter when CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 or
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 is used
cudnnConvolutionBackwardData when CUDNN_CONVOLUTION_BWD_DATA_ALGO_0
is used
cudnnPoolingBackward when CUDNN_POOLING_MAX is used
cudnnSpatialTfSamplerBackward
cudnnCTCLoss and cudnnCTCLoss_v8 when
CUDNN_CTC_LOSS_ALGO_NON_DETERMINSTIC is used
I'm working on a training a neural network model using Python and Keras library.
My model test accuracy is very low (60.0%) and I tried a lot to rise it, but I couldn't. I'm using DEAP dataset (total 32 participants) to train the model. The splitting technique that I'm using is a fixed one. It was as the followings:28 participants for training, 2 for validation and 2 for testing.
For the model I'm using is as follows.
sequential model
Optimizer = Adam
With L2_regularizer, Gaussian noise, dropout, and Batch normalization
Number of hidden layers = 3
Activation = relu
Compile loss = categorical_crossentropy
initializer = he_normal
Now, I'm using train-test technique (fixed one also) to split the data and I got better results. However, I figured out that some of the participants are affecting the training accuracy in a negative way. Thus, I want to know if there is a way to study the effect of the each data (participant) on the accuracy (performance) of a model?
Best Regards,
From my Starting deep learning hands-on: image classification on CIFAR-10 tutorial, in which I insist on keeping track of both:
global metrics (log-loss, accuracy),
examples (correctly and incorrectly classifies cases).
The later may help us telling which kinds of patterns are problematic, and on numerous occasions helped me with changing the network (or supplementing training data, if it was the case).
And example how does it work (here with Neptune, though you can do it manually in Jupyter Notebook, or using TensorBoard image channel):
And then looking at particular examples, along with the predicted probabilities:
Full disclaimer: I collaborate with deepsense.ai, the creators or Neptune - Machine Learning Lab.
This is, perhaps, more broad an answer than you may like, but I hope it'll be useful nevertheless.
Neural networks are great. I like them. But the vast majority of top-performance, hyper-tuned models are ensembles; use a combination of stats-on-crack techniques, neural networks among them. One of the main reasons for this is that some techniques handle some situations better. In your case, you've run into a situation for which I'd recommend exploring alternative techniques.
In the case of outliers, rigorous value analyses are the first line of defense. You might also consider using principle component analysis or linear discriminant analysis. You could also try to chase them out with density estimation or nearest neighbors. There are many other techniques for handling outliers, and hopefully you'll find the tools I've pointed to easily implemented (with help from their docs); sklearn tends to readily accept data prepared for Keras.