I'm currently struggling when using Capsule Network (Keras version: CapsNet).
Each time I run more than 2-5 predictions in a row (in side a loop) the results vary a lot. I have tried to change so mange things. I have changed the optimizer from ADAM to SGD as well - but I simply can't make it 100 % stable and thereby be able to reproduce a given run - once again.
How can I make CapsNet 100% reproducible every run?
The answer to this is long and involved. There's a blog post that goes into much more detail than I can here, but I'll try to capture the high level points.
Set the PYTHONHASHSEED environment variable to 0 before running your python program.
If you're running calculations on the GPU, that can result in non-reproducible results due to float-rounding. You can disable it and run all operations on the CPU by setting the CUDA_VISIBLE_DEVICES environment variable to an empty string in the same manner as before.
CUDA_VISIBLE_DEVICES="" PYTHONHASHSEED=0 python your_program.py
Related
I'm trying to implement an AlphaZero-like boardgame AI for the game Gomoku (Monte Carlo Tree Search in combination with a CNN that evaluations board positions).
Right now, the MCTS is implemented as a separate component.
Additionally, I have a simple TCP server written in Python that receives positions from the MCTS (in batches of around 50 to 200), converts them to Numpy arrays, passes them to the TF/Keras Model by invoking __call__, converts them back, and sends the results to the MCTS via TCP.
During training, I generate training data (around 5000 boards) by having the AI play against itself, call model.fit once, create a new dataset using the new model weights and so on.
I play multiple matches in parallel, each using their separate Python/TF server.
Each server loads their own copy of the model (I use tf.config.experimental.set_memory_growth(gpu, True)).
I problem I encounter is that, while playing matches, inference time is getting longer and longer, the longer the network is loaded until it gets so painfully slow that I have to restart the training.
After a restart, inference times are back to normal.
This, by the way, also happens if I only play one game at a time with just having one model loaded.
I tried to mitigate the problem by restarting the Python server (and therefore the model) after each match.
This seemingly solved it until I started experiencing the same issue after a couple of training iterations.
At first I thought the reason was my not ideal setup (gaming notebook running Windows), but on a Linux server of my university the problem also ocured.
On my Windows machine, along with the model getting slower and slower it was also using less and less memory. This apparently did not occur on Linux.
I had a similar issue and was due to having the wrong NVIDIA drivers installed for my card and not having CUDA installed. Hope that was helpful.
https://developer.nvidia.com/cuda-downloads
I recently built my first TensorFlow model (converted from hand-coded python). I'm using tensorflow-gpu, but I only want to use GPU for backprop during training. For everything else I want to use CPU. I've seen this article showing how to force CPU use on a system that will use GPU by default. However, you have to specify every single operation where you want to force CPU use. Instead I'd like to do the opposite. I'd like to default to CPU use, but then specify GPU just for the backprop that I do during training. Is there a way to do that?
Update
Looks like things are just going to run slower over tensorflow because of how my model and scenario are built at present. I tried using a different environment that just uses regular (non-gpu) tensorflow, and it still runs significantly slower than hand-coded python. The reason for this, I suspect, is it's a reinforcement learning model that plays checkers (see below) and makes one single forward prop "prediction" at a time as it plays against a computer opponent. At the time I designed the architecture, that made sense. But it's not very efficient to do predictions one at a time, and less so with whatever overhead there is for tensorflow.
So, now I'm thinking that I'm going to need to change the game playing architecture to play, say, a thousand games simultaneously and run a thousand forward prop moves in a batch. But, man, changing the architecture now is going to be tricky at best.
TensorFlow lets you control device placement with the tf.device context manager.
So for example to run some code on the CPU do
with tf.device('cpu:0'):
<your code goes here>
Similarly to force GPU usage.
Instead of always running your forward pass on the CPU though you're better off making two graphs: a forward-only cpu-only graph to be used when rolling out the policy and a gpu-only forward-and-backward graph to be used when training.
I am a beginner in machine learning. Recently, I had successfully running a machine learning application using Tensorflow object detection API.
My dataset is 200 images of object with 300*300 resolution. However, the training had been running for two days and yet to be completed.
I wonder how long would it take to complete a training?? At the moment it is running at global step 9000, how many global step needed to complete the training?
P.S: the training used only CPUs
It depends on your desired accuracy and data set of course but I generally stop training when the loss value gets around 4 or less. What is your current loss value after 9000 steps?
To me this sounds like your training is not converging.
See the discussion in the comments of this question.
Basically, it is recommended that you run eval.py in parallel and check how it performs there as well.
I wish to use TensorFlow to perform classification on medical datasets.
To do this, I am using exactly the same code as that proposed in https://www.tensorflow.org/get_started/estimator, the only difference being that I face medical datasets rather than the Iris database. So, it is not custom code.
I do need help about the following problem: if I run the same code on the same data with the same parameters for the network configuration (number of layers, number of neurons in each layer, and so on) , the results are different. I mean, I run the same code ten times, and I obtain ten different values for the accuracy. These values are even largely different, as they range from 73% up to 83%.
This means that the subjects considered suffering from a given disease vary from a run to another. Differently said, once set a network structure, there are several subjects who are considered either healthy or sick depending on the run only.
As you can imagine, this lack of repeatability makes that code useless from both scientific and medical viewpoints: another user, running the same configuration over the same data set, would find a different model and different results, so would cure different subjects.
I have noticed that for the Iris database the problem seems not to take place, and accuracy is always 0.9666. This depends on the problem being very easy (linearly separable for all but one items, and very small data set).
I have carried out a search on the internet, and I have found several other people who have noted the same problem. As for the possible solutions, I have read several as well, I have implemented them all, with no result.
Here I add a short list of some suggested remedies that failed in my case:
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(0)
tf.set_random_seed(0)
rn.seed(0)
tf.reset_default_graph()
session_conf = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1
)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
Is there any chance to fix this problem? It is a pity that such an excellent tool, as TensorFlow is, cannot guarantee repeatibility.
I am using the following:
custom code: no, it is the one in https://www.tensorflow.org/get_started/estimator
system: Apple
OS: Mac OsX 10.13
TensorFlow version: 1.3.0
Python version: 3.6.3
GPU model: AMD FirePro D700 (actually, two such GPUs)
Thank you very much!
Best regards
Ivanoe De Falco
I'm trying to run the CIFAR10 tutorial with the training code on one gpu and the eval code on the other. I know for sure I have two gpus on my computer, and I can test this by running the simple examples here: https://www.tensorflow.org/how_tos/using_gpu/index.html
However, using a with device('/gpu:0') does not work for most variables in the CIFAR example. I tried a whole lot of combinations of different variables on gpu vs. cpu, or all the variables on one or the other. Always the same error for some variable, something like this:
Cannot assign a device to node 'shuffle_batch/random_shuffle_queue': Could not satisfy explicit device specification '/gpu:0'
Is this possibly a bug in Tensor Flow or am I missing something?
Could not satisfy explicit device specification means you do not have the corresponding device. Do you actually have a CUDA-enabled GPU on your machine?
UPDATE: As it turned out in the discussion below, this error is also raised if the particular operation (in this case, RandomShuffleQueue) cannot be executed on the GPU, because it only has a CPU implementation.
If you are fine with TensorFlow choosing a device for you (particularly, falling back to CPU when no GPU implementation is available), consider setting allow_soft_placement in your configuration, as per this article:
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True, log_device_placement=True))