I am fairly new to the internals of TensorFlow. Towards trying to understand TensorFlow's implementation of AdamOptimizer, I checked the corresponding subgraph in TensorBoard. There seems to be a duplicate subgraph named name + '_1', where name='Adam' by default.
The following MWE produces the graph below. (Note that I have expanded the x node!)
import tensorflow as tf
tf.reset_default_graph()
x = tf.Variable(1.0, name='x')
train_step = tf.train.AdamOptimizer(1e-1, name='MyAdam').minimize(x)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
with tf.summary.FileWriter('./logs/mwe') as writer:
writer.add_graph(sess.graph)
I am confused because I would expect the above code to produce just a single namespace inside the graph. Even after examining the relevant source files (namely adam.py, optimizer.py and training_ops.cc), it's not clear to me how/why/where the duplicate is created.
Question: What is the source of the duplicate AdamOptimizer subgraph?
I can think of the following possibilities:
A bug in my code
Some sort of artifact generated in TensorBoard
This is expected behavior (if so, then why?)
A bug in TensorFlow
Edit: Cleanup and clarification
Due to some initial confusion, I cluttered my original question with detailed instructions for how to set up a reproducible environment with TensorFlow/TensorBoard which reproduces this graph. I have now replaced all that with the clarification about expanding the x node.
This is not a bug, just a perhaps questionable way of leaking outside of your own scope.
First, not a bug: The Adam optimizer is not duplicated. As can be seen in your graph, there is a single /MyAdam scope, not two. No problem here.
However, there are two MyAdam and MyAdam_1 subscopes added to your variable scope. They correspond respectively to the m and v variables (and their initialization operations) of the Adam optimizer for this variable.
This is where choices made by the optimizer are debatable. You could indeed reasonably expect the Adam optimizer operations and variables to be strictly defined within its assigned scope. Instead, they choose to creep in the optimized variables' scope to locate the statistics variables.
So, debatable choice to say the least, but not a bug, in the sense that the Adam optimizer is indeed not duplicated.
EDIT
Note that this way of locating variables is common across optimizers -- you can observe the same effect with a MomentumOptimizer for example. Indeed, this is the standard way of creating slots for optimizers -- see here:
# Scope the slot name in the namespace of the primary variable.
# Set "primary.op.name + '/' + name" as default name, so the scope name of
# optimizer can be shared when reuse is True. Meanwhile when reuse is False
# and the same name has been previously used, the scope name will add '_N'
# as suffix for unique identifications.
So as I understand it, they chose to locate the statistics of a variable within a subscope of the scope of the variable itself, so that if the variable is shared/reused, then its statistics are also shared/reused and do not need to be recomputed. This is indeed a reasonable thing to do, even if again, creeping outside of your scope is somewhat unsettling.
Related
The docs states that:
It is important to realize that scopes are determined textually: the global scope of a function defined in a module is that module’s namespace, no matter from where or by what alias the function is called. On the other hand, the actual search for names is done dynamically, at run time
I understand the first part: scopes are determined textually. But what does it mean that the actual search for names is done dynamically at run time? As opposed to what?
Let's try to compare this to what happens in C for instance, as I understand that this is the opposite of what happens in Python.
In C, consider the following code:
int a = 5
printf("The value of a is: %d\n", a);
So in C, the actual search for names is done at compile time - that means that the compiled machine code for the printf function will contain reference to the memory address of a whereas in Python
a = 5
print(a)
The compiled code of the print(a) will contain instructions for going looking in the namespace dictionary for what is pointed to by a and then access it.
Is that correct?
It means that a name can suddenly start resolving to something else, because it was redefined during the execution of the program. The alternative would be to resolve names when the program is read and parsed, and stick to this interpretation. (Which would be somewhat faster allow considerable additional optimization, e.g. by "knowing" things about the default behavior of Python built-in functions; but it is not how the language was designed.)
Here's an example that suddenly changes behavior:
for n in range(3):
print(max([10, 20, 30]))
max = min
This loop will print what you expect on the first iteration, but from then on the identifier max will refer to the local variable max, and will resolve to the builtin min(). Silly, but realistic use cases are a different question...
As opposed to being done statically at compile-time.
For instance in a language like C or Rust, by default symbols are looked up at compile-time, and at runtime the code just goes to whatever was resolved during compilation.
In Python however, every time you call a function the interpreter will look for that name in the relevant scope(s), then will use whatever's bound to that name at that point in time. Semantically if not necessarily technically.
So if e.g. you swap the object assigned to that name, then the code will call the remplacement instead of the original. Even if the replacement is not callable at all.
In tensorflow, after I use cell.zero_state() to initialize the cell state and hidden state, I should initialize the global variables or the RNN cell won't run.
However, I wonder how does it globalize(initialize variables range?) and what variables does it globalize(bias? weight? activation function?) ?
enter link description here
I think the parameters that should initialize is non other than: weight, bias, activation function in each neuron.
What does the global_variables_initializer actually do?
Thanks a lot!
Whenever you create a variable in TensorFlow, the framework takes care of adding this variable to a collection of created variables. Think of a list with pointers to variables.
The default collection of such variables is called GraphKeys.GLOBAL_VARIABLES.
The function tf.global_variables_initializer simply retrieves all these variables from the collection and initializes them.
The zero_state is not directly creating a variable. It simply returns "all-zero"-tensor of the matching shape to the cell variables.
The range of the initial variable values depends on the variable-initializers.
TO sum up: Each weight, bias, hidden state variable is collected in a special list of created variables and TensorFlow just initializes each of these variables similar to the pseudo-code:
foreach v in GraphKeys.GLOBAL_VARIABLES:
assign v.value = v.call_initializer()
I would like to have an example illustrating the use of the function tf.control_dependencies. For example, I want to create two tensors X and Y and if they are equal do or print something.
import tensorflow as tf
session = tf.Session()
X = tf.constant(5)
Y = tf.constant(50)
with tf.control_dependencies([tf.assert_equal(X, Y)]):
print('X and Y are equal!')
In the code above, X is clearly not equal to Y. What is tf.control_dependencies doing in this case?
control_dependencies is not a conditional. It is a mechanism to add dependencies to whatever ops you create in the with block. More specifically, what you specify in the argument to control_dependencies is ensured to be evaluated before anything you define in the with block.
In your example, you don't add any (TensorFlow) operations in the with block, so the block does nothing.
This answer has an example of how to use control_dependencies, where it is used to make sure the assignments happen before the batchnorm operations are evaluated.
This is a follow up question to this one.
I'm still working on the cifar10 example on the file cifar10.py and noticed some strange behavior regarding the creation of variables.
But first a side-question: Why are the variables created with weight decay factor of wd=0.0 and not wd=None? That way you would have less vertices in the computation graph.
Next, the strange behavior. I added the following function to make it more convenient to create variables:
def _create_variable(name, shape, initializer, wd=None):
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, dtype, initializer)
if wd is not None:
wd_val = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', wd_val)
return var
When using this function to create the variables (with the original parameters), the logits that are computed come from a range of +-1e13 for the first batch, gradually getting better reaching +-1.5. The loss on the other hand starts at around 400000 and gets bigger until it hits NaN.
When using the original functions to create the variables, the logits come from a range of +-1 right from the beginning and the loss start at around 4.5, gradually getting smaller.
Can somebody explain to me what the difference between my and the provided functions for variable generation is, and why the effect is so huge? I don't see it.
The full code of my modified cifar10.py can be found here. To test it out simple replace the original file with my version. To than switch between the original and my function simply change line 212 to CUSTOM = False
Thank you in advance.
Stupid me! I used my own function the wrong way and passed the values for stddev as the mean and used the default stddev of 1.
The curse of not addressing the arguments by their name.
Anyway, why does this cause such a huge loss; sometimes even NaN?
I am running a non-linear optimization problem in OpenMDAO, which I know the optimal solution of (I just want to verify the solution). I am using SLSQP driver configuration of ScipyOptimizer from openmdao.api.
I have 3 design variables A, B and C, their respective design-spaces (Amin to Amax for A and so on) and a single objective function Z. As I said, I know the optimal values for all the three design variables (let's call them Asol, Bsol and Csol) which yield the minimum value of Z (call it Zsol).
When I run this problem, I get a value for Z which is larger than Zsol, signifying that it is not an optimal solution. When I assign Csol to C and run the problem with only A and B as the design variables, I get the value of Z which is much closer to Zsol and which is actually lesser than what I got earlier (in 3 design variable scenario).
Why am I observing this behavior? Shouldn't ScipyOptimizer give the same solution in both the cases?
EDIT: Adding some code..
from openmdao.api import IndepVarComp, Group, Problem
from openmdao.api import ScipyOptimizer
class RootGroup(Group):
def __init__(self):
super(RootGroup, self).__init__()
self.add('desvar_f', IndepVarComp('f', 0.08))
self.add('desvar_twc', IndepVarComp('tool_wear_compensation', 0.06))
self.add('desvar_V', IndepVarComp('V', 32.0))
# Some more config (adding components, connections etc.)
class TurningProblem_singlepart(Problem):
def __init__(self):
super(TurningProblem_singlepart, self).__init__()
self.root = RootGroup()
self.driver = ScipyOptimizer()
self.driver.options['optimizer'] = 'SLSQP'
self.driver.add_desvar('desvar_f.f', lower=0.08, upper=0.28)
self.driver.add_desvar('desvar_twc.tool_wear_compensation', lower=0.0, upper=0.5)
self.driver.add_desvar('desvar_V.V', lower=32.0, upper=70.0)
self.driver.add_objective('Inverse_inst.comp_output')
# Other config
This code gives me incorrect result. When I remove desvar_twc from both the classes, and assign it with its optimal value (from the solution I have), I get fairly correct result i.e. the answer for objective function which is lesser than the previous scenario.
Without seeing your actual model, we can't say anything for sure. However, it is NOT the case that a local optimizer's solution is independent of the starting condition in general. That is only case if the problem is convex. So I would guess that your problem is not convex, and you'r running into local optima.
You can try to get around this by using the COBYLA optimizer instead of SLSQP, which in my experience can manage to jump over some local optima better. But if your problem is really bumpy, then I would suggest you switch to NSGA-II or ALPSO from the pyopt-sparse library. These are heuristic based optimizers that do a good job of finding the "biggest hill", though they don't always climb all the way to the top of it (they don't converge as tightly). The heuristic algorithms are also generally more expensive than the gradient based methods.