I'd like to use if operation in tensorflow, as I've learned tf.cond()should be used here. While I'm confused about what if i'd like to use only if operation instead else operation. For example:
a = tf.Constant(10)
b = tf.Constant(5)
for i in range(5):
tmp = tf.greater_equal(a,b)
result = tf.cond(tmp, lambda:tf.constant(i), None)
if result is not None:
return i
as above, I want to do nothing in else operation, while tf.cond() ask there has to be a value to return.Hope someone can provide some help.
To answer the actual question, if you understand the graph execution model of Tensorflow, you will see that after the conditional, your execution flow must converge again. Meaning that the operations you perform on the result of tf.cond must be technically possible independent of the branch taken in tf.cond. If they are dependent, they should be part of the tf.cond itself.
For the convergence to happen, you have to return same looking tensors from both branches. If you "don't want to do anything" in the else branch, you can just return some zero tensors or some reshaped input from there. What exactly makes sense, depends on your exact use case.
Related
a=torch.tensor([1,2],dtype=dtype,requires_grad=True)
b=a[0]*a[1]
b.backward()
print(a.grad)
but when I use a+=1 and want to give a another value and do another round of backpropogation, it shows that a leaf Variable that requires grad is being used in an in-place operation.
However a=a+1 seems right. What's the difference between a=a+1 and a+=1?
(solved)
I find my doubt about torch actually lies in the following small example,
a=torch.tensor([1,2],dtype=dtype,requires_grad=True)
b=a[0]*a[1]
b.backward()
print(a.grad)
print(a.requires_grad)
c=2*a[0]*a[1]
c.backward()
print(a.grad)
When I do the first round of backpropogation, I get a.grad=[2,1].
Then I want to make a fresh start and calculate the differential of c with respect to a, however, the gradient seems to accumulate. How can I eliminate this effect?
The += operator is for in-place operations, while a = a+1 is not in-place (that is, a refers to a different tensor after this operation).
But in your example you don't seem to be using either of these, so it is hard to say what you want to achieve.
Assume I have
result = mynet1(input)
calculated_value1 = result
calculated_value2 = calculated_value1 * result
and I want to only derive result using backward, in the calculated_value2 expression, but not the inner result which comes from calculated_value1
Meaning, I want
calculated_value2.backward()
to refer to calculated_value1 as a constant, even though it depends on result, so that the gradient would only depend on result once, rather that on result^2.
Obviously, result.detach() won't do the trick, because then the entire expression would have no parameters to optimize.
How can I achieve the described behavior?
Clarification of intent:
I intend to implement section 13.5 of this book
In which delta depends on V, but is not being derived, while V itself IS being derived.
Also, if there is a more suitable name for this question, please tell me so that I could change it.
I am using a scipy.minimize function, where I'd like to have one parameter only searching for options with two decimals.
def cost(parameters,input,target):
from sklearn.metrics import mean_squared_error
output = self.model(parameters = parameters,input = input)
cost = mean_squared_error(target.flatten(), output.flatten())
return cost
parameters = [1, 1] # initial parameters
res = minimize(fun=cost, x0=parameters,args=(input,target)
model_parameters = res.x
Here self.model is a function that performs some matrix manipulation based on the parameters. Input and target are two matrices. The function works the way I want to, except I would like to have parameter[1] to have a constraint. Ideally I'd just like to give an numpy array, like np.arange(0,10,0.01). Is this possible?
In general this is very hard to do as smoothness is one of the core-assumptions of those optimizers.
Problems where some variables are discrete and some are not are hard and usually tackled either by mixed-integer optimization (working good for MI-linear-programming, quite okay for MI-convex-programming although there are less good solvers) or global-optimization (usually derivative-free).
Depending on your task-details, i recommend decomposing the problem:
outer-loop for np.arange(0,10,0.01)-like fixing of variable
inner-loop for optimizing, where this variable is fixed
return the model with the best objective (with status=success)
This will effect in N inner-optimizations, where N=state-space of your to fix-var.
Depending on your task/data, it might be a good idea to traverse the fixing-space monotonically (like using np's arange) and use the solution of iteration i as initial-point for the problem i+1 (potentially less iterations needed if guess is good). But this is probably not relevant here, see next part.
If you really got 2 parameters, like indicated, this decomposition leads to an inner-problem with only 1 variable. Then, don't use minimize, use minimize_scalar (faster and more robust; does not need an initial-point).
This is a follow up question to this one.
I'm still working on the cifar10 example on the file cifar10.py and noticed some strange behavior regarding the creation of variables.
But first a side-question: Why are the variables created with weight decay factor of wd=0.0 and not wd=None? That way you would have less vertices in the computation graph.
Next, the strange behavior. I added the following function to make it more convenient to create variables:
def _create_variable(name, shape, initializer, wd=None):
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, dtype, initializer)
if wd is not None:
wd_val = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', wd_val)
return var
When using this function to create the variables (with the original parameters), the logits that are computed come from a range of +-1e13 for the first batch, gradually getting better reaching +-1.5. The loss on the other hand starts at around 400000 and gets bigger until it hits NaN.
When using the original functions to create the variables, the logits come from a range of +-1 right from the beginning and the loss start at around 4.5, gradually getting smaller.
Can somebody explain to me what the difference between my and the provided functions for variable generation is, and why the effect is so huge? I don't see it.
The full code of my modified cifar10.py can be found here. To test it out simple replace the original file with my version. To than switch between the original and my function simply change line 212 to CUSTOM = False
Thank you in advance.
Stupid me! I used my own function the wrong way and passed the values for stddev as the mean and used the default stddev of 1.
The curse of not addressing the arguments by their name.
Anyway, why does this cause such a huge loss; sometimes even NaN?
I am trying to use Tensorflow. Here is an very simple code.
train = tf.placeholder(tf.float32, [1], name="train")
W1 = tf.Variable(tf.truncated_normal([1], stddev=0.1), name="W1")
loss = tf.pow(tf.sub(train, W1), 2)
step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
Just ignore the optimization part (4th line). It will take a floating number and train W1 so as to increase squared difference.
My question is simple. If I use just minus sign instead of
tf.sub" as below, what is different? Will it cause a wrong result?
loss = tf.pow(train-W1, 2)
When I replace it, the result looks the same. If they are the same, why do we need to use the "tf.add/tf.sub" things?
Built-in back propagation calculation can be done only by the "tf.*" things?
Yes, - and + resolve to tf.sub ad tf.add. If you look at the tensorflow code you will see that these operators on tf.Variable are overloaded with the tf.* methods.
As to why both exists I assume the tf.* ones exist for consistency. So sub and say matmul operation can be used in the same way. While the operator overloading is for convenience.
(tf.sub appears to have been replaced with tf.subtract)
The only advantage I see is that you can specify a name of the operation as in:
tf.subtract(train, W1, name='foofoo')
This helps identify the operation causing an error as the name you provide is also shown:
ValueError: Dimensions must be equal, but are 28 and 40 for 'foofoo' (op: 'Sub') with input shapes
it may also help with TensorBoard understanding. It might be overkill for most people as python also shows the line number which triggered the error.