pytorch gradient / derivative / difference along axis like numpy.diff

pytorch gradient / derivative / difference along axis like numpy.diff - python

I have been struggling with this for quite some time. All I want is a torch.diff() function. However, many matrix operations do not appear to be easily compatible with tensor operations.
I have tried an enormous amount of various pytorch operation combinations, yet none of them work.
Due to the fact that pytorch hasn't implemented this basic feature, I started by simply trying to subtract the element i+1 from element i along a specific axis.
However, you can't simply do this element-wise (due to the tensor limitations), so I tried to construct another tensor, with the elements shifted along one axis:
ix_plus_one = [0]+list(range(0,prediction.size(1)-1))
ix_differential_tensor = torch.LongTensor(ix_plus_one)
diff_one_tensor = prediction[:,ix_differential_tensor]
But now we have a different problem - indexing doesn't really work to mimic numpy in pytorch as it advertises, so you can't index with a "list-like" Tensor like this. I also tried using the tensor scatter functions
So I'm still stuck with this simple problem of trying to get a gradient on a pytoch tensor.
All of my searching leads to the marvelous capabilities of pytorchs' "autograd" function - which has nothing to do with this problem.

A 1D convolution with a fixed filter should do the trick:
filter = torch.nn.Conv1d(in_channels=1, out_channels=1, kernel_size=2, stride=1, padding=1, groups=1, bias=False)
kernel = np.array([-1.0, 1.0])
kernel = torch.from_numpy(kernel).view(1,1,2)
filter.weight.data = kernel
filter.weight.requires_grad = False
Then use filter like you would any other layer in torch.nn.
Also, you might want to change padding to suit your specific needs.

There appears to be a simpler solution to this (as I needed a similarly), referenced here: https://discuss.pytorch.org/t/equivalent-function-like-numpy-diff-in-pytorch/35327/2
diff = x[1:] - x[:-1]
which can be done along different dimensions such as
diff = polygon[:, 1:] - polygon[:, :-1]
I would recommend writing a unit test that verifies identical behavior though.

For all those running into the question after March 2021
As of torch 1.8 there's torch.diff that works exactly as expected by the OP

Related

Problem with solving exquations of TransposedConv2D for finding its parameters

Regarding the answer posted here, when I want to use the equations for obtaining the values of the parameters of the transposed convolution, I face some problems. For example, I have a tensor with the size of [16, 256, 16, 160, 160] and I want to upsample that to the size of [16, 256, 16, 224, 224]. Based on the equation of the transposed convolution, when, for solving the equations for the height, I select stride+2 and I want to find the k (kernel size), I have the following equation that the kernel size will have a large and also negative value.
224 = (160 - 2)x (2) + 1x(k - 1) + 1
What is wrong with my calculations and how I can find the parameters.

I don't think you applied the formula incorrectly, I think it's primarily the issue with the input and output dimensions you desire that are not possible with a stride=2
Transposed or Dialated convolutions scale the output really quickly. Let's say for example, you were just taking these params for your Transposed Convolution(I'm simplifying the values here to 1D just to make the calculations clear):
Input Size = 160
Stride = 2
Kernel = 1
Padding = 0
Output Padding = 0
Now we apply the formula from the official docs for calculating output shape:
H_out =(H_in − 1)×stride[0]−2×padding[0]+dilation[0]×(kernel_size[0]−1)+output_padding[0]+1
OR we can simplify the formula a bit:
Output Size = ((Input Size - 1) * Strides) - (2 * Padding) + Filter_Size + Ouput Padding
Here, Filter_Size = dilation_factor* (kernel_size-1) to make the formula seem less scary.
Now let's take our example and put the values in to see what Transposed OUtput size we can get with the stride=2 and smallest kernel size possible, that is, kernel=1
Ouput_Size = ((160-1)*2) - (2*0) + 1*(1-1) + 0
Output_Size = 318 - 0 + 0 + 0
Output_Size = 318
So, with the stride you want, you will atleast have an output_size >= 318 and you want 224 hence the negative kernel_size.
I hope that answers your question.
Ref Links to understand Transposed Convolution calculations better with an example:
Paperspace: Transpose Convolution Explained for Up-Sampling Images
Calculating the Output Size of Convolutions and Transpose Convolutions

There is no good constructive answer to this question.
Being in some sense inverse to conv2d, which downsample image stride times, transposed_conv2d upsample stride times. One can not use it for arbitrary resize and get evenly good result, there's torchvision.transforms.Resize or adaptive pooling for this.
torchvision.transforms.Resize is the default choice, it is simple and flexible, one can feed PIL image or torch.Tensor to it, - use former, if input sizes vary dynamically, use latter, if not.
Adaptive pooling, usually it is AdaptiveAvgPool2d, is more sofisticated, it supposed to be a part of architecture. Being inserted at the begining of network, it works as (batched) ImageResize; no magic - it is CPU implemented usualy, one will have a hard time implementing it on tensor hardware. In embedded solutions it is typical to have special image processor for such work.
Well, you still could formaly solved the task with transposed_conv2d, by playing with padding, but it would be just cutting off part of the image, probably loosing information, or inserting a lot of useless spacing.

Pytorch and numpy least squares with an intercept: performance complications

I am performing regression analysis on some reasonably large vectors (for now, working with numpy and other scientific tools is ok if I leave the computer working overnight) but they will grow by several factors eventually, and so I was looking to improve performance, moving the implementation to pytorch.
The regression is fairly simple. I have 2 vectors, predictions and betas, with dimensions (750, 6340) and (750, 4313) respectively. The least squares solution I am looking for is predictions * x = betas where x would have dimensions (6340, 4313), but I have to account for intercepts in the regression. With numpy I solved this by iterating through the second dimension in predictions, creating a vector with each column + a column of ones, and passing that as the first argument
for candidate in range(0, predictions.shape[1])): #each column is a candidate
prediction = predictions[:, candidate]
#allow for an intercept by adding a column with ones
prediction = np.vstack([prediction, np.ones(prediction.shape[0])]).T
sol = np.linalg.lstsq(prediction, betas, rcond=-1)
Question number 1 would be: is there a way to avoid iterating over each candidate in order to allow the least squares calculation to account for an intercept? That would improve computation time by a lot.
I tried using statsmodels.regression.linear_model.ols which allows for this by default (you can add a -1to the formula if you want it removed), but using this approach either forces me to iterate over each candidate (using apply was appealing but didn't really improve computation time noticeably) or there is something I'm missing. Question 1.5 then: can I use this tool in such a way or is all there is to it?
Similarly in pytorch I would do
t_predictions = torch.tensor(predictions, dtype=torch.float)
t_betas_roi = torch.tensor(betas, dtype=torch.float)
t_sol = torch.linalg.lstsq(t_predictions, t_betas_roi)
And it's fast indeed, but I'm missing the intercept here. I reckon if I did this with numpy instead of iterating as I do it would also be much faster but either way, if question 1 has a solution I imagine it could be similarly applied here, right?

Why are embeddings necessary in Pytorch?

I (think) I understand the basic principle behind embeddings: they're a shortcut to quickly perform the operation (one hot encoded vector) * matrix without actually performing that operation (by utilizing the fact that this operation is equivalent to indexing into the matrix) while still maintaining the same gradient as if that operation was actually performed.
I know in the following example:
e = Embedding(3, 2)
n = e(torch.LongTensor([0, 2]))
n will be a tensor of shape 2, 2.
But, we could also do:
p = nn.Parameter(torch.zeros([3, 2]).normal_(0, 0.01))
p[tensor([0, 2])]
and get the same result without an embedding.
This in and of itself wouldn't be confusing since in the first example n has a grad_fn called EmbeddingBackward whereas in the 2nd example p has a grad_fn called IndexBackward, which is what we expect since we know embeddings simulate a different derivative.
The confusing part is in chapter 8 of the fastbook they use embeddings to compute movie recommendations. But then, they do it without embeddings in basically the same manner & the model still works. I would expect the version without embeddings to fail because the derivative would be incorrect.
Version with embeddings:
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
res = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
return sigmoid_range(res, *self.y_range)
Version without:
def create_params(size):
return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
users = self.user_factors[x[:,0]]
movies = self.movie_factors[x[:,1]]
res = (users*movies).sum(dim=1)
res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
return sigmoid_range(res, *self.y_range)
Does anyone know what is going on?

Besides the points provided in the comments:
I (think) I understand the basic principle behind embeddings: they're
a shortcut to quickly perform the operation (one hot encoded vector) *
matrix without actually performing that operation
This one is only partially correct. Indexing into matrix is fully differentiable operation, it's just taking part of the data and propagate gradient this path only. Multiplication here would be wasteful and unnecessary.
and get the same result without an embedding.
This one is true
called EmbeddingBackward whereas in the 2nd example p has a grad_fn called IndexBackward, which is what we expect since we know embeddings simulate a different derivative. (emphasis mine)
This one isn't (or often is not). Embedding also chooses part of the data, just like you did, grad_fn is different due to "extra functionalities", but in principle it is the same (choosing some vector(s) from the matrix), think of it like IndexBackward on steroids.
The confusing part is in chapter 8 of the fastbook they use embeddings
to compute movie recommendations. But then, they do it without
embeddings in basically the same manner & the model still works. I
would expect the version without embeddings to fail because the
derivative would be incorrect. (emphasis mine)
In this exact case, both approaches are equivalent (give or take different results from random initialization).
Why nn.Embedding at all?
Easier to understand your intent when reading the code
Common utilities and functionality added if needed (as pointed out in the comments)

More efficient way for multidimensional state-action space tiling than with np.meshgrid?

First of all, this is for practice and comparison, I know there are more efficient ways to tile state space than with an linear grid.
To run some reinforcement learning algorithm, I would like to tile my state and action space lineary. As result I want to have every space-action-pair in array form. The problem is, there are different (gym) environments with different state- and action-space dimensions. Therefore I dontlike to have hard coded variables or dimensions.
So I need to calculate every state-action pair given only the min and max for each.
I've mostly solved the easy problems, but none of the solutions are "pretty".
First lets compute state and action space. Tile the area with linspace from min to max. I've given the variables for one random test environment.
import numpy as np
NOF_ACTION_SPACE_TILES = 20
NOF_STATE_SPACE_TILES = 10
action_low = np.array([-2])
state_low = np.array([-1, -1, -8])
action_space = np.vstack([*[x.flatten() for x in (np.meshgrid(*(np.linspace(action_low, action_high, NOF_ACTION_SPACE_TILES).T)))]]).T
state_space = np.vstack([*[x.flatten() for x in (np.meshgrid(*(np.linspace(state_low, state_high, NOF_STATE_SPACE_TILES).T)))]]).T
That works as intended and gives all the possible combinations for the states and actions on their own. Any way to do this more straight forward? I needed to use the *[] two times, due to np.meshgrid returning multiple matrices and trying to flatten the vectors.
Now to the funny part...
In the end I want to have every possible state-action pair. Every state with every action. This is coded pretty fast with for loops, but well... numpy and for loops are no speedy friends.
So heres my workaround, that works for 1D action space:
s_s, a_s = np.meshgrid(state_space, action_space)
state_action_space = np.concatenate((
s_s.reshape(-1, state_space.shape[1]),
a_s.reshape(state_space.shape[1], action_space.shape[1], -1)[0].T), axis=1)
With state_space.shape[1] beeing the dim of a single state / action.
One problem beeing, that np.meshgrid returns a_s for each of the 3 state-space dimensions, and reshaping it like above does not work, because we need to reshape the states to 3xn and the action to 1xn.
This is even worse than the code above, but works for now. Does anyone have suggestions how to use meshgrid or sth else properly and fast?
In the end, for the second step, its just a combination of every row of the two matrices. There has to be a better way...

Thanks to the both answers above, here my final results.
I still had to use *() to disassemble the linspace for meshgrid, but it looks more human readable now.
The big issue with the state-action code before was that I tried to overcomplicate it. Its just copying the arrays on top of each other. So just copy (or tile in this case) the state-space array as often as you have different actions in your action-space.This is the same as ACTION_SPACE_SIZE^(action-dims).
action_space = np.stack(np.meshgrid(*(np.linspace(env.action_space.low, env.action_space.high, ACTION_SPACE_SIZE)).T), -1).reshape(-1, env.action_space.shape[0])
state_space = np.stack(np.meshgrid(*(np.linspace(env.observation_space.low, env.observation_space.high, STATE_SPACE_SIZE)).T), -1).reshape(-1, env.observation_space.shape[0])
state_action_space = np.concatenate((
np.tile(state_space, (action_space.shape[0])).reshape(-1, state_space.shape[1])
np.tile(action_space, (state_space.shape[0], 1))
), axis=1)

Use tensor's value to modify another tensor at run-time

In TensorFlow I'd like to apply one of the "fake" quantization functions to one of my tensors. Concretely I'm interested in using this function:
fake_quant_with_min_max_args(
inputs,
min=-6,
max=6,
num_bits=8,
narrow_range=False,
name=None
)
It works just fine out of the box. Ideally I would like to adjust the min and max arguments depending on the input tensor. Concretely, I want to use the 99.7% rule to define that range. In other words, I want to use the range of values that, if representing the input tensor as a 1-dimensional vector, 99.7% of its elements will lie between the [mean-3*std, mean + 3*std] range.
For this purpose I do the following:
def smartFakeQuantization(tensor):
# Convert the input tensor to a 1-d tensor
t_1d_data = tf.reshape(tensor,[tf.size(tensor), 1])
# get the moments of that tensor. Now mean and var have shape (1,)
mean, var = tf.nn.moments(t_1d_data, axes=[0])
# get a tensor containing the std
std = tf.sqrt(var)
< some code to get the values of those tensors at run-time>
clip_range = np.round([mean_val - 3*std_val, mean_val + 3*stdstd_val], decimals=3)
return tf.fake_quant_with_min_max_args(tensor, min=clip_range[0], max=clip_range[1])
I know that I could evaluate any tensor in the graph by doing: myTensor.eval() or mySession.run(myTensor) but if I add those kind of lines in side my function above it will crash when executing the graph. I'll get an error of the form:
tensor <...> has been marked as not fetchable.
Probably steps I'm following are not the correct ones for the "graph" nature of TensorFlow. Any ideas how this could be done? Summarising, I want to use the value of a tensor at run-time to modify another tensor. I'd say this problem is more complex than what can be done with tf.cond().

I don't think there is an easy way of doing what you want. The min and max arguments to fake_quant_with_min_max_args are converted to operation attributes and used in underlying kernel's construction. They cannot be changed at runtime. There are some (seemingly not part of public API) ops (see LastValueQuantize and MovingAvgQuantize) that adjust their intervals depending on the data they see, but they don't do quite what you want.
You can write your own custom op or if you believe this is something generally valuable, file a feature request on github.

You can use tf.fake_quant_with_min_max_vars, which accepts tensors as min/max arguments:
return tf.fake_quant_with_min_max_vars(tensor, min=mean-3*std, max=mean+3*std)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.