Tensorflow "map operation" for tensor? - python

I am adapting the cifar10 convolution example to my problem. I'd like to change the data input from a design that reads images one-at-a-time from a file to a design that operates on an already-in-memory set of images. The original inputs() function looks like this:
read_input = cifar10_input.read_cifar10(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# Crop the central [height, width] of the image.
resized_image = tf.image.resize_image_with_crop_or_pad(reshaped_image,
width, height)
In the original version, read_input is a tensor containing one image.
I keep all my images in RAM, so instead of using filename_queue, I have one huge images_tensor = tf.constant(images), where images_tensor.shape is (something, 32, 32, 3).
My question is very-very basic: what is the best way to apply some function (tf.image.resize_image_with_crop_or_pad in my case) to all elements of images_tensor?
Iterating is problematic in tensorflow, with limited slices(TensorFlow - numpy-like tensor indexing). Is there a solution to achieving this using just one command?

As of version 0.8 there is map_fn. From the documentation:
map_fn(fn, elems, dtype=None, parallel_iterations=10, back_prop=True,
swap_memory=False, name=None)
map on the list of tensors unpacked from elems on dimension 0.
This map operator repeatedly applies the callable fn to a sequence of elements from first to last. The elements are made of the tensors unpacked from elems. dtype is the data type of the return value of fn. Users must provide dtype if it is different from the data type of elems.
Suppose that elems is unpacked into values, a list of tensors. The shape of the result tensor is [len(values)] + fn(values[0]).shape.
Args:
fn: The callable to be performed.
elems: A tensor to be unpacked to apply fn.
dtype: (optional) The output type of fn.
parallel_iterations: (optional) The number of iterations allowed to run
in parallel.
back_prop: (optional) True enables back propagation.
swap_memory: (optional) True enables GPU-CPU memory swapping.
name: (optional) Name prefix for the returned tensors.
Returns:
A tensor that packs the results of applying fn to the list of tensors
unpacked from elems, from first to last.
Raises:
TypeError: if fn is not callable.
Example:
elems = [1, 2, 3, 4, 5, 6]
squares = map_fn(lambda x: x * x, elems)
# squares == [1, 4, 9, 16, 25, 36]
```

There are a few answers - none quite as elegant as a map function. Which is best depends a bit on your desire for memory efficiency.
(a) You can use enqueue_many to throw them into a tf.FIFOQueue and then dequeue and tf.image.resize_image_with_crop_or_pad an image at a time, and concat it all back into one big smoosh. This is probably slow. Requires N calls to run for N images.
(b) You could use a single placeholder feed and run to resize and crop them on their way in from your original datasource. This is possibly the best option from a memory perspective, because you never have to store the unresized data in memory.
(c) You could use the tf.control_flow_ops.While op to iterate through the full batch and build up the result in a tf.Variable. Particularly if you take advantage of the parallel execution permitted by while, this is likely to be the fastest approach.
I'd probably go for option (c) unless you want to minimize memory use, in which case filtering it on the way in (option b) would be a better choice.

Tensorflow provides a couple of higher-order functions and one of them is tf.map_fn. The usage is very easy: you define your mappping and apply it to the tensor:
variable = tf.Variable(...)
mapping = lambda x: f(x)
res = tf.map_fn(mapping, variable)

Related

Confusion about Pytorch `torch.split` documentation

When I view the explanation of the function torch.split in PyTorch, I find it difficult for me to read as a non-English-speaker:
torch.split(tensor, split_size_or_sections, dim=0)
[...]
If split_size_or_sections is a list, then tensor will be split
into len(split_size_or_sections) chunks with sizes in dim according
to split_size_or_sections.
Does "with sizes in dim" mean "with sizes in split_size_or_sections along the dimension dim"?
Don't worry - your English is fine, that line is a bit confusing.
Yes you're correct. It means if you pass a list e.g. split_size_or_sections=[1,2,4,5] it will split the tensor into len([1,2,4,5]) chunks (with the splits happening across dim), and each chunk will be of length 1, 2, 4, 5 respectively.
This implicitly assumes that sum([1,2,4,5]) equals the size of dim, and will return an error if not.

How to Efficiently Find the Indices of Max Values in a Multidimensional Array of Matrices using Pytorch and/or Numpy

Background
It is common in machine learning to deal with data of a high dimensionality. For example, in a Convolutional Neural Network (CNN) the dimensions of each input image may be 256x256, and each image may have 3 color channels (Red, Green, and Blue). If we assume that the model takes in a batch of 16 images at a time, the dimensionality of the input going into our CNN is [16,3,256,256]. Each individual convolutional layer expects data in the form [batch_size, in_channels, in_y, in_x], and all of these quantities often change layer-to-layer (except batch_size). The term we use for the matrix made up of the [in_y, in_x] values is feature map, and this question is concerned with finding the maximum value, and its index, in every feature map at a given layer.
Why do I want to do this? I want to apply a mask to every feature map, and I want to apply that mask centered at the max value in each feature map, and to do that I need to know where each max value is located. This mask application is done during both training and testing of the model, so efficiency is vitally important to keep computational times down. There are many Pytorch and Numpy solutions for finding singleton max values and indices, and for finding the maximum values or indices along a single dimension, but no (that I could find) dedicated and efficient built-in functions for finding the indices of maximum values along 2 or more dimensions at a time. Yes, we can nest functions that operate on a single dimension, but these are some of the least efficient approaches.
What I've Tried
I've looked at this Stackoverflow question, but the author is dealing with a special-case 4D array which is trivially squeezed to a 3D array. The accepted answer is specialized for this case, and the answer pointing to TopK is misguided because it not only operates on a single dimension, but would necessitate that k=1 given the question asked, thus devlolving to a regular torch.max call.
I've looked at this Stackoverflow question, but this question, and its answer, focus on looking through a single dimension.
I have looked at this Stackoverflow question, but I already know of the answer's approach as I independently formulated it in my own answer here (where I amended that the approach is very inefficient).
I have looked at this Stackoverflow question, but it does not satisfy the key part of this question, which is concerned with efficiency.
I have read many other Stackoverflow questions and answers, as well as the Numpy documentation, Pytorch documentation, and posts on the Pytorch forums.
I've tried implementing a LOT of varying approaches to this problem, enough that I have created this question so that I can answer it and give back to the community, and anyone who goes looking for a solution to this problem in the future.
Standard of Performance
If I am asking a question about efficiency I need to detail expectations clearly. I am trying to find a time-efficient solution (space is secondary) for the problem above without writing C code/extensions, and which is reasonably flexible (hyper specialized approaches aren't what I'm after). The approach must accept an [a,b,c,d] Torch tensor of datatype float32 or float64 as input, and output an array or tensor of the form [a,b,2] of datatype int32 or int64 (because we are using the output as indices).
Solutions should be benchmarked against the following typical solution:
max_indices = torch.stack([torch.stack([(x[k][j]==torch.max(x[k][j])).nonzero()[0] for j in range(x.size()[1])]) for k in range(x.size()[0])])
The Approach
We are going to take advantage of the Numpy community and libraries, as well as the fact that Pytorch tensors and Numpy arrays can be converted to/from one another without copying or moving the underlying arrays in memory (so conversions are low cost). From the Pytorch documentation:
Converting a torch Tensor to a Numpy array and vice versa is a breeze. The torch Tensor and Numpy array will share their underlying memory locations, and changing one will change the other.
Solution One
We are first going to use the Numba library to write a function that will be just-in-time (JIT) compiled upon its first usage, meaning we can get C speeds without having to write C code ourselves. Of course, there are caveats to what can get JIT-ed, and one of those caveats is that we work with Numpy functions. But this isn't too bad because, remember, converting from our torch tensor to Numpy is low cost. The function we create is:
#njit(cache=True)
def indexFunc(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
This function if from another Stackoverflow answer located here (This was the answer which introduced me to Numba). The function takes an N-Dimensional Numpy array and looks for the first occurrence of a given item. It immediately returns the index of the found item on a successful match. The #njit decorator is short for #jit(nopython=True), and tells the compiler that we want it to compile the function using no Python objects, and to throw an error if it is not able to do so (Numba is the fastest when no Python objects are used, and speed is what we are after).
With this speedy function backing us, we can get the indices of the max values in a tensor as follows:
import numpy as np
x = x.numpy()
maxVals = np.amax(x, axis=(2,3))
max_indices = np.zeros((n,p,2),dtype=np.int64)
for index in np.ndindex(x.shape[0],x.shape[1]):
max_indices[index] = np.asarray(indexFunc(x[index], maxVals[index]),dtype=np.int64)
max_indices = torch.from_numpy(max_indices)
We use np.amax because it can accept a tuple for its axis argument, allowing it to return the max values of each 2D feature map in the 4D input. We initialize max_indices with np.zeros ahead of time because appending to numpy arrays is expensive, so we allocate the space we need ahead of time. This approach is much faster than the Typical Solution in the question (by an order of magnitude), but it also uses a for loop outside the JIT-ed function, so we can improve...
Solution Two
We will use the following solution:
#njit(cache=True)
def indexFunc(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
raise RuntimeError
#njit(cache=True, parallel=True)
def indexFunc2(x,maxVals):
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
max_indices[i,j] = np.asarray(indexFunc(x[i,j], maxVals[i,j]),dtype=np.int64)
return max_indices
x = x.numpy()
maxVals = np.amax(x, axis=(2,3))
max_indices = torch.from_numpy(indexFunc2(x,maxVals))
Instead of iterating through our feature maps one-at-a-time with a for loop, we can take advantage of parallelization using Numba's prange function (which behaves exactly like range but tells the compiler we want the loop to be parallelized) and the parallel=True decorator argument. Numba also parallelizes the np.zeros function. Because our function is compiled Just-In-Time and uses no Python objects, Numba can take advantage of all the threads available in our system! It is worth noting that there is now a raise RuntimeError in the indexFunc. We need to include this, otherwise the Numba compiler will try to infer the return type of the function and infer that it will either be an array or None. This doesn't jive with our usage in indexFunc2, so the compiler would throw an error. Of course, from our setup we know that indexFunc will always return an array, so we can simply raise and error in the other logical branch.
This approach is functionally identical to Solution One, but changes the iteration using nd.index into two for loops using prange. This approach is about 4x faster than Solution One.
Solution Three
Solution Two is fast, but it is still finding the max values using regular Python. Can we speed this up using a more comprehensive JIT-ed function?
#njit(cache=True)
def indexFunc(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
raise RuntimeError
#njit(cache=True, parallel=True)
def indexFunc3(x):
maxVals = np.zeros((x.shape[0],x.shape[1]),dtype=np.float32)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
maxVals[i][j] = np.max(x[i][j])
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
x[i][j] == np.max(x[i][j])
max_indices[i,j] = np.asarray(indexFunc(x[i,j], maxVals[i,j]),dtype=np.int64)
return max_indices
max_indices = torch.from_numpy(indexFunc3(x))
It might look like there is a lot more going on in this solution, but the only change is that instead of calculating the maximum values of each feature map using np.amax, we have now parallelized the operation. This approach is marginally faster than Solution Two.
Solution Four
This solution is the best I've been able to come up with:
#njit(cache=True, parallel=True)
def indexFunc4(x):
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
maxTemp = np.argmax(x[i][j])
max_indices[i][j] = [maxTemp // x.shape[2], maxTemp % x.shape[2]]
return max_indices
max_indices = torch.from_numpy(indexFunc4(x))
This approach is more condensed and also the fastest at 33% faster than Solution Three and 50x faster than the Typical Solution. We use np.argmax to get the index of the max value of each feature map, but np.argmax only returns the index as if each feature map were flattened. That is, we get a single integer telling us which number the element is in our feature map, not the indices we need to be able to access that element. The math [maxTemp // x.shape[2], maxTemp % x.shape[2]] is to turn that singular int into the [row,column] that we need.
Benchmarking
All approaches were benchmarked together against a random input of shape [32,d,64,64], where d was incremented from 5 to 245. For each d, 15 samples were gathered and the times were averaged. An equality test ensured that all solutions provided identical values. An example of the benchmark output is:
A plot of the benchmarking times as d increased is (leaving out the Typical Solution so the graph isn't squashed):
Woah! What is going on at the start with those spikes?
Solution Five
Numba allows us to produce Just-In-Time compiled functions, but it doesn't compile them until the first time we use them; It then caches the result for when we call the function again. This means the very first time we call our JIT-ed functions we get a spike in compute time as the function is compiled. Luckily, there is a way around this- if we specify ahead of time what our function's return type and argument types will be, the function will be eagerly compiled instead of compiled just-in-time. Applying this knowledge to Solution Four we get:
#njit('i8[:,:,:](f4[:,:,:,:])',cache=True, parallel=True)
def indexFunc4(x):
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
maxTemp = np.argmax(x[i][j])
max_indices[i][j] = [maxTemp // x.shape[2], maxTemp % x.shape[2]]
return max_indices
max_indices6 = torch.from_numpy(indexFunc4(x))
And if we restart our kernel and rerun our benchmark, we can look at the first result where d==5 and the second result where d==10 and note that all of the JIT-ed solutions were slower when d==5 because they had to be compiled, except for Solution Four, because we explicitly provided the function signature ahead of time:
There we go! That's the best solution I have so far for this problem.
EDIT #1
Solution Six
An improved solution has been developed which is 33% faster than the previously posted best solution. This solution only works if the input array is C-contiguous, but this isn't a big restriction since numpy arrays or torch tensors will be contiguous unless they are reshaped, and both have functions to make the array/tensor contiguous if needed.
This solution is the same as the previous best, but the function decorator which specifies the input and return types are changed from
#njit('i8[:,:,:](f4[:,:,:,:])',cache=True, parallel=True)
to
#njit('i8[:,:,::1](f4[:,:,:,::1])',cache=True, parallel=True)
The only difference is that the last : in each array typing becomes ::1, which signals to the numba njit compiler that the input arrays are C-contiguous, allowing it to better optimize.
The full solution six is then:
#njit('i8[:,:,::1](f4[:,:,:,::1])',cache=True, parallel=True)
def indexFunc5(x):
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
maxTemp = np.argmax(x[i][j])
max_indices[i][j] = [maxTemp // x.shape[2], maxTemp % x.shape[2]]
return max_indices
max_indices7 = torch.from_numpy(indexFunc5(x))
The benchmark including this new solution confirms the speedup:

How to remove an item from a multi-dimentional array?

In order to make say it simply, I have a list of dimension [32, 31, 4] which I would like to reduce to shape [32, 31, 3] in order to replace every array in the last dimension by an array of size (3).
for a in range(len(liste)): #len(list) = 95
for b in range(len(liste[a])): #shape = [32, 31, 3], b travels in the 1st dim.
#print('frame : ', liste[a][b].shape) #[31, 4]
#print('b', b) #32 frames each time ok
for c in range(len(liste[a][b])):
#print('c', c) #31 each time ok
#print('norme du quaternion', np.abs(np.linalg.norm(liste[a][b][c]))) #norm = 1
r = quat2expmap(liste[a][b][c]) #convertion to expmap successful
#print('ExpMap : ', r)
quat = liste[a][b][c]
quat = r #this works
#print('quat', quat)
liste[a][b][c] = r #this doesn't work
To be more precise, I have a dataset of 95 different gestures each represented by 32 frames and quaternions. I converted the quaternions into ExpMap but due to the difference of shapes I am unable to replace the quaternions by their corresponding ExpMap. The error code I receive the most is the following:
ValueError: could not broadcast input array from shape (3) into shape (4)
It comes from the last line of the code.
The weirdest thing is that when I take the quaternion apart and replace it, it works parfectly, yet python would refuse that I do it inside my list. I don't really get why.
Could you lighten me about it? How could I get the proper dimension in my list? I tried all the tricks such as del, remove() but got no result...
You seem to be using numpy arrays (not Python lists). Numpy does not allow changing dimensions on assignment to an element of an array because it would become irregular (some entries with 4 and some with 3).
Also, iterating through numpy arrays using loops is the wrong way to use numpy. In this case you're probably looking at applying the quat2expmap function to the 4th dimension of your matrix to produce a new matrix of shape (95,32,31,3). This will make maximum use of numpy's parallelism and can be written in a couple of lines without any loops.
You could either modify the quat2expmap function so that it works directly on your 4d matrix (will be fastest approach) or use np.apply_along_axis (which is not much faster than loops).

how to loop through each row in a tensor in tensorflow

I have a 2d tensor in tensorflow,
Let's say for example a 2*4 tensor [[1.,2.,3.,4.],[2.,4.,5.,6.]].
I have a function a() to let each row in the tensor to pass, and then sum over all the results of a(). How to do it (not doing it in the session)?
The output should be a([1.,2.,3.,4.]) + a([2.,4.,5.,6.]), in practice I have a very large tensor with many rows.
This is different from reduce_sum, because the a() function here is quite complex, which cannot be directly used through vectorization.
Many thanks!
Perhaps what you're looking for is the map_fn function in Tensorflow. map_fn(a, elems) unpacks a tensor, elems along its first dimension into a sequences of slices, and then applies the supplied function a to each slice, followed by combining the outputs into a single tensor again by concatenating along the first dimension.
It sounds like what you want is
Y = map_fn(a, X)
answer = reduce_sum(Y, axis=0)
where X is your supplied tensor.

tensorflow-using dynamic shape when defining models

I have a batch of input:
input = tf.placeholder(tf.float32, [NUM_SAMPLE, None, 15])
For each one in the batch, I have a dictionary that describes the relationship of rows. It looks like:
dic = {i:{j:rij,k:rik,...},j:{i:rij,l:rjl,...},...}
Now I wanna do this for each sample and corresponding dic:
updated_sample = sample
for i in range(len(sample)):
for j in dic[i]:
tmp = concanate(sample[j],rij)
updated_sample[i] += matmul(tmp,W)
in which W is the same for all samples and rows.
However, I cannot use len(sample) in tensorflow. It seems tf.while_loop may be the answer, but I don't know how to use it in this problem. Any suggestions?
Besides, can I use dictionary in this way in tensorflow?
There are 2 analogs in tensorflow for len(sample):
tf.shape(sample)[0]
sample.get_shape().as_list()[0]
The first one, tf.shape(sample) returns a tensor of integers of length equal to the rank of the tensor, doing tf.shape(sample)[0] is a tensor with shape () and should be used within the tenosrflow workflow.
The second one, sample.get_shape() returns a Tensor.shape object, doing sample.get_shape().as_list() transforms this into a list of integers.
In your case, you should to use the second of these.
Consider also the option of doing this computations at the numpy level, and then input them into the graph through placeholders.

Categories

Resources