Scanning over different dimensions of tensors in theano - python

I'm moving my first steps with theano and I cannot figure out how to solve this problem which could be actually very easy.
I have a 3 * 4 * 2 tensor, like the following:
[1 1] | [2 2] | [3 3]
[1 1] | [2 2] | [3 3]
[0 0] | [2 2] | [3 3]
[9 9] | [0 0] | [3 3]
So I have N=3 sequences, each of them of length L=4 with their elements that are vectors of dimension d=2. Actually, the sequences can be of different length but I could think of padding them with [0 0] vectors, as shown above.
What I want to do is, first scan through the first axis of the tensor and sum up all the vector in the lists up to the the first [0 0] vector -- that's why I added the [9 9] at the end of the first tensor slice, in order to check the sum exit condition [1]. I should end up in [[2 2], [6 6], [12 12]]. I tried in many ways to solve this problem which seems to me just a nested looping problem... but always got some weird errors[2].
Thanks,
Giulio
--
[1]: the actual problem is the training of a recurrent neural network for NLP purposes, with N the dimension of the batch, L the max length of a sentence in the batch and d the dimension of the representation of each word. I omitted the problem so that I could focus on the simplest coding aspect.
[2] I omit the history of my failures, maybe I could add them later.

If your sequences are always zero padded then you can just sum along the axis of interest since the padding regions will not change the sum. However, if the padding regions may contain non-zero values there are two approaches.
Use scan. This is slow and should be avoided if possible. In fact it can be avoided because,
Create a binary mask and multiply out the padding region.
Here's some code that illustrates these three approaches. For the two approaches that allow for non-zero padding regions (v2 and v3) the computation needs an additional input: a vector giving the lengths of the sequences within the batch.
import numpy
import theano
import theano.tensor as tt
def v1():
# NOTE: [9, 9] element changed to [0, 0]
# since zero padding must be used for
# this method
x_data = [[[1, 1], [1, 1], [0, 0], [0, 0]],
[[2, 2], [2, 2], [2, 2], [0, 0]],
[[3, 3], [3, 3], [3, 3], [3, 3]]]
x = tt.tensor3()
x.tag.test_value = x_data
y = x.sum(axis=1)
f = theano.function([x], outputs=y)
print f(x_data)
def v2_step(i_t, s_tm1, x, l):
in_sequence = tt.lt(i_t, l).dimshuffle(0, 'x')
s_t = s_tm1 + tt.switch(in_sequence, x[i_t], 0)
return s_t
def v2():
x_data = [[[1, 1], [1, 1], [0, 0], [9, 9]],
[[2, 2], [2, 2], [2, 2], [0, 0]],
[[3, 3], [3, 3], [3, 3], [3, 3]]]
l_data = [2, 3, 4]
x = tt.tensor3()
x.tag.test_value = x_data
l = tt.lvector()
l.tag.test_value = l_data
# Must dimshuffle first because scan can only iterate over first (0'th) axis.
x_hat = x.dimshuffle(1, 0, 2)
y, _ = theano.scan(v2_step, sequences=[tt.arange(x_hat.shape[0])],
outputs_info=[tt.zeros_like(x_hat[0])],
non_sequences=[x_hat, l], strict=True)
f = theano.function([x, l], outputs=y[-1])
print f(x_data, l_data)
def v3():
x_data = [[[1, 1], [1, 1], [0, 0], [9, 9]],
[[2, 2], [2, 2], [2, 2], [0, 0]],
[[3, 3], [3, 3], [3, 3], [3, 3]]]
l_data = [2, 3, 4]
x = tt.tensor3()
x.tag.test_value = x_data
l = tt.lvector()
l.tag.test_value = l_data
indexes = tt.arange(x.shape[1]).dimshuffle('x', 0)
mask = tt.lt(indexes, l.dimshuffle(0, 'x')).dimshuffle(0, 1, 'x')
y = (mask * x).sum(axis=1)
f = theano.function([x, l], outputs=y)
print f(x_data, l_data)
def main():
theano.config.compute_test_value = 'raise'
v1()
v2()
v3()
main()
In general, if your step function is dependent on the output of a previous step then you need to use scan.
If every step/iteration could, in principle, be executed concurrently (i.e. they don't rely on each other at all) then there is often a much more efficient way to do this without using scan

Related

How can I stack 2D arrays along an axis to make a 3D array in an h5 file

Most of the solutions online either concatenate two 3-dimensional arrays.
What I'm looking for is possibly an empty 3D array and then keep adding 2D arrays(each of the same dimensions) along any axis.
For example, initially, we have X as an h5 file dataset
X = [[[]]]
A = [[1, 1]
[1, 1]
[1, 1]
[1, 1]
[1, 1]]
# appending A to X
X = [[[1, 1]
[1, 1]
[1, 1]
[1, 1]
[1, 1]]]
# then we have another (5,2) array say
B = [[2, 2]
[2, 2]
[2, 2]
[2, 2]
[2, 2]]
# append B to X we get
X = [[[1, 1]
[1, 1]
[1, 1]
[1, 1]
[1, 1]],
[[2, 2]
[2, 2]
[2, 2]
[2, 2]
[2, 2]]]
It would have been easy to just stack two such 2d arrays but, I wanna keep adding 2D arrays in X, in such a way that dimensions of x and y-axis remain the same but z keeps growing.
Is this even possible in Python?

Pytorch's packed_sequence/pad_sequence pads tensors vertically for list of tensors

I am trying to pad sequence of tensors for LSTM mini-batching, where each timestep in the sequence contains a sub-list of tensors (representing multiple features in a single timestep).
For example, sequence 1 would have 3 timesteps and within each timestep there are 2 features. An example below would be:
Sequence 1 = [[1,2],[2,2],[3,3],[3,2],[3,2]]
Sequence 2 = [[4,2],[5,1],[4,4]]
Sequence 3 = [[6,9]]
I run pytorch's pad_sequence function (this goes for pack_sequence too) like below:
import torch
import torch.nn.utils.rnn as rnn_utils
a = torch.tensor([[1,2],[2,2],[3,3],[3,2],[3,2]])
b = torch.tensor([[4,2],[5,1],[4,4]])
c = torch.tensor([[6,9]])
result = rnn_utils.pad_sequence([a, b, c])
My expected output is as follows:
Sequence 1 = [[1,2],[2,2],[3,3],[3,2],[3,2]]
Sequence 2 = [[4,2],[5,1],[4,4],[0,0],[0,0]]
Sequence 3 = [[6,9],[0,0],[0,0],[0,0],[0,0]]
However, the output I got is as follows:
tensor([[[1, 2],
[4, 2],
[6, 9]],
[[2, 2],
[5, 1],
[0, 0]],
[[3, 3],
[4, 4],
[0, 0]],
[[3, 2],
[0, 0],
[0, 0]],
[[3, 2],
[0, 0],
[0, 0]]])
The padding seems to go vertically rather than what I expect. How do I go about getting the correct padding that I need?
Simply change
result = rnn_utils.pad_sequence([a, b, c])
to
result = rnn_utils.pad_sequence([a, b, c], batch_first=True)
seq1 = result[0]
seq2 = result[1]
seq3 = result[2]
By default, batch_first is False. Output will be in B x T x * if True, or in T x B x * otherwise, where
B is batch size. It is equal to the number of elements in sequences,
T is length of the longest sequence, and
* is any number of trailing dimensions, including none.
output:
tensor([[1, 2],
[2, 2],
[3, 3],
[3, 2],
[3, 2]]) # sequence 1
tensor([[4, 2],
[5, 1],
[4, 4],
[0, 0],
[0, 0]]) # sequence 2
tensor([[6, 9],
[0, 0],
[0, 0],
[0, 0],
[0, 0]]) # sequence 3

How to find missing combinations/sequences in a 2D array with finite element values

In the case of the set np.array([1, 2, 3]), there are only 9 possible combinations/sequences of its constituent elements: [1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3], [3, 1], [3, 2], [3, 3].
If we have the following array:
np.array([1, 1],
[1, 2],
[1, 3],
[2, 2],
[2, 3],
[3, 1],
[3, 2])
What is the best way, with NumPy/SciPy, to determine that [2, 1] and [3, 3] are missing? Put another way, how do we find the inverse list of sequences (when we know all of the possible element values)? Manually doing this with a couple of for loops is easy to figure out, but that would negate whatever speed gains we get from using NumPy over native Python (especially with larger datasets).
Your can generate a list of all possible pairs using itertools.product and collect all of them which are not in your array:
from itertools import product
pairs = [ [1, 1], [1, 2], [1, 3], [2, 2], [2, 3], [3, 1], [3, 2] ]
allPairs = list(map(list, product([1, 2, 3], repeat=2)))
missingPairs = [ pair for pair in allPairs if pair not in pairs ]
print(missingPairs)
Result:
[[2, 1], [3, 3]]
Note that map(list, ...) is needed to convert your list of list to a list of tuples that can be compared to the list of tuples returned by product. This can be simplified if your input array already was a list of tuples.
This is one way using itertools.product and set.
The trick here is to note that sets may only contain immutable types such as tuples.
import numpy as np
from itertools import product
x = np.array([1, 2, 3])
y = np.array([[1, 1], [1, 2], [1, 3], [2, 2],
[2, 3], [3, 1], [3, 2]])
set(product(x, repeat=2)) - set(map(tuple, y))
{(2, 1), (3, 3)}
If you want to stay in numpy instead of going back to raw python sets, you can do it using void views (based on #Jaime's answer here) and numpy's built in set methods like in1d
def vview(a):
return np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
x = np.array([1, 2, 3])
y = np.array([[1, 1], [1, 2], [1, 3], [2, 2],
[2, 3], [3, 1], [3, 2]])
xx = np.array([i.ravel() for i in np.meshgrid(x, x)]).T
xx[~np.in1d(vview(xx), vview(y))]
array([[2, 1],
[3, 3]])
a = np.array([1, 2, 3])
b = np.array([[1, 1],
[1, 2],
[1, 3],
[2, 2],
[2, 3],
[3, 1],
[3, 2]])
c = np.array(list(itertools.product(a, repeat=2)))
If you want to use numpy methods, try this...
Compare the array being tested against the product using broadcasting
d = b == c[:,None,:]
#d.shape is (9,7,2)
Check if both elements of a pair matched
e = np.all(d, -1)
#e.shape is (9,7)
Check if any of the test items match an item of the product.
f = np.any(e, 1)
#f.shape is (9,)
Use f as a boolean index into the product to see what is missing.
>>> print(c[np.logical_not(f)])
[[2 1]
[3 3]]
>>>
Every combination corresponds to the number in range 0..L^2-1 where L=len(array). For example, [2, 2]=>3*(2-1)+(2-1)=4. Off by -1 arises because elements start from 1, not from zero. Such mapping might be considered as natural perfect hashing for this data type.
If operations on integer sets in NumPy are faster than operations on pairs - for example, integer set of known size might be represented by bit sequence (integer sequence) - then it is worth to traverse pair list, mark corresponding bits in integer set, then look for unset ones and retrieve corresponding pairs.

numpy multiple slicing booleans

I'm having trouble editing values in a numpy array
import numpy as np
foo = np.ones(10,10,2)
foo[row_criteria, col_criteria, 0] += 5
foo[row_criteria,:,0][:,col_criteria] += 5
row_criteria and col_criteria are boolean arrays (1D). In the first case I get a
"shape mismatch: objects cannot be broadcast to a single shape" error
In the second case, += 5 doesn't get applied at all. When I do
foo[row_criteria,:,0][:,col_criteria] + 5
I get a modified return value but modifying the value in place doesn't seem to work...
Can someone explain how to fix this? Thanks!
You want:
foo[np.ix_(row_criteria, col_criteria, [0])] += 5
To understand how this works take this example:
import numpy as np
A = np.arange(25).reshape([5, 5])
print A[[0, 2, 4], [0, 2, 4]]
# [0, 12, 24]
# The above example gives the the elements A[0, 0], A[2, 2], A[4, 4]
# But what if I want the "outer product?" ie for [[0, 2, 4], [1, 3]] i want
# A[0, 1], A[0, 3], A[2, 1], A[2, 3], A[4, 1], A[4, 3]
print A[np.ix_([0, 2, 4], [1, 3])]
# [[ 1 3]
# [11 13]
# [21 23]]
The same thing works with boolean indexing. Also np.ix_ doesn't do anything really amazing, it just reshapes it's arguments so they can be broadcast against each other:
i, j = np.ix_([0, 2, 4], [1, 3])
print i.shape
# (3, 1)
print j.shape
# (1, 2)

Sort a numpy array like a table

I have a list
[[0, 3], [5, 1], [2, 1], [4, 5]]
which I have made into an array using numpy.array:
[[0 3]
[5 1]
[2 1]
[4 5]]
How do I sort this like a table? In particular, I want to sort by the second column in ascending order and then resolve any ties by having the first column sorted in ascending order. Thus I desire:
[[2 1]
[5 1]
[0 3]
[4 5]]
Any help would be greatly appreciated!
See http://docs.scipy.org/doc/numpy/reference/generated/numpy.lexsort.html#numpy.lexsort
Specifically in your case,
import numpy as np
x = np.array([[0,3],[5,1],[2,1],[4,5]])
x[np.lexsort((x[:,0],x[:,1]))]
outputs
array([[2,1],[5,1],[0,3],[4,5]])
You can use numpy.lexsort():
>>> a = numpy.array([[0, 3], [5, 1], [2, 1], [4, 5]])
>>> a[numpy.lexsort(a.T)]
array([[2, 1],
[5, 1],
[0, 3],
[4, 5]])
Another way of doing this - slice out the bits of data you want, get the sort indices using argsort, then use the result of that to slice your original array:
a = np.array([[0, 3], [5, 1], [2, 1], [4, 5]])
subarray = a[:,1] # 3,1,1,5
indices = np.argsort(subarray) # Returns array([1,2,0,3])
result = a[indices]
Or, all in one go:
a[np.argsort(a[:,1])]
If you want to sort using a single column only (e.g., second column), you can do something like:
from operator import itemgetter
a = [[0, 3], [5, 1], [2, 1], [4, 5]]
a_sorted = sorted(a, key=itemgetter(1))
If there are more than one key, then use numpy.lexsort() as pointed out in the other answers.

Categories

Resources