Flatten a dataset in TensorFlow - python

I am trying to convert a dataset in TensorFlow to have several single-valued tensors. The dataset currently looks like this:
[12 43 64 34 45 2 13 54] [34 65 34 67 87 12 23 43] [23 53 23 1 5] ...
After the transformation it should look like this:
[12] [43] [64] [34] [45] [2] [13] [54] [34] [65] [34] [67] [87] [12] ...
My initial idea was using flat_map on the data set and then converting each tensor to a list of tensors using reshape and unstack:
output_labels = self.dataset.flat_map(convert_labels)
...
def convert_labels(tensor):
id_list = tf.unstack(tf.reshape(tensor, [-1, 1]))
return tf.data.Dataset.from_tensors(id_list)
However the shape of each tensor is only partially known (i.e. (?, 1)) which is why the unstack operation fails. Is there any way to still "concat" the different tensors without explicitly iterating over them?

Your solution is very close, but Dataset.flat_map() takes a function that returns a tf.data.Dataset object, rather than a list of tensors. Fortunately, the Dataset.from_tensor_slices() method works for exactly your use case, because it can split a tensor into a variable number of elements:
output_labels = self.dataset.flat_map(tf.data.Dataset.from_tensor_slices)
Note that the tf.contrib.data.unbatch() transformation implements the same functionality, and has a slightly more efficient implementation in the current master branch of TensorFlow (will be included in the 1.9 release):
output_labels = self.dataset.apply(tf.contrib.data.unbatch())

Related

Assign values from small matrix to specified places in larger matrix

I would like to know if there exists a similar way of doing this (Mathematica) in Python:
Mathematica
I have tried it in Python and it does not work. I have also tried it with numpy.put() or with simple 2 for loops. This 2 ways work properly but I find them very time consuming with larger matrices (3000×3000 elements for example).
Described problem in Python,
import numpy as np
a = np.arange(0, 25, 1).reshape(5, 5)
b = np.arange(100, 500, 100).reshape(2, 2)
p = np.array([0, 3])
a[p][:, p] = b
which outputs non-changed matrix a: Python
Perhaps you are looking for this:
a[p[...,None], p] = b
Array a after the above assignment looks like this:
[[100 1 2 200 4]
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[300 16 17 400 19]
[ 20 21 22 23 24]]
As documented in Integer Array Indexing, the two integer index arrays will be broadcasted together, and iterated together, which effectively indexes the locations a[0,0], a[0,3], a[3,0], and a[3,3]. The assignment statement would then perform an element-wise assignment at these locations of a, using the respective element-values from RHS.

How to implement fast numpy array computation with multiple occuring slice indices?

I was recently wondering how I could by-pass the following numpy behavior.
Starting with an simple example:
import numpy as np
a = np.array([[1,2,3,4,5,6,7,8,9,0], [11, 12, 13, 14, 15, 16, 17, 18, 19, 10]])
then:
b = a.copy()
b[:, [0,1,4,8]] = b[:, [0,1,4,8]] + 50
print(b)
...results in printing:
[[51 52 3 4 55 6 7 8 59 0]
[61 62 13 14 65 16 17 18 69 10]]
but also taking one index double into the slice then:
c = a.copy()
c[:, [0,1,4,4,8]] = c[:, [0,1,4,4,8]] + 50
print(c)
giving:
[[51 52 3 4 55 6 7 8 59 0]
[61 62 13 14 65 16 17 18 69 10]]
(in short; they do the same thing)
Could I also have that for index 4 it is executed 2 times?
Or more practically; Let the slice element i be given r times: Can we let the above expression be applied r times, instead of numpy just taking it once into account? Also if we replace "50" by something that differs for every occurance of i?
For my current code, I used:
w[p1] = w[p1] + D[pix]
where I define "pix", "p1" as some numpy arrays with dtype int, same length and some integers may appear multiple times.
(So one may have pix = [..., 1,1,1,2,2,3,...] at the same time as p1 = [..., 21,32,13,23,11,78,...], however, thus resulting on its own into taking for index 1 only the first 1 and the corresponding 21 and scraping the rest of the ones.)
Of course using a for loop would solve the problem easily. The point is that both the integers and the sizes of the arrays are huge, so it would cost a lot of computational resources to use for-loops instead of efficient numpy-array routines. Any ideas, links to existing documentation etc.?

How to index a ndarray with another ndarray?

I am doing some machine learning stuff in python/numpy in which I want to index a 2-dimensional ndarray with a 1-D ndarray, so that I get a 1-D array with the indexed values.
I got it to work with some ugly piece of code and I would like to know if there is a better way, because this just seems unnatural for such a nice language and module combination as python+numpy.
a = np.arange(50).reshape(10, 5) # Array to be indexed
b = np.arange(9, -1, -2) # Indexing array
print(a)
print(b)
print(a[b, np.arange(0, a.shape[1]).reshape(1,a.shape[1])])
#Prints:
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]
[25 26 27 28 29]
[30 31 32 33 34]
[35 36 37 38 39]
[40 41 42 43 44]
[45 46 47 48 49]]
[9 7 5 3 1]
[[45 36 27 18 9]]
This is exactly what I want(even though technically a 2-D ndarray), but this seems very complicated. Is there a neater and tidier way?
Edit:
To clarify, I actually I do not want a 1-D array, that was very poorly explained. I actually do want one dimension with length 1, because that is what I need for processing it later, but this would be easily achieved with a reshape() statement. Sorry for the confusion, I just mixed my actual code with the more general question.
You want a 1D array, yet you included a reshape call whose only purpose is to take the array from the format you want to a format you don't want.
Stop reshaping the arange output. Also, you don't need to specify the 0 start value explicitly:
result = a[b, np.arange(a.shape[1])]
You can just use np.diagonal to get what you want. No need of reshape or indexing. The tricky part here was to identify the pattern which you want to obtain which is basically the diagonal elements of a[b] matrix.
a = np.arange(50).reshape(10, 5) # Array to be indexed
b = np.arange(9, -1, -2) # Indexing array
print (np.diagonal(a[b]))
# [45 36 27 18 9]
As #user2357112 mentioned in the comments, the return of np.diagonal is read only. In my opinion, it would be a problem if you plan to append/modify the values to this final desired list. If you just want to print them or use them for some further indexing, it should be fine.
As per the docs
Starting in NumPy 1.9 it returns a read-only view on the original array. Attempting to write to the resulting array will produce an error.
In some future release, it will return a read/write view and writing to the returned array will alter your original array. The returned array will have the same type as the input array.
If you don’t write to the array returned by this function, then you can just ignore all of the above.

Python subscript syntax clarification

Can you clarify what the [:, :5] part of the code does in the following code segment?
for i in range(5):
weights = None
test_inputs = testset[i][:, :5]
test_inputs = test_inputs.astype(np.float32)
test_answer = testset[i][:, :5]
test_answer = code_answer(test_answer)
this is explained in the numpy indexing guide of the manual. this is not standard python syntax.
if you have an array a, a[:] returns a view (not a copy; assigning to this will change a) on the whole array; a[:5] a view on elements 0, 1, ..., 4.
numpy allows more-dimensional arrays to be indexed with a[i, j] instead of the pure python version a[i][j].
this should cover all your cases.
This code probably uses numpy arrays.
Numpy defines more elaborate array slicing, similar to Matlab style.
The [:, :5] means that from you array (probably a 2D array, returned from tester[i]) you take all the rows (designated by :) and then it takes columns from the beginning until but not including column 5.
Each part of the slicing expression follows regular Python slicing syntax.
The [:, :5] itself is actually interpreted like this [(:, :5)], since in Python a comma separated values without parenthesis are interpreted as a tuple.
The array object can handle tuples that denote complex slicing patterns.
This is in short the meaning of this syntax.
For more information, you may want to visit the numpy page, you can start at www.scipy.org.
As mentioned by others, the array being referenced is from the
numpy package (see it's home page)
An example should help. First create a multi-dimensional array structure similar to what the code provided is manipulating:
from numpy import *
a=arange(36).reshape(6,6)
b=a.reshape(6,3,2).swapaxes(0,2)
print b
[[[0 6 12 18 24 30]
[2 8 14 20 26 32]
[4 10 16 22 28 34]]
[[1 7 13 19 25 31]
[3 9 15 21 27 33]
[5 11 17 23 29 35]]]
The [:,:5] syntax is an array slicing mechanism that culls all
component array entries beyond the fifth:
print b[1][:,:5]
[[1 7 13 19 25]
[3 9 15 21 27]
[5 11 17 23 29]]

Mapping functions of 2D numpy arrays

I have a function foo that takes a NxM numpy array as an argument and returns a scalar value. I have a AxNxM numpy array data, over which I'd like to map foo to give me a resultant numpy array of length A.
Curently, I'm doing this:
result = numpy.array([foo(x) for x in data])
It works, but it seems like I'm not taking advantage of the numpy magic (and speed). Is there a better way?
I've looked at numpy.vectorize, and numpy.apply_along_axis, but neither works for a function of 2D arrays.
EDIT: I'm doing boosted regression on 24x24 image patches, so my AxNxM is something like 1000x24x24. What I called foo above applies a Haar-like feature to a patch (so, not terribly computationally intensive).
If NxM is big (say, 100), they the cost of iterating over A will be amortized into basically nothing.
Say the array is 1000 X 100 X 100.
Iterating is O(1000), but the cumulative cost of the inside function is O(1000 X 100 X 100) - 10,000 times slower. (Note, my terminology is a bit wonky, but I do know what I'm talking about)
I'm not sure, but you could try this:
result = numpy.empty(data.shape[0])
for i in range(len(data)):
result[i] = foo(data[i])
You would save a big of memory allocation on building the list ... but the loop overhead would be greater.
Or you could write a parallel version of the loop, and split it across multiple processes. That could be a lot faster, depending on how intensive foo is (as it would have to offset the data handling).
You can achieve that by reshaping your 3D array as a 2D array with the same leading dimension, and wrap your function foo with a function that works on 1D arrays by reshaping them as required by foo. An example (using trace instead of foo):
from numpy import *
def apply2d_along_first(func2d, arr3d):
a, n, m = arr3d.shape
def func1d(arr1d):
return func2d(arr1d.reshape((n,m)))
arr2d = arr3d.reshape((a,n*m))
return apply_along_axis(func1d, -1, arr2d)
A, N, M = 3, 4, 5
data = arange(A*N*M).reshape((A,N,M))
print data
print apply2d_along_first(trace, data)
Output:
[[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]
[[20 21 22 23 24]
[25 26 27 28 29]
[30 31 32 33 34]
[35 36 37 38 39]]
[[40 41 42 43 44]
[45 46 47 48 49]
[50 51 52 53 54]
[55 56 57 58 59]]]
[ 36 116 196]

Categories

Resources