How can I take out(or slice) the elements in a rank-2 tensor , whose first element is unique? - python

My title might be ambiguous due to my awkward English. But I mean this:
suppose i have a tensor a like this:
array([[1, 2, 3],
[2, 2, 3],
[2, 2, 4],
[3, 2, 3],
[4, 2, 3]], dtype=int32)
the 'first column' of this tensor could contain duplicate elements (e.g. [1, 2, 2, 3, 4] or [1, 1, 2, 3, 3, 4, 5, 5]), and which element is duplicated is not known beforehand.
and i wanna take out a tensor this:
array([[1, 2, 3],
[2, 2, 3],
[3, 2, 3],
[4, 2, 3]], dtype=int32)
as u can see, I take out the rows whose first element is a unique element in the column of a.
I first wanted to use the function tf.unique() . BUT the idx value returned by it doesn't indicate the first index of each value of output tensor in the original tensor.
tf.unique() works like this:
# tensor 'x' is [1, 1, 2, 3, 3, 3, 7, 8, 8]
y, idx = tf.unique(x)
y ==> [1, 2, 3, 7, 8]
idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
The function tf.unique(x, name=None) finds the unique elements in a 1-D tensor. And it now returns two value: y and idx. y contains all of the unique elements of x sorted inthe same order that they occur in x. idx contains the index of each value of x in the unique output y.
How I wish it has a third return value which contains the first index of each value of y in the original tensor x is also needed. It might work like this:
# tensor 'x' is [1, 1, 2, 3, 3, 3, 7, 8, 8]
y, idx, idx_ori = tf.unique(x)
y ==> [1, 2, 3, 7, 8]
idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
idx_ori ==> [0, 2, 3, 6, 7]
Just like its equivalent in Numpy does:
array 'x' is [1, 1, 2, 3, 3, 3, 7, 8, 8]
y, idx_ori = np.unique(x, return_index=True)
y ==> [1, 2, 3, 7, 8]
idx_ori ==> [0, 2, 3, 6, 7]
IF i have this idx_ori, i can solve my problem by tf.gather():
_, _1, idx_ori = tf.unique(a[:, 0])
result = tf.gather(a, idx_ori)
Any idea to workaround this problem? or any idea to get this indices that i want.
P.S. I know my description is tediously long ... :-p

This is a bit gross, but you could do:
print a
y, idx = tf.unique(a[:,0])
z = tf.one_hot(idx, tf.shape(y)[0])
s = tf.cumsum(z)
e = tf.equal(s, 1) # only seen once so far
ss = tf.to_int32(e) * tf.to_int32(z) # and we equal the thing
m = tf.reduce_max(ss, reduction_indices=1)
out = tf.boolean_mask(a, tf.equal(m, 1))
sess = tf.Session()
print sess.run(out)
[[1 2 3]
[2 2 3]
[2 2 4]
[3 2 3]
[4 2 3]]
[[1 2 3]
[2 2 3]
[3 2 3]
[4 2 3]]

Related

Create histogram from two arrays

I have two numpy arrays with the same dimensions: weights, and percents. Percents is 'real' data, and the weights is how many of each 'real' data there is in the histogram.
Eg)
weights = [[0, 1, 1, 4, 2]
[0, 1, 0, 3, 5]]
percents = [[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]]
(every row of percents is the same)
I would like to "multiply" these together in such a way that I produce weights[x] * [percents[x]]:
results = [[0 * [1] + 1 * [2] + 1 * [3] + 4 * [4] + 2 * [5]
[0 * [1] + 1 * [2] + 0 * [3] + 3 * [4] + 5 * [5]]
= [[2, 3, 4, 4, 4, 4, 5, 5]
[2, 4, 4, 4, 5, 5, 5, 5, 5]]
Notice that the lengths of each row can be different.. Ideally this can be done in numpy but because of this it may end up being a list of lists.
Edit:
I've been able to cobble together these nested for loops but obviously it's not ideal:
list_of_hists = []
for index in df.index:
hist = []
# Create a list of lists, later to be flattened to 'results'
for i, percent in enumerate(percents):
hist.append(
# For each percent, create a list of [percent] * weight
[percent]
* int(
df.iloc[index].values[i]
)
)
# flatten the list of lists in hist
results = [val for list_ in hist for val in list_]
list_of_hists.append(results)
There is a np.repeat designed for such kind of operations but it doesn't work in 2D case. So you need to work with flattened views of arrays instead.
weights = np.array([[0, 1, 1, 4, 2], [0, 1, 0, 3, 5]])
percents = np.array([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]])
>>> np.repeat(percents.ravel(), weights.ravel())
array([2, 3, 4, 4, 4, 4, 5, 5, 2, 4, 4, 4, 5, 5, 5, 5, 5])
And after that you need to select index locations where to split it:
>>> np.split(np.repeat(percents.ravel(), weights.ravel()), np.cumsum(np.sum(weights, axis=1)[:-1]))
[array([2, 3, 4, 4, 4, 4, 5, 5]), array([2, 4, 4, 4, 5, 5, 5, 5, 5])]
Note that np.split is quite unefficient operation as well as your wish to make array out of rows of unequal lenghts.
You can use list-comprehension and reduce from functools:
import functools
res=[functools.reduce(lambda x,y: x+y,
[x*[y] for x, y in zip(w, p)])
for w, p in zip(weights, percents)]
OUTPUT:
[[2, 3, 4, 4, 4, 4, 5, 5],
[2, 4, 4, 4, 5, 5, 5, 5, 5]]
Or, just list-comprehension solution only:
res= [[j for i in [x*[y]
for x, y in zip(w, p)]
for j in i]
for w, p in zip(weights, percents)]
OUTPUT:
[[2, 3, 4, 4, 4, 4, 5, 5],
[2, 4, 4, 4, 5, 5, 5, 5, 5]]

Numpy: Imposing row dependent maximum on array

Suppose I have the following array:
a = [[1, 4, 2, 3]
[3, 1, 5, 4]
[4, 3, 1, 2]]
What I'd like to do is impose a maximum value on the array, but have that maximum vary by row. For instance if I wanted to limit the 1st and 3rd row to a maximum value of 3, and the 2nd row to a value of 4, I could create something like:
[[1, 3, 2, 3]
[3, 1, 4, 4]
[3, 3, 1, 2]
Is there any better way than just looping over each row individually and setting it with 'nonzero'?
With numpy.clip (using the method version here):
a.clip(max=np.array([3, 4, 3])[:, None]) # np.clip(a, ...)
# array([[1, 3, 2, 3],
# [3, 1, 4, 4],
# [3, 3, 1, 2]])
Generalized:
def clip_2d_rows(a, maxs):
maxs = np.asanyarray(maxs)
if maxs.ndim == 1:
maxs = maxs[:, np.newaxis]
return np.clip(a, a_min=None, a_max=maxs)
You might be safer using the module-level function (np.clip) rather than the class method (np.ndarray.clip). The former uses a_max as a parameter, while the latter uses the builtin max as a parameter which is never a great idea.
With masking -
In [50]: row_lims = np.array([3,4,3])
In [51]: np.where(a > row_lims[:,None], row_lims[:,None], a)
Out[51]:
array([[1, 3, 2, 3],
[3, 1, 4, 4],
[3, 3, 1, 2]])
With
>>> a
array([[1, 4, 2, 3],
[3, 1, 5, 4],
[4, 3, 1, 2]])
Say you have
>>> maxs = np.array([[3],[4],[3]])
>>> maxs
array([[3],
[4],
[3]])
What about doing
>>> a.clip(max=maxs)
array([[1, 3, 2, 3],
[3, 1, 4, 4],
[3, 3, 1, 2]])

Repeat a NumPy array in multiple dimensions at once?

np.repeat(np.repeat([[1, 2, 3]], 3, axis=0), 3, axis=1)
works as expected and produces
array([[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3]])
However,
np.repeat([[1, 2, 3]], [3, 3])
and
np.repeat([[1, 2, 3]], [3, 3], axis=0)
produce errors.
Is it possible to repeat an array in multiple dimensions at once?
First off, I think the original method you propose is totally fine. It's readable, it makes sense, and it's not very slow.
You could use the repeat method instead of function which reads a bit more nicely:
>>> x.repeat(3, 1).repeat(3, 0)
array([[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3]])
With numpy's broadcasting rules, there's likely dozens of ways to create the repeated data and throw it around into the shape you want, too. One approach could be to use np.broadcast_to() and repeat the data in D+1 dimensions, where D is the dimension you need, and then collapse it down to D.
For example:
>>> x = np.array([[1, 2, 3]])
>>> np.broadcast_to(x.T, (3, 3, 3)).reshape((3, 9))
array([[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3]])
And without reshaping (so that you don't need to know the final length):
>>> np.hstack(np.broadcast_to(x, (3, 3, 3)).T)
array([[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3]])
And there's likely a dozen other ways to do this. But I still think your original version is more idiomatic, as throwing it into extra dimensions to collapse it down is weird.
It isn't possible, see repeat. But you are using a array with the shape (1,3), so you have to use:
np.repeat(X, [2], axis=0)
because np.repeat(X, [2,2], axis=0) needs shape (2,3), e.g.
X = np.array([[1, 2, 3], [5, 6, 7]])
np.repeat(X, [2, 5], axis=0)
the output looks like:
[[1 2 3]
[1 2 3]
[5 6 7]
[5 6 7]
[5 6 7]
[5 6 7]]
This means [2,5] stands for [2, 5]:2x first row and [2, 5]:5x second row (shape: (2, *doesn't matter*) because axis=0 means you want to repeat the rows.
Therefore you first have to generate an array with the dimensions (3, *), and then produce the next array.
If you want to repeat your array:
np.repeat(X2, [5], axis=0)
produces:
[[1 2 3]
[1 2 3]
[1 2 3]
[1 2 3]
[1 2 3]]
because you have only a 1-dimensional array.
The first call of np.repeat produces a 2D-array, the second call duplicates the columns. If you want to use np.repeat(X2, [5], axis=0) you get the same result as you have mentioned in your post above, because you have to call np.repeat a second time on the output of np.repeat(X2, [5], axis=0).
In my opinion your use of np.repeat is the easiest and best way to achieve your output.
Edit: Hopefully the answer is now more clearly

Building a matrix of 'rolled' rows efficiently in Numpy

I'd like to construct a (n,n)-array from a one dimensional array, where each row is shifted (with wrapping) by one relative to the previous one. The following code does this:
import numpy as np
r = np.array([1, 2, 3, 4, 5])
n = len(r)
MM = np.zeros((n, n), dtype=r.dtype)
for k in range(n):
MM[k, :] = np.roll(r, k)
print(MM)
which results in:
[[1 2 3 4 5]
[5 1 2 3 4]
[4 5 1 2 3]
[3 4 5 1 2]
[2 3 4 5 1]]
Is there a way to do this Numpy faster, i.e., avoiding the for-loop, for large r in Numpy?
Take a look at scipy.linalg.circulant
In [255]: r
Out[255]: array([1, 2, 3, 4, 5])
In [256]: circulant(r).T
Out[256]:
array([[1, 2, 3, 4, 5],
[5, 1, 2, 3, 4],
[4, 5, 1, 2, 3],
[3, 4, 5, 1, 2],
[2, 3, 4, 5, 1]])
or scipy.linalg.toeplitz
In [257]: toeplitz(np.roll(r[::-1], 1), r)
Out[257]:
array([[1, 2, 3, 4, 5],
[5, 1, 2, 3, 4],
[4, 5, 1, 2, 3],
[3, 4, 5, 1, 2],
[2, 3, 4, 5, 1]])

Remove numpy concat from algo

I have a function called gen_data which will make a single pass through a list and construct a 3D array. I then iterate across a list of list, applying the function gen_data, and then concat the results together.
fst = lambda x: x[0]
snd = lambda x: x[1]
def gen_data(data,p=0, batch_size = BATCH_SIZE, n_session = N_SESSION,
x = np.zeros((batch_size,SEQ_LENGTH,vocab_size))
y = np.zeros(batch_size)
for n in range(batch_size):
ptr = n
for i in range(SEQ_LENGTH):
x[n,i,char_to_ix[data[p+ptr+i]]] = 1.
if(return_target):
y[n] = char_to_ix[data[p+ptr+SEQ_LENGTH]]
return x, np.array(y,dtype='int32')
def batch_data(data):
nest = [gen_data(datum) for datum in data]
x = np.concatenate(map(fst,nest))
y = np.concatenate(map(snd,nest))
return (x,y)
What is the best way to combine these functions so I do not need to make multiple passes back through the data to concatenate the results?
To clarify, the goal would be remove the need to zip/concat/splat/list comp in general. To be able to initialize the x tensor to the correct dimensions and then iterate across each datum/SEQ_LENGTH, batch_size in a single pass.
Without testing things, here are a few quick fixes:
def gen_data(data,p=0, batch_size = BATCH_SIZE, n_session = N_SESSION,
x = np.zeros((batch_size,SEQ_LENGTH,vocab_size))
y = np.zeros(batch_size, dtype=int) # initial to desired type
for n in range(batch_size):
ptr = n
for i in range(SEQ_LENGTH):
x[n,i,char_to_ix[data[p+ptr+i]]] = 1.
if(return_target):
y[n] = char_to_ix[data[p+ptr+SEQ_LENGTH]]
return x, y
# y is already an array; don't need this: np.array(y,dtype='int32')
nest = [gen_data(datum) for datum in data] produces, I think,
[(x0,y0), (x1,y1),...] where x is 3d (n,m,y), and y is 1d (n)
x = np.concatenate([n[0] for n in nest]) (I like this format over mapping) look ok to me. Compared to all the list comprehension operations, concatenate is relatively cheap. Look at the guts of np.vstack, etc to see how those use comprehensions along with concatenate.
A small example:
In [515]: def gen():
return np.arange(8).reshape(2,4),np.arange(1,3)
.....:
In [516]: gen()
Out[516]:
(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([1, 2]))
In [517]: nest=[gen() for _ in range(3)]
In [518]: nest
Out[518]:
[(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([1, 2])),
(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([1, 2])),
(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([1, 2]))]
In [519]: np.concatenate([x[0] for x in nest])
Out[519]:
array([[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7]])
In [520]: np.concatenate([x[1] for x in nest])
Out[520]: array([1, 2, 1, 2, 1, 2])
zip* effectively does a 'tanspose' on a nested list, so the arrays could be constructed with:
In [532]: nest1=zip(*nest)
In [533]: np.concatenate(nest1[0])
Out[533]:
array([[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7]])
In [534]: np.concatenate(nest1[1])
Out[534]: array([1, 2, 1, 2, 1, 2])
Still requires concatenates.
Since nest is a list of tuples, it could serve as input to a structured array:
In [524]: arr=np.array(nest,dtype=[('x','(2,4)int'),('y','(2,)int')])
In [525]: arr['x']
Out[525]:
array([[[0, 1, 2, 3],
[4, 5, 6, 7]],
[[0, 1, 2, 3],
[4, 5, 6, 7]],
[[0, 1, 2, 3],
[4, 5, 6, 7]]])
In [526]: arr['y']
Out[526]:
array([[1, 2],
[1, 2],
[1, 2]])
Another possibility is to initial x and y, and iterate. But you are already doing this in gen_data. Only thing new is that I'd be assigning larger blocks.
x = ...
y = ...
for i in range(...):
x[i,...], y[i] = gen(data[i])
I like the comprehension solutions better, but I won't speculate on speeds.
In terms of speed I think it's the low level iteration in gen_data that is the time consumer. Concatenating larger blocks is relatively fast.
Another idea - since you are iterating over the rows of arrays within gen_data, how about passing views to that function, and iterate over those.
def gen_data(data,x=None,y=None):
# accept array or make own
if x is None:
x = np.zeros((3,4),int)
if y is None:
y = np.zeros(3,int)
for n in range(3):
x[n,...] = np.arange(4)+n
y[n] = n
return x,y
with no inputs, generate arrays as before:
In [543]: gen_data(None)
Out[543]:
(array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]]),
array([0, 1, 2]))
or initial a pair, and iterate over views:
In [544]: x,y = np.zeros((9,4),int),np.zeros(9,int)
In [546]: for i in range(0,9,3):
.....: gen_data(None,x[i:i+3,...],y[i:i+3])
In [547]: x
Out[547]:
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]])
In [548]: y
Out[548]: array([0, 1, 2, 0, 1, 2, 0, 1, 2])

Categories

Resources