Tensorflow: Masking an array based on duplicated elements of another array - python

I have an array, x=[2, 3, 4, 3, 2] which contains the states of model and another array which gives corresponding probabilities of these states, prob=[.2, .1, .4, .1, .2]. But some states are duplicated and I need to sum their corresponding probabilities. So my desired outputs are: unique_elems=[2, 3, 4] and reduced_prob=[.2+.2, .1+.1, .4]. Here is my approach:
x = tf.constant([2, 3, 4, 3, 2])
prob = tf.constant([.2, .1, .4, .1, .2])
unique_elems, _ = tf.unique(x) # [2, 3, 4]
unique_elems = tf.expand_dims(unique_elems, axis=1) # [[2], [3], [4]]
tiled_prob = tf.tile(tf.expand_dims(prob, axis=0), [3, 1])
# [[0.2, 0.1, 0.4, 0.1, 0.2],
# [0.2, 0.1, 0.4, 0.1, 0.2],
# [0.2, 0.1, 0.4, 0.1, 0.2]]
equal = tf.equal(x, unique_elems)
# [[ True, False, False, False, True],
# [False, True, False, True, False],
# [False, False, True, False, False]]
reduced_prob = tf.multiply(tiled_prob, tf.cast(equal, tf.float32))
# [[0.2, 0. , 0. , 0. , 0.2],
# [0. , 0.1, 0. , 0.1, 0. ],
# [0. , 0. , 0.4, 0. , 0. ]]
reduced_prob = tf.reduce_sum(reduced_prob, axis=1)
# [0.4, 0.2, 0.4]
but I am wondering whether there is a more efficient way to do that. In particular I am using tile operation which I think is not very efficient for large arrays.

It can be done in two lines by tf.unsorted_segment_sum:
unique_elems, idx = tf.unique(x) # [2, 3, 4]
reduced_prob = tf.unsorted_segment_sum(prob, idx, tf.size(unique_elems))

Related

How to remove numpy columns based on condition?

I have a numpy array which contains the correlation between a label column
[0.5 -0.02 0.2]
And also a numpy array containing
[[0.42 0.35 0.6]
[0.3 0.34 0.2]]
Can I use a function to determine which columns to keep?
Such as
abs(cors) > 0.05
It will yield
[True False True]
then the resulting numpy array will becomes
[[0.42 0.6]
[0.3 0.2]]
May I know how to achieve this?
You can do boolean indexing along values with something like this:
a = np.array([
[1, 2, 3],
[4, 5, 6]
])
b = np.array([
[True, False, True],
[False, True, False]
])
new_a = a[b]
Or, to do boolean indexing along rows/columns, use this syntax:
a = np.array([
[1, 2, 3],
[4, 5, 6]
])
b = np.array([True, False, True])
c = np.array([False, True])
new_a = a[c, b]
So, for your example you could do:
a = np.array([
[0.42, 0.35, 0.6],
[0.3, 0.34, 0.2]
])
cors = np.array([0.5, -0.02, 0.2])
new_a = a[:, abs(cors) > 0.05]
In NumPy, you can do something like this
new_array = np.delete(array, np.where(cors <= 0.05), 1)

Efficient probability tree branching

As an example, I have an array of branches and probabilities that looks like this:
paths = np.array([
[1, 0, 1.0],
[2, 0, 0.4],
[2, 1, 0.6],
[3, 1, 1.0],
[5, 1, 0.25],
[5, 2, 0.5],
[5, 4, 0.25],
[6, 0, 0.7],
[6, 5, 0.2],
[6, 2, 0.1]])
The columns are upper node, lower node, probability.
Here's a visual of the nodes:
6
/ | \
5 0 2
/ | \ / \
1 2 4 0 1
| /\ |
0 0 1 0
|
0
I want to be able to pick a starting node and output an array of the branches and cumulative probabilities, including all the duplicate branches. For example:
start_node = 5 should return
array([
[5, 1, 0.25],
[5, 2, 0.5],
[5, 4, 0.25],
[1, 0, 0.25],
[2, 0, 0.2],
[2, 1, 0.3],
[1, 0, 0.3]])
Notice the [1, 0, x] branch is included twice, as it's fed by both the [5, 1, 0.25] branch and the [2, 1, 0.3] branch.
Here's some code I got working but it's far too slow for my application (millions of branches):
def branch(start_node, paths):
output = paths[paths[:,0]==start_node]
next_nodes = output
while True:
can_go_lower = np.isin(next_nodes[:,1], paths[:,0])
if ~np.any(can_go_lower): break
next_nodes_checked = next_nodes[can_go_lower]
next_nodes = np.empty([0,3])
for nodes in next_nodes_checked:
to_append = paths[paths[:,0]==nodes[1]]
to_append[:,2] *= nodes[2]
next_nodes = np.append(next_nodes, to_append, axis=0)
output = np.append(output, next_nodes, axis=0)
return output
The branches are always higher to lower, therefor getting caught in circles isn't a concern. A way to vectorize the for loop and avoid the appends would be the best optimization, I think.
Instead of storing in numpy array lets' store graph in dict.
tree = {k:arr[arr[:, 0] == k] for k in np.unique(arr[:, 0])}
Make as set of nodes which are non-leaf:
non_leaf_nodes = set(np.unique(arr[:, 0]))
Now to find the branch and cumulative probability:
def branch(start_node, tree, non_leaf_nodes):
curr_nodes = [[start_node, start_node, 1.0]] #(prev_node, starting_node, current_probability)
output = []
while True:
next_nodes = []
for _, node, prob in curr_nodes:
if node not in non_leaf_nodes: continue
subtree = tree[node]
to_append = subtree.copy()
to_append[:, 2] *= prob
to_append = to_append.tolist()
output += to_append
next_nodes += to_append
curr_nodes = next_nodes
if len(curr_nodes) == 0:
break
return np.array(output)
Output:
>>> branch(5, tree, non_leaf_nodes)
array([
[5. , 1. , 0.25],
[5. , 2. , 0.5 ],
[5. , 4. , 0.25],
[1. , 0. , 0.25],
[2. , 0. , 0.2 ],
[2. , 1. , 0.3 ],
[1. , 0. , 0.3 ]])
I am expecting it to work faster. Let me know.

numpy 2d argwhere range in a matrix where a(ij) = a(ji)

I am trying to find the row and column index in a 2d numpy array where the value lies in a range.
Though I am able to accomplish this with the following code, I would like only one occurrence to be encountered in a matrix where a ij = a ji:
In [118]: test_arr = np.array([[1, 0.2, 0.04],
...: [0.2, 0.3, 0.06 ],
...: [0.04, 0.06, 0.09]
...: ])
...:
In [119]: test_arr
Out[119]:
array([[1. , 0.2 , 0.04],
[0.2 , 0.3 , 0.06],
[0.04, 0.06, 0.09]])
In [120]: np.argwhere((test_arr==0.06))
Out[120]:
array([[1, 2],
[2, 1]])
Is there any way using numpy where we can restrict i<j so that the output will only be as:
array([[1, 2]])
Any help is appreciated!
In [38]: In [118]: test_arr = np.array([[1, 0.2, 0.04],
...: ...: [0.2, 0.3, 0.06 ],
...: ...: [0.04, 0.06, 0.09]
...: ...: ])
In [39]: test_arr
Out[39]:
array([[1. , 0.2 , 0.04],
[0.2 , 0.3 , 0.06],
[0.04, 0.06, 0.09]])
In [40]: np.where(test_arr==0.06)
Out[40]: (array([1, 2]), array([2, 1]))
Let's explore using one of the tri functions to set some of the values of the array to 0:
In [41]: np.tril(test_arr)
Out[41]:
array([[1. , 0. , 0. ],
[0.2 , 0.3 , 0. ],
[0.04, 0.06, 0.09]])
In [42]: np.triu(test_arr)
Out[42]:
array([[1. , 0.2 , 0.04],
[0. , 0.3 , 0.06],
[0. , 0. , 0.09]])
Now apply the equality test:
In [44]: np.triu(test_arr)==0.06
Out[44]:
array([[False, False, False],
[False, False, True],
[False, False, False]])
In [45]: np.argwhere(np.triu(test_arr)==0.06)
Out[45]: array([[1, 2]])

TensorFlow - dense vector to one-hot

Suppose I have the following tensor:
T = [[0.1, 0.3, 0.7],
[0.2, 0.5, 0.3],
[0.1, 0.1, 0.8]]
I want to transform this into a one-hot tensor, such that the indexes with the maximum value over dimension 0 get set to 1 and all the other ones get set to zero, like this:
T_onehot = [[0, 0, 1],
[0, 1, 0],
[0, 0, 1]]
I know there's tf.argmax to get the indices of the largest elements in the tensor, but is there any method which allows me to do what I want to do in one step?
I don't know if there's a way to do this in one step, but there's a one_hot function in tensorflow:
import tensorflow as tf
T = tf.constant([[0.1, 0.3, 0.7], [0.2, 0.5, 0.3], [0.1, 0.1, 0.8]])
T_onehot = tf.one_hot(tf.argmax(T, 1), T.shape[1])
tf.InteractiveSession()
print(T_onehot.eval())
# [[ 0. 0. 1.]
# [ 0. 1. 0.]
# [ 0. 0. 1.]]

Numpy array of distances to list of (row,col,distance)

I have an nd array that looks as follows:
[[ 0. 1.73205081 6.40312424 7.21110255 2.44948974]
[ 1.73205081 0. 5.09901951 5.91607978 1. ]
[ 6.40312424 5.09901951 0. 1. 4.35889894]
[ 7.21110255 5.91607978 1. 0. 5.09901951]
[ 2.44948974 1. 4.35889894 5.09901951 0. ]]
Each element in this array is a distance and I need to turn this into a list with the row,col,distance as follows:
l = [(0,0,0),(0,1, 1.73205081),(0,2, 6.40312424),...,(1,0, 1.73205081),(1,1,0),...,(4,4,0)]
Additionally, it would be cool to remove the diagonal elements and also the elements (j,i) as (i,j) are already there. Essentially, is it possible to take just the top triangular matrix of this?
Is this possible to do efficiently (without a lot of loops)? I had created this array with squareform, but couldn't find any docs to do this.
squareform does all this. Read the docs and experiment. It works in both directions. If you give it a matrix it returns the upper triangle values (condensed form). If you give it those values, it returns the matrix.
In [668]: M
Out[668]:
array([[ 0. , 0.1, 0.5, 0.2],
[ 0.1, 0. , 2. , 0.3],
[ 0.5, 2. , 0. , 0.2],
[ 0.2, 0.3, 0.2, 0. ]])
In [669]: spatial.distance.squareform(M)
Out[669]: array([ 0.1, 0.5, 0.2, 2. , 0.3, 0.2])
In [670]: v=spatial.distance.squareform(M)
In [671]: v
Out[671]: array([ 0.1, 0.5, 0.2, 2. , 0.3, 0.2])
In [672]: spatial.distance.squareform(v)
Out[672]:
array([[ 0. , 0.1, 0.5, 0.2],
[ 0.1, 0. , 2. , 0.3],
[ 0.5, 2. , 0. , 0.2],
[ 0.2, 0.3, 0.2, 0. ]])
You can also specify a force and checks parameter, but without those it just goes by the shape.
Indicies can come from triu
In [677]: np.triu_indices(4,1)
Out[677]:
(array([0, 0, 0, 1, 1, 2], dtype=int32),
array([1, 2, 3, 2, 3, 3], dtype=int32))
In [680]: np.vstack((np.triu_indices(4,1),v)).T
Out[680]:
array([[ 0. , 1. , 0.1],
[ 0. , 2. , 0.5],
[ 0. , 3. , 0.2],
[ 1. , 2. , 2. ],
[ 1. , 3. , 0.3],
[ 2. , 3. , 0.2]])
Just to check, we can fill in a 4x4 matrix with these values
In [686]: A=np.vstack((np.triu_indices(4,1),v)).T
In [687]: MM = np.zeros((4,4))
In [688]: MM[A[:,0].astype(int),A[:,1].astype(int)]=A[:,2]
In [689]: MM
Out[689]:
array([[ 0. , 0.1, 0.5, 0.2],
[ 0. , 0. , 2. , 0.3],
[ 0. , 0. , 0. , 0.2],
[ 0. , 0. , 0. , 0. ]])
Those triu indices can also fetch the values from M:
In [693]: I,J = np.triu_indices(4,1)
In [694]: M[I,J]
Out[694]: array([ 0.1, 0.5, 0.2, 2. , 0.3, 0.2])
squareform uses compiled code in spatial.distance._distance_wrap so I expect it will be quite fast for large arrays. Only problem it just returns the condensed form values, but not the indices. But given the shape,the indices can always be calculated. They don't need to be stored with the values.
If your input is x, first generate the indices:
i0,i1 = np.indices(x.shape)
Then:
np.concatenate((i1,i0,x)).reshape(3,5,5).T
That gives you the first result--for the entire matrix.
As for taking only the upper triangle, you might considering trying np.triu() but I'm not sure exactly what result you're looking for. You can probably figure out how to mask the parts you don't want now though.
you can try this,
print([(x,y, value) for (x,y), value in np.ndenumerate(numpymatrixarray)])
output [(0, 0, 0.0), (0, 1, 1.7320508100000001), (0, 2, 6.4031242400000004), (0, 3, 7.2111025499999997), (0, 4, 2.4494897400000002), (1, 0, 1.7320508100000001), (1, 1, 0.0), (1, 2, 5.0990195099999998), (1, 3, 5.9160797799999996), (1, 4, 1.0), (2, 0, 6.4031242400000004), (2, 1, 5.0990195099999998), (2, 2, 0.0), (2, 3, 1.0), (2, 4, 4.3588989400000004), (3, 0, 7.2111025499999997), (3, 1, 5.9160797799999996), (3, 2, 1.0), (3, 3, 0.0), (3, 4, 5.0990195099999998), (4, 0, 2.4494897400000002), (4, 1, 1.0), (4, 2, 4.3588989400000004), (4, 3, 5.0990195099999998), (4, 4, 0.0)]
Do you really want the top triangular matrix for an [nxm] matrix where n>m? That will give you (nxn-n)/2 elements and lose all the data where m⊖n.
What you probably want is the lower triangular matrix:
def tri_reduce(m):
n=m.shape
if n[0]>n[1]:
i=np.tril_indices(n[0],1,n[1])
else:
i=np.triu_indices(n[0],1,n[1])
return np.vstack((i,m[i])).T
Rebuilding it into a list of tuples would require a loop though I believe. list(tri_reduce(m)) would give a list of nd arrays.

Categories

Resources