I am trying to create a tf_boolean_mask that filters duplicate indices from a tensor, by the value of the indices. If the value is greater than that of a duplicate it should be kept, the others discarded. If index and value are identical, only one should remain:
[Pseudocode]
for index in indices
If index is unique:
keep index = True
else:
if val[index] > val[index of all other duplicate indices]:
keep index = True
elif val[index] < val[index of any other duplicate indices]:
keep index = False
elif val[index] == val[index of any other duplicate indices]:
keep only a single one of the equal indices(doesn't matter which)
A short example for the problem is the following:
import tensorflow as tf
tf.enable_eager_execution()
index = tf.convert_to_tensor([ 10, 5, 20, 20, 30, 30])
value = tf.convert_to_tensor([ 1., 0., 2., 0., 0., 0.])
# bool_mask = [True, True, True, False, True, False]
# or [True, True, True, False, False, True]
# the index 3 is filtered because index 2 has a greater value (2 comp. to 0)
# The index 4 and 5 are identical in their respective values, that's why both
# of them can be kept, but at maximum one of them.
...
bool_mask = ?
My current approach succesfully solves the removal of duplicates with different values but fails at the ones with identical value. However this is an edge case that unfortunately appears in my data:
import tensorflow as tf
y, idx = tf.unique(index)
num_segments = tf.shape(y)[0]
maximum_vals = tf.unsorted_segment_max(value, idx, num_segments)
fused_filt = tf.stack([tf.cast(y, tf.float32), maximum_vals],axis=1)
fused_orig = tf.stack([tf.cast(index, tf.float32), value], axis=1)
fused_orig_tiled = tf.tile(fused_orig, [1, tf.shape(fused_filt)[0]])
fused_orig_res = tf.reshape(fused_orig_tiled, [-1, tf.shape(fused_filt)[0], 2])
comp_1 = tf.equal(fused_orig_res, fused_filt)
comp_2 = tf.reduce_all(comp_1, -1)
comp_3 = tf.reduce_any(comp_2, -1)
# comp_3 = [True, True, True, False, True, True]
A pure tensorflow solution would be nice, since a For loop over the indices could be implemented rather simply. Thank you.
Related
I have a 1D numpy array of False booleans, and a 2D numpy array containing the min,max indices of values in the first array to change to True.
An example:
my_data = numpy.zeros((10,), dtype=bool)
inds2true = numpy.array([[1, 3], [8, 9]])
And I want the following result:
out = numpy.array([False, True, True, True, False, False, False, False, True, True])
How is this possible in Python with Numpy?
Edit: I would like this to be performed in one step (i.e. no looping).
There's one rule-breaking hack:
my_data[inds2true] = True
my_data = np.cumsum(my_data) % 2 == 1
my_data
>>> array([False, True, True, False, False, False, False, False, True, False])
The most common practise is to change indices within np.arange([1, 3]) and np.arange([8, 9]), not including 3 or 9. If you still want to include them, do in addition: my_data[inds2true[:, 1]] = True
If you're looking for other options to do it in one go, the most probably it will include np.cumsum tricks.
import numpy as np
my_data = np.zeros((10,), dtype=bool)
inds2true = np.array([[1, 3], [8, 9]])
indeces = []
for ix_range in inds2true:
indeces += list(range(ix_range[0], ix_range[1] + 1))
my_data[indeces] = True
I'm looking for a vectorized function that returns a mask with values of True if the value in the array has been seen before and False otherwise.
I'm looking for the fastest solution possible as speed is very important.
For example this is what I would like to see:
array = [1, 2, 1, 2, 3]
mask = [False, False, True, True, False]
So is_duplicate = array[mask] should return [1, 2].
Is there a fast, vectorized way to do this? Thanks!
Approach #1 : With sorting
def mask_firstocc(a):
sidx = a.argsort(kind='stable')
b = a[sidx]
out = np.r_[False,b[:-1] == b[1:]][sidx.argsort()]
return out
We can use array-assignment to boost perf. further -
def mask_firstocc_v2(a):
sidx = a.argsort(kind='stable')
b = a[sidx]
mask = np.r_[False,b[:-1] == b[1:]]
out = np.empty(len(a), dtype=bool)
out[sidx] = mask
return out
Sample run -
In [166]: a
Out[166]: array([2, 1, 1, 0, 0, 4, 0, 3])
In [167]: mask_firstocc(a)
Out[167]: array([False, False, True, False, True, False, True, False])
Approach #2 : With np.unique(..., return_index)
We can leverage np.unique with its return_index which seems to return the first occurence of each unique elemnent, hence a simple array-assignment and then indexing works -
def mask_firstocc_with_unique(a):
mask = np.ones(len(a), dtype=bool)
mask[np.unique(a, return_index=True)[1]] = False
return mask
Use np.unique
a = np.array([1, 2, 1, 2, 3])
_, ix = np.unique(a, return_index=True)
b = np.full(a.shape, True)
b[ix] = False
In [45]: b
Out[45]: array([False, False, True, True, False])
You can achieve that using the enumerate method - which lets you loop through using index + value :
array = [1, 2, 1, 2, 3]
mask = []
for i,v in enumerate(array):
if array.index(v) == i:
mask.append(False)
else:
mask.append(True)
print(mask)
Output:
[False, False, True, True, False]
Almost by definition, this can't be vectorized. The value of mask for any index depends on the value of array for every value between 0 and index. There may be some algorithm where you expand array into a NxN matrix and do fancy tests, but you're still going to have an O(n^2) algorithm. The straightforward set algorithm is O(n log n).
I have one array of shape (X, 5):
M = [[1,2,3,4,5],
[6,7,8,9,1],
[2,5,7,8,3]
...]
and one array of shape (X, 1):
n = [[3],
[7],
[100],
...]
Now I need to get the first index of M >= n for each row, or nan if there is no such index.
For example:
np.where([1,2,3,4,5] >= 3)[0][0] # Returns 2
np.searchsorted(np.array([1,2,3,4,5]), 3) # Returns 2
These examples are applied to each row individually (I could loop X times as both arrays have the length X).
I wonder, is there a way to do it in a multidimensional way to get an output of all indices at once?
Something like:
np.where(M>=n)
Thank you
Edit: Values in M are unsorted, I'm still looking for the first index/occurrence fitting M >= n (so probably not searchsorted)
You could start by checking which row indices are lower or equal than n and use argmax to get the first True for each row. For the rows where all columns are False, we can use np.where to set them to np.nan for instance:
M = np.array([[1,2,3,4,5],
[6,7,8,9,1],
[2,5,7,8,3]])
n = np.array([[3],[7],[100]])
le = n<=M
# array([[False, False, True, True, True],
# [False, True, True, True, False],
# [False, False, False, False, False]])
lea = le.argmax(1)
has_any = le[np.arange(len(le)), lea]
np.where(has_any, lea, np.nan)
# array([ 2., 1., nan])
I have a N x M numpy array (matrix). Here is an example with a 3 x 5 array:
x = numpy.array([[0,1,2,3,4,5],[0,-1,2,3,-4,-5],[0,-1,-2,-3,4,5]])
I'd like to scan all the columns of x and replace the values of each column if they are equal to a specific value.
This code for example aims to replace all the negative values (where the value is equal to the column number) to 100:
for i in range(1,6):
x[:,i == -(i)] = 100
This code obtains this warning:
DeprecationWarning: using a boolean instead of an integer will result in an error in the future
I'm using numpy 1.8.2. How can I avoid this warning without downgrade numpy?
I don't follow what your code is trying to do:
the i == -(i)
will evaluate to something like this:
x[:, True]
x[:, False]
I don't think this is what you want. You should try something like this:
for i in range(1, 6):
mask = x[:, i] == -i
x[:, i][mask] = 100
Create a mask over the whole column, and use that to change the values.
Even without the warning, the code you have there will not do what you want. i is the loop index and will equal minus itself only if i == 0, which is never. Your test will always return false, which is cast to 0. In other words your code will replace the first element of each row with 100.
To get this to work I would do
for i in range(1, 6):
col = x[:,i]
col[col == -i] = 100
Notice that you use the name of the array for the masking and that you need to separate the conventional indexing from the masking
If you are worried about the warning spewing out text, then ignore it as a Warning/Exception:
import numpy
import warnings
warnings.simplefilter('default') # this enables DeprecationWarnings to be thrown
x = numpy.array([[0,1,2,3,4,5],[0,-1,2,3,-4,-5],[0,-1,-2,-3,4,5]])
with warnings.catch_warnings():
warnings.simplefilter("ignore") # and this ignores them
for i in range(1,6):
x[:,i == -(i)] = 100
print(x) # just to show that you are actually changing the content
As you can see in the comments, some people are not getting DeprecationWarning. That is probably because python suppresses developer-only warnings since 2.7
As others have said, your loop isn't doing what you think it is doing. I would propose you change your code to use numpy's fancy indexing.
# First, create the "test values" (column index):
>>> test_values = numpy.arange(6)
# test_values is array([0, 1, 2, 3, 4, 5])
#
# Now, we want to check which columns have value == -test_values:
#
>>> mask = (x == -test_values) & (x < 0)
# mask is True wherever a value in the i-th column of x is negative i
>>> mask
array([[False, False, False, False, False, False],
[False, True, False, False, True, True],
[False, True, True, True, False, False]], dtype=bool)
#
# Now, set those values to 100
>>> x[mask] = 100
>>> x
array([[ 0, 1, 2, 3, 4, 5],
[ 0, 100, 2, 3, 100, 100],
[ 0, 100, 100, 100, 4, 5]])
I would like to determine the sum of a two dimensional numpy array. However, elements with a certain value I want to exclude from this summation. What is the most efficient way to do this?
For example, here I initialize a two dimensional numpy array of 1s and replace several of them by 2:
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
How can I sum over the elements in my two dimensional array while excluding all of the 2s? Note that with the 10 by 10 array the correct answer should be 97 as I replaced three elements with the value 2.
I know I can do this with nested for loops. For example:
elements = []
for idx_x in range(data_set.shape[0]):
for idx_y in range(data_set.shape[1]):
if data_set[idx_x][idx_y] != 2:
elements.append(data_set[idx_x][idx_y])
data_set_sum = numpy.sum(elements)
However on my actual data (which is very large) this is too slow. What is the correct way of doing this?
Use numpy's capability of indexing with boolean arrays. In the below example data_set!=2 evaluates to a boolean array which is True whenever the element is not 2 (and has the correct shape). So data_set[data_set!=2] is a fast and convenient way to get an array which doesn't contain a certain value. Of course, the boolean expression can be more complex.
In [1]: import numpy as np
In [2]: data_set = np.ones((10, 10))
In [4]: data_set[4,4] = 2
In [5]: data_set[5,5] = 2
In [6]: data_set[6,6] = 2
In [7]: data_set[data_set != 2].sum()
Out[7]: 97.0
In [8]: data_set != 2
Out[8]:
array([[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
...
[ True, True, True, True, True, True, True, True, True,
True]], dtype=bool)
Without numpy, the solution is not much more complex:
x = [1,2,3,4,5,6,7]
sum(y for y in x if y != 7)
# 21
Works for a list of excluded values too:
# set is faster for resolving `in`
exl = set([1,2,3])
sum(y for y in x if y not in exl)
# 22
Using np.sums where= argument, we avoid the need for array copying which would otherwise be triggered from using advanced array indexing:
>>> import numpy as np
>>> data_set = np.ones((10,10))
>>> data_set[(4,5,6),(4,5,6)] = 2
>>> np.sum(data_set, where=data_set != 2)
97.0
>>> data_set.sum(where=data_set != 2)
97.0
https://numpy.org/doc/stable/reference/generated/numpy.sum.html
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing
How about this way that makes use of numpy's boolean capabilities.
We simply set all the values that meet the specification to zero before taking the sum, that way we don't change the shape of the array as we would if we were to filter them from the array.
The other benefit of this is that it means we can sum along axis after the filter is applied.
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
print "Sum", data_set.sum()
another_set = numpy.array(data_set) # Take a copy, we'll need that later
data_set[data_set == 2] = 0 # Set all the values that are 2 to zero
print "Filtered sum", data_set.sum()
print "Along axis", data_set.sum(0), data_set.sum(1)
Equally we could use any other boolean to set the data we wish to exclude from the sum.
another_set[(another_set > 1) & (another_set < 3)] = 0
print "Another filtered sum", another_set.sum()