numpy: ravel_multi_index increment different results from iterating over indices loop - python

I have an array of indices (possible duplicates) where I increment each these of indices in another 2D matrix by 1. There have been several several suggestions and this answer proposes to use np.ravel_multi_index.
So, I've tried it out but they don't seem to give me the same set of answers. Any idea why?
raveled = np.ravel_multi_index(legit_indices.T, acc.shape)
counts = np.bincount(raveled)
acc = np.resize(counts, acc.shape)
acc2 = np.zeros(acc2.shape)
for i in legit_indices:
acc2[i[0], i[1]] += 1
(Pdb) np.array_equal(acc, acc2)
False
(Pdb) acc[493][5]
135
(Pdb) acc2[493][5]
0.0

There are a few problems with your current approach. Firstly, np.bincount(x)
will give you the counts for every positive integer value of x starting at 0
and ending at max(x):
print(np.bincount([1, 1, 3, 3, 3, 4]))
# [0, 2, 0, 3, 1]
# i.e. [count for 0, count for 1, count for 2, count for 3, count for 4]
Therefore, if not every location in acc.flat gets indexed, the length of
np.bincount(raveled) will be greater than the number of unique indices. What
you actually want is the counts only for those locations in acc.flat that are
indexed at least once.
Secondly, what you want to do is assign the bin counts to the corresponding
indices into acc.flat. What your call to np.resize does is to repeat parts
of your array of bincounts in order to make it the same size as acc.flat,
then reshape it to the same shape as acc. This will not result in the bin
counts being assigned to the correct locations in acc!
The way I would solve this problem would be to use np.unique instead of
np.bincount, and use it to return both the unique indices and their corresponding
counts. These can then be used to assign the correct counts to the correct unique locations within acc:
import numpy as np
# some example data
acc = np.zeros((4, 3))
legit_indices = np.array([[0, 1],
[0, 1],
[1, 2],
[1, 0],
[1, 0],
[1, 0]])
# convert the index array into a set of indices into acc.flat
flat_idx = np.ravel_multi_index(legit_indices.T, acc.shape)
# get the set of unique indices and their corresponding counts
uidx, ucounts = np.unique(flat_idx, return_counts=True)
# assign the count value to each unique index in acc.flat
acc.flat[uidx] = ucounts
# confirm that this matches the result of your for loop
acc2 = np.zeros_like(acc)
for ii, jj in legit_indices:
acc2[ii, jj] += 1
assert np.array_equal(acc, acc2)

Related

How to do elementwise comparison sum without for loop

The for loop makes my program very slow. I would've used np.sum(target==output) but I need the argmax value for each row in the output. How can I speed this up?
The output is a tensor data type
for i, x in enumerate(target):
if target[i] == torch.argmax(output[i]):
correct_class += 1
You could vectorize the above using np.argmax's axis argument, to obtain the indices of the maximum value across the rows:
(target==np.argmax(output, axis=1)).sum()
For instance:
output = np.random.choice([0,1],(4,2))
print(output)
array([[1, 1],
[0, 1],
[0, 1],
[0, 1]])
target = np.array([[0,1,0,1]])
(target==np.argmax(output, axis=1)).sum()
# 3

Function Failing at Large List Sizes

I have a question: Starting with a 1-indexed array of zeros and a list of operations, for each operation add a value to each the array element between two given indices, inclusive. Once all operations have been performed, return the maximum value in the array.
Example: n = 10, Queries = [[1,5,3],[4,8,7],[6,9,1]]
The following will be the resultant output after iterating through the array, Index 1-5 will have 3 added to it etc...:
[0,0,0, 0, 0,0,0,0,0, 0]
[3,3,3, 3, 3,0,0,0,0, 0]
[3,3,3,10,10,7,7,7,0, 0]
[3,3,3,10,10,8,8,8,1, 0]
Finally you output the max value in the final list:
[3,3,3,10,10,8,8,8,1, 0]
My current solution:
def Operations(size, Array):
ResultArray = [0]*size
Values = [[i.pop(2)] for i in Array]
for index, i in enumerate(Array):
#Current Values in = Sum between the current values in the Results Array AND the added operation of equal length
#Results Array
ResultArray[i[0]-1:i[1]] = list(map(sum, zip(ResultArray[i[0]-1:i[1]], Values[index]*len(ResultArray[i[0]-1:i[1]]))))
Result = max(ResultArray)
return Result
def main():
nm = input().split()
n = int(nm[0])
m = int(nm[1])
queries = []
for _ in range(m):
queries.append(list(map(int, input().rstrip().split())))
result = Operations(n, queries)
if __name__ == "__main__":
main()
Example input: The first line contains two space-separated integers n and m, the size of the array and the number of operations.
Each of the next m lines contains three space-separated integers a,b and k, the left index, right index and summand.
5 3
1 2 100
2 5 100
3 4 100
Compiler Error at Large Sizes:
Runtime Error
Currently this solution is working for smaller final lists of length 4000, however in order test cases where length = 10,000,000 it is failing. I do not know why this is the case and I cannot provide the example input since it is so massive. Is there anything clear as to why it would fail in larger cases?
I think the problem is that you make too many intermediary trow away list here:
ResultArray[i[0]-1:i[1]] = list(map(sum, zip(ResultArray[i[0]-1:i[1]], Values[index]*len(ResultArray[i[0]-1:i[1]]))))
this ResultArray[i[0]-1:i[1]] result in a list and you do it twice, and one is just to get the size, which is a complete waste of resources, then you make another list with Values[index]*len(...) and finally compile that into yet another list that will also be throw away once it is assigned into the original, so you make 4 throw away list, so for example lets said the the slice size is of 5.000.000, then you are making 4 of those or 20.000.000 extra space you are consuming, 15.000.000 of which you don't really need, and if your original list is of 10.000.000 elements, well just do the math...
You can get the same result for your list(map(...)) with list comprehension like
[v+Value[index][0] for v in ResultArray[i[0]-1:i[1]] ]
now we use two less lists, and we can reduce one list more by making it a generator expression, given that slice assignment does not need that you assign a list specifically, just something that is iterable
(v+Value[index][0] for v in ResultArray[i[0]-1:i[1]] )
I don't know if internally the slice assignment it make it a list first or not, but hopefully it doesn't, and with that we go back to just one extra list
here is an example
>>> a=[0]*10
>>> a
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> a[1:5] = (3+v for v in a[1:5])
>>> a
[0, 3, 3, 3, 3, 0, 0, 0, 0, 0]
>>>
we can reduce it to zero extra list (assuming that internally it doesn't make one) by using itertools.islice
>>> import itertools
>>> a[3:7] = (1+v for v in itertools.islice(a,3,7))
>>> a
[0, 3, 3, 4, 4, 1, 1, 0, 0, 0]
>>>

Find the index of first non-zero element to the right of given elements in python

I have a 2D numpy.ndarray. Given a list of positions, I want to find the positions of first non-zero elements to the right of the given elements in the same row. Is it possible to vectorize this? I have a huge array and looping is taking too much time.
Eg:
matrix = numpy.array([
[1, 0, 0, 1, 1],
[1, 1, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 1, 1, 1, 1],
[1, 0, 0, 0, 1]
])
query = numpy.array([[0,2], [2,1], [1,3], [0,1]])
Expected Result:
>> [[0,3], [2,4], [1,4], [0,3]]
Currently I'm doing this using for loops as follows
for query_point in query:
y, x = query_point
result_point = numpy.min(numpy.argwhere(self.matrix[y, x + 1:] == 1)) + x + 1
print(f'{y}, {result_point}')
PS: I also want to find the first non-zero element to the left as well. I guess, the solution to find the right point can be easily tqeaked to find the left point.
If your query array is sufficiently dense, you can reverse the computation: find an array of the same size as matrix that gives the index of the next nonzero element in the same row for each location. Then your problem becomes one of just one of applying query to this index array, which numpy supports directly.
It is actually much easier to find the left index, so let's start with that. We can transform matrix into an array of indices like this:
r, c = np.nonzero(matrix)
left_ind = np.zeros(matrix.shape, dtype=int)
left_ind[r, c] = c
Now you can find the indices of the preceding nonzero element by using np.maximum similarly to how it is done in this answer: https://stackoverflow.com/a/48252024/2988730:
np.maximum.accumulate(left_ind, axis=1, out=left_ind)
Now you can index directly into ind to get the previous nonzero column index:
left_ind[query[:, 0], query[:, 1]]
or
left_ind[tuple(query.T)]
Now to do the same thing with the right index, you need to reverse the array. But then your indices are no longer ascending, and you risk overwriting any zeros you have in the first column. To solve that, in addition to just reversing the array, you need to reverse the order of the indices:
right_ind = np.zeros(matrix.shape, dtype=int)
right_ind[r, c] = matrix.shape[1] - c
You can use any number larger than matrix.shape[1] as your constant as well. The important thing is that the reversed indices all come out greater than zero so np.maximum.accumulate overwrites the zeros. Now you can use np.maximum.accumulate in the same way on the reversed array:
right_ind = matrix.shape[1] - np.maximum.accumulate(right_ind[:, ::-1], axis=1)[:, ::-1]
In this case, I would recommend against using out=right_ind, since right_ind[:, ::-1] is a view into the same buffer. The operation is buffered, but if your line size is big enough, you may overwrite data unintentionally.
Now you can index the array in the same way as before:
right_ind[(*query.T,)]
In both cases, you need to stack with the first column of query, since that's the row key:
>>> row, col = query.T
>>> np.stack((row, left_ind[row, col]), -1)
array([[0, 0],
[2, 0],
[1, 1],
[0, 0]])
>>> np.stack((row, right_ind[row, col]), -1)
array([[0, 3],
[2, 4],
[1, 4],
[0, 3]])
>>> np.stack((row, left_ind[row, col], right_ind[row, col]), -1)
array([[0, 0, 3],
[2, 0, 4],
[1, 1, 4],
[0, 0, 3]])
If you plan on sampling most of the rows in the array, either at once, or throughout your program, this will help you speed things up. If, on the other hand, you only need to access a small subset, you can apply this technique only to the rows you need.
I came up with a solution to get both your wanted indices,
i.e. to the left and to the right from the indicated position.
First define the following function, to get the row number and both indices:
def inds(r, c, arr):
ind = np.nonzero(arr[r])[0]
indSlice = ind[ind < c]
iLeft = indSlice[-1] if indSlice.size > 0 else None
indSlice = ind[ind > c]
iRight = indSlice[0] if indSlice.size > 0 else None
return r, iLeft, iRight
Parameters:
r and c are row number (in the source array) and the "starting"
index in this row,
arr is the array to look in (matrix will be passed here).
Then define the vectorized version of this function:
indsVec = np.vectorize(inds, excluded=['arr'])
And to get the result, run:
result = np.vstack(indsVec(query[:, 0], query[:, 1], arr=matrix)).T
The result is:
array([[0, 0, 3],
[2, 0, 4],
[1, 1, 4],
[0, 0, 3]], dtype=int64)
Your expected result is the left and right column (row number
and the index of first non-zero element after the "starting" position.
The middle column is the index of last non-zero element before the "starting" position.
This solution is resistant to "non-existing" case (if there are no
any "before" or "after" non-zero element). In such case the respective
index is returned as None.

Loop over clump_masked indices

I have an array y_filtered that contains some masked values. I want to replace these values by some value I calculate based on their neighbouring values. I can get the indices of the masked values by using masked_slices = ma.clump_masked(y_filtered). This returns a list of slices, e.g. [slice(194, 196, None)].
I can easily get the values from my masked array, by using y_filtered[masked_slices], and even loop over them. However, I need to access the index of the values as well, so i can calculate its new value based on its neighbours. Enumerate (logically) returns 0, 1, etc. instead of the indices I need.
Here's the solution I came up with.
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
y_enum = [(i, y_i) for i, y_i in zip(range(len(y_filtered)), y_filtered)]
for sl in masked_slices:
for i, y_i in y_enum[sl]:
# simplified example calculation
y_filtered[i] = np.average(y_filtered[i-2:i+2])
It is very ugly method i.m.o. and I think there has to be a better way to do this. Any suggestions?
Thanks!
EDIT:
I figured out a better way to achieve what I think you want to do. This code picks every window of 5 elements and compute its (masked) average, then uses those values to fill the gaps in the original array. If some index does not have any unmasked value close enough it will just leave it as masked:
import numpy as np
from numpy.lib.stride_tricks import as_strided
SMOOTH_MARGIN = 2
x = np.ma.array(data=[1, 2, 3, 4, 5, 6, 8, 9, 10],
mask=[0, 1, 0, 0, 1, 1, 1, 1, 0])
print(x)
# [1 -- 3 4 -- -- -- -- 10]
pad_data = np.pad(x.data, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant')
pad_mask = np.pad(x.mask, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant',
constant_values=True)
k = 2 * SMOOTH_MARGIN + 1
isize = x.dtype.itemsize
msize = x.mask.dtype.itemsize
x_pad = np.ma.array(
data=as_strided(pad_data, (len(x), k), (isize, isize), writeable=False),
mask=as_strided(pad_mask, (len(x), k), (msize, msize), writeable=False))
x_avg = np.ma.average(x_pad, axis=1).astype(x_pad.dtype)
fill_mask = ~x_avg.mask & x.mask
result = x.copy()
result[fill_mask] = x_avg[fill_mask]
print(result)
# [1 2 3 4 3 4 10 10 10]
(note all the values are integers here because x was originally of integer type)
The original posted code has a few errors, firstly it both reads and writes values from y_filtered in the loop, so the results of later indices are affected by the previous iterations, this could be fixed with a copy of the original y_filtered. Second, [i-2:i+2] should probably be [max(i-2, 0):i+3], in order to have a symmetric window starting at zero or later always.
You could do this:
from itertools import chain
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
for idx in chain.from_iterable(range(s.start, s.stop) for s in masked_slices):
y_filtered[idx] = np.average(y_filtered[max(idx - 2, 0):idx + 3])

Python - how to find numbers in a list which are not the minimum

I have a list S = [a[n],b[n],c[n]] and for n=0 the minimum of list S is the value 'a'. How do I select the values b and c given that I know the minimum? The code I'm writing runs through many iterations of n, and I want to examine the elements which are not the minimum for a given iteration in the loop.
Python 2.7.3, 32-bit. Numpy 1.6.2. Scipy 0.11.0b1
If you can flatten the whole list into a numpy array, then use argsort, the first row of argsort will tell you which array contains the minimum value:
a = [1,2,3,4]
b = [3,-4,5,8]
c = [6,1,-7,12]
S = [a,b,c]
S2 = np.array(S)
S2.argsort(axis=0)
array([[0, 1, 2, 0],
[1, 2, 0, 1],
[2, 0, 1, 2]])
Maybe you can do something like
S.sort()
S[1:3]
This is what you want?

Categories

Resources