Related
I have two arrays that are loaded from images, I need to add these two images, I.e. if there's a circle of pixel value 1 in the center, I need to add it with a triangle of pixel value 2 in the top left corner. The rule I want to set is that if the 1 is already in that index, it will only add pixel value 2 to the pixels that are blank (pixel value is 0)
How do I do that? I keep trying with np.add and the where option
mask_test = master_array == 0
master_array = np.add(master_array, new_pic, where = mask_test)
But it keeps screwing up and master_array just ends up being new_pic instead of the sum. Online searches of how 'where' works has been fruitless because everyone doesn't give an example, some even just go "oh it's not used much so I won't go over it".
This code:
master_array = np.add(master_array, new_pic, where = mask_test)
gives me this:
But the problem is when the pixels do overlap I get a pixel value of 3 instead of it retaining the value of 1 as it should.
As explained in the docs, the out array will retain its "original value" in cases where the condition in the where parameter is false. This implies that you need to specify an out array to which the function will output if you are going to set the where parameter. Otherwise the function tries to get the original values from an uninitialized array, which has odd results. If you're happy to overwrite master_array, you can do that like this:
np.add(master_array, new_pic, out=master_array, where=master_array == 0)
(You don't need to assign the returned value here - specifying the output array is sufficient.)
It is probably less of a headache to use + with np.where instead:
master_array += np.where(master_array == 0, new_pic, 0)
But since you are only adding in cases where pixel value is 0 in the master, there is no need to add in the first place. You could just use np.where without any addition.
master_array = np.where(master_array == 0, new_pic, master_array)
Use of where in np.add (or other ufunc) is not common - especially compared to the use of the np.where function. And at least when I answered SO, I stress the out needs to be included.
The docs talk about "uninitialized" values when out=None, the default. That may be unclear, but effectively it means, an array such as that created by np.empty.
This may contain anything, such as:
In [263]: res = np.empty((5,5),int)
In [264]: res
Out[264]:
array([[ 50999536, 0, 140274438367024,
-6315338160082163841, 140273540789184],
[ 161, 55839504, 140274448227440,
140273575343728, 358094631352936090],
[ 140273564120384, 140273575343344, -7783537013977118542,
140273543024256, 140273575343200],
[-6522034781934541837, 140273620247296, 140273575343776,
1387433780369843801, 140273560270848],
[ 140273561761968, -3190833100527581043, 140273563628672,
140273561762640, 480]])
Define an initial array:
In [265]: x1 = np.random.randint(0,5,(5,5))
In [266]: x1
Out[266]:
array([[3, 2, 0, 1, 3],
[3, 2, 4, 0, 3],
[2, 3, 3, 4, 3],
[3, 2, 0, 2, 2],
[1, 2, 1, 1, 2]])
In [267]: x2=x1.copy()
Without out, we get values much like res above. Only the x1==0 elements are set to 10:
In [268]: np.add(x1, 10, where=x1==0)
Out[268]:
array([[51108864, 0, 10, 47780512, 51193856],
[51213024, 51245760, 51252528, 10, 51260336],
[51261168, 51261920, 51264176, 51298864, 51270656],
[51271040, 51274864, 10, 51276640, 51277024],
[51277808, 51278528, 51279104, 51284496, 51286448]])
Or we could set the out to np.zeros:
In [269]: np.add(x1, 10, where=x1==0, out=np.zeros((5,5),int))
Out[269]:
array([[ 0, 0, 10, 0, 0],
[ 0, 0, 0, 10, 0],
[ 0, 0, 0, 0, 0],
[ 0, 0, 10, 0, 0],
[ 0, 0, 0, 0, 0]])
But if we set it to x1, or a copy of x1 (which is probably what you want):
In [270]: np.add(x1, 10, where=x1==0, out=x2)
Out[270]:
array([[ 3, 2, 10, 1, 3],
[ 3, 2, 4, 10, 3],
[ 2, 3, 3, 4, 3],
[ 3, 2, 10, 2, 2],
[ 1, 2, 1, 1, 2]])
But we could do the same with masked addition:
In [271]: x1[x1==0] += 10
In [272]: x1
Out[272]:
array([[ 3, 2, 10, 1, 3],
[ 3, 2, 4, 10, 3],
[ 2, 3, 3, 4, 3],
[ 3, 2, 10, 2, 2],
[ 1, 2, 1, 1, 2]])
Or using the more commonly use np.where function:
In [273]: np.where(x1==10, 20, x1)
Out[273]:
array([[ 3, 2, 20, 1, 3],
[ 3, 2, 4, 20, 3],
[ 2, 3, 3, 4, 3],
[ 3, 2, 20, 2, 2],
[ 1, 2, 1, 1, 2]])
In my experience with SO, the where/out is most useful when evaluation at certain values can give rise to errors, for example division by 0, or log of negatives.
In np.where(A,B,C), the 3 arguments are evaluated in full, and result just selects from B and C based on A. With the np.add(x, y, where=A, out=C), the x+y addition is only done where the condition is true. The evaluation is selective. The distinction may be hard to grasp, and may not matter when using np.add.
You could simply use the normal addition:
mask_test = master_array == 0
master_array += new_pic * mask_test
I would like to sort an array by column sum and delete the largest element of each column then continue the sorting.
#sorted by sum of columns
def sorting(a):
b = np.sum(a, axis = 0)
idx = b.argsort()
a = np.take(a, idx, axis=1)
return a
arr = [[1,2,3,8], [3,0,2,1],[5, 4, 25, 67], [11, 1, 6, 10]]
print(sorting(arr))
Here is the output:
[[ 2 1 3 8]
[ 0 3 2 1]
[ 4 5 25 67]
[ 1 11 6 10]]
I was able to able to find the max of each column and their indexes but I couldn't delete them without deleting the whole row/column. Please any help I am new to numpy!!!
Though not very elegant, one way to achieve this would be like this using broadcasting and fancy/advanced indexing:
import numpy as np
arr = np.array([[1,2,3,8], [3,0,2,1],[5, 4, 25, 67], [11, 1, 6, 10]])
First get the intermediate array sorted by column sums.
arr1 = arr[:, arr.sum(axis = 0).argsort()]
print(arr1)
# array([[ 2, 1, 3, 8],
# [ 0, 3, 2, 1],
# [ 4, 5, 25, 67],
# [ 1, 11, 6, 10]])
Next get where the maximas occur in each column.
idx = arr1.argmax(axis = 0)
print(idx)
# array([2, 3, 2, 2])
Now prepare row and column index arrays to slice from arr1. Note that the line to compute rows essentially performs a set difference of {0, 1, 2, 3} (in general to number of rows in arr) for each element in idx above, and stores them along the columns of the rows matrix.
k = np.arange(arr1.shape[0]) # original number of rows
rows = np.nonzero(k != idx[:, None])[1].reshape(-1, arr1.shape[0] - 1).T
cols = np.arange(arr1.shape[1])
print(rows)
# array([[0, 0, 0, 0],
# [1, 1, 1, 1],
# [3, 2, 3, 3]])
Note that cols will be broadcasted to the shape of rows while indexing arr1 by them. For your understanding cols will look like this to be compatible with rows:
print(np.broadcast_to(cols, rows.shape))
# array([[0, 1, 2, 3],
# [0, 1, 2, 3],
# [0, 1, 2, 3]])
Basically when you (fancy) index arr1 by them, you get the 0th column for rows 0, 1 and 3; 1st column for rows 0, 1 and 2 and so on. Hope you get the idea.
arr2 = arr1[rows, cols]
print(arr2)
# array([[ 2, 1, 3, 8],
# [ 0, 3, 2, 1],
# [ 1, 5, 6, 10]])
You can write a simple function composing these steps for your convenience to perform the operation multiplie times.
My goal was to insert a column to the right on a numpy matrix. However, I found that the code I was using is putting in two columns rather than just one.
# This one results in a 4x1 matrix, as expected
np.insert(np.matrix([[0],[0]]), 1, np.matrix([[0],[0]]), 0)
>>>matrix([[0],
[0],
[0],
[0]])
# I would expect this line to return a 2x2 matrix, but it returns a 2x3 matrix instead.
np.insert(np.matrix([[0],[0]]), 1, np.matrix([[0],[0]]), 1)
>>>matrix([[0, 0, 0],
[0, 0, 0]]
Why do I get the above, in the second example, instead of [[0,0], [0,0]]?
While new use of np.matrix is discouraged, we get the same result with np.array:
In [41]: np.insert(np.array([[1],[2]]),1, np.array([[10],[20]]), 0)
Out[41]:
array([[ 1],
[10],
[20],
[ 2]])
In [42]: np.insert(np.array([[1],[2]]),1, np.array([[10],[20]]), 1)
Out[42]:
array([[ 1, 10, 20],
[ 2, 10, 20]])
In [44]: np.insert(np.array([[1],[2]]),1, np.array([10,20]), 1)
Out[44]:
array([[ 1, 10],
[ 2, 20]])
Insert as [1]:
In [46]: np.insert(np.array([[1],[2]]),[1], np.array([[10],[20]]), 1)
Out[46]:
array([[ 1, 10],
[ 2, 20]])
In [47]: np.insert(np.array([[1],[2]]),[1], np.array([10,20]), 1)
Out[47]:
array([[ 1, 10, 20],
[ 2, 10, 20]])
np.insert is a complex function written in Python. So we need to look at that code, and see how values are being mapped on the target space.
The docs elaborate on the difference between insert at 1 and [1]. But off hand I don't see an explanation of how the shape of values matters.
Difference between sequence and scalars:
>>> np.insert(a, [1], [[1],[2],[3]], axis=1)
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
>>> np.array_equal(np.insert(a, 1, [1, 2, 3], axis=1),
... np.insert(a, [1], [[1],[2],[3]], axis=1))
True
When adding an array at the end of another, I'd use concatenate (or one of its stack variants) rather than insert. None of these operate in-place.
In [48]: np.concatenate([np.array([[1],[2]]), np.array([[10],[20]])], axis=1)
Out[48]:
array([[ 1, 10],
[ 2, 20]])
I'm trying to get the index values out of a numpy array, I've tried using intersects instead to no avail. I'm simply trying to find like values in 2 arrays. One is 2D and I'm selecting a column, and the other is 1D, just a list of values to search for, so effectively just 2 1D arrays.
We'll call this array a:
array([[ 1, 97553, 1],
[ 1, 97587, 1],
[ 1, 97612, 1],
[ 1, 97697, 1],
[ 1, 97826, 3],
[ 1, 97832, 1],
[ 1, 97839, 1],
[ 1, 97887, 1],
[ 1, 97944, 1],
[ 1, 97955, 2]])
And we're searching say, values = numpy.array([97612, 97633, 97697, 97999, 97943, 97944])
So I try:
numpy.where(a[:, 1] == values)
And I'd expect a bunch of indices of the values, but instead I get back an array that's empty, it spits out [(array([], dtype=int64),)].
If I try this though:
numpy.where(a[:, 1] == 97697)
It gives me back (array([2]),), which is what I would expect.
What weirdness of arrays am I missing here? Or is there maybe even an easier way to do this? Finding array indices and matching arrays seems to not work as I expect at all. When I want to find the unions or intersects of arrays, by indice or unique value it just doesn't seem to function. Any help would be super. Thanks.
Edit:
As per Warrens request:
import numpy
a = numpy.array([[ 1, 97553, 1],
[ 1, 97587, 1],
[ 1, 97612, 1],
[ 1, 97697, 1],
[ 1, 97826, 3],
[ 1, 97832, 1],
[ 1, 97839, 1],
[ 1, 97887, 1],
[ 1, 97944, 1],
[ 1, 97955, 2]])
values = numpy.array([97612, 97633, 97697, 97999, 97943, 97944])
I've found that numpy.in1d will give me a correct truth table of booleans for the operation, with a 1d array of the same length that should map to the original data. My only issue here is now how to act with that, for instance deleting or modifying the original array at those indices. I could do it laboriously with a loop, but as far as I know there are better ways in numpy. Truth tables as masks are supposed to be quite powerful with numpy from what I have been able to find.
np.where with a single argument is equivalent to np.nonzero. It gives you the indices where a condition, the input array, is True.
In your example you are checking for element-wise equality between a[:,1] and values
a[:, 1] == values
False
So it's giving you the correct result: no index in the input is True.
You should use np.isin instead
np.isin(a[:,1], values)
array([False, False, True, True, False, False, False, False, True, False], dtype=bool)
Now you can use np.where to get the indices
np.where(np.isin(a[:,1], values))
(array([2, 3, 8]),)
and use those to address the original array
a[np.where(np.isin(a[:,1], values))]
array([[ 1, 97612, 1],
[ 1, 97697, 1],
[ 1, 97944, 1]])
Your initial solution with a simple equality check could indeed have worked with proper broadcasting:
np.where(a[:,1] == values[..., np.newaxis])[1]
array([2, 3, 8])
EDIT: given you seem to have issues with using the above results to index and manipulate your array here's a couple of simple examples
Now you should have two ways of accessing your matching elements in the original array, either the binary mask or the indices from np.where.
mask = np.isin(a[:,1], values) # np.in1d if np.isin is not available
idx = np.where(mask)
Let's say you want to set all matching rows to zero
a[mask] = 0 # or a[idx] = 0
array([[ 1, 97553, 1],
[ 1, 97587, 1],
[ 0, 0, 0],
[ 0, 0, 0],
[ 1, 97826, 3],
[ 1, 97832, 1],
[ 1, 97839, 1],
[ 1, 97887, 1],
[ 0, 0, 0],
[ 1, 97955, 2]])
Or you want to multiply the third column of matching rows by 100
a[mask, 2] *= 100
array([[ 1, 97553, 1],
[ 1, 97587, 1],
[ 1, 97612, 100],
[ 1, 97697, 100],
[ 1, 97826, 3],
[ 1, 97832, 1],
[ 1, 97839, 1],
[ 1, 97887, 1],
[ 1, 97944, 100],
[ 1, 97955, 2]])
Or you want to delete matching rows (here using indices is more convenient than masks)
np.delete(a, idx, axis=0)
array([[ 1, 97553, 1],
[ 1, 97587, 1],
[ 1, 97826, 3],
[ 1, 97832, 1],
[ 1, 97839, 1],
[ 1, 97887, 1],
[ 1, 97955, 2]])
Just a thought:
Try to flatten the 2D array and compare using numpy.intersect1d.
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.flatten.html
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.intersect1d.html
I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...
import numpy
uniq_rows = numpy.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = numpy.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
print row, numpy.where((test_rows == row).all(axis=1))[0]
This prints...
[0, 1, 0] [ 1 4 10]
[1, 1, 0] [ 3 8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]
Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.
EDIT:
This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.
Numpy
From version 1.13 of numpy you can use numpy.unique like np.unique(test_rows, return_counts=True, return_index=True, axis=1)
Pandas
df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)
uniq
0 1 2
0 0 1 0
1 1 1 0
2 1 1 1
3 0 1 1
Or you could generate the unique rows automatically from the incoming DataFrame
uniq_generated = df.drop_duplicates().reset_index(drop=True)
yields
0 1 2
0 0 1 1
1 0 1 0
2 0 0 0
3 1 1 0
4 1 1 1
and then look for it
d = dict()
for idx, row in uniq.iterrows():
d[idx] = df.index[(df == row).all(axis=1)].values
This is about the same as your where method
d
{0: array([ 1, 4, 10], dtype=int64),
1: array([ 3, 8, 12], dtype=int64),
2: array([7, 9], dtype=int64),
3: array([0, 5, 6], dtype=int64)}
There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.
np.where((uniq_rows[:, None, :] == test_rows).all(2))
Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.
(array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
array([ 1, 4, 10, 3, 8, 12, 7, 9, 0, 5, 6]))
How it works:
(uniq_rows[:, None, :] == test_rows)
Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.
With the np.unique from v1.13 (downloaded from the source link on the latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247)
In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]:
(array([[0, 0, 0],
[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]]),
array([2, 1, 0, 3, 7], dtype=int32),
array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))
In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
...: dd[v].append(i)
...:
In [167]: dd
Out[167]:
defaultdict(list,
{0: [2, 11],
1: [1, 4, 10],
2: [0, 5, 6],
3: [3, 8, 12],
4: [7, 9]})
or indexing the dictionary with the unique rows (as hashable tuple):
In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
...: dd[tuple(a[v])].append(i)
...:
In [172]: dd
Out[172]:
defaultdict(list,
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]})
This will do the job:
import numpy as np
uniq_rows = np.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = np.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0]
loc=indices//len(test_rows)
indices=indices-loc*len(test_rows)
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
print(res)
[[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]]
This will work for all the cases including the cases in which not all the rows in uniq_rows are present in test_rows. However, if somehow you know ahead that all of them are present, you could replace the part
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
with just the row:
res=np.split(indices,np.where(np.diff(loc)>0)[0]+1)
Thus avoiding loops entirely.
Not very 'numpythonic', but for a bit of an upfront cost, we can make a dict with the keys as a tuple of your row, and a list of indices:
test_rowsdict = {}
for i,j in enumerate(test_rows):
test_rowsdict.setdefault(tuple(j),[]).append(i)
test_rowsdict
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]}
Then you can filter based on your uniq_rows, with a fast dict lookup: test_rowsdict[tuple(row)]:
out = []
for i in uniq_rows:
out.append((i, test_rowsdict.get(tuple(i),[])))
For your data, I get 16us for just the lookup, and 66us for building and looking up, versus 95us for your np.where solution.
Approach #1
Here's one approach, not sure about the level of "NumPythonic-ness" though to such a tricky problem -
def get1Ds(a, b): # Get 1D views of each row from the two inputs
# check that casting to void will create equal size elements
assert a.shape[1:] == b.shape[1:]
assert a.dtype == b.dtype
# compute dtypes
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
# convert to 1d void arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel()
b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel()
return a_void, b_void
def matching_row_indices(uniq_rows, test_rows):
A, B = get1Ds(uniq_rows, test_rows)
validA_mask = np.in1d(A,B)
sidx_A = A.argsort()
validA_mask = validA_mask[sidx_A]
sidx = B.argsort()
sortedB = B[sidx]
split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1
all_split_indx = np.split(sidx, split_idx)
match_mask = np.in1d(B,A)[sidx]
valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx])
locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]]
return uniq_rows[sidx_A[validA_mask]], locations
Scope(s) of improvement (on performance) :
np.split could be replaced by a for-loop for splitting using slicing.
np.r_ could be replaced by np.concatenate.
Sample run -
In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows)
In [332]: unq_rows
Out[332]:
array([[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]])
In [333]: idx
Out[333]: [array([ 1, 4, 10]),array([0, 5, 6]),array([ 3, 8, 12]),array([7, 9])]
Approach #2
Another approach to beat the setup overhead from the previous one and making use of get1Ds from it, would be -
A, B = get1Ds(uniq_rows, test_rows)
idx_group = []
for row in A:
idx_group.append(np.flatnonzero(B == row))
The numpy_indexed package (disclaimer: I am its author) was created to solve problems of this kind in an elegant and efficient manner:
import numpy_indexed as npi
indices = np.arange(len(test_rows))
unique_test_rows, index_groups = npi.group_by(test_rows, indices)
If you dont care about the indices of all rows, but only those present in test_rows, npi has a bunch of simple ways of tackling that problem too; f.i:
subset_indices = npi.indices(unique_test_rows, unique_rows)
As a sidenote; it might be useful to take a look at the examples in the npi library; in my experience, most of the time people ask a question of this kind, these grouped indices are just a means to an end, and not the endgoal of the computation. Chances are that using the functionality in npi you can reach that end goal more efficiently, without ever explicitly computing those indices. Do you care to give some more background to your problem?
EDIT: if you arrays are indeed this big, and always consist of a small number of columns with binary values, wrapping them with the following encoding might boost efficiency a lot further still:
def encode(rows):
return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8)