Related
I find myself running into the following situation in numpy muliple times over the past couple of months, and I cannot imagine there is not a proper solution for it.
I have a 2d array, let's say
x = np.array([
[1, 2, 3],
[2, -5, .333],
[1, 4, 2],
[2, -5, 4]])
Now I would like to (sort/get the maximum/do argsort/argmax/ etc) this array in such a way that it first compares the first column. If the first column is equal, it compares the second column, and then the third. So this means for our example:
# max like python: max(x.tolist())
np.tuple_like_amax(x) = np.array([2, -5, 4])
# argmax does't have python equivalent, but something like: [i for i, e in enumerate(x.tolist()) if e == max(x.tolist())][0]
np.tuple_like_argmax = 3
# sorting like python: sorted(x.tolist())
np.tuple_like_sort(x) = np.array([[1.0, 2.0, 3.0], [1.0, 4.0, 2.0], [2.0, -5.0, 0.333], [2.0, -5.0, 4.0]])
# argsort doesn't have python equivalent, but something like: sorted(range(len(x)), key=lambda i: x[i].tolist())
np.tuple_like_argsort(x) = np.array([0, 2, 1, 3])
This is exactly the way how python compares tuples (so actually just calling max(x.tolist()) does the trick here for max. It does feel however like a time-and-memory waste to first convert the array to a python list, and in addition I would like to use things like argmax, sort and all the other great numpy functions.
So just to be clear, I'm not interested in python code that mimics an argmax, but for something that achieves this without converting the lists to python lists.
Found so far:
np.sort seems to work on structured arrays when order= is given. It does feel to me that creating a structured array and then using this method is overkill. Also, argmax doesn't seem to support this, meaning that one would have to use argsort, which has a much higher complexity.
Here I will focus only on finding the lexicographic argmax (the others: max, argmin, and min can be found trivially from argmax). In addition, unlike np.argmax(), we will return all rows that are at rank 0 (if there are duplicate rows), i.e. all the indices where the row is the lexicographic maximum.
The idea is that, for the "tuple-like order" desired here, the function is really:
find all indices where the first column has the maximum;
break ties with the places where the second column is max, under condition that the first column is max;
etc., as long as there are ties to break (and more columns).
def ixmax(x, k=0, idx=None):
col = x[idx, k] if idx is not None else x[:, k]
z = np.where(col == col.max())[0]
return z if idx is None else idx[z]
def lexargmax(x):
idx = None
for k in range(x.shape[1]):
idx = ixmax(x, k, idx)
if len(idx) < 2:
break
return idx
At first, I was worried that the explicit looping in Python would kill it. But it turns out that it is quite fast. In the case where there is no ties (more likely with independent float values, for instance), that returns immediately after a single np.where(x[:, 0] == x[:, 0].max()). Only in the case of ties do we need to look at the (much smaller) subset of rows that were tied. In unfavorable conditions (many repeated values in all columns), it is still ~100x or more than the partition method, and O(log n) faster than lexsort(), of course.
Test 1: correctness
for i in range(1000):
x = np.random.randint(0, 10, size=(1000, 8))
found = lexargmax(x)
assert lexargmax_by_sort(x) in found and np.unique(x[found], axis=0).shape[0] == 1
(where lexargmark_by_sort is np.lexsort(x[:, ::-1].T)[-1])
Test 2: speed
x = np.random.randint(0, 10, size=(100_000, 100))
a = %timeit -o lexargmax(x)
# 776 µs ± 313 ns per loop
b = %timeit -o lexargmax_by_sort(x)
# 507 ms ± 2.65 ms per loop
# b.average / a.average: 652
c = %timeit -o lexargmax_by_partition(x)
# 141 ms ± 2.38 ms
# c.average / a.average: 182
(where lexargmark_by_partition is based on #MadPhysicist very elegant idea:
def lexargmax_by_partition(x):
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
return np.argpartition(view, -1)[-1]
)
After some more testing on various sizes, we get the following time measurements and performance ratios:
In the LHS plot, lexargmax is the group shown with 'o-' and lexargmax_by_partition is the upper group of lines.
In the RHS plot, we just show the speed ratio.
Interestingly, lexargmax_by_partition execution time seems fairly independent of m, the number of columns, whereas our lexargmax depends a little bit on it. I believe that is reflecting the fact that, in this setting (purposeful collisions of max in each column), the more columns we have, the "deeper" we need to go when breaking ties.
Previous (wrong) answer
To find the argmax of the row by lexicographic order, I was thinking you could do:
def lexmax(x):
r = (2.0 ** np.arange(x.shape[1]))[::-1]
return np.argmax(((x == x.max(axis=0)) * r).sum(axis=1))
Explanation:
x == x.max(axis=0) (as an int) is 1 for each element that is equal to the column's max. In your example, it is (astype(int)):
[[0 0 0]
[1 0 0]
[0 1 0]
[1 0 1]]
then we multiply by a column weight that is more than the sum of 1's on the right. Powers of two achieve that. We do it in float to address cases with more than 64 columns.
But this is fatally flawed: The positions of max in the second column should be considered only in the subset where the first column had the max value (to break the tie).
Other approaches including affine transformations of all columns so that we can sum them and find the max don't work either: if the max in column 0 is, say, 1.0, and there is a second place at 0.999, then we would have to know ahead of time that difference of 0.001 and make sure no combination of values from the columns to the right to sum up to overtake that difference. So, that's a dead end.
To sort a list by the contents of a row, you can use np.lexsort. The only catch is that it sorts by the last element of the selected axis first:
index = np.lexsort(x.T[::-1])
OR
index = np.lexsort(x[:, ::-1].T)
This is "argsort". You can make it into "sort" by doing
x[index]
"min" and "max" can be done trivially by using the index:
xmin = x[index[0]]
xmax = x[index[-1]]
Alternatively, you can use a technique I suggested in one of my questions: Sorting array of objects by row using custom dtype. The idea is to make each row into a structure that has a field for each element:
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
You can sort the array in-place by operating on the view:
>>> view.sort()
>>> x
array([[ 1. , 2. , 3. ],
[ 1. , 4. , 2. ],
[ 2. , -5. , 0.333],
[ 2. , -5. , 4. ]])
That's because the ndarray constructor points to x as the original buffer.
You can not get argmin, argmax, min and max to work on the result. However, you can still get the min and max in O(n) time using my favorite function in all of numpy: np.partition:
view.partition([0, -1])
xmin = x[0]
xmax = x[-1]
You can use argpartition on the array as well to get the indices of the desired elements:
index = view.argpartition([0, -1])[[0, -1]]
xmin = x[index[0]]
xmax = x[index[-1]]
Notice that both sort and partition have an order argument that you can use to rearrange the comparison of the columns.
For a set of matrices that I have, called distance_matrix (these exist in a function that then generates all of them over a given range). I need to find the smallest value in this matrix, obviously notated by an index pair,for which i have this code:
min_indices = np.unravel_index(np.argmin(np.abs(distance_matrix)),np.shape(distance_matrix))
This works perfectly well. But now i need to write something that finds the second lowest value if the indices returned by the above code is (0,0). I guess i cant use the code above, since you cant modify it to find the next value (as far as i can tell). Ive tried with an if-loop, but thats not quite working:
sdiff = np.diff(np.sign(np.diff(distance_matrix)))
rising_1 = (sdiff == 2)
rising_2 = (sdiff[:-1] == 1) & (sdiff[1:] == 1)
rising_all = rising_1
rising_all[1:] = rising_all[1:] | rising_2
min_ind = np.where(rising_all)[0] + 1
minima = list(zip(min_ind, distance_matrix[min_ind]))
for ind_pair in range(0,len(minima)):
if ind_pair ==(0,0):
minima=sorted(minima, key=lambda pair: pair[1])[1]
else:
minima=sorted(minima, key=lambda pair: pair[1])[0]
Supposing that the distance matrix is 2 dimensional then, and using the following test data:
distance_matrix = np.array([[0. , 1. , 2. ],
[1. , 0.5, 1.5],
[2. , 1.5, 2. ]])
Now,
np.unravel_index(
np.argmin(np.abs(distance_matrix)),
np.shape(distance_matrix)
)
returns (0, 0) for you, which is what you don't want, of course. But is there a reason you can't just make that impossible by using the following:
mask = np.ones(np.shape(distance_matrix))
mask[0, 0] = np.nan # you can put this in a loop if there is
# more than one coordinate set you don't want
distance_matrix * mask
# array([[nan, 1. , 2. ],
# [1. , 0.5, 1.5],
# [2. , 1.5, 2. ]])
np.unravel_index(
np.nanargmin(np.abs(distance_matrix * mask)),
np.shape(distance_matrix)
)
# (1, 1)
Note that nanargmin is a version of argmin that ignores NaNs.
You can find the indices of the first two smallest elements at once using np.argpartition. Partitioning on the second element of the array guarantees that the first element will be smaller, so you get both.
indices2 = np.unravel_index(
np.argpartition(np.abs(distance_matrix), 1, None),
distance_matrix.shape)
You must pass an axis of None to ravel the index instead of partitioning the columns. Otherwise you will not be able to call np.unravel_index directly on the result.
If you want each index separately:
i = np.argpartition(np.abs(distance_matrix), 1, None)
indices = np.unravel_index(i[0], distance_matrix.shape)
indices2 = np.unravel_index(i[1], distance_matrix.shape)
If you want to get more minima, you will have to sort the left side of the array.
k = 5
i = np.argpartition(np.abs(distance_matrix), k, None)[:k]
i.sort()
indices = np.unravel_index(i, distance_matrix.shape)
I'm stuck with matrices in numpy.
I need to create matrix, where sum by column will be not greater than one.
np.random.rand(3,3).round(2)
gives
array([[ 0.48, 0.73, 0.81],
[ 0.4 , 0.01, 0.32],
[ 0.44, 0.4 , 0.92]])
Is there a smart way to generate matrix with random numbers where sum by column will be not greater than one?
Thank you!
You could do this:
x = np.random.rand(3,3)
x /= np.sum(x, axis=0)
The rationale behind this is you're dividing every column by the sum of all the values. This ensures that all columns will add to 1.
Or, you could do:
x = np.random.rand(3,3)/3
Because every number will be between [0,1]. If you squish the domain to [0,1/3], then the sum is guarranteed to be <1.
It is generally unclear what you mean when you want a restriction to the numbers but still want them to be random.
You could always make sure your values to begin with are restricted:
>>> numcols=3
>>> np.random.uniform(0,1/numcols,9).reshape(3,3)
array([[ 0.26718273, 0.29798534, 0.0309619 ],
[ 0.10923526, 0.12371555, 0.03797226],
[ 0.15974434, 0.02435774, 0.30885667]])
For a square matrix this has the benefit (maybe not?) that it works simultaneously on rows as well. You can't have rows of (0,0,1) though.
Normalize the columns by a random value 0 < x <= 1, so that the sum by column will be not greater than one as you request:
>>> for i in range(3): a[:,i] = a[0,i] * a[:,i] / np.sum(a[:,i])
>>> a
array([[0.02041667, 0.42772512, 0.01939597],
[0.07875 , 0.17109005, 0.0433557 ],
[0.04083333, 0.35118483, 0.10724832]])
>>> np.sum(a,axis=0)
array([0.14, 0.95, 0.17])
Suppose I have a 1D numpy array (A) containing 5 elements:
A = np.array([ -4.0, 5.0, -3.5, 5.4, -5.9])
I need to add 5 to all the elements of A that are lesser than zero. What is the numpy way to do this without for-looping ?
It can be done using mask:
A[A < 0] += 5
The way it works is - the expression A < 0 returns a boolean array. Each cell corresponds to the predicate applied on the matching cell. In the current example:
A < 0 # [ True False True False True]
And then, the action is applied only on the cells that match the predicate. So in this example, it works only on the True cells.
I found another answer:
A = np.where(A<0, A+5, A)
Say I have an array:
values = np.array([1.1,2.2,3.3,4.4,2.1,8.4])
I want to round these values to members of an arbitrary array, say:
rounds = np.array([1.,3.5,5.1,6.7,9.2])
ideally returning an array of rounded numbers and an array of the residues:
rounded = np.array([1.,1.,3.5,5.1,1.,9.2])
residues = np.array([-0.1,-1.2,0.2,0.7,-1.1,0.6])
Is there a good pythonic way of doing this?
One option is this:
>>> x = np.subtract.outer(values, rounds)
>>> y = np.argmin(abs(x), axis=1)
And then rounded and residues are, respectively:
>>> rounds[y]
array([ 1. , 1. , 3.5, 5.1, 1. , 9.2])
>>> rounds[y] - values
array([-0.1, -1.2, 0.2, 0.7, -1.1, 0.8])
Essentially x is a 2D array of every value in values minus every value in rounds. y is a 1D array of the index of the minimum absolute value of each row of x. This y is then used to index rounds.
I should caveat this answer by noting that if len(values) * len(rounds) is big (e.g. starting to exceed 10e8), memory usage may start to become of concern. In this case, you could consider building up y iteratively instead to avoid having to allocate a large block of memory to x.
As the items in rounds array are sorted(or if not sort them) we can do this is O(n logn) time using numpy.searchsorted:
from functools import partial
def closest(rounds, x):
ind = np.searchsorted(rounds, x, side='right')
length = len(rounds)
if ind in (0, length) :
return rounds[ind]
else:
left, right = rounds[ind-1], rounds[ind]
val = min((left, right), key=lambda y:abs(x-y))
return val
f = partial(closest, rounds)
rounded = np.apply_along_axis(f, 1, values[:,None])[:,0]
residues = rounded - values
print repr(rounded)
print repr(residues)
Output:
array([ 1. , 1. , 3.5, 5.1, 1. , 9.2])
array([-0.1, -1.2, 0.2, 0.7, -1.1, 0.8])
The same time complexity as the answer by Ashwini Chaudhary, but fully vectorized:
def round_to(rounds, values):
# The main speed is in this line
I = np.searchsorted(rounds, values)
# Pad so that we can index easier
rounds_p = np.pad(rounds, 1, mode='edge')
# We have to decide between I and I+1
rounded = np.vstack([rounds_p[I], rounds_p[I+1]])
residues = rounded - values
J = np.argmin(np.abs(residues), axis=0)
K = np.arange(len(values))
return rounded[J,K], residues[J,K]
Find the closest number of x in rounds:
def findClosest(x,rounds):
return rounds[np.argmin(np.absolute(rounds-x))]
Loop over all values:
rounded = [findClosest(x,rounds) for x in values]
residues = values - rounded
This is a straightforward method, but you can be more efficient using that your rounds array is ordered.
def findClosest(x,rounds):
for n in range(len(rounds)):
if x > rounds[n]:
if n == 0:
return rounds[n]
elif rounds[n]-x > x-rounds[n-1]:
return rounds[n-1]
else:
return rounds[n]
return rounds[-1]
This might be, but not necessarily is faster than the argmin approach because you lose time with the python for loop, but you don't have to check along the whole rounds array.
The selected answer is already great. This one may seem convoluted to those that aren't necessarily used to more complex list-comprehensions, but otherwise it's actually quite clear (IMO) if you're familiar with it.
(Interestingly enough, this happens to run faster than the selected answer. Why would the numPy version be slower than this? Hmm... )
values = np.array([1.1,2.2,3.3,4.4,2.1,8.4])
rounds = np.array([1.,3.5,5.1,6.7,9.2])
rounded, residues = zip(*[
[
(rounds[cIndex]),
(dists[cIndex])
]
for v in values
for dists in [[r-v for r in rounds]]
for absDists in [[abs(d) for d in dists]]
for cIndex in [absDists.index(min(absDists))]
])
print np.array(rounded)
print np.array(residues)