I'm stuck with matrices in numpy.
I need to create matrix, where sum by column will be not greater than one.
np.random.rand(3,3).round(2)
gives
array([[ 0.48, 0.73, 0.81],
[ 0.4 , 0.01, 0.32],
[ 0.44, 0.4 , 0.92]])
Is there a smart way to generate matrix with random numbers where sum by column will be not greater than one?
Thank you!
You could do this:
x = np.random.rand(3,3)
x /= np.sum(x, axis=0)
The rationale behind this is you're dividing every column by the sum of all the values. This ensures that all columns will add to 1.
Or, you could do:
x = np.random.rand(3,3)/3
Because every number will be between [0,1]. If you squish the domain to [0,1/3], then the sum is guarranteed to be <1.
It is generally unclear what you mean when you want a restriction to the numbers but still want them to be random.
You could always make sure your values to begin with are restricted:
>>> numcols=3
>>> np.random.uniform(0,1/numcols,9).reshape(3,3)
array([[ 0.26718273, 0.29798534, 0.0309619 ],
[ 0.10923526, 0.12371555, 0.03797226],
[ 0.15974434, 0.02435774, 0.30885667]])
For a square matrix this has the benefit (maybe not?) that it works simultaneously on rows as well. You can't have rows of (0,0,1) though.
Normalize the columns by a random value 0 < x <= 1, so that the sum by column will be not greater than one as you request:
>>> for i in range(3): a[:,i] = a[0,i] * a[:,i] / np.sum(a[:,i])
>>> a
array([[0.02041667, 0.42772512, 0.01939597],
[0.07875 , 0.17109005, 0.0433557 ],
[0.04083333, 0.35118483, 0.10724832]])
>>> np.sum(a,axis=0)
array([0.14, 0.95, 0.17])
Related
I find myself running into the following situation in numpy muliple times over the past couple of months, and I cannot imagine there is not a proper solution for it.
I have a 2d array, let's say
x = np.array([
[1, 2, 3],
[2, -5, .333],
[1, 4, 2],
[2, -5, 4]])
Now I would like to (sort/get the maximum/do argsort/argmax/ etc) this array in such a way that it first compares the first column. If the first column is equal, it compares the second column, and then the third. So this means for our example:
# max like python: max(x.tolist())
np.tuple_like_amax(x) = np.array([2, -5, 4])
# argmax does't have python equivalent, but something like: [i for i, e in enumerate(x.tolist()) if e == max(x.tolist())][0]
np.tuple_like_argmax = 3
# sorting like python: sorted(x.tolist())
np.tuple_like_sort(x) = np.array([[1.0, 2.0, 3.0], [1.0, 4.0, 2.0], [2.0, -5.0, 0.333], [2.0, -5.0, 4.0]])
# argsort doesn't have python equivalent, but something like: sorted(range(len(x)), key=lambda i: x[i].tolist())
np.tuple_like_argsort(x) = np.array([0, 2, 1, 3])
This is exactly the way how python compares tuples (so actually just calling max(x.tolist()) does the trick here for max. It does feel however like a time-and-memory waste to first convert the array to a python list, and in addition I would like to use things like argmax, sort and all the other great numpy functions.
So just to be clear, I'm not interested in python code that mimics an argmax, but for something that achieves this without converting the lists to python lists.
Found so far:
np.sort seems to work on structured arrays when order= is given. It does feel to me that creating a structured array and then using this method is overkill. Also, argmax doesn't seem to support this, meaning that one would have to use argsort, which has a much higher complexity.
Here I will focus only on finding the lexicographic argmax (the others: max, argmin, and min can be found trivially from argmax). In addition, unlike np.argmax(), we will return all rows that are at rank 0 (if there are duplicate rows), i.e. all the indices where the row is the lexicographic maximum.
The idea is that, for the "tuple-like order" desired here, the function is really:
find all indices where the first column has the maximum;
break ties with the places where the second column is max, under condition that the first column is max;
etc., as long as there are ties to break (and more columns).
def ixmax(x, k=0, idx=None):
col = x[idx, k] if idx is not None else x[:, k]
z = np.where(col == col.max())[0]
return z if idx is None else idx[z]
def lexargmax(x):
idx = None
for k in range(x.shape[1]):
idx = ixmax(x, k, idx)
if len(idx) < 2:
break
return idx
At first, I was worried that the explicit looping in Python would kill it. But it turns out that it is quite fast. In the case where there is no ties (more likely with independent float values, for instance), that returns immediately after a single np.where(x[:, 0] == x[:, 0].max()). Only in the case of ties do we need to look at the (much smaller) subset of rows that were tied. In unfavorable conditions (many repeated values in all columns), it is still ~100x or more than the partition method, and O(log n) faster than lexsort(), of course.
Test 1: correctness
for i in range(1000):
x = np.random.randint(0, 10, size=(1000, 8))
found = lexargmax(x)
assert lexargmax_by_sort(x) in found and np.unique(x[found], axis=0).shape[0] == 1
(where lexargmark_by_sort is np.lexsort(x[:, ::-1].T)[-1])
Test 2: speed
x = np.random.randint(0, 10, size=(100_000, 100))
a = %timeit -o lexargmax(x)
# 776 µs ± 313 ns per loop
b = %timeit -o lexargmax_by_sort(x)
# 507 ms ± 2.65 ms per loop
# b.average / a.average: 652
c = %timeit -o lexargmax_by_partition(x)
# 141 ms ± 2.38 ms
# c.average / a.average: 182
(where lexargmark_by_partition is based on #MadPhysicist very elegant idea:
def lexargmax_by_partition(x):
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
return np.argpartition(view, -1)[-1]
)
After some more testing on various sizes, we get the following time measurements and performance ratios:
In the LHS plot, lexargmax is the group shown with 'o-' and lexargmax_by_partition is the upper group of lines.
In the RHS plot, we just show the speed ratio.
Interestingly, lexargmax_by_partition execution time seems fairly independent of m, the number of columns, whereas our lexargmax depends a little bit on it. I believe that is reflecting the fact that, in this setting (purposeful collisions of max in each column), the more columns we have, the "deeper" we need to go when breaking ties.
Previous (wrong) answer
To find the argmax of the row by lexicographic order, I was thinking you could do:
def lexmax(x):
r = (2.0 ** np.arange(x.shape[1]))[::-1]
return np.argmax(((x == x.max(axis=0)) * r).sum(axis=1))
Explanation:
x == x.max(axis=0) (as an int) is 1 for each element that is equal to the column's max. In your example, it is (astype(int)):
[[0 0 0]
[1 0 0]
[0 1 0]
[1 0 1]]
then we multiply by a column weight that is more than the sum of 1's on the right. Powers of two achieve that. We do it in float to address cases with more than 64 columns.
But this is fatally flawed: The positions of max in the second column should be considered only in the subset where the first column had the max value (to break the tie).
Other approaches including affine transformations of all columns so that we can sum them and find the max don't work either: if the max in column 0 is, say, 1.0, and there is a second place at 0.999, then we would have to know ahead of time that difference of 0.001 and make sure no combination of values from the columns to the right to sum up to overtake that difference. So, that's a dead end.
To sort a list by the contents of a row, you can use np.lexsort. The only catch is that it sorts by the last element of the selected axis first:
index = np.lexsort(x.T[::-1])
OR
index = np.lexsort(x[:, ::-1].T)
This is "argsort". You can make it into "sort" by doing
x[index]
"min" and "max" can be done trivially by using the index:
xmin = x[index[0]]
xmax = x[index[-1]]
Alternatively, you can use a technique I suggested in one of my questions: Sorting array of objects by row using custom dtype. The idea is to make each row into a structure that has a field for each element:
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
You can sort the array in-place by operating on the view:
>>> view.sort()
>>> x
array([[ 1. , 2. , 3. ],
[ 1. , 4. , 2. ],
[ 2. , -5. , 0.333],
[ 2. , -5. , 4. ]])
That's because the ndarray constructor points to x as the original buffer.
You can not get argmin, argmax, min and max to work on the result. However, you can still get the min and max in O(n) time using my favorite function in all of numpy: np.partition:
view.partition([0, -1])
xmin = x[0]
xmax = x[-1]
You can use argpartition on the array as well to get the indices of the desired elements:
index = view.argpartition([0, -1])[[0, -1]]
xmin = x[index[0]]
xmax = x[index[-1]]
Notice that both sort and partition have an order argument that you can use to rearrange the comparison of the columns.
For a set of matrices that I have, called distance_matrix (these exist in a function that then generates all of them over a given range). I need to find the smallest value in this matrix, obviously notated by an index pair,for which i have this code:
min_indices = np.unravel_index(np.argmin(np.abs(distance_matrix)),np.shape(distance_matrix))
This works perfectly well. But now i need to write something that finds the second lowest value if the indices returned by the above code is (0,0). I guess i cant use the code above, since you cant modify it to find the next value (as far as i can tell). Ive tried with an if-loop, but thats not quite working:
sdiff = np.diff(np.sign(np.diff(distance_matrix)))
rising_1 = (sdiff == 2)
rising_2 = (sdiff[:-1] == 1) & (sdiff[1:] == 1)
rising_all = rising_1
rising_all[1:] = rising_all[1:] | rising_2
min_ind = np.where(rising_all)[0] + 1
minima = list(zip(min_ind, distance_matrix[min_ind]))
for ind_pair in range(0,len(minima)):
if ind_pair ==(0,0):
minima=sorted(minima, key=lambda pair: pair[1])[1]
else:
minima=sorted(minima, key=lambda pair: pair[1])[0]
Supposing that the distance matrix is 2 dimensional then, and using the following test data:
distance_matrix = np.array([[0. , 1. , 2. ],
[1. , 0.5, 1.5],
[2. , 1.5, 2. ]])
Now,
np.unravel_index(
np.argmin(np.abs(distance_matrix)),
np.shape(distance_matrix)
)
returns (0, 0) for you, which is what you don't want, of course. But is there a reason you can't just make that impossible by using the following:
mask = np.ones(np.shape(distance_matrix))
mask[0, 0] = np.nan # you can put this in a loop if there is
# more than one coordinate set you don't want
distance_matrix * mask
# array([[nan, 1. , 2. ],
# [1. , 0.5, 1.5],
# [2. , 1.5, 2. ]])
np.unravel_index(
np.nanargmin(np.abs(distance_matrix * mask)),
np.shape(distance_matrix)
)
# (1, 1)
Note that nanargmin is a version of argmin that ignores NaNs.
You can find the indices of the first two smallest elements at once using np.argpartition. Partitioning on the second element of the array guarantees that the first element will be smaller, so you get both.
indices2 = np.unravel_index(
np.argpartition(np.abs(distance_matrix), 1, None),
distance_matrix.shape)
You must pass an axis of None to ravel the index instead of partitioning the columns. Otherwise you will not be able to call np.unravel_index directly on the result.
If you want each index separately:
i = np.argpartition(np.abs(distance_matrix), 1, None)
indices = np.unravel_index(i[0], distance_matrix.shape)
indices2 = np.unravel_index(i[1], distance_matrix.shape)
If you want to get more minima, you will have to sort the left side of the array.
k = 5
i = np.argpartition(np.abs(distance_matrix), k, None)[:k]
i.sort()
indices = np.unravel_index(i, distance_matrix.shape)
I have this array:
a=array([[0. , 0.3, 0.2],
[0.5, 0. , 0.1]])
and this custom min method:
def custom_min(l_):
return min([x for x in l_ if x>0])
How do I apply that over the rows to select some of them? For example, if for a row custom_min > 0.1, that row should be selected: i.e.,
b = [[0. , 0.3, 0.2]]
To be clear, I am looking for methods like this:
a[a[:,1] > 0.1]
First, you can use numpy.apply_along_axis to apply custom_min to each row.
I'd also rewrite custom_min to be more numpythonic: return min(l_[l_ > 0]).
Now that you have that custom min in a vector, you can again use logical indexing:
row_mask = result > 0.3 and filtered_array = a[row_mask, :]
EDIT:
Thinking a bit more about how to make everything use only numpy vectorized functions. We can first use numpy.where to replace everything smaller than 0 with infinity. That takes it out of consideration for the minimum:
row_wise_custom_mins = np.min(np.where(a > 0, a, np.inf), axis=1)
The "where" picks values form a if the condition is true and picks np.inf if the condition is false. Then we pick the minimum (along axis 1) and that's it.
My task is to create a program that simulates a discrete time Markov Chain, for an arbitrary number of events. However, right now the part I'm struggling with is creating the right stochastic matrix that will represent the probabilities. A right stochastic matrix is a matrix that has row entries that sum to 1. And for a given size, I kind of know how to write the matrix that does that, however, the problem is that I don't know how to do that for an arbitrary size.
Any help is appreciated.
(Note that this isn't a homework problem, it's only for extra credit in my Math class and the professor doesn't mind the use of outside sources.)
Using #MBo's idea:
In [16]: matrix = np.random.rand(3,3)
In [17]: matrix/matrix.sum(axis=1)[:,None]
Out[17]:
array([[ 0.25429337, 0.22502947, 0.52067716],
[ 0.17744651, 0.42358254, 0.39897096],
[ 0.36179247, 0.28707039, 0.35113714]])
In [18]:
Generate NxN matrix with random values.
For every row:
Find sum of row S
S[j] = Sum(0..N-1){A[j, i]}
Then subtract (S-1)/N from every value in this row
A[j, i] = A[j, i] - (S[j] - 1) / N
If you need only non-negative values, generate non-negative randoms, and divide every value in row by sum of this row
A[j, i] = A[j, i] / S[j]
Here is some code:
import random
precision = 1000000
def f(n) :
matrix = []
for l in range(n) :
lineLst = []
sum = 0
crtPrec = precision
for i in range(n-1) :
val = random.randrange(crtPrec)
sum += val
lineLst.append(float(val)/precision)
crtPrec -= val
lineLst.append(float(precision - sum)/precision)
matrix.append(lineLst)
return matrix
matrix = f(5)
print matrix
I assumed the random numbers have to be positive, the sum of numbers on a raw has to be 1. I used a precision give in variable 'precision', if this is 1000 it means that the random numbers will have 3 digits after the comma. In y example 6 digits are used, you may use more.
Output:
[[0.086015, 0.596464, 0.161664, 0.03386, 0.121997],
[0.540478, 0.040961, 0.374275, 0.003793, 0.040493],
[0.046263, 0.249761, 0.460089, 0.006739, 0.237148],
[0.594743, 0.125554, 0.142809, 0.056124, 0.08077],
[0.746161, 0.151382, 0.068062, 0.005772, 0.028623]]
A right stochastic matrix is a real square matrix, with each row summing to 1.
Here's a sample you can create a function from, I leave that to you as homework
In [26]: import numpy as np
In [27]: N, M = 5, 5
In [28]: matrix = np.random.rand(N, M)
In [29]: matrix
Out[29]:
array([[ 0.27926909, 0.37026136, 0.35978443, 0.75216853, 0.53517512],
[ 0.93285517, 0.54825643, 0.43948394, 0.15134782, 0.31310007],
[ 0.91934362, 0.51707873, 0.3604323 , 0.78487053, 0.85757986],
[ 0.53595238, 0.80467646, 0.88001499, 0.4668259 , 0.63567632],
[ 0.83359167, 0.41603073, 0.21192656, 0.22650423, 0.95721952]])
In [30]: matrix = np.apply_along_axis(lambda x: x - (np.sum(x) - 1)/len(x), 1, matrix)
In [31]: matrix
Out[31]:
array([[ 0.01993739, 0.11092965, 0.10045272, 0.49283682, 0.27584341],
[ 0.65584649, 0.27124774, 0.16247526, -0.12566087, 0.03609139],
[ 0.43148261, 0.02921772, -0.12742871, 0.29700952, 0.36971886],
[ 0.07132317, 0.34004725, 0.41538578, 0.00219669, 0.17104711],
[ 0.50453713, 0.08697618, -0.11712798, -0.10255031, 0.62816498]])
Explanation
We create an N x M matrix
We then calculate the (sum - 1) / N to be subtracted from each item row-wise
Then we apply it to each row of the matrix by using np.apply_along_axis() with axis=1 to be applied on each row
Verify the result
Each row needs to sum up to 1
In [37]: matrix.sum(axis=1)
Out[37]: array([ 1., 1., 1., 1., 1.])
but how do I subtract that value from each entry in the row?
In my example I've used a lambda that is equivalent to this function
def subtract_value(x):
return x - (np.sum(x) - 1)/len(x)
You can pass a function to apply_along_axis() to be called on each element on the axis, in our case it's the rows
There are other ways too like numpy.vectorize() and numpy.frompyfunc
Making a function and apply it like any method from the above is better than looping through each item in each row, faster and less code, easier to read / understand the intent
One small point has been missed. A stochastic matrix is an M x N matrix of non-negative elements which rows sum to 1.0. MBo comment above states that:
If you need only non-negative values, generate non-negative randoms,
and divide every value in row by sum of this row
A[j, i] = A[j, i] / S[j]
This is only true if the stored matrix is comprised entirely of whole numbers (not necessarily integers). Otherwise the resulting matrix may contain negative numbers, the larger the matrix, the more the negative elements.
This can be accomplished using:
X[i, j] = Math.Abs(random.Next(100, 900));
Say I have an array:
values = np.array([1.1,2.2,3.3,4.4,2.1,8.4])
I want to round these values to members of an arbitrary array, say:
rounds = np.array([1.,3.5,5.1,6.7,9.2])
ideally returning an array of rounded numbers and an array of the residues:
rounded = np.array([1.,1.,3.5,5.1,1.,9.2])
residues = np.array([-0.1,-1.2,0.2,0.7,-1.1,0.6])
Is there a good pythonic way of doing this?
One option is this:
>>> x = np.subtract.outer(values, rounds)
>>> y = np.argmin(abs(x), axis=1)
And then rounded and residues are, respectively:
>>> rounds[y]
array([ 1. , 1. , 3.5, 5.1, 1. , 9.2])
>>> rounds[y] - values
array([-0.1, -1.2, 0.2, 0.7, -1.1, 0.8])
Essentially x is a 2D array of every value in values minus every value in rounds. y is a 1D array of the index of the minimum absolute value of each row of x. This y is then used to index rounds.
I should caveat this answer by noting that if len(values) * len(rounds) is big (e.g. starting to exceed 10e8), memory usage may start to become of concern. In this case, you could consider building up y iteratively instead to avoid having to allocate a large block of memory to x.
As the items in rounds array are sorted(or if not sort them) we can do this is O(n logn) time using numpy.searchsorted:
from functools import partial
def closest(rounds, x):
ind = np.searchsorted(rounds, x, side='right')
length = len(rounds)
if ind in (0, length) :
return rounds[ind]
else:
left, right = rounds[ind-1], rounds[ind]
val = min((left, right), key=lambda y:abs(x-y))
return val
f = partial(closest, rounds)
rounded = np.apply_along_axis(f, 1, values[:,None])[:,0]
residues = rounded - values
print repr(rounded)
print repr(residues)
Output:
array([ 1. , 1. , 3.5, 5.1, 1. , 9.2])
array([-0.1, -1.2, 0.2, 0.7, -1.1, 0.8])
The same time complexity as the answer by Ashwini Chaudhary, but fully vectorized:
def round_to(rounds, values):
# The main speed is in this line
I = np.searchsorted(rounds, values)
# Pad so that we can index easier
rounds_p = np.pad(rounds, 1, mode='edge')
# We have to decide between I and I+1
rounded = np.vstack([rounds_p[I], rounds_p[I+1]])
residues = rounded - values
J = np.argmin(np.abs(residues), axis=0)
K = np.arange(len(values))
return rounded[J,K], residues[J,K]
Find the closest number of x in rounds:
def findClosest(x,rounds):
return rounds[np.argmin(np.absolute(rounds-x))]
Loop over all values:
rounded = [findClosest(x,rounds) for x in values]
residues = values - rounded
This is a straightforward method, but you can be more efficient using that your rounds array is ordered.
def findClosest(x,rounds):
for n in range(len(rounds)):
if x > rounds[n]:
if n == 0:
return rounds[n]
elif rounds[n]-x > x-rounds[n-1]:
return rounds[n-1]
else:
return rounds[n]
return rounds[-1]
This might be, but not necessarily is faster than the argmin approach because you lose time with the python for loop, but you don't have to check along the whole rounds array.
The selected answer is already great. This one may seem convoluted to those that aren't necessarily used to more complex list-comprehensions, but otherwise it's actually quite clear (IMO) if you're familiar with it.
(Interestingly enough, this happens to run faster than the selected answer. Why would the numPy version be slower than this? Hmm... )
values = np.array([1.1,2.2,3.3,4.4,2.1,8.4])
rounds = np.array([1.,3.5,5.1,6.7,9.2])
rounded, residues = zip(*[
[
(rounds[cIndex]),
(dists[cIndex])
]
for v in values
for dists in [[r-v for r in rounds]]
for absDists in [[abs(d) for d in dists]]
for cIndex in [absDists.index(min(absDists))]
])
print np.array(rounded)
print np.array(residues)