I am solving a problem in which I am using two large matrices A and B. Matrix A is formed by ones and zeros and matrix B is formed by integers in the range [0,...,10].
At some point, I have to update the components of A such that, if the component is 1 it stays at 1. If the component is 0, it can stay at 0 with probability p or change to 1 with probability 1-p. The parameters p are functions of the same component of matrix B, i.e., I have a list of probabilities such that, if I update A[i,j], then p equals the component B[i,j] of the vector probabilities
I can do the updating with the following code:
import numpy as np
for i in range(n):
for j in range(n):
if A[i,j]==0:
pass
else:
A[i,j]=np.random.choice([0,1],p=[probabilities[B[i,j]],1-probabilities[B[i,j]]])
I think that there should be a faster way to update matrix A using slice notation. Any advice?
See that this problem is equivalent to, given a vector 'a' and a matrix with positive entries 'B', obtain a matrix C in which C[i,j]=a[B[i,j]]
The general answer
You can achieve this by generating a random value with a continuous probability, and then compare the generated value with the inverse cumulative probability function of that distribution.
To be practical
Let's use uniform random variable X in the range [0, 1), then the probability that X < a is a. Assuming that A and B have the same shape
You can use numpy.where for this (make sure that your probabilities variable is a numpy array)
A[:,:] = np.where(A == 0, np.where(np.random.rand(*A.shape) < probabilities[B], 1, 0), A);
If you want to avoid computing the random values for positions where A is non-zero then you have a more complex indexing.
A[A == 0] = np.where(np.random.rand(*A[A == 0].shape) < probabilities[B[A == 0]], 1, 0);
Make many p in the 0-1 range:
In [36]: p=np.arange(1000)/1000
Your choice use - with sum to get an overall statistic:
In [37]: sum([np.random.choice([0,1],p=[p[i],1-p[i]]) for i in range(p.shape[0])])
Out[37]: 485
and a statistically similar random array without the comprehension:
In [38]: (np.random.rand(p.shape[0])>p).astype(int).sum()
Out[38]: 496
You are welcome to perform other tests to verify their equivalence.
If probabilities is a function that takes the whole B or B[mask], we should be able to do:
mask = A==0
n = mask.ravel().sum() # the number of true elements
A[mask] = (np.random.rand(n)>probabilities(B[mask])).astype(int)
To test this:
In [39]: A = np.random.randint(0,2, (4,5))
In [40]: A
Out[40]:
array([[1, 1, 0, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 0, 0, 1],
[1, 1, 1, 1, 0]])
In [41]: mask = A==0
In [42]: A[mask]
Out[42]: array([0, 0, 0, 0, 0, 0])
In [43]: B = np.arange(1,21).reshape(4,5)
In [44]: def foo(B): # scale the B values to probabilites
...: return B/20
...:
In [45]: foo(B)
Out[45]:
array([[0.05, 0.1 , 0.15, 0.2 , 0.25],
[0.3 , 0.35, 0.4 , 0.45, 0.5 ],
[0.55, 0.6 , 0.65, 0.7 , 0.75],
[0.8 , 0.85, 0.9 , 0.95, 1. ]])
In [46]: n = mask.ravel().sum()
In [47]: n
Out[47]: 6
In [51]: (np.random.rand(n)>foo(B[mask])).astype(int)
Out[51]: array([1, 1, 0, 1, 1, 0])
In [52]: A[mask] = (np.random.rand(n)>foo(B[mask])).astype(int)
In [53]: A
Out[53]:
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 0]])
In [54]: foo(B[mask])
Out[54]: array([0.15, 0.2 , 0.25, 0.65, 0.7 , 1. ])
Related
Basically I have a 10000x10000 matrix named M and there are 1s and 0s in every column. I'm trying to count the number of 1s in every column and then divide every element in that column with this number.
This is what I have tried:
outbound_links = M[M == 1].count()
mat = [[1] * 10000] * 10000
n = 10000
#len(mat)
# for each column
for col_index in range(0, n):
# count the number of 1s
for row_index in range(0, n):
if M[row_index][col_index] == 1:
mat[row_index][col_index] = 1 / outbound_links[col_index]
else:
mat[row_index][col_index] = 0
print(mat)
But the code is unable to run because it seems too big a matrix. I was wondering what other alternatives I could use?
As suggested in the comments, you should use numpy for this. I think this will do:
import numpy as np
m = np.random.randint(0, 2, (4, 4))
# array([[0, 1, 1, 0],
# [0, 1, 0, 1],
# [0, 1, 0, 1],
# [1, 1, 1, 0]])
m / np.sum(m, axis=0)[np.newaxis, :]
# array([[0. , 0.25, 0.5 , 0. ],
# [0. , 0.25, 0. , 0.5 ],
# [0. , 0.25, 0. , 0.5 ],
# [1. , 0.25, 0.5 , 0. ]])
None numpy way. Simply iterate all columns, for each find the amount of ones and then divide each cell with that count:
from random import randint
n = 4
mat = [[randint(0,1) for _ in range(n)] for _ in range(n)]
print(*mat, sep='\n')
for col in range(n):
# count the number of 1s
ones = sum(mat[row][col] for row in range(n))
if ones: # Avoid dividing by zero
for row in range(n):
mat[row][col] /= ones
print('\n', *mat, sep='\n')
An example run:
[1, 0, 0, 1]
[0, 1, 1, 0]
[0, 0, 0, 1]
[1, 1, 1, 1]
[0.5, 0.0, 0.0, 0.33]
[0.0, 0.5, 0.5, 0.0]
[0.0, 0.0, 0.0, 0.33]
[0.5, 0.5, 0.5, 0.33]
You can try this:
import numpy as np
mat = np.array(M)
for i in range(len(mat[0])):
try:
mat[:,i] = mat[i,:]/np.sum(mat[:,i])
except:
print("no ones in that column")
Here's a sample nparray:
array([[ 0.70582116, 0.29417881],
[ 0.65219176, 0.34780821],
[ 0.82653958, 0.17346044],
...,
[ 0.76903266, 0.23096734],
[ 0.65070963, 0.3492904 ],
[ 0.63485813, 0.36514184]], dtype=float32)
I intend to mask on the first column that if it is greater than 0.7, then apply 1 else 0 (for second column, vice versa). So in the end the nparray should look something like this:
array([[ 1, 0],
[ 0, 1],
[ 1, 0],
...,
[ 1, 0],
[ 0, 1 ],
[ 0, 1]], dtype=float32)
How could I do it via numpy in Pythonic way? Thanks!
IIUC, a little broadcasted logical comparison and conversion to int:
(x > 0.7).astype(int)
array([[1, 0],
[0, 0],
[1, 0],
[1, 0],
[0, 0],
[0, 0]])
It's rather simple:
arr > 0.7
That gives a result in np.bool. To convert to np.float32:
(arr > 0.7).astype(dtype=np.float32)
You could use numpy.column_stack:
x = array([[ 0.70582116, 0.29417881],
[ 0.65219176, 0.34780821],
[ 0.82653958, 0.17346044])
col = x[:,0] > 0.7
final = numpy.column_stack([col, ~col]).astype(int)
Since col consists of booleans, ~col is the inverse of col.
Assuming your rows sum to 1, another way would be to compare it to numpy.array([0.7, 0.3]):
final = (x > numpy.array([0.7, 0.3])).astype(int)
Given a 2x3 array, I want to calculate the average on axis=0, but only considering values that are larger than 0.
So given the array
[ [1,0],
[0,0],
[1,0] ]
I want the output to be
# 1, 0, 1 filtered for > 0 gives 1, 1, average = (1+1)/2 = 1
# 0, 0, 0 filtered for > 0 gives 0, 0, 0, average = 0
[1 0]
My current code is
import numpy as np
frame = np.array([ [1,0],
[0,0],
[1,0] ])
weights=np.array(frame)>0
print("weights:")
print(weights)
print("average without weights:")
print((np.average(frame, axis=0)))
print("average with weights:")
print((np.average(frame, axis=0, weights=weights)))
This gives me
weights:
[[ True False]
[False False]
[ True False]]
average without weights:
[ 0.66666667 0. ]
average with weights:
Traceback (most recent call last):
File "C:\Users\myuser\project\test.py", line 123, in <module>
print((np.average(frame, axis=0, weights=weights)))
File "C:\Users\myuser\Miniconda3\envs\myenv\lib\site-packages\numpy\lib\function_base.py", line 1140, in average
"Weights sum to zero, can't be normalized")
ZeroDivisionError: Weights sum to zero, can't be normalized
I don't understand this error. What am I doing wrong and how can I get the average for all values greater than zero along axis=0? Thanks!
You can get the mask of greater than zeros and use it to do elementwise multilication and sum-reduction along the first axis. Finally, divide by the number of masked elements along the first axis for getting the average values.
Thus, one solution would be -
mask = a > 0 # Input array : a
out = np.einsum('i...,i...->...',a,mask)/mask.sum(0)
Sample run -
In [52]: a
Out[52]:
array([[ 3, -3, 3],
[ 2, 2, 0],
[ 0, -3, 1],
[ 0, 1, 1]])
In [53]: mask = a > 0
In [56]: np.einsum('i...,i...->...',a,mask) # summations of > 0s
Out[56]: array([5, 3, 5])
In [57]: np.einsum('i...,i...->...',a,mask)/mask.sum(0) # avg values of >0s
Out[57]: array([ 2.5 , 1.5 , 1.66666667])
To account for all zero columns, it seems we are expecting 0 as the result. So, we can use np.where to do the choosing, like so -
In [61]: a[:,-1] = 0
In [62]: a
Out[62]:
array([[ 3, -3, 0],
[ 2, 2, 0],
[ 0, -3, 0],
[ 0, 1, 0]])
In [63]: mask = a > 0
In [65]: np.where( mask.any(0), np.einsum('i...,i...->...',a,mask)/mask.sum(0), 0)
__main__:1: RuntimeWarning: invalid value encountered in true_divide
Out[65]: array([ 2.5, 1.5, 0. ])
Just ignore the warning there.
If you feel paranoid about warnings, use masking -
mask = a > 0
vm = mask.any(0) # valid mask
out = np.zeros(a.shape[1])
out[vm] = np.einsum('ij,ij->j',a[:,vm],mask[:,vm])/mask.sum(0)[vm]
Example:
arr = np.array([[.5, .25, .19, .05, .01],[.25, .5, .19, .05, .01],[.5, .25, .19, .05, .01]])
print(arr)
[[ 0.5 0.25 0.19 0.05 0.01]
[ 0.25 0.5 0.19 0.05 0.01]
[ 0.5 0.25 0.19 0.05 0.01]]
idxs = np.argsort(arr)
print(idxs)
[[4 3 2 1 0]
[4 3 2 0 1]
[4 3 2 1 0]]
How can I use idxs to index arr? I want to do something like arr[idxs], but this does not work.
It's not the prettiest, but I think something like
>>> arr[np.arange(len(arr))[:,None], idxs]
array([[ 0.01, 0.05, 0.19, 0.25, 0.5 ],
[ 0.01, 0.05, 0.19, 0.25, 0.5 ],
[ 0.01, 0.05, 0.19, 0.25, 0.5 ]])
should work. The first term gives the x coordinates we want (using broadcasting over the last singleton axis):
>>> np.arange(len(arr))[:,None]
array([[0],
[1],
[2]])
with idxs providing the y coordinates. Note that if we had used unravel_index, the x coordinates to use would always have been 0 instead:
>>> np.unravel_index(idxs, arr.shape)[0]
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
How about something like this:
I changed variables to make the example more clear, but you basically need to index by two 2D arrays.
In [102]: a = np.array([[1,2,3], [4,5,6]])
In [103]: b = np.array([[0,2,1], [2,1,0]])
In [104]: temp = np.repeat(np.arange(a.shape[0]), a.shape[1]).reshape(a.shape).T
# temp is just [[0,1], [0,1], [0,1]]
# probably can be done more elegantly
In [105]: a[temp, b.T].T
Out[105]:
array([[1, 3, 2],
[6, 5, 4]])
Im working with two arrays, trying to work with them like a 2 dimensional array. I'm using a lot of vectorized calculations with NumPy. Any idea how I would populate an array like this:
X = [1, 2, 3, 1, 2, 3, 1, 2, 3]
or:
X = [0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8]
Ignore the first part of the message.
I had to populate two arrays in a form of a grid. But the grid dimensions varied from the users, thats why I needed a general form. I worked on it all this morning and finally got what I wanted.
I apologize if I caused any confusion earlier. English is not my tongue language, and sometimes it is hard for me to explain things.
This is the code that did the job for me:
myIter = linspace(1, N, N)
for x in myIter:
for y in myIter:
index = ((x - 1)*N + y) - 1
X[index] = x / (N+1)
Y[index] = y / (N+1)
The user inputs N.
And the length of X, Y is N*N.
You can use the function tile. From the examples:
>>> a = np.array([0, 1, 2])
>>> np.tile(a, 2)
array([0, 1, 2, 0, 1, 2])
With this function, you can also reshape your array at once like they do in the other answers with reshape (by defining the 'repeats' is more dimensions):
>>> np.tile(a, (2, 1))
array([[0, 1, 2],
[0, 1, 2]])
Addition: and a little comparison of the difference in speed between the built in function tile and the multiplication:
In [3]: %timeit numpy.array([1, 2, 3]* 3)
100000 loops, best of 3: 16.3 us per loop
In [4]: %timeit numpy.tile(numpy.array([1, 2, 3]), 3)
10000 loops, best of 3: 37 us per loop
In [5]: %timeit numpy.array([1, 2, 3]* 1000)
1000 loops, best of 3: 1.85 ms per loop
In [6]: %timeit numpy.tile(numpy.array([1, 2, 3]), 1000)
10000 loops, best of 3: 122 us per loop
EDIT
The output of the code you gave in your question can also be achieved as following:
arr = myIter / (N + 1)
X = numpy.repeat(arr, N)
Y = numpy.tile(arr, N)
This way you can avoid looping the arrays (which is one of the great advantages of using numpy). The resulting code is simpler (if you know the functions of course, see the documentation for repeat and tile) and faster.
print numpy.array(range(1, 4) * 3)
print numpy.array(range(1, 5) * 4).astype(float) * 2 / 10
If you want to create lists of repeating values, you could use list/tuple multiplication...
>>> import numpy
>>> numpy.array((1, 2, 3) * 3)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])
>>> numpy.array((0.2, 0.4, 0.6, 0.8) * 3).reshape((3, 4))
array([[ 0.2, 0.4, 0.6, 0.8],
[ 0.2, 0.4, 0.6, 0.8],
[ 0.2, 0.4, 0.6, 0.8]])
Thanks for updating your question -- it's much clearer now. Though I think joris's answer is the best one in this case (because it is more readable), I'll point out that the new code you posted could also be generalized like so:
>>> arr = numpy.arange(1, N + 1) / (N + 1.0)
>>> X = arr[numpy.indices((N, N))[0]].flatten()
>>> Y = arr[numpy.indices((N, N))[1]].flatten()
In many cases, when using numpy, one avoids while loops by using numpy's powerful indexing system. In general, when you use array I to index array A, the result is an array J of the same shape as I. For each index i in I, the value A[i] is assigned to the corresponding position in J. For example, say you have arr = numpy.arange(0, 9) / (9.0) and you want the values at indices 3, 5, and 8. All you have to do is use numpy.array([3, 5, 8]) as the index to arr:
>>> arr
array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889])
>>> arr[numpy.array([3, 5, 8])]
array([ 0.33333333, 0.55555556, 0.88888889])
What if you want a 2-d array? Just pass in a 2-d index:
>>> arr[numpy.array([[1,1,1],[2,2,2],[3,3,3]])]
array([[ 0.11111111, 0.11111111, 0.11111111],
[ 0.22222222, 0.22222222, 0.22222222],
[ 0.33333333, 0.33333333, 0.33333333]])
>>> arr[numpy.array([[1,2,3],[1,2,3],[1,2,3]])]
array([[ 0.11111111, 0.22222222, 0.33333333],
[ 0.11111111, 0.22222222, 0.33333333],
[ 0.11111111, 0.22222222, 0.33333333]])
Since you don't want to have to type indices like that out all the time, you can generate them automatically -- with numpy.indices:
>>> numpy.indices((3, 3))
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])
In a nutshell, that's how the above code works. (Also check out numpy.mgrid and numpy.ogrid -- which provide slightly more flexible index-generators.)
Since many numpy operations are vectorized (i.e. they are applied to each element in an array) you just have to find the right indices for the job -- no loops required.
import numpy as np
X = range(1,4)*3
X = list(np.arange(.2,.8,.2))*4
these will make your two lists, respectively. Hope thats what you were asking
I'm not exactly sure what you are trying to do, but as a guess: if you have a 1D array and you need to make it 2D you can use the array classes reshape method.
>>> import numpy
>>> a = numpy.array([1,2,3,1,2,3])
>>> a.reshape((2,3))
array([[1, 2, 3],
[1, 2, 3]])