Im working with two arrays, trying to work with them like a 2 dimensional array. I'm using a lot of vectorized calculations with NumPy. Any idea how I would populate an array like this:
X = [1, 2, 3, 1, 2, 3, 1, 2, 3]
or:
X = [0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8]
Ignore the first part of the message.
I had to populate two arrays in a form of a grid. But the grid dimensions varied from the users, thats why I needed a general form. I worked on it all this morning and finally got what I wanted.
I apologize if I caused any confusion earlier. English is not my tongue language, and sometimes it is hard for me to explain things.
This is the code that did the job for me:
myIter = linspace(1, N, N)
for x in myIter:
for y in myIter:
index = ((x - 1)*N + y) - 1
X[index] = x / (N+1)
Y[index] = y / (N+1)
The user inputs N.
And the length of X, Y is N*N.
You can use the function tile. From the examples:
>>> a = np.array([0, 1, 2])
>>> np.tile(a, 2)
array([0, 1, 2, 0, 1, 2])
With this function, you can also reshape your array at once like they do in the other answers with reshape (by defining the 'repeats' is more dimensions):
>>> np.tile(a, (2, 1))
array([[0, 1, 2],
[0, 1, 2]])
Addition: and a little comparison of the difference in speed between the built in function tile and the multiplication:
In [3]: %timeit numpy.array([1, 2, 3]* 3)
100000 loops, best of 3: 16.3 us per loop
In [4]: %timeit numpy.tile(numpy.array([1, 2, 3]), 3)
10000 loops, best of 3: 37 us per loop
In [5]: %timeit numpy.array([1, 2, 3]* 1000)
1000 loops, best of 3: 1.85 ms per loop
In [6]: %timeit numpy.tile(numpy.array([1, 2, 3]), 1000)
10000 loops, best of 3: 122 us per loop
EDIT
The output of the code you gave in your question can also be achieved as following:
arr = myIter / (N + 1)
X = numpy.repeat(arr, N)
Y = numpy.tile(arr, N)
This way you can avoid looping the arrays (which is one of the great advantages of using numpy). The resulting code is simpler (if you know the functions of course, see the documentation for repeat and tile) and faster.
print numpy.array(range(1, 4) * 3)
print numpy.array(range(1, 5) * 4).astype(float) * 2 / 10
If you want to create lists of repeating values, you could use list/tuple multiplication...
>>> import numpy
>>> numpy.array((1, 2, 3) * 3)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])
>>> numpy.array((0.2, 0.4, 0.6, 0.8) * 3).reshape((3, 4))
array([[ 0.2, 0.4, 0.6, 0.8],
[ 0.2, 0.4, 0.6, 0.8],
[ 0.2, 0.4, 0.6, 0.8]])
Thanks for updating your question -- it's much clearer now. Though I think joris's answer is the best one in this case (because it is more readable), I'll point out that the new code you posted could also be generalized like so:
>>> arr = numpy.arange(1, N + 1) / (N + 1.0)
>>> X = arr[numpy.indices((N, N))[0]].flatten()
>>> Y = arr[numpy.indices((N, N))[1]].flatten()
In many cases, when using numpy, one avoids while loops by using numpy's powerful indexing system. In general, when you use array I to index array A, the result is an array J of the same shape as I. For each index i in I, the value A[i] is assigned to the corresponding position in J. For example, say you have arr = numpy.arange(0, 9) / (9.0) and you want the values at indices 3, 5, and 8. All you have to do is use numpy.array([3, 5, 8]) as the index to arr:
>>> arr
array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889])
>>> arr[numpy.array([3, 5, 8])]
array([ 0.33333333, 0.55555556, 0.88888889])
What if you want a 2-d array? Just pass in a 2-d index:
>>> arr[numpy.array([[1,1,1],[2,2,2],[3,3,3]])]
array([[ 0.11111111, 0.11111111, 0.11111111],
[ 0.22222222, 0.22222222, 0.22222222],
[ 0.33333333, 0.33333333, 0.33333333]])
>>> arr[numpy.array([[1,2,3],[1,2,3],[1,2,3]])]
array([[ 0.11111111, 0.22222222, 0.33333333],
[ 0.11111111, 0.22222222, 0.33333333],
[ 0.11111111, 0.22222222, 0.33333333]])
Since you don't want to have to type indices like that out all the time, you can generate them automatically -- with numpy.indices:
>>> numpy.indices((3, 3))
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])
In a nutshell, that's how the above code works. (Also check out numpy.mgrid and numpy.ogrid -- which provide slightly more flexible index-generators.)
Since many numpy operations are vectorized (i.e. they are applied to each element in an array) you just have to find the right indices for the job -- no loops required.
import numpy as np
X = range(1,4)*3
X = list(np.arange(.2,.8,.2))*4
these will make your two lists, respectively. Hope thats what you were asking
I'm not exactly sure what you are trying to do, but as a guess: if you have a 1D array and you need to make it 2D you can use the array classes reshape method.
>>> import numpy
>>> a = numpy.array([1,2,3,1,2,3])
>>> a.reshape((2,3))
array([[1, 2, 3],
[1, 2, 3]])
Related
There are two arrays and I want to get distance between two arrays based on known individual elements distance.
dist = {(4,3): 0.25, (4,1):0.75, (0,0):0, (3,3):0, (2,1):0.25, (1,0): 0.25}
a = np.array([[4, 4, 0], [3, 2, 1]])
b = np.array([[3, 1, 0]])
a
array([[4, 4, 0],
[3, 2, 1]])
b
array([[3, 1, 0]])
expected output based on dictionary dist:
array([[0.25, 0.75, 0. ],
[0. , 0.25, 0.25]])
So, if we need which elements are different we can do a!=b. Similarly, instead of !=, I want to apply the below function -
def get_distance(a, b):
return dist[(a, b)]
to get the expected output above.
I tried np.vectorize(get_distance)(a, b) and it works. But I am not sure if it is the best way to do the above in vectorized way. So, for two numpy arrays, what is the best way to apply custom function/operator?
Instead of storing your distance mapping as a dict, use a np.array for lookup (or possibly a sparse matrix if size becomes an issue).
d = np.zeros((5, 4))
for (x, y), z in dist.items():
d[x, y] = z
Then, simply index.
>>> d[a, b]
array([[0.25, 0.75, 0. ],
[0. , 0.25, 0.25]])
For a sparse solution (code is almost identical):
In [14]: from scipy import sparse
In [15]: d = sparse.dok_matrix((5, 4))
In [16]: for (x, y), z in dist.items():
...: d[x, y] = z
...:
In [17]: d[a, b].A
Out[17]:
array([[0.25, 0.75, 0. ],
[0. , 0.25, 0.25]])
I'm looking to see if there is a more efficient way (i.e. using native NumPy functionality) to achieve what I'm doing currently.
My process is I start with an array a:
a = np.array([[0,2,0,-1],[-0.2,0,-0.1,0],[0,0,-0.1,0],[0,0,0,0]])
array([[ 0. , 2. , 0. , -1. ],
[-0.2, 0. , -0.1, 0. ],
[ 0. , 0. , -0.1, 0. ],
[ 0. , 0. , 0. , 0. ]])
I then filter based on where the values are not equal to 0:
r_indices, c_indicies = np.where(a != 0)
(array([0, 0, 1, 1, 2]), array([1, 3, 0, 2, 2]))
From there, I create a Python dictionary b like so:
b = {i: c_indices[r_indices == i] for i in np.unique(r_indices)}
{
0: array([1, 3]),
1: array([0, 2]),
2: array([2])},
}
I do this because I want to know for a given unique row index r, which column indices are not 0.
My own preference is to try to use NumPy as much as possible to take advantage of speed benefits. However, I'm not sure how else to structure this in NumPy since the values in the dictionary could range from a length of 0 (no values are not zero) to 4 (all values are not zero).
Am I being paranoid about the potential speed benefits?
You can use Pandas in the following way:
import pandas as pd
import numpy as np
if __name__=='__main__':
a = np.array([[0, 2, 0, -1], [-0.2, 0, -0.1, 0], [0, 0, -0.1, 0], [0, 0, 0, 0]])
rows, cols = np.where(a !=0)
x = list(zip(rows, cols))
df = pd.DataFrame.from_records(data=x)
l = df.groupby(0)[1].apply(list)
L = [np.array(a) for a in l.values]
d = dict(zip(np.unique(rows), L))
Output
{0: array([1, 3]), 1: array([0, 2]), 2: array([2])}
As pandas works with numpy under the hood, this code will be much more efficient than the regular list comprehension.
Also, if all you need is a dictionary-like object - you could inhance the performance further by using the l Pandas.GroupBy as:
l.loc[0]
which will result in :
[1, 3]
which is equivalent to the b[0] in your example.
and omitting the last two lines altogether, as Pandas provide a very fast mechanisms for handling large amounts of tabular data, and generally preferable to a plain dict object, if they used for the same thing.
Cheers.
I am solving a problem in which I am using two large matrices A and B. Matrix A is formed by ones and zeros and matrix B is formed by integers in the range [0,...,10].
At some point, I have to update the components of A such that, if the component is 1 it stays at 1. If the component is 0, it can stay at 0 with probability p or change to 1 with probability 1-p. The parameters p are functions of the same component of matrix B, i.e., I have a list of probabilities such that, if I update A[i,j], then p equals the component B[i,j] of the vector probabilities
I can do the updating with the following code:
import numpy as np
for i in range(n):
for j in range(n):
if A[i,j]==0:
pass
else:
A[i,j]=np.random.choice([0,1],p=[probabilities[B[i,j]],1-probabilities[B[i,j]]])
I think that there should be a faster way to update matrix A using slice notation. Any advice?
See that this problem is equivalent to, given a vector 'a' and a matrix with positive entries 'B', obtain a matrix C in which C[i,j]=a[B[i,j]]
The general answer
You can achieve this by generating a random value with a continuous probability, and then compare the generated value with the inverse cumulative probability function of that distribution.
To be practical
Let's use uniform random variable X in the range [0, 1), then the probability that X < a is a. Assuming that A and B have the same shape
You can use numpy.where for this (make sure that your probabilities variable is a numpy array)
A[:,:] = np.where(A == 0, np.where(np.random.rand(*A.shape) < probabilities[B], 1, 0), A);
If you want to avoid computing the random values for positions where A is non-zero then you have a more complex indexing.
A[A == 0] = np.where(np.random.rand(*A[A == 0].shape) < probabilities[B[A == 0]], 1, 0);
Make many p in the 0-1 range:
In [36]: p=np.arange(1000)/1000
Your choice use - with sum to get an overall statistic:
In [37]: sum([np.random.choice([0,1],p=[p[i],1-p[i]]) for i in range(p.shape[0])])
Out[37]: 485
and a statistically similar random array without the comprehension:
In [38]: (np.random.rand(p.shape[0])>p).astype(int).sum()
Out[38]: 496
You are welcome to perform other tests to verify their equivalence.
If probabilities is a function that takes the whole B or B[mask], we should be able to do:
mask = A==0
n = mask.ravel().sum() # the number of true elements
A[mask] = (np.random.rand(n)>probabilities(B[mask])).astype(int)
To test this:
In [39]: A = np.random.randint(0,2, (4,5))
In [40]: A
Out[40]:
array([[1, 1, 0, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 0, 0, 1],
[1, 1, 1, 1, 0]])
In [41]: mask = A==0
In [42]: A[mask]
Out[42]: array([0, 0, 0, 0, 0, 0])
In [43]: B = np.arange(1,21).reshape(4,5)
In [44]: def foo(B): # scale the B values to probabilites
...: return B/20
...:
In [45]: foo(B)
Out[45]:
array([[0.05, 0.1 , 0.15, 0.2 , 0.25],
[0.3 , 0.35, 0.4 , 0.45, 0.5 ],
[0.55, 0.6 , 0.65, 0.7 , 0.75],
[0.8 , 0.85, 0.9 , 0.95, 1. ]])
In [46]: n = mask.ravel().sum()
In [47]: n
Out[47]: 6
In [51]: (np.random.rand(n)>foo(B[mask])).astype(int)
Out[51]: array([1, 1, 0, 1, 1, 0])
In [52]: A[mask] = (np.random.rand(n)>foo(B[mask])).astype(int)
In [53]: A
Out[53]:
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 0]])
In [54]: foo(B[mask])
Out[54]: array([0.15, 0.2 , 0.25, 0.65, 0.7 , 1. ])
I think this is an easy question for experienced numpy users.
I have a score matrix. The raw index corresponds to samples and column index corresponds to items. For example,
score_matrix =
[[ 1. , 0.3, 0.4],
[ 0.2, 0.6, 0.8],
[ 0.1, 0.3, 0.5]]
I want to get top-M indices of items for each samples. Also I want to get top-M scores. For example,
top2_ind =
[[0, 2],
[2, 1],
[2, 1]]
top2_score =
[[1. , 0.4],
[0,8, 0.6],
[0.5, 0.3]]
What is the best way to do this using numpy?
Here's an approach using np.argpartition -
idx = np.argpartition(a,range(M))[:,:-M-1:-1] # topM_ind
out = a[np.arange(a.shape[0])[:,None],idx] # topM_score
Sample run -
In [343]: a
Out[343]:
array([[ 1. , 0.3, 0.4],
[ 0.2, 0.6, 0.8],
[ 0.1, 0.3, 0.5]])
In [344]: M = 2
In [345]: idx = np.argpartition(a,range(M))[:,:-M-1:-1]
In [346]: idx
Out[346]:
array([[0, 2],
[2, 1],
[2, 1]])
In [347]: a[np.arange(a.shape[0])[:,None],idx]
Out[347]:
array([[ 1. , 0.4],
[ 0.8, 0.6],
[ 0.5, 0.3]])
Alternatively, possibly slower, but a bit shorter code to get idx would be with np.argsort -
idx = a.argsort(1)[:,:-M-1:-1]
Here's a post containing some runtime test that compares np.argsort and np.argpartition on a similar problem.
I'd use argsort():
top2_ind = score_matrix.argsort()[:,::-1][:,:2]
That is, produce an array which contains the indices which would sort score_matrix:
array([[1, 2, 0],
[0, 1, 2],
[0, 1, 2]])
Then reverse the columns with ::-1, then take the first two columns with :2:
array([[0, 2],
[2, 1],
[2, 1]])
Then similar but with regular np.sort() to get the values:
top2_score = np.sort(score_matrix)[:,::-1][:,:2]
Which following the same mechanics as above, gives you:
array([[ 1. , 0.4],
[ 0.8, 0.6],
[ 0.5, 0.3]])
In case someone is interested in the both the values and corresponding indices without tempering with the order, the following simple approach will be helpful. Though it could be computationally expensive if working with large data since we are using a list to store tuples of value, index.
import numpy as np
values = np.array([0.01,0.6, 0.4, 0.0, 0.1,0.7, 0.12]) # a simple array
values_indices = [] # define an empty list to store values and indices
while values.shape[0]>1:
values_indices.append((values.max(), values.argmax()))
# remove the maximum value from the array:
values = np.delete(values, values.argmax())
The final output as list of tuples:
values_indices
[(0.7, 5), (0.6, 1), (0.4, 1), (0.12, 3), (0.1, 2), (0.01, 0)]
Easy way would be:
To get top-2 indices
np.argsort(-score_matrix)[:, :2]
To get top-2 values
-np.sort(-score_matrix)[:, :2]
I want to find the two find the pair of values and their index number in a meshgrid that a closets to another pair of values. Suppose I have two vectors a= np.array([0.01,0.5,0.9]) and b = np.array([0,3,6,10]) and two meshgrids X,Y = np.meshgrid(a,b). For illustration, they look as follows:
X= array([[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9]])
Y =array([[ 0, 0, 0],
[ 3, 3, 3],
[ 6, 6, 6],
[10, 10, 10]])
Now, I have another array called c of dimension (2 x N). For illustration suppose c contains the following entries:
c = array([[ 0.07268017, 0.08816632, 0.11084398, 0.13352165, 0.1490078 ],
[ 0.00091219, 0.00091219, 0.00091219, 0.00091219, 0.00091219]])
Denote a column vector of c by x. For each vector x I want to find
To complicate matters a bit, I am in fact not only looking for the index with the smallest distance (i,j) but also the second smallest distance (i',j').
All my approaches so far turned out to be extremely complicated and involved a lot of side routes. Does someone have an idea for how to tackle the problem efficiently?
If X, Y always come from meshgrid(), your minimization is separable in X and Y. Just find the closest elements of X to c[0,] and the closest elements of Y to c[1,] ---
you don't need to calculate the 2-dimensional metric.
If either a or b have uniform steps, you can save yourself even more time if you scale the corresponding values of c onto the indexes. In your example, all(a == 0.1+0.4*arange(3)), so you can find the x values by inverting: x = (c[0,] - 0.1)/0.4. If you have an invertible (possibly non-linear) function that maps integers onto b, you can similarly find y values directly by applying the inverse function to c[1,].
This is more a comment than an answer but i like to [... lots of stuff mercifully deleted, that you can still see using the revision history ...]
Complete Revision
As a followup of my own comment, please look at the following
Setup
In [25]: from numpy import *
In [26]: from scipy.spatial import KDTree
In [27]: X= array([[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9]])
In [28]: Y =array([[ 0, 0, 0],
[ 3, 3, 3],
[ 6, 6, 6],
[10, 10, 10]])
In [29]: c = array([[ 0.07268017, 0.08816632, 0.11084398, 0.13352165, 0.1490078 ],
[ 0.00091219, 0.00091219, 0.00091219, 0.00091219, 0.00091219]])
Solution
Two lines of code, please notice that you have to pass the transpose of your c array.
In [30]: tree = KDTree(zip(X.ravel(), Y.ravel()))
In [31]: tree.query(c.T,k=2)
Out[31]:
(array([[ 0.02733505, 0.4273208 ],
[ 0.01186879, 0.41183469],
[ 0.01088228, 0.38915709],
[ 0.03353406, 0.36647949],
[ 0.04901629, 0.35099339]]), array([[0, 1],
[0, 1],
[0, 1],
[0, 1],
[0, 1]]))
Comment
To interpret the result, the excellent scipy docs inform you that tree.query() gives you back two arrays, containing respectively for each point in c
a scalar or an array of length k>=2 giving you the distances
from the point to the closest point on grid, the second closest, etc,
a scalar or an array of length k>=2 giving you the indices
pointing to the grid point(s) closest (next close etc).
To access the grid point, KDTree maintains a copy of the grid data, e.g
In [32]: tree.data[[0,1]]
Out[32]:
array([[ 0.1, 0. ],
[ 0.5, 0. ]])
where [0,1] is the first element of the second output array.
Should you need the indices of the closest(s) point in the mesh matrices, it simply a matter of using divmod.