Related
I'm looking to see if there is a more efficient way (i.e. using native NumPy functionality) to achieve what I'm doing currently.
My process is I start with an array a:
a = np.array([[0,2,0,-1],[-0.2,0,-0.1,0],[0,0,-0.1,0],[0,0,0,0]])
array([[ 0. , 2. , 0. , -1. ],
[-0.2, 0. , -0.1, 0. ],
[ 0. , 0. , -0.1, 0. ],
[ 0. , 0. , 0. , 0. ]])
I then filter based on where the values are not equal to 0:
r_indices, c_indicies = np.where(a != 0)
(array([0, 0, 1, 1, 2]), array([1, 3, 0, 2, 2]))
From there, I create a Python dictionary b like so:
b = {i: c_indices[r_indices == i] for i in np.unique(r_indices)}
{
0: array([1, 3]),
1: array([0, 2]),
2: array([2])},
}
I do this because I want to know for a given unique row index r, which column indices are not 0.
My own preference is to try to use NumPy as much as possible to take advantage of speed benefits. However, I'm not sure how else to structure this in NumPy since the values in the dictionary could range from a length of 0 (no values are not zero) to 4 (all values are not zero).
Am I being paranoid about the potential speed benefits?
You can use Pandas in the following way:
import pandas as pd
import numpy as np
if __name__=='__main__':
a = np.array([[0, 2, 0, -1], [-0.2, 0, -0.1, 0], [0, 0, -0.1, 0], [0, 0, 0, 0]])
rows, cols = np.where(a !=0)
x = list(zip(rows, cols))
df = pd.DataFrame.from_records(data=x)
l = df.groupby(0)[1].apply(list)
L = [np.array(a) for a in l.values]
d = dict(zip(np.unique(rows), L))
Output
{0: array([1, 3]), 1: array([0, 2]), 2: array([2])}
As pandas works with numpy under the hood, this code will be much more efficient than the regular list comprehension.
Also, if all you need is a dictionary-like object - you could inhance the performance further by using the l Pandas.GroupBy as:
l.loc[0]
which will result in :
[1, 3]
which is equivalent to the b[0] in your example.
and omitting the last two lines altogether, as Pandas provide a very fast mechanisms for handling large amounts of tabular data, and generally preferable to a plain dict object, if they used for the same thing.
Cheers.
I want to calculate the Gaussian PDF of two dimensional data, i am trying to do this in python using scipy.stats.multivariate_normal function but i don't understand how can i pass my data into it?
Is multivariate_normal only used for analyzing one dimensional data in n-dimensions or can I use for my data set also?
data set-> X = [X1,X2....Xn]
where each
Xi=[x1 x2]
is 2 dimensional.
To compute the density function, use the pdf() method of the object scipy.stats.multivariate_normal. The first argument is your array X. The next two arguments are the mean and the covariance matrix of the distribution.
For example:
In [72]: import numpy as np
In [73]: from scipy.stats import multivariate_normal
In [74]: mean = np.array([0, 1])
In [75]: cov = np.array([[2, -0.5], [-0.5, 4]])
In [76]: x = np.array([[0, 1], [1, 1], [0.5, 0.25], [1, 2], [-1, 0]])
In [77]: x
Out[77]:
array([[ 0. , 1. ],
[ 1. , 1. ],
[ 0.5 , 0.25],
[ 1. , 2. ],
[-1. , 0. ]])
In [78]: p = multivariate_normal.pdf(x, mean, cov)
In [79]: p
Out[79]: array([ 0.05717014, 0.04416653, 0.05106649, 0.03639454, 0.03639454])
I want to find the two find the pair of values and their index number in a meshgrid that a closets to another pair of values. Suppose I have two vectors a= np.array([0.01,0.5,0.9]) and b = np.array([0,3,6,10]) and two meshgrids X,Y = np.meshgrid(a,b). For illustration, they look as follows:
X= array([[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9]])
Y =array([[ 0, 0, 0],
[ 3, 3, 3],
[ 6, 6, 6],
[10, 10, 10]])
Now, I have another array called c of dimension (2 x N). For illustration suppose c contains the following entries:
c = array([[ 0.07268017, 0.08816632, 0.11084398, 0.13352165, 0.1490078 ],
[ 0.00091219, 0.00091219, 0.00091219, 0.00091219, 0.00091219]])
Denote a column vector of c by x. For each vector x I want to find
To complicate matters a bit, I am in fact not only looking for the index with the smallest distance (i,j) but also the second smallest distance (i',j').
All my approaches so far turned out to be extremely complicated and involved a lot of side routes. Does someone have an idea for how to tackle the problem efficiently?
If X, Y always come from meshgrid(), your minimization is separable in X and Y. Just find the closest elements of X to c[0,] and the closest elements of Y to c[1,] ---
you don't need to calculate the 2-dimensional metric.
If either a or b have uniform steps, you can save yourself even more time if you scale the corresponding values of c onto the indexes. In your example, all(a == 0.1+0.4*arange(3)), so you can find the x values by inverting: x = (c[0,] - 0.1)/0.4. If you have an invertible (possibly non-linear) function that maps integers onto b, you can similarly find y values directly by applying the inverse function to c[1,].
This is more a comment than an answer but i like to [... lots of stuff mercifully deleted, that you can still see using the revision history ...]
Complete Revision
As a followup of my own comment, please look at the following
Setup
In [25]: from numpy import *
In [26]: from scipy.spatial import KDTree
In [27]: X= array([[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9]])
In [28]: Y =array([[ 0, 0, 0],
[ 3, 3, 3],
[ 6, 6, 6],
[10, 10, 10]])
In [29]: c = array([[ 0.07268017, 0.08816632, 0.11084398, 0.13352165, 0.1490078 ],
[ 0.00091219, 0.00091219, 0.00091219, 0.00091219, 0.00091219]])
Solution
Two lines of code, please notice that you have to pass the transpose of your c array.
In [30]: tree = KDTree(zip(X.ravel(), Y.ravel()))
In [31]: tree.query(c.T,k=2)
Out[31]:
(array([[ 0.02733505, 0.4273208 ],
[ 0.01186879, 0.41183469],
[ 0.01088228, 0.38915709],
[ 0.03353406, 0.36647949],
[ 0.04901629, 0.35099339]]), array([[0, 1],
[0, 1],
[0, 1],
[0, 1],
[0, 1]]))
Comment
To interpret the result, the excellent scipy docs inform you that tree.query() gives you back two arrays, containing respectively for each point in c
a scalar or an array of length k>=2 giving you the distances
from the point to the closest point on grid, the second closest, etc,
a scalar or an array of length k>=2 giving you the indices
pointing to the grid point(s) closest (next close etc).
To access the grid point, KDTree maintains a copy of the grid data, e.g
In [32]: tree.data[[0,1]]
Out[32]:
array([[ 0.1, 0. ],
[ 0.5, 0. ]])
where [0,1] is the first element of the second output array.
Should you need the indices of the closest(s) point in the mesh matrices, it simply a matter of using divmod.
I have a function that has a bunch of parameters. Rather than setting all of the parameters manually, I want to perform a grid search. I have a list of possible values for each parameter. For every possible combination of parameters, I want to run my function which reports the performance of my algorithm on those parameters. I want to store the results of this in a many-dimensional matrix, so that afterwords I can just find the index of the maximum performance, which would in turn give me the best parameters. Here is how the code is written now:
param1_list = [p11, p12, p13,...]
param2_list = [p21, p22, p23,...] # not necessarily the same number of values
...
results_size = (len(param1_list), len(param2_list),...)
results = np.zeros(results_size, dtype = np.float)
for param1_idx in range(len(param1_list)):
for param2_idx in range(len(param2_list)):
...
param1 = param1_list[param1_idx]
param2 = param2_list[param2_idx]
...
results[param1_idx, param2_idx, ...] = my_func(param1, param2, ...)
max_index = np.argmax(results) # indices of best parameters!
I want to keep the first part, where I define the lists as-is, since I want to easily be able to manipulate the values over which I search.
I also want to end up with the results matrix as is, since I will be visualizing how changing different parameters affects the performance of the algorithm.
The bit in the middle, though, is quite repetitive and bulky (especially because I have lots of parameters, and I might want to add or remove parameters), and I feel like there should be a more succinct/elegant way to initialize the results matrix, iterate over all of the indices, and set the appropriate parameters.
So, is there?
You can use the ParameterGrid from the sklearn module
http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.ParameterGrid.html
Example
from sklearn.grid_search import ParameterGrid
param_grid = {'param1': [value1, value2, value3], 'paramN' : [value1, value2, valueM]}
grid = ParameterGrid(param_grid)
for params in grid:
your_function(params['param1'], params['param2'])
I think scipy.optimize.brute is what you're after.
>>> from scipy.optimize import brute
>>> a,f,g,j = brute(my_func,[param1_list,param2_list,...],full_output = True)
Note that if the full_output argument is True, the evaluation grid will be returned.
The solutions from John Vinyard and Sibelius Seraphini are good built-in options, but but if you're looking for more flexibility, you could use broadcasting + vectorize. Use ix_ to produce a broadcastable set of parameters, and then pass those to a vectorized version of the function (but see caveat below):
a, b, c = range(3), range(3), range(3)
def my_func(x, y, z):
return (x + y + z) / 3.0, x * y * z, max(x, y, z)
grids = numpy.vectorize(my_func)(*numpy.ix_(a, b, c))
mean_grid, product_grid, max_grid = grids
With the following results for mean_grid:
array([[[ 0. , 0.33333333, 0.66666667],
[ 0.33333333, 0.66666667, 1. ],
[ 0.66666667, 1. , 1.33333333]],
[[ 0.33333333, 0.66666667, 1. ],
[ 0.66666667, 1. , 1.33333333],
[ 1. , 1.33333333, 1.66666667]],
[[ 0.66666667, 1. , 1.33333333],
[ 1. , 1.33333333, 1.66666667],
[ 1.33333333, 1.66666667, 2. ]]])
product grid:
array([[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 1, 2],
[0, 2, 4]],
[[0, 0, 0],
[0, 2, 4],
[0, 4, 8]]])
and max grid:
array([[[0, 1, 2],
[1, 1, 2],
[2, 2, 2]],
[[1, 1, 2],
[1, 1, 2],
[2, 2, 2]],
[[2, 2, 2],
[2, 2, 2],
[2, 2, 2]]])
Note that this may not be the fastest approach. vectorize is handy, but it's limited by the speed of the function passed to it, and python functions are slow. If you could rewrite my_func to use numpy ufuncs, you could get your grids faster, if you cared to. Something like this:
>>> def mean(a, b, c):
... return (a + b + c) / 3.0
...
>>> mean(*numpy.ix_(a, b, c))
array([[[ 0. , 0.33333333, 0.66666667],
[ 0.33333333, 0.66666667, 1. ],
[ 0.66666667, 1. , 1.33333333]],
[[ 0.33333333, 0.66666667, 1. ],
[ 0.66666667, 1. , 1.33333333],
[ 1. , 1.33333333, 1.66666667]],
[[ 0.66666667, 1. , 1.33333333],
[ 1. , 1.33333333, 1.66666667],
[ 1.33333333, 1.66666667, 2. ]]])
You may use numpy meshgrid for this:
import numpy as np
x = range(1, 5)
y = range(10)
xx, yy = np.meshgrid(x, y)
results = my_func(xx, yy)
note that your function must be able to work with numpy.arrays.
Im working with two arrays, trying to work with them like a 2 dimensional array. I'm using a lot of vectorized calculations with NumPy. Any idea how I would populate an array like this:
X = [1, 2, 3, 1, 2, 3, 1, 2, 3]
or:
X = [0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8]
Ignore the first part of the message.
I had to populate two arrays in a form of a grid. But the grid dimensions varied from the users, thats why I needed a general form. I worked on it all this morning and finally got what I wanted.
I apologize if I caused any confusion earlier. English is not my tongue language, and sometimes it is hard for me to explain things.
This is the code that did the job for me:
myIter = linspace(1, N, N)
for x in myIter:
for y in myIter:
index = ((x - 1)*N + y) - 1
X[index] = x / (N+1)
Y[index] = y / (N+1)
The user inputs N.
And the length of X, Y is N*N.
You can use the function tile. From the examples:
>>> a = np.array([0, 1, 2])
>>> np.tile(a, 2)
array([0, 1, 2, 0, 1, 2])
With this function, you can also reshape your array at once like they do in the other answers with reshape (by defining the 'repeats' is more dimensions):
>>> np.tile(a, (2, 1))
array([[0, 1, 2],
[0, 1, 2]])
Addition: and a little comparison of the difference in speed between the built in function tile and the multiplication:
In [3]: %timeit numpy.array([1, 2, 3]* 3)
100000 loops, best of 3: 16.3 us per loop
In [4]: %timeit numpy.tile(numpy.array([1, 2, 3]), 3)
10000 loops, best of 3: 37 us per loop
In [5]: %timeit numpy.array([1, 2, 3]* 1000)
1000 loops, best of 3: 1.85 ms per loop
In [6]: %timeit numpy.tile(numpy.array([1, 2, 3]), 1000)
10000 loops, best of 3: 122 us per loop
EDIT
The output of the code you gave in your question can also be achieved as following:
arr = myIter / (N + 1)
X = numpy.repeat(arr, N)
Y = numpy.tile(arr, N)
This way you can avoid looping the arrays (which is one of the great advantages of using numpy). The resulting code is simpler (if you know the functions of course, see the documentation for repeat and tile) and faster.
print numpy.array(range(1, 4) * 3)
print numpy.array(range(1, 5) * 4).astype(float) * 2 / 10
If you want to create lists of repeating values, you could use list/tuple multiplication...
>>> import numpy
>>> numpy.array((1, 2, 3) * 3)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])
>>> numpy.array((0.2, 0.4, 0.6, 0.8) * 3).reshape((3, 4))
array([[ 0.2, 0.4, 0.6, 0.8],
[ 0.2, 0.4, 0.6, 0.8],
[ 0.2, 0.4, 0.6, 0.8]])
Thanks for updating your question -- it's much clearer now. Though I think joris's answer is the best one in this case (because it is more readable), I'll point out that the new code you posted could also be generalized like so:
>>> arr = numpy.arange(1, N + 1) / (N + 1.0)
>>> X = arr[numpy.indices((N, N))[0]].flatten()
>>> Y = arr[numpy.indices((N, N))[1]].flatten()
In many cases, when using numpy, one avoids while loops by using numpy's powerful indexing system. In general, when you use array I to index array A, the result is an array J of the same shape as I. For each index i in I, the value A[i] is assigned to the corresponding position in J. For example, say you have arr = numpy.arange(0, 9) / (9.0) and you want the values at indices 3, 5, and 8. All you have to do is use numpy.array([3, 5, 8]) as the index to arr:
>>> arr
array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889])
>>> arr[numpy.array([3, 5, 8])]
array([ 0.33333333, 0.55555556, 0.88888889])
What if you want a 2-d array? Just pass in a 2-d index:
>>> arr[numpy.array([[1,1,1],[2,2,2],[3,3,3]])]
array([[ 0.11111111, 0.11111111, 0.11111111],
[ 0.22222222, 0.22222222, 0.22222222],
[ 0.33333333, 0.33333333, 0.33333333]])
>>> arr[numpy.array([[1,2,3],[1,2,3],[1,2,3]])]
array([[ 0.11111111, 0.22222222, 0.33333333],
[ 0.11111111, 0.22222222, 0.33333333],
[ 0.11111111, 0.22222222, 0.33333333]])
Since you don't want to have to type indices like that out all the time, you can generate them automatically -- with numpy.indices:
>>> numpy.indices((3, 3))
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])
In a nutshell, that's how the above code works. (Also check out numpy.mgrid and numpy.ogrid -- which provide slightly more flexible index-generators.)
Since many numpy operations are vectorized (i.e. they are applied to each element in an array) you just have to find the right indices for the job -- no loops required.
import numpy as np
X = range(1,4)*3
X = list(np.arange(.2,.8,.2))*4
these will make your two lists, respectively. Hope thats what you were asking
I'm not exactly sure what you are trying to do, but as a guess: if you have a 1D array and you need to make it 2D you can use the array classes reshape method.
>>> import numpy
>>> a = numpy.array([1,2,3,1,2,3])
>>> a.reshape((2,3))
array([[1, 2, 3],
[1, 2, 3]])