Elegant grid search in python/numpy - python

I have a function that has a bunch of parameters. Rather than setting all of the parameters manually, I want to perform a grid search. I have a list of possible values for each parameter. For every possible combination of parameters, I want to run my function which reports the performance of my algorithm on those parameters. I want to store the results of this in a many-dimensional matrix, so that afterwords I can just find the index of the maximum performance, which would in turn give me the best parameters. Here is how the code is written now:
param1_list = [p11, p12, p13,...]
param2_list = [p21, p22, p23,...] # not necessarily the same number of values
...
results_size = (len(param1_list), len(param2_list),...)
results = np.zeros(results_size, dtype = np.float)
for param1_idx in range(len(param1_list)):
for param2_idx in range(len(param2_list)):
...
param1 = param1_list[param1_idx]
param2 = param2_list[param2_idx]
...
results[param1_idx, param2_idx, ...] = my_func(param1, param2, ...)
max_index = np.argmax(results) # indices of best parameters!
I want to keep the first part, where I define the lists as-is, since I want to easily be able to manipulate the values over which I search.
I also want to end up with the results matrix as is, since I will be visualizing how changing different parameters affects the performance of the algorithm.
The bit in the middle, though, is quite repetitive and bulky (especially because I have lots of parameters, and I might want to add or remove parameters), and I feel like there should be a more succinct/elegant way to initialize the results matrix, iterate over all of the indices, and set the appropriate parameters.
So, is there?

You can use the ParameterGrid from the sklearn module
http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.ParameterGrid.html
Example
from sklearn.grid_search import ParameterGrid
param_grid = {'param1': [value1, value2, value3], 'paramN' : [value1, value2, valueM]}
grid = ParameterGrid(param_grid)
for params in grid:
your_function(params['param1'], params['param2'])

I think scipy.optimize.brute is what you're after.
>>> from scipy.optimize import brute
>>> a,f,g,j = brute(my_func,[param1_list,param2_list,...],full_output = True)
Note that if the full_output argument is True, the evaluation grid will be returned.

The solutions from John Vinyard and Sibelius Seraphini are good built-in options, but but if you're looking for more flexibility, you could use broadcasting + vectorize. Use ix_ to produce a broadcastable set of parameters, and then pass those to a vectorized version of the function (but see caveat below):
a, b, c = range(3), range(3), range(3)
def my_func(x, y, z):
return (x + y + z) / 3.0, x * y * z, max(x, y, z)
grids = numpy.vectorize(my_func)(*numpy.ix_(a, b, c))
mean_grid, product_grid, max_grid = grids
With the following results for mean_grid:
array([[[ 0. , 0.33333333, 0.66666667],
[ 0.33333333, 0.66666667, 1. ],
[ 0.66666667, 1. , 1.33333333]],
[[ 0.33333333, 0.66666667, 1. ],
[ 0.66666667, 1. , 1.33333333],
[ 1. , 1.33333333, 1.66666667]],
[[ 0.66666667, 1. , 1.33333333],
[ 1. , 1.33333333, 1.66666667],
[ 1.33333333, 1.66666667, 2. ]]])
product grid:
array([[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 1, 2],
[0, 2, 4]],
[[0, 0, 0],
[0, 2, 4],
[0, 4, 8]]])
and max grid:
array([[[0, 1, 2],
[1, 1, 2],
[2, 2, 2]],
[[1, 1, 2],
[1, 1, 2],
[2, 2, 2]],
[[2, 2, 2],
[2, 2, 2],
[2, 2, 2]]])
Note that this may not be the fastest approach. vectorize is handy, but it's limited by the speed of the function passed to it, and python functions are slow. If you could rewrite my_func to use numpy ufuncs, you could get your grids faster, if you cared to. Something like this:
>>> def mean(a, b, c):
... return (a + b + c) / 3.0
...
>>> mean(*numpy.ix_(a, b, c))
array([[[ 0. , 0.33333333, 0.66666667],
[ 0.33333333, 0.66666667, 1. ],
[ 0.66666667, 1. , 1.33333333]],
[[ 0.33333333, 0.66666667, 1. ],
[ 0.66666667, 1. , 1.33333333],
[ 1. , 1.33333333, 1.66666667]],
[[ 0.66666667, 1. , 1.33333333],
[ 1. , 1.33333333, 1.66666667],
[ 1.33333333, 1.66666667, 2. ]]])

You may use numpy meshgrid for this:
import numpy as np
x = range(1, 5)
y = range(10)
xx, yy = np.meshgrid(x, y)
results = my_func(xx, yy)
note that your function must be able to work with numpy.arrays.

Related

Apply custom function/operator between numpy arrays

There are two arrays and I want to get distance between two arrays based on known individual elements distance.
dist = {(4,3): 0.25, (4,1):0.75, (0,0):0, (3,3):0, (2,1):0.25, (1,0): 0.25}
a = np.array([[4, 4, 0], [3, 2, 1]])
b = np.array([[3, 1, 0]])
a
array([[4, 4, 0],
[3, 2, 1]])
b
array([[3, 1, 0]])
expected output based on dictionary dist:
array([[0.25, 0.75, 0. ],
[0. , 0.25, 0.25]])
So, if we need which elements are different we can do a!=b. Similarly, instead of !=, I want to apply the below function -
def get_distance(a, b):
return dist[(a, b)]
to get the expected output above.
I tried np.vectorize(get_distance)(a, b) and it works. But I am not sure if it is the best way to do the above in vectorized way. So, for two numpy arrays, what is the best way to apply custom function/operator?
Instead of storing your distance mapping as a dict, use a np.array for lookup (or possibly a sparse matrix if size becomes an issue).
d = np.zeros((5, 4))
for (x, y), z in dist.items():
d[x, y] = z
Then, simply index.
>>> d[a, b]
array([[0.25, 0.75, 0. ],
[0. , 0.25, 0.25]])
For a sparse solution (code is almost identical):
In [14]: from scipy import sparse
In [15]: d = sparse.dok_matrix((5, 4))
In [16]: for (x, y), z in dist.items():
...: d[x, y] = z
...:
In [17]: d[a, b].A
Out[17]:
array([[0.25, 0.75, 0. ],
[0. , 0.25, 0.25]])

Updating a numpy array values based on multiple conditions

I have an array P as shown below:
P
array([[ 0.49530662, 0.32619367, 0.54593724, -0.0224462 ],
[-0.10503237, 0.48607405, 0.28572714, 0.15175049],
[ 0.0286128 , -0.32407902, -0.56598029, -0.26743756],
[ 0.14353725, -0.35624814, 0.25655861, -0.09241335]])
and a vector y:
y
array([0, 0, 1, 0], dtype=int16)
I want to modify another matrix Z which has the same dimension as P, such that Z_ij = y_j when Z_ij < 0.
In the above example, my Z matrix should be
Z = array([[-, -, -, 0],
[0, -, -, -],
[-, 0, 1, 0],
[-, 0, -, 0]])
Where '-' indicates the original Z values. What I thought about is very straightforward implementation which basically iterates through each row of Z and comparing the column values against corresponding Y and P. Do you know any better pythonic/numpy approach?
What you need is np.where. This is how to use it:-
import numpy as np
z = np.array([[ 0.49530662, 0.32619367, 0.54593724, -0.0224462 ],
[-0.10503237, 0.48607405, 0.28572714, 0.15175049],
[ 0.0286128 , -0.32407902, -0.56598029, -0.26743756],
[ 0.14353725, -0.35624814, 0.25655861, -0.09241335]])
y=([0, 0, 1, 0])
result = np.where(z<0,y,z)
#Where z<0, replace it by y
Result
>>> print(result)
[[0.49530662 0.32619367 0.54593724 0. ]
[0. 0.48607405 0.28572714 0.15175049]
[0.0286128 0. 1. 0. ]
[0.14353725 0. 0.25655861 0. ]]

Interpolate 2D matrix along columns using Python

I am trying to interpolate a 2D numpy matrix with the dimensions (5, 3) to a matrix with the dimensions (7, 3) along the axis 1 (columns). Obviously, the wrong approach would be to randomly insert rows anywhere between the original matrix, see the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Target (terrible interpolation -> not wanted!):
[[0, 1, 1]
[0, 1.5, 0.5]
[0, 2, 0]
[0, 3, 1]
[0, 3.5, 0.5]
[0, 4, 0]
[0, 5, 1]]
The correct approach would be to take every row into account and interpolate between all of them to expand the source matrix to a (7, 3) matrix. I am aware of the scipy.interpolate.interp1d or scipy.interpolate.interp2d methods, but could not get it to work with other Stack Overflow posts or websites. I hope to receive any type of tips or tricks.
Update #1: The expected values should be equally spaced.
Update #2:
What I want to do is basically use the separate columns of the original matrix, expand the length of the column to 7 and interpolate between the values of the original column. See the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Split into 3 separate Columns:
[0 [1 [1
0 2 0
0 3 1
0 4 0
0] 5] 1]
Expand length to 7 and interpolate between them, example for second column:
[1
1.66
2.33
3
3.66
4.33
5]
It seems like each column can be treated completely independently, but for each column you need to define essentially an "x" coordinate so that you can fit some function "f(x)" from which you generate your output matrix.
Unless the rows in your matrix are associated with some other datastructure (e.g. a vector of timestamps), an obvious set of x values is just the row-number:
x = numpy.arange(0, Source.shape[0])
You can then construct an interpolating function:
fit = scipy.interpolate.interp1d(x, Source, axis=0)
and use that to construct your output matrix:
Target = fit(numpy.linspace(0, Source.shape[0]-1, 7)
which produces:
array([[ 0. , 1. , 1. ],
[ 0. , 1.66666667, 0.33333333],
[ 0. , 2.33333333, 0.33333333],
[ 0. , 3. , 1. ],
[ 0. , 3.66666667, 0.33333333],
[ 0. , 4.33333333, 0.33333333],
[ 0. , 5. , 1. ]])
By default, scipy.interpolate.interp1d uses piecewise-linear interpolation. There are many more exotic options within scipy.interpolate, based on higher order polynomials, etc. Interpolation is a big topic in itself, and unless the rows of your matrix have some particular properties (e.g. being regular samples of a signal with a known frequency range), there may be no "truly correct" way of interpolating. So, to some extent, the choice of interpolation scheme will be somewhat arbitrary.
You can do this as follows:
from scipy.interpolate import interp1d
import numpy as np
a = np.array([[0, 1, 1],
[0, 2, 0],
[0, 3, 1],
[0, 4, 0],
[0, 5, 1]])
x = np.array(range(a.shape[0]))
# define new x range, we need 7 equally spaced values
xnew = np.linspace(x.min(), x.max(), 7)
# apply the interpolation to each column
f = interp1d(x, a, axis=0)
# get final result
print(f(xnew))
This will print
[[ 0. 1. 1. ]
[ 0. 1.66666667 0.33333333]
[ 0. 2.33333333 0.33333333]
[ 0. 3. 1. ]
[ 0. 3.66666667 0.33333333]
[ 0. 4.33333333 0.33333333]
[ 0. 5. 1. ]]

Pass 2-dimensional data in the Multivariate normal density function of python?

I want to calculate the Gaussian PDF of two dimensional data, i am trying to do this in python using scipy.stats.multivariate_normal function but i don't understand how can i pass my data into it?
Is multivariate_normal only used for analyzing one dimensional data in n-dimensions or can I use for my data set also?
data set-> X = [X1,X2....Xn]
where each
Xi=[x1 x2]
is 2 dimensional.
To compute the density function, use the pdf() method of the object scipy.stats.multivariate_normal. The first argument is your array X. The next two arguments are the mean and the covariance matrix of the distribution.
For example:
In [72]: import numpy as np
In [73]: from scipy.stats import multivariate_normal
In [74]: mean = np.array([0, 1])
In [75]: cov = np.array([[2, -0.5], [-0.5, 4]])
In [76]: x = np.array([[0, 1], [1, 1], [0.5, 0.25], [1, 2], [-1, 0]])
In [77]: x
Out[77]:
array([[ 0. , 1. ],
[ 1. , 1. ],
[ 0.5 , 0.25],
[ 1. , 2. ],
[-1. , 0. ]])
In [78]: p = multivariate_normal.pdf(x, mean, cov)
In [79]: p
Out[79]: array([ 0.05717014, 0.04416653, 0.05106649, 0.03639454, 0.03639454])

how to create similarity matrix in numpy python?

I have data in a file in following form:
user_id, item_id, rating
1, abc,5
1, abcd,3
2, abc, 3
2, fgh, 5
So, the matrix I want to form for above data is following:
# itemd_ids
# abc abcd fgh
[[5, 3, 0] # user_id 1
[3, 0, 5]] # user_id 2
where missing data is replaced by 0.
But from this I want to create both user to user similarity matrix and item to item similarity matrix?
How do I do that?
Technically, this is not a programming problem but a math problem. But I think you better off using variance-covariance matrix. Or correlation matrix, if the scale of the values are very different, say, instead of having:
>>> x
array([[5, 3, 0],
[3, 0, 5],
[5, 5, 0],
[1, 1, 7]])
You have:
>>> x
array([[5, 300, 0],
[3, 0, 5],
[5, 500, 0],
[1, 100, 7]])
To get a variance-cov matrix:
>>> np.cov(x)
array([[ 6.33333333, -3.16666667, 6.66666667, -8. ],
[ -3.16666667, 6.33333333, -5.83333333, 7. ],
[ 6.66666667, -5.83333333, 8.33333333, -10. ],
[ -8. , 7. , -10. , 12. ]])
Or the correlation matrix:
>>> np.corrcoef(x)
array([[ 1. , -0.5 , 0.91766294, -0.91766294],
[-0.5 , 1. , -0.80295507, 0.80295507],
[ 0.91766294, -0.80295507, 1. , -1. ],
[-0.91766294, 0.80295507, -1. , 1. ]])
This is the way to look at it, the diagonal cell, i.e., (0,0) cell, is the correlation of your 1st vector in X to it self, so it is 1. The other cells, i.e, (0,1) cell, is the correlation between the 1st and 2nd vector in X. They are negatively correlated. Or similarly, the 1st and 3rd cell are positively correlated.
covariance matrix or correlation matrix avoid the zero problem pointed out by #Akavall.
See this question: What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
Having:
A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])
dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out
Result in:
array([[ 1. , 0.40824829, 0.40824829],
[ 0.40824829, 1. , 0.33333333],
[ 0.40824829, 0.33333333, 1. ]])
But that works for dense matrix. For sparse you have to develop your solution.

Categories

Resources