vectorizing an optimization in python - python

I am trying to vectorize an optimization (root finding) problem where I need to find roots of an integral. I haven't found any good integration functions that work with vectorization so I coded up a tradezoidal approach to integration. The following code is an attempt to vectorize the roots (for each d there is a different root) finding problem for an input array d of length N=300000 instead of using a loop for different values of d.
from scipy import optimize as op
import numpy as np
N=300000
d=np.random.rand(N)
def f(z): # vectorized version- depends on d as well
return ((1-2*z)*np.exp(-d[:,None]/z))/(((1-z)**(2+d[:,None]))*(z**(2-d[:,None])))
def trap(func,start,end,num): #trapezoidal approach to integration
diff=(end-start)/(num-1.0)
g=np.arange(num)
xlinear=diff[:,None]*g+start[:,None]
fx=func(xlinear)
return np.trapz(fx,xlinear)
def integral(p):
start=np.zeros_like(p) +0.001
return trap(f,start,p,100)
root=op.fsolve(integral,0.75*np.ones_like(d))
However, the above code is throwing a memory error for me.
Traceback (most recent call last):
...
File "H:...my_module.py", line 344, in
root=op.fsolve(integral,0.75*np.ones_like(a))
File "C:..scipy\optimize\minpack.py", line 140, in fsolve
res = _root_hybr(func, x0, args, jac=fprime, **options)
File "C:...scipy\optimize\minpack.py", line 209, in _root_hybr
ml, mu, epsfcn, factor, diag)
MemoryError
I am open to completely different approaches to do this.

Related

Which unsupervised clustering algorithm from the sklearn library can I use with custom distance?

I have a function that takes as input two samples and return their distance and from this function I have defined a metric
def TwoPointsDistance(x1, x2):
cord1 = f.rf.apply(x1)
cord2 = f.rf.apply(x2)
return 1 - (cord1==cord2).sum()/f.n_trees
metric = sk.neighbors.DistanceMetric.get_metric('pyfunc',
func=TwoPointsDistance)
Now I would like to cluster my data according to this metric. I would like to see some examples of algorithms for unsupervised clustering that use this as a distance metric.
EDIT: I am particularly interested in this algorithm:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
EDIT: I have tried
DBSCAN(metric=metric, algorithm='brute').fit(Xor)
but I receive an error:
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/dist-packages/sklearn/cluster/dbscan_.py", line 249, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python3.4/dist-packages/sklearn/cluster/dbscan_.py", line 100, in dbscan
metric=metric, p=p)
File "/usr/local/lib/python3.4/dist-packages/sklearn/neighbors/unsupervised.py", line 83, in __init__
leaf_size=leaf_size, metric=metric, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/sklearn/neighbors/base.py", line 127, in _init_params
% (metric, algorithm))
ValueError: Metric '<sklearn.neighbors.dist_metrics.PyFuncDistance object at 0x7ff5c299f358>' not valid for algorithm 'brute'
>>>
I've tried to figure out why this error arises... I first thought sklearn.neighbors.NearestNeighbors (which is what DBSCAN is based upon) would be constrained to those distances listed in sklearn.neighbors.base.VALID_METRICS["brute"]. But judging from the source code, any callable function should be okay - so it seems your distance isn't callable?
Please try this:
DBSCAN(metric=TwoPointsDistance, algorithm='brute').fit(Xor)
i.e. without wrapping your distance as neighbors.DistanceMetric. It seems a bit inconsistent to me to now allow these to be used here...
Myself, I have used ELKI with great success with a custom distance function, and there is a short tutorial on how to write these available: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/DistanceFunctions
Today, years later, I still stumbled over this in a different context. The solution is simple: pass the function directly as a metric.
BSCAN(metric=TwoPointsDistance, algorithm='brute').fit(Xor)

Trying to parallelize parameter search in scikit-learn leads to "SystemError: NULL result without error in PyObject_Call"

I'm using the sklearn.grid_search.RandomizedSearchCV class from scikit-learn 14.1, and I get an error when running the following code:
X, y = load_svmlight_file(inputfile)
min_max_scaler = preprocessing.MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X.toarray())
parameters = {'kernel':'rbf', 'C':scipy.stats.expon(scale=100), 'gamma':scipy.stats.expon(scale=.1)}
svr = svm.SVC()
classifier = grid_search.RandomizedSearchCV(svr, parameters, n_jobs=8)
classifier.fit(X_scaled, y)
When I set the n_jobs parameter to more than 1, I get the following error output:
Traceback (most recent call last):
File "./svm_training.py", line 185, in <module>
main(sys.argv[1:])
File "./svm_training.py", line 63, in main
gridsearch(inputfile, kerneltype, parameterfile)
File "./svm_training.py", line 85, in gridsearch
classifier.fit(X_scaled, y)
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux- x86_64.egg/sklearn/grid_search.py", line 860, in fit
return self._fit(X, y, sampled_params)
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/grid_search.py", line 493, in _fit
for parameters in parameter_iterable
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/externals/joblib/parallel.py", line 519, in __call__
self.retrieve()
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/externals/joblib/parallel.py", line 419, in retrieve
self._output.append(job.get())
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
SystemError: NULL result without error in PyObject_Call
It seems to have something to do with the python multiprocessing functionality, but I'm not sure how to work around it other than just implement the parallelization for the parameter search by hand. Has anyone had a similar issue with trying to parallelize the randomized parameter search in that they were able to solve?
It turns out the problem was with the use of MinMaxScaler. Since MinMaxScaler only accepts dense arrays, I was translating the sparse representation of the feature vector to a dense array before scaling. Since the feature vector has thousands of elements, my assumption is that the dense arrays caused a memory error when trying to parallelize the parameter search. Instead, I switched to StandardScaler, which accepts sparse arrays as input, and should be better for use with my problem space anyway.

SymPy cannot solve an equation that Matlab can

I have an equation which is related to the sun-synchronous resonance condition in orbital mechanics. I'm learning Python at the moment, so I attempted to solve it in SymPy using the following code:
from sympy import symbols,solve
[n_,Re_,p_,i_,J2_,Pe_] = symbols(['n_','Re_','p_','i_','J2_','Pe_'])
del_ss = -((3*n_*(Re_**2)*J2_/(4*(p_**2)))*(4-5*(sin(i_)**2)))-((3*n_*(Re_**2)*J2_/(2*(p_**2)))*cos(i_))-((2*pi)/Pe_)
pprint(solve(del_ss,i_))
The expression can be successfully rearranged for five of the variables, but when the variable i_ is used in the solve command (as above), an error is produced:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 479, in runfile
execfile(filename, namespace)
File "C:\Users\Nathan\Python\sympy_test_1.py", line 22, in <module>
pprint(solve(del_ss,i_))
File "C:\Python27\lib\site-packages\sympy\solvers\solvers.py", line 484, in solve
solution = _solve(f, *symbols, **flags)
File "C:\Python27\lib\site-packages\sympy\solvers\solvers.py", line 700, in _solve
soln = tsolve(f_num, symbol)
File "C:\Python27\lib\site-packages\sympy\solvers\solvers.py", line 1143, in tsolve
"(tsolve: at least one Function expected at this point")
NotImplementedError: Unable to solve the equation(tsolve: at least one Function expected at this point
However, when the same expression is entered into Matlab and the solve command is called, it is rearranged correctly. I realise that the error mentions a non-implemented feature and that the two functions will no doubt differ, but it would still be nice to know if there's a more appropriate SymPy function that I can use. Any help would be greatly appreciated.
Use the sympy version of Pi.
Substitute cos(i_) by a new variable ci_, replace sin(i_)**2 by 1-ci_**2, and solve for ci_.
This should do it:
from sympy import symbols,solve,sin,cos,pi
[n_,Re_,p_,ci_,J2_,Pe_] = symbols(['n_','Re_','p_','ci_','J2_','Pe_'])
del_ss = -((3*n_*(Re_**2)*J2_/(4*(p_**2)))*(4-5*(1-ci_**2)))-((3*n_*(Re_**2)*J2_/(2*(p_**2)))*ci_)-((2*pi)/Pe_)
pprint(solve(del_ss,ci_))
(Edited because I only wrote half of the solution in the first attempt...)

Scipy odeint ODE error on array sizes

I am trying to solve an ODE which arises from N-body problems in field theory in Physics. For that I thought of using scipy.integrate.odeint function and I have written some code which can be found on:
http://pastebin.com/yuBbEjwg (updated since the question was first posed)
However, when I try to execute it, I get the following error:
Traceback (most recent call last):
File "./main.py", line 87, in <module>
solution = odeint(ODE,XV0,t,args=(M,))
File "/usr/lib/python2.7/site-packages/scipy/integrate/odepack.py", line 143, in odeint
ixpr, mxstep, mxhnil, mxordn, mxords)
ValueError: object too deep for desired array
Could somebody point me out what am I doing wrong? And why my code doesn't work? Also, I wanted to ask if there is any difference between using ode and odeint functions in my case?
Thanks.
EDIT: corrected silly mistakes (shape() -> shape), thanks to Talonmies for pointing that out. The link above should point to the correct script now.
EDIT 2: I somehow suspect that the odeint function doesn't like the tuple returned by the ODE function. Could somebody help on how one formats the tuple if coupled vector ODE need to be solved? I found cases were people are solving coupled ODEs or vector ODEs but not both...
EDIT 3: I have reworked the example so that I give odeinit function a matrix of initial conditions and the returned matrix from the function named ODE, is a matrix of the same dimensions... However, I get the same error.

Applying SVD throws a Memory Error instantaneously?

I am trying to apply SVD on my matrix (3241 x 12596) that was obtained after some text processing (with the ultimate goal of performing Latent Semantic Analysis) and I am unable to understand why this is happening as my 64-bit machine has 16GB RAM. The moment svd(self.A) is called, it throws an error. The precise error is given below:
Traceback (most recent call last):
File ".\SVD.py", line 985, in <module>
_svd.calc()
File ".\SVD.py", line 534, in calc
self.U, self.S, self.Vt = svd(self.A)
File "C:\Python26\lib\site-packages\scipy\linalg\decomp_svd.py", line 81, in svd
overwrite_a = overwrite_a)
MemoryError
So I tried using
self.U, self.S, self.Vt = svd(self.A, full_matrices= False)
and this time, it throws the following error:
Traceback (most recent call last):
File ".\SVD.py", line 985, in <module>
_svd.calc()
File ".\SVD.py", line 534, in calc
self.U, self.S, self.Vt = svd(self.A, full_matrices= False)
File "C:\Python26\lib\site-packages\scipy\linalg\decomp_svd.py", line 71, in svd
return numpy.linalg.svd(a, full_matrices=0, compute_uv=compute_uv)
File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1317, in svd
work = zeros((lwork,), t)
MemoryError
Is this supposed to be such a large matrix that Numpy cannot handle and is there something that I can do at this stage without changing the methodology itself?
Yes, the full_matrices parameter to scipy.linalg.svd is important: your input is highly rank-deficient (rank max 3,241), so you don't want to allocate the entire 12,596 x 12,596 matrix for V!
More importantly, matrices coming from text processing are likely very sparse. The scipy.linalg.svd is dense and doesn't offer truncated SVD, which results in a) tragic performance and b) lots of wasted memory.
Have a look at the sparseSVD package from PyPI, which works over sparse input and you can ask for top K factors only. Or try scipy.sparse.linalg.svd, though that's not as efficient and only available in newer versions of scipy.
Or, to avoid the gritty details completely, use a package that does efficient LSA for you transparently, such as gensim.
Apparently, as it turns out, thanks to #Ferdinand Beyer, I did not notice that I was using a 32-bit version of Python on my 64-bit machine.
Using a 64-bit version of Python and reinstalling all the libraries solved the problem.

Categories

Resources