I have the following code:
import numpy as np
from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from functools import partial
import pandas as pd
def tanimotoKernel(xs, ys):
a = 0
b = 0
for x, y in zip(xs, ys):
a += min(x, y)
b += max(x, y)
return a / b
#gammaExp = 1/(np.exp(gamma) - 1), calculated outside the kernel
def tanimotoLambdaKernel(xs,ys, gamma, gammaExp):
return np.exp(gamma * tanimotoKernel(xs,ys) - 1) * gammaExp
class GramBuilder:
def __init__(self, Kernel):
self._Kernel = Kernel
def generateMatrixBuilder(self, X1, X2):
gram_matrix = np.zeros((X1.shape[0], X2.shape[0]))
for i, x1 in enumerate(X1):
for j, x2 in enumerate(X2):
gram_matrix[i, j] = self._Kernel(x1, x2)
return gram_matrix
gammaList = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
CList = [0.001, 0.01, 0.1, 1, 10, 100]
X, y = datasets.load_digits(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(X, y)
svc_list = [
(svm.SVC(
kernel=GramBuilder(
partial(tanimotoLambdaKernel, gamma = x, gammaExp = 1/(np.exp(x) - 1)))
.generateMatrixBuilder),
x)
for x in gammaList
]
gammas = []
Cs = []
accuracy = []
for svc, gamma in svc_list:
print("Training gamma ", gamma)
clf = GridSearchCV(svc, {'C' : CList}, verbose = 1, n_jobs = -1)
clf.fit(x_train, y_train)
gammas.append(gamma)
Cs.append(clf.best_params_['C'])
accuracy.append(clf.best_score_)
For this toy dataset, I have to wait 50 minutes approx to perform all the cross validations in the loop.
The first improvement I did was to calculate gammaExp outside the function, so I can save millions of exponentials. Also I multiplication is faster than division, so I calculated the inverse of the exponential minus one to also try to save more time.
With those modifications I improved a lot the time training the models, however I need it to be faster, so I would appreciate any ideas. Thanks.
You can use Numpy to speed up the min/max operations. Then you can use Numba's JIT to speed up the code even more by inlining calls.
import numba as nb
#nb.njit
def tanimotoKernel(xs, ys):
a = np.minimum(xs, ys).sum()
b = np.maximum(xs, ys).sum()
return a / b
#nb.njit
def tanimotoLambdaKernel(xs,ys, gamma, gammaExp):
return np.exp(gamma * tanimotoKernel(xs,ys) - 1) * gammaExp
# [...]
The above code should be correct and is more than 20 times faster on my machine. It took actually only few minutes to complete.
I think you can speed things up even more by removing the partial call and use Numba for the GramBuilder class too (look at the Numba documentation to JIT class, partial function are probably not supported, but you can store values in the class and do part of the job yourself). Moreover, note that many operation seems performed multiple times in the kernel. I is probably possible to compute them once (the kernel is called with the same x2 multiple times and recompute the max again and again).
Related
Extending the examples from http://implicit-layers-tutorial.org/neural_odes/ I am tying to mimic the curve fitting function in scipy , scipy.optimize.curve_fit ,using google jax. The function to be fitted is a first order ODE.
#Generate toy data for first order ode.
import jax.numpy as jnp
import jax
import numpy as np
#input data
u = np.zeros(100)
u[10:50] = 1
t = np.arange(len(u))
u = jnp.array(u)
#first order ODE
def f(y,t,k,tau,u):
return (k*u[t]-y)/tau
#Euler integration
def odeint_euler(f, y0, t, *args):
def step(state, t):
y_prev, t_prev = state
dt = t - t_prev
y = y_prev + dt * f(y_prev, t_prev, *args)
return (y, t), y
_, ys = jax.lax.scan(step, (y0, t[0]), t[1:])
return ys
pred = odeint_euler(f, jnp.array([0.0]),t,2.,5.,u)
pred_noise = pred.reshape(-1) + 0.05* np.random.randn(len(pred)) # this is the data to be fitted
# define loss function
def loss_function(params,u,targets):
k,tau = params
pred = odeint_euler(f, jnp.array([0.0]),t,k,tau,u)
return jnp.sum((pred-targets)**2)
def update(params, u, targets):
grads = jax.grad(loss_function)(params,u, targets)
return [w - 0.0001 * dw for w,dw in zip(params, grads)]
updated_params = jnp.array([1.0,2.0]) #initial parameters
for i in range(100):
updated_params = update(updated_params, u, pred_noise)
print(updated_params)
The code works fine. However , this runs pretty slow when compared to scipy curve fit. The accuracy of the solution is not good even after 500, 1000 iterations.
What is wrong with the above code ? Any idea how to make the code run faster and to get more accurate solution? Is there any better way of doing the curve fitting with jax?
I see two overall issues with your approach:
The reason your code is running slowly is because you are doing your looping in Python, which incurs JAX's dispatch overhead every loop. I'd recommend using JAX's built-in tools for minimization of loss functions; for example:
from jax.scipy.optimize import minimize
result = minimize(
loss_function, x0=jnp.array([1.0,2.0]),
method='BFGS', args=(u, pred_noise))
The reason your accuracy does not approach that of scipy is likely because JAX defaults to 32-bit computations (See Double (64 bit) Precision). To run your code in 64-bit, you can run this block before any other imports:
from jax import config
config.update('jax_enable_x64', True)
I am tring to implement the RBF Kernel Function for my kernel k-means alg. Here is my formula.
And then I implement it with Numpy, but there's a two-layer for loop, and I'm thinking about how to turn it into a matrix operation. Because if I could do matrix operations, it would be a lot faster to process my 784-dimensional data. Or maybe my implemention is not correct? Can someone help me?
import numpy as np
def get_gamma(X, Y):
gamma = 0
for x in X:
for y in Y:
tmp = x - y
gamma += tmp**2
gamma = gamma / (length**2)
return gamma
def kernel(X, Y, gamma):
up = np.sum(np.power(X-Y, 2))
res = np.exp(-*up/gamma)
return res
def kernel_distance(X, Y):
gamma = get_gamma(X, Y)
a = kernel(X, X, gamma)
b = kernel(Y, Y, gamma)
c = kernel(X, Y, gamma)
return np.sqrt(a+b-2*c)
That's odd if I run your code it gives me a number for k. But shouldn't it be an array? Also shouldn't X and Y be 2d since those are basically a list of your points? Anyways if I take my own X and Y
from scipy.spatial.distance import cdist
import numpy as np
n = 10
X = np.random.random((n,3))
Y = np.random.random((n,3))
I can solve your problem like this
norms_sq = cdist(X,Y,'sqeuclidean')
two_sigma_sq = 1/n**2*np.sum(norms_sq)
k = np.exp(-norms_sq/two_sigma_sq)
i'm studying gaussian process regression, and i'm trying to use the built-in functions from scikit-learn, and also trying to impement a custom function for doing so.
This is the code when using scikit-learn:
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor as gpr
from sklearn.gaussian_process.kernels import RBF,WhiteKernel,ConstantKernel as C
from scipy.optimize import minimize
import scipy.stats as s
X = np.linspace(0,10,10).reshape(-1,1) # Input Values
Y = 2*X + np.sin(X) # Function
v = 1
kernel = v*RBF() + WhiteKernel() #Defining kernel
gp = gpr(kernel=kernel,n_restarts_optimizer=50).fit(X,Y) #fitting the process to get optimized
hyperparameter
gp.kernel_ #Hyperparameters optimized by the GPR function in scikit-learn
Out[]: 14.1**2 * RBF(length_scale=3.7) + WhiteKernel(noise_level=1e-05) #result
And this is the code i wrote manually:
def marglike(par,X,Y): #defining log-marginal-likelihood
# print(par)
l,var,sigma_n = par
n = len(X)
dist_X = (X - X.T)**2
# print(dist_X)
k = var*np.exp(-(1/(2*(l**2)))*dist_X)
inverse = np.linalg.inv(k + (sigma_n**2)*np.eye(len(k)))
ml = (1/2)*np.dot(np.dot(Y.T,inverse),Y) + (1/2)*np.log(np.linalg.det(k +
(sigma_n**2)*np.eye(len(k)))) + (n/2)*np.log(2*np.pi)
return ml
b= [0.0005,100]
bnd = [b,b,b] #bounds used for "minimize" function
start = np.array([1.1,1.6,0.05]) #initial hyperparameters values
re = minimize(marglike,start,args=(X,Y),method="L-BFGS-B",options = {'disp':True},bounds=bnd) #the
method used is the same as the one used by scikit-learn
re.x #Hyperparameter results
Out[]: array([3.55266484e+00, 9.99986210e+01, 5.00000000e-04])
As you can see, the hyperparameter i got from the 2 methods are different, but yet i used the same data(X,Y) and same minimization method.
Could somebody help me to understand why and maybe how to get same results ?!
As suggested by San Mason, adding noise actually works! Otherwise, while you do it manually (in the custom code), set the initial noise to reasonably low and have multiple restarts with different initializations then you will get values close by. By the way, noiseless data seems to be creating a stationary ridge in the space of hyperparameters (like Fig. 1.6 in Surrogates GP book). Note that scikit-learn noise is sigma_n^2 for your custom function. Below are the snippets of noisy and noise-less cases.
Noise-less case
scikit-learn
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor as gpr
from sklearn.gaussian_process.kernels import RBF,WhiteKernel,ConstantKernel as C
from scipy.optimize import minimize
import scipy.stats as s
X = np.linspace(0,10,10).reshape(-1,1) # Input Values
Y = 2*X + np.sin(X) #+ np.random.normal(10)# Function
v = 1
kernel = v*RBF() + WhiteKernel() #Defining kernel
gp = gpr(kernel=kernel,n_restarts_optimizer=50).fit(X,Y) #fitting the process to get optimized
# hyperparameter
gp.kernel_ #Hyperparameters optimized by the GPR function in scikit-learn
# Out[]: 14.1**2 * RBF(length_scale=3.7) + WhiteKernel(noise_level=1e-05) #result
custom function
def marglike(par,X,Y): #defining log-marginal-likelihood
# print(par)
l,std,sigma_n = par
n = len(X)
dist_X = (X - X.T)**2
# print(dist_X)
k = std**2*np.exp(-(dist_X/(2*(l**2)))) + (sigma_n**2)*np.eye(n)
inverse = np.linalg.inv(k)
ml = (1/2)*np.dot(np.dot(Y.T,inverse),Y) + (1/2)*np.log(np.linalg.det(k)) + (n/2)*np.log(2*np.pi)
return ml[0,0]
b= [10**-5,10**5]
bnd = [b,b,b] #bounds used for "minimize" function
start = [1,1,10**-5] #initial hyperparameters values
re = minimize(fun=marglike,x0=start,args=(X,Y),method="L-BFGS-B",options = {'disp':True},bounds=bnd) #the
# method used is the same as the one used by scikit-learn
re.x[1], re.x[0], re.x[2]**2
# Output - (9.920690495739379, 3.5657912350017575, 1.0000000000000002e-10)
Noisy case
scikit-learn
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor as gpr
from sklearn.gaussian_process.kernels import RBF,WhiteKernel,ConstantKernel as C
from scipy.optimize import minimize
import scipy.stats as s
X = np.linspace(0,10,10).reshape(-1,1) # Input Values
Y = 2*X + np.sin(X) + np.random.normal(size=10).reshape(10,1)*0.1 # Function
v = 1
kernel = v*RBF() + WhiteKernel() #Defining kernel
gp = gpr(kernel=kernel,n_restarts_optimizer=50).fit(X,Y) #fitting the process to get optimized
# hyperparameter
gp.kernel_ #Hyperparameters optimized by the GPR function in scikit-learn
# Out[]: 10.3**2 * RBF(length_scale=3.45) + WhiteKernel(noise_level=0.00792) #result
Custom function
def marglike(par,X,Y): #defining log-marginal-likelihood
# print(par)
l,std,sigma_n = par
n = len(X)
dist_X = (X - X.T)**2
# print(dist_X)
k = std**2*np.exp(-(dist_X/(2*(l**2)))) + (sigma_n**2)*np.eye(n)
inverse = np.linalg.inv(k)
ml = (1/2)*np.dot(np.dot(Y.T,inverse),Y) + (1/2)*np.log(np.linalg.det(k)) + (n/2)*np.log(2*np.pi)
return ml[0,0]
b= [10**-5,10**5]
bnd = [b,b,b] #bounds used for "minimize" function
start = [1,1,10**-5] #initial hyperparameters values
re = minimize(fun=marglike,x0=start,args=(X,Y),method="L-BFGS-B",options = {'disp':True},bounds=bnd) #the
# method used is the same as the one used by scikit-learn
re.x[1], re.x[0], re.x[2]**2
# Output - (10.268943740577331, 3.4462604625225106, 0.007922681239535326)
Recently I've found interesting article about regression clustering algorithm which can deal both tasks of regression and clustering:
http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Regression_Clustering.pdf
I'm just curios-is there some technics (libraries) to do it via Python? Thanks!
The algorithm of Spath is not implemented in Python, as far as I know.
But you could replicate its results using Gaussian mixture models in scikit-learn:
import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
# generate random data
np.random.seed(1)
n = 10
x1 = np.random.uniform(0, 20, size=n)
x2 = np.random.uniform(0, 20, size=n)
y1 = x1 + np.random.normal(size=n)
y2 = 15 - x2 + np.random.normal(size=n)
x = np.concatenate([x1, x2])
y = np.concatenate([y1, y2])
data = np.vstack([x, y]).T
model = GaussianMixture (n_components=2).fit(data)
plt.scatter(x, y, c=model.predict(data))
plt.show()
This code produces the picture, similar to one in the paper:
The GMM is different from Spath algorithm, because the former tries to maximize prediction accuracy of ALL data (X and y), and the latter maximizes only R^2 of y. In my opinion, for most practical problems you would prefer the GMM.
If you still want the Spath algorithm, it could be done with a class like this, implementing a version of EM algorithm:
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.base import RegressorMixin, BaseEstimator, clone
class ClusteredRegressor(RegressorMixin, BaseEstimator):
def __init__(self, n_components=2, base=Ridge(), random_state=1, max_iter=100, tol=1e-10, verbose=False):
self.n_components = n_components
self.base = base
self.random_state = random_state
self.max_iter = max_iter
self.tol = tol
self.verbose = verbose
def fit(self, X, y):
np.random.seed(self.random_state)
self.estimators_ = [clone(self.base) for i in range(self.n_components)]
# initialize cluster responsibilities randomly
self.resp_ = np.random.uniform(size=(X.shape[0], self.n_components))
self.resp_ /= self.resp_.sum(axis=1, keepdims=True)
for it in range(self.max_iter):
old_resp = self.resp_.copy()
# Estimate sample-weithted regressions
errors = np.empty(shape=self.resp_.shape)
for i, est in enumerate(self.estimators_):
est.fit(X, y, sample_weight=self.resp_[:, i])
errors[:, i] = y - est.predict(X)
self.mse_ = np.sum(self.resp_ * errors**2) / X.shape[0]
if self.verbose:
print(self.mse_)
# Recalculate responsibilities
self.resp_ = np.exp(-errors**2 / self.mse_)
self.resp_ /= self.resp_.sum(axis=1, keepdims=True)
# stop if change in responsibilites is small
delta = np.abs(self.resp_ - old_resp).mean()
if delta < self.tol:
break
self.n_iter_ = it
return self
def predict(self, X):
""" Calculate a matrix of conditional predictions """
return np.vstack([est.predict(X) for est in self.estimators_]).T
def predict_proba(self, X, y):
""" Estimate cluster probabilities of labeled data """
predictions = self.predict(X)
errors = np.empty(shape=self.resp_.shape)
for i, est in enumerate(self.estimators_):
errors[:, i] = y - est.predict(X)
resp_ = np.exp(-errors**2 / self.mse_)
resp_ /= resp_.sum(axis=1, keepdims=True)
return resp_
This code is similar to Spath algorithm, with the only difference that it uses soft "responsibilities" of each cluster for each observation, instead of hard cluster assignment (this way, it is easier for optimization). You can see that the resulting cluster assignment is similar to GMM:
model = ClusteredRegressor()
model.fit(x[:, np.newaxis], y)
labels = np.argmax(model.resp_, axis=1)
plt.scatter(x, y, c=labels)
plt.show()
Unfortunately, this model cannot be applied to predict test data, because its output depends on data labels (y). However, if you further modify my code, you could predict cluster probability conditional on X. In this case, the model would be useful for prediction.
I have a function of multiple arguments. I want to optimize it with respect to a single variable while holding others constant. For that I want to use minimize_scalar from spicy.optimize. I read the documentation, but I am still confused how to tell minimize_scalar that I want to minimize with respect to variable:w1. Below is a minimal working code.
import numpy as np
from scipy.optimize import minimize_scalar
def error(w0,w1,x,y_actual):
y_pred = w0+w1*x
mse = ((y_actual-y_pred)**2).mean()
return mse
w0=50
x = np.array([1,2,3])
y = np.array([52,54,56])
minimize_scalar(error,args=(w0,x,y),bounds=(-5,5))
You can use a lambda function
minimize_scalar(lambda w1: error(w0,w1,x,y),bounds=(-5,5))
You can also use a partial function.
from functools import partial
error_partial = partial(error, w0=w0, x=x, y_actual=y)
minimize_scalar(error_partial, bounds=(-5, 5))
In case you are wondering about the performance ... it is the same as with lambdas.
import time
from functools import partial
import numpy as np
from scipy.optimize import minimize_scalar
def error(w1, w0, x, y_actual):
y_pred = w0 + w1 * x
mse = ((y_actual - y_pred) ** 2).mean()
return mse
w0 = 50
x = np.arange(int(1e5))
y = np.arange(int(1e5)) + 52
error_partial = partial(error, w0=w0, x=x, y_actual=y)
p_time = []
for _ in range(100):
p_time_ = time.time()
p = minimize_scalar(error_partial, bounds=(-5, 5))
p_time_ = time.time() - p_time_
p_time.append(p_time_ / p.nfev)
l_time = []
for _ in range(100):
l_time_ = time.time()
l = minimize_scalar(lambda w1: error(w1, w0, x, y), bounds=(-5, 5))
l_time_ = time.time() - l_time_
l_time.append(l_time_ / l.nfev)
print(f'Same performance? {np.median(p_time) == np.median(l_time)}')
# Same performance? True
The marked correct answer is actually minimizing with respect to W0. It should be:
minimize_scalar(lambda w1: error(w1,w0,x,y),bounds=(-5,5))