Manual implementation of a Support Vector Machine

Manual implementation of a Support Vector Machine - python

I am wondering is there any article where SVM (Support Vector Machine) is implemented manually in R or Python.
I do not want to use a built-in function or package. ?
The example could be very simple in terms of feature space and linear separable.
I just want to go through the whole process to enhance my understanding.

The answer to this question is rather broad since there are several possible algorithms in order to train SVMs. Also packages like LibSVM (available both for Python and R) are open-source so you are free to check the code inside.
Hereinafter I will consider the Sequential Minimal Optimization (SMO) algorithm by J. Pratt which is implemented in LibSVM. Implementing manually an algorithm which solves the SVM optimization problem is rather tedious but, if that's your first approach with SVMs I'd suggest the following (albeit simplified) version of the SMO algorithm
http://cs229.stanford.edu/materials/smo.pdf
This lecture is from Prof. Andrew Ng (Stanford) and he shows a simplified version of the SMO algorithm. I don't know what your theoretical background in SVMs is, but let's just say that the major difference is that the Lagrange multipliers pair (alpha_i and alpha_j) is randomly selected whereas in the original SMO algorithm there's a much harder heuristic involved.
In other terms, thus, this algorithm does not guarantee to converge to a global optimum (which is always true in SVMs for the dataset at hand if trained with a proper algorithm), but this can give you a nice introduction to the optimization problem behind SVMs.
This paper, however, does not show any codes in R or Python but the pseudo-code is rather straightforward and easy to implement (~100 lines of code in either Matlab or Python). Also keep in mind that this training algorithm returns alpha (Lagrange multipliers vector) and b (intercept). You might want to have some additional parameters such as the proper Support Vectors but starting from vector alpha such quantities are rather easy to calculate.
At first, let's suppose you have a LearningSet in which there are as many rows as there are patterns and as many columns as there are features and let also suppose that you have LearningLabels which is a vector with as many elements as there are patterns and this vector (as its name suggests) contains the proper labels for the patterns in LearningSet.
Also recall that alpha has as many elements as there are patterns.
In order to evaluate the Support Vector indices you can check whether element i in alpha is greater than or equal to 0: if alpha[i]>0 then the i-th pattern from LearningSet is a Support Vector. Similarly, the i-th element from LearningLabels is the related label.
Finally, you might want to evaluate vector w, the free parameters vector. Given that alpha is known, you can apply the following formula
w = ((alpha.*LearningLabels)'*LearningSet)'
where alpha and LearningLabels are column vectors and LearningSet is the matrix as described above. In the above formula .* is the element-wise product whereas ' is the transpose operator.

I would like to add a small supplement to the previous answer.
If you feel confident with linear algebra, there is an alternative derivation of the SMO algorithm and a very simple implementation based on that formulas.
Basically, it contains about 30 lines of code in Python
class SVM:
def __init__(self, kernel='linear', C=10000.0, max_iter=100000, degree=3, gamma=1):
self.kernel = {'poly' : lambda x,y: np.dot(x, y.T)**degree,
'rbf' : lambda x,y: np.exp(-gamma*np.sum((y - x[:,np.newaxis])**2, axis=-1)),
'linear': lambda x,y: np.dot(x, y.T)}[kernel]
self.C = C
self.max_iter = max_iter
def restrict_to_square(self, t, v0, u):
t = (np.clip(v0 + t*u, 0, self.C) - v0)[1]/u[1]
return (np.clip(v0 + t*u, 0, self.C) - v0)[0]/u[0]
def fit(self, X, y):
self.X = X.copy()
self.y = y * 2 - 1
self.lambdas = np.zeros_like(self.y, dtype=float)
self.K = self.kernel(self.X, self.X) * self.y[:,np.newaxis] * self.y
for _ in range(self.max_iter):
for idxM in range(len(self.lambdas)):
idxL = np.random.randint(0, len(self.lambdas))
Q = self.K[[[idxM, idxM], [idxL, idxL]], [[idxM, idxL], [idxM, idxL]]]
v0 = self.lambdas[[idxM, idxL]]
k0 = 1 - np.sum(self.lambdas * self.K[[idxM, idxL]], axis=1)
u = np.array([-self.y[idxL], self.y[idxM]])
t_max = np.dot(k0, u) / (np.dot(np.dot(Q, u), u) + 1E-15)
self.lambdas[[idxM, idxL]] = v0 + u * self.restrict_to_square(t_max, v0, u)
idx, = np.nonzero(self.lambdas > 1E-15)
self.b = np.sum((1.0 - np.sum(self.K[idx] * self.lambdas, axis=1)) * self.y[idx]) / len(idx)
def decision_function(self, X):
return np.sum(self.kernel(X, self.X) * self.y * self.lambdas, axis=1) + self.b
In simple cases it works not much worth than sklearn.svm.SVC, comparison shown below
For more elaborate explanation with formulas you may want to refer to this ResearchGate preprint.
Code for image generation can be found on GitHub.

Related

How do I understand the gradient for complex functions that do not satisfy the Cauchy Reimann Equation

Let us suppose that my function is
(z : C -> C)
z = x - i*y
now here the real part is,
u(x, y) = x
the imaginary part is,
v(x, y) = -y
so, when we get the derivatives, we find
d_u_x(x,y) = 1 # derivative of u wrt x
d_u_y(x,y) = 0
d_v_x(x, y) = 0
d_v_y(x, y) = -1
so, here,
d_u_x != d_v_y
thus, it does not follow Cauchy Reimann equation.
but, then comes the Wirtinger calculus, that says, I could write my function as,
u(x, y) = ((x + iy) + (x - iy))/2
= (z + z.conj())/2
v(x, y) = (((x + iy) - (x - iy))/2i
= (z - z.conj())/2i
but what after this, how do I find the gradient.
plus, in PyTorch, what is the correct way to specify such a function,
if I do,
import torch
a = torch.randn(1, dtype=torch.cfloat, requires_grad=True)
f = a.conj()
f.backward()
print(a.grad)
is this a correct way?

You may find the following page of interest:
When you use PyTorch to differentiate any function f(z) with complex domain and/or codomain, the gradients are computed under the assumption that the function is a part of a larger real-valued loss function g(input)=L. The gradient computed is ∂L/∂z* (note the conjugation of z), the negative of which is precisely the direction of steepest descent used in Gradient Descent algorithm. Thus, all the existing optimizers work out of the box with complex parameters.
This convention matches TensorFlow’s convention for complex differentiation, but is different from JAX (which computes ∂L/∂z).
If you have a real-to-real function which internally uses complex operations, the convention here doesn’t matter: you will always get the same result that you would have gotten if it had been implemented with only real operations.
...
For optimization problems, only real valued objective functions are used in the research community since complex numbers are not part of any ordered field and so having complex valued loss does not make much sense.
It also turns out that no interesting real-valued objective fulfill the Cauchy-Riemann equations. So the theory with homomorphic function cannot be used for optimization and most people therefore use the Wirtinger calculus.
https://pytorch.org/docs/stable/notes/autograd.html

Checking Neural Network Gradient with Finite Difference Methods Doesn't Work

After a full week of print statements, dimensional analysis, refactoring, and talking through the code out loud, I can say I'm completely stuck.
The gradients my cost function produces are too far from those produced by finite differences.
I have confirmed my cost function produces correct costs for regularized inputs and not. Here's the cost function:
def nnCost(nn_params, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels):
# reshape parameter/weight vectors to suit network size
Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)], (hidden_layer_size, (input_layer_size + 1)))
Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size+1)):], (num_labels, (hidden_layer_size + 1)))
if lambda_ is None:
lambda_ = 0
# grab number of observations
m = X.shape[0]
# init variables we must return
cost = 0
Theta1_grad = np.zeros(Theta1.shape)
Theta2_grad = np.zeros(Theta2.shape)
# one-hot encode the vector y
y_mtx = pd.get_dummies(y.ravel()).to_numpy()
ones = np.ones((m, 1))
X = np.hstack((ones, X))
# layer 1
a1 = X
z2 = Theta1#a1.T
# layer 2
ones_l2 = np.ones((y.shape[0], 1))
a2 = np.hstack((ones_l2, sigmoid(z2.T)))
z3 = Theta2#a2.T
# layer 3
a3 = sigmoid(z3)
reg_term = (lambda_/(2*m)) * (np.sum(np.sum(np.multiply(Theta1, Theta1))) + np.sum(np.sum(np.multiply(Theta2,Theta2))) - np.subtract((Theta1[:,0].T#Theta1[:,0]),(Theta2[:,0].T#Theta2[:,0])))
cost = (1/m) * np.sum((-np.log(a3).T * (y_mtx) - np.log(1-a3).T * (1-y_mtx))) + reg_term
# BACKPROPAGATION
# δ3 equals the difference between a3 and the y_matrix
d3 = a3 - y_mtx.T
# δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units) multiplied element-wise by the g′() of z2 (computed back in Step 2).
d2 = Theta2[:,1:].T#d3 * sigmoidGradient(z2)
# Δ1 equals the product of δ2 and a1.
Delta1 = d2#a1
Delta1 /= m
# Δ2 equals the product of δ3 and a2.
Delta2 = d3#a2
Delta2 /= m
reg_term1 = (lambda_/m) * np.append(np.zeros((Theta1.shape[0],1)), Theta1[:,1:], axis=1)
reg_term2 = (lambda_/m) * np.append(np.zeros((Theta2.shape[0],1)), Theta2[:,1:], axis=1)
Theta1_grad = Delta1 + reg_term1
Theta2_grad = Delta2 + reg_term2
grad = np.append(Theta1_grad.ravel(), Theta2_grad.ravel())
return cost, grad
Here's the code to check the gradients. I have been over every line and there is nothing whatsoever that I can think of to change here. It seems to be in working order.
def checkNNGradients(lambda_):
"""
Creates a small neural network to check the backpropagation gradients.
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Input: Regularization parameter, lambda, as int or float.
Output: Analytical gradients produced by backprop code and the numerical gradients (computed
using computeNumericalGradient). These two gradient computations should result in
very similar values.
"""
input_layer_size = 3
hidden_layer_size = 5
num_labels = 3
m = 5
# generate 'random' test data
Theta1 = debugInitializeWeights(hidden_layer_size, input_layer_size)
Theta2 = debugInitializeWeights(num_labels, hidden_layer_size)
# reusing debugInitializeWeights to generate X
X = debugInitializeWeights(m, input_layer_size - 1)
y = np.ones(m) + np.remainder(np.range(m), num_labels)
# unroll parameters
nn_params = np.append(Theta1.ravel(), Theta2.ravel())
costFunc = lambda p: nnCost(p, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels)
cost, grad = costFunc(nn_params)
numgrad = computeNumericalGradient(costFunc, nn_params)
# examine the two gradient computations; two columns should be very similar.
print('The columns below should be very similar.\n')
# Credit: http://stackoverflow.com/a/27663954/583834
print('{:<25}{}'.format('Numerical Gradient', 'Analytical Gradient'))
for numerical, analytical in zip(numgrad, grad):
print('{:<25}{}'.format(numerical, analytical))
# If you have a correct implementation, and assuming you used EPSILON = 0.0001
# in computeNumericalGradient.m, then diff below should be less than 1e-9
diff = np.linalg.norm(numgrad-grad)/np.linalg.norm(numgrad+grad)
print(diff)
print("\n")
print('If your backpropagation implementation is correct, then \n' \
'the relative difference will be small (less than 1e-9). \n' \
'\nRelative Difference: {:.10f}'.format(diff))
The check function generates its own data using a debugInitializeWeights function (so there's the reproducible example; just run that and it will call the other functions), and then calls the function that calculates the gradient using finite differences. Both are below.
def debugInitializeWeights(fan_out, fan_in):
"""
Initializes the weights of a layer with fan_in
incoming connections and fan_out outgoing connections using a fixed
strategy.
Input: fan_out, number of outgoing connections for a layer as int; fan_in, number
of incoming connections for the same layer as int.
Output: Weight matrix, W, of size(1 + fan_in, fan_out), as the first row of W handles the "bias" terms
"""
W = np.zeros((fan_out, 1 + fan_in))
# Initialize W using "sin", this ensures that the values in W are of similar scale;
# this will be useful for debugging
W = np.sin(range(1, np.size(W)+1)) / 10
return W.reshape(fan_out, fan_in+1)
def computeNumericalGradient(J, nn_params):
"""
Computes the gradient using "finite differences"
and provides a numerical estimate of the gradient (i.e.,
gradient of the function J around theta).
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Inputs: Cost, J, as computed by nnCost function; Parameter vector, theta.
Output: Gradient vector using finite differences. Per Dr. Ng,
'Sets numgrad(i) to (a numerical approximation of) the partial derivative of
J with respect to the i-th input argument, evaluated at theta. (i.e., numgrad(i) should
be the (approximately) the partial derivative of J with respect
to theta(i).)'
"""
numgrad = np.zeros(nn_params.shape)
perturb = np.zeros(nn_params.shape)
e = .0001
for i in range(np.size(nn_params)):
# Set perturbation (i.e., noise) vector
perturb[i] = e
# run cost fxn w/ noise added to and subtracted from parameters theta in nn_params
cost1, grad1 = J((nn_params - perturb))
cost2, grad2 = J((nn_params + perturb))
# record the difference in cost function ouputs; this is the numerical gradient
numgrad[i] = (cost2 - cost1) / (2*e)
perturb[i] = 0
return numgrad
The code is not for class. That MOOC was in MATLAB and it's over. This is for me. Other solutions exist on the web; looking at them has proved fruitless. Everyone has a different (inscrutable) approach. So, I'm in serious need of assistance or a miracle.
Edit/Update: Fortran ordering when raveling vectors influences the outcome, but I have not been able to get the gradients to move together changing that option.

One thought: I think your perturbation is a little large, being 1e-4. For double precision floating point numbers, it should be more like 1e-8, i.e., the root of the machine precision (or are you working with single precision?!).
That being said, finite differences can be very bad approximations to true derivatives. Specifically, floating point computations in numpy are not deterministic, as you seem to have found out. The noise in evaluations can cancel out many significant digits under some circumstances. What values are you seeing and what are you expecting?

All of the following figured into the solution to my problem. For those attempting to translate MATLAB code to Python, whether from Andrew NG's Coursera Machine Learning course or not, these are things everyone should know.
MATLAB does everything in FORTRAN order; Python does everything in C order. This affects how vectors are populated and, thus, your results. You should always be in FORTRAN order, if you want your answers to match what you did in MATLAB. See docs
Getting your vectors in FORTRAN order can be as easy as passing order='F' as an argument to .reshape(), .ravel(), or .flatten(). You may, however, achieve the same thing if you are using .ravel() by transposing the vector then applying the .ravel() function like so X.T.ravel().
Speaking of .ravel(), the .ravel() and .flatten() functions do not do the same thing and may have different use cases. For example, .flatten() is preferred by SciPy optimization methods. So, if your equivalent of fminunc isn't working, it's likely because you forgot to .flatten() your response vector y. See this Q&A StackOverflow and docs on .ravel() which link to .flatten().More Docs
If you're translating your code from MATLAB live script into a Jupyter notebook or Google COLAB, you must police your name space. On one occasion, I found that the variable I thought was being passed was not actually the variable that was being passed. Why? Jupyter and Colab notebooks have a lot of global variables that one would never write ordinarily.
There is a better function to evaluate the differences between numerical and analytical gradients: Relative Error Comparison np.abs(numerical-analyitical)/(numerical+analytical). Read about it here CS231 Also, consider the accepted post above.

Speeding up matrix-vector multiplication and exponentiation in Python, possibly by calling C/C++

I am currently working on a machine learning project where - given a data matrix Z and a vector rho - I have to compute the value and slope of the logistic loss function at rho. The computation involves basic matrix-vector multiplication and log/exp operations, with a trick to avoid numerical overflow (described in this previous post).
I am currently doing this in Python using NumPy as shown below (as a reference, this code runs in 0.2s). Although this works well, I would like to speed it up since I call the function multiple times in my code (and it represents over 90% of the computation involved in my project).
I am looking for any way to improve the runtime of this code without parallelization (i.e. only 1 CPU). I am happy using any publicly available package in Python, or calling C or C++ (since I have heard that this improves runtimes by an order of magnitude). Preprocessing the data matrix Z would also be OK. Some things that could be exploited for better computation are that the vector rho is usually sparse (with around 50% of entries = 0) and there are usually far more rows than columns (in most cases n_cols <= 100)
import time
import numpy as np
np.__config__.show() #make sure BLAS/LAPACK is being used
np.random.seed(seed = 0)
#initialize data matrix X and label vector Y
n_rows, n_cols = 1e6, 100
X = np.random.random(size=(n_rows, n_cols))
Y = np.random.randint(low=0, high=2, size=(n_rows, 1))
Y[Y==0] = -1
Z = X*Y # all operations are carried out on Z
def compute_logistic_loss_value_and_slope(rho, Z):
#compute the value and slope of the logistic loss function in a way that is numerically stable
#loss_value: (1 x 1) scalar = 1/n_rows * sum(log( 1 .+ exp(-Z*rho))
#loss_slope: (n_cols x 1) vector = 1/n_rows * sum(-Z*rho ./ (1+exp(-Z*rho))
#see also: https://stackoverflow.com/questions/20085768/
scores = Z.dot(rho)
pos_idx = scores > 0
exp_scores_pos = np.exp(-scores[pos_idx])
exp_scores_neg = np.exp(scores[~pos_idx])
#compute loss value
loss_value = np.empty_like(scores)
loss_value[pos_idx] = np.log(1.0 + exp_scores_pos)
loss_value[~pos_idx] = -scores[~pos_idx] + np.log(1.0 + exp_scores_neg)
loss_value = loss_value.mean()
#compute loss slope
phi_slope = np.empty_like(scores)
phi_slope[pos_idx] = 1.0 / (1.0 + exp_scores_pos)
phi_slope[~pos_idx] = exp_scores_neg / (1.0 + exp_scores_neg)
loss_slope = Z.T.dot(phi_slope - 1.0) / Z.shape[0]
return loss_value, loss_slope
#initialize a vector of integers where more than half of the entries = 0
rho_test = np.random.randint(low=-10, high=10, size=(n_cols, 1))
set_to_zero = np.random.choice(range(0,n_cols), size =(np.floor(n_cols/2), 1), replace=False)
rho_test[set_to_zero] = 0.0
start_time = time.time()
loss_value, loss_slope = compute_logistic_loss_value_and_slope(rho_test, Z)
print "total runtime = %1.5f seconds" % (time.time() - start_time)

Libraries of the BLAS family are already highly tuned for best performance. So no effort to link to some C/C++ code is likely to give you any benefits. You could however try various BLAS implementations, since there are quite a few of them around, including some specially tuned to some CPUs.
The other thing that comes to my mind is to use a library like theano (or Google's tensorflow) that is able to represent the entire computational graph (all of the operations in your function above) and apply global optimizations to it. It can then generate CPU code from that graph via C++ (and by flipping a simple switch also GPU code). It can also automatically compute symbolic derivatives for you. I've used theano for machine learning problems and it's a really great library for that, although not the easiest one to learn.
(I'm posting this as an answer because it's too long for a comment)
Edit:
I actually had a go at this in theano, but the result is actually about 2x slower on the CPU, see below why. I'll post it here anyway, maybe it's a starting point for someone else to do something better: (this is only partial code, complete with the code from the original post)
import theano
def make_graph(rho, Z):
scores = theano.tensor.dot(Z, rho)
# this is very inefficient... it calculates everything twice and
# then picks one of them depending on scores being positive or not.
# not sure how to express this in theano in a more efficient way
pos = theano.tensor.log(1 + theano.tensor.exp(-scores))
neg = theano.tensor.log(scores + theano.tensor.exp(scores))
loss_value = theano.tensor.switch(scores > 0, pos, neg)
loss_value = loss_value.mean()
# however computing the derivative is a real joy now:
loss_slope = theano.tensor.grad(loss_value, rho)
return loss_value, loss_slope
sym_rho = theano.tensor.col('rho')
sym_Z = theano.tensor.matrix('Z')
sym_loss_value, sym_loss_slope = make_graph(sym_rho, sym_Z)
compute_logistic_loss_value_and_slope = theano.function(
inputs=[sym_rho, sym_Z],
outputs=[sym_loss_value, sym_loss_slope]
)
# use function compute_logistic_loss_value_and_slope() as in original code

Numpy is quite optimized. The best you can do is to try other libraries with data of the same size initialized to random (not initialized to 0) and do your own benchmark.
If you want to try, you can of course try BLAS. You should also give a try to eigen, I personally found it faster on one of my applications.

scipy optimize - fmin Nelder-Mead simplex

I'm trying to use the scipy Nelder-Mead simplex search function to find a minimum to a non-linear function. It appears my simplex gets stuck because it starts off with an initial simplex that is too small. Unfortunately, I don't see anywhere in scipy where you can change some of the simplex parameters (e.g. initial simplex size). Is there a way? Am I missing something? Or are there other implementations of the NM simplex?
Thanks

Two suggestions for Nelder-Mead:
1) snap all x to a grid, say .01, inside the function:
x = np.round( x / grid ) * grid
f = ...
This acts as a simple noise filter in high dimensions
(in 2d or 3d, don't bother).
2) start off with the best d+1 of 2d+1 nearby points,
instead of the usual d+1:
def neard1( func, x, h, verbose=1 ):
""" eval func at 2d+1 points x, x +- h
sort
-> f[ d+1 best values ], X[ d+1 ]
to start or restart Nelder-Mead
"""
dim = len(x)
I = np.eye(dim)
np.fill_diagonal( I, h ) # scalar or vec
X = x + np.vstack(( np.zeros(dim), I, - I ))
fnear = np.array([ func( x ) for x in X ]) # 2d+1
f0 = fnear[0]
up = np.argsort( fnear ) # vec func: |fnear|
if verbose:
print "neard1: f %g +- %s around x %s" % (
f0, fnear[up] - f0, x )
bestd1 = up[:dim+1]
return fnear[bestd1], X[bestd1]
It's also not a bad idea to look at the neard1() values after Nelder-Mead,
to get an idea of what func() looks like there.
If any neighbors are better then the N-M "best", restart N-M from that new simplex.
(One can alternate neard1, N-M, neard1, N-M: easy but very problem-dependent.)
How many variables do you have, and how noisy is your function ?
Hope this helps

From the reference at http://docs.scipy.org/doc/:
Method Nelder-Mead uses the Simplex algorithm [R123], [R124]. This algorithm has been successful in many applications but other algorithms using the first and/or second derivatives information might be preferred for their better performances and robustness in general.
It may be recommended to use a completely different algorithm, then. Note that:
Method BFGS uses the quasi-Newton method of Broyden, Fletcher, Goldfarb, and Shanno (BFGS) [R127] pp. 136. It uses the first derivatives only. BFGS has proven good performance even for non-smooth optimizations. This method also returns an approximation of the Hessian inverse, stored as hess_inv in the OptimizeResult object.
BFGS sounds more robust and faster overall.
ParagonRG

Python: least square fit with side conditions on fit-parameters

I have a timeseries that I want to fit to function using Scipy.optimize.leastsq.
fitfunc= lambda a, x: a[0]+a[1]*exp(-x/a[4])+a[2]*exp(-x/a[5])+a[3]*exp(-x /a[6])
errfunc lambda a,x,y: fitfunc(a,x) - y
Next I would pass errfunc to leastsq to minimze it. The fit-function I use is a sum of exponentials decaying with different timescales a(4:6) and different weights (a(0:4)). (as a sideuqestion: can I use leastsq with more than 1 parameter arrays? I didn't suceed to do so....)
The question: How can I put additional side conditions on the parameters entering the fit-function. I want for example that sum(a(0:4))=1.0

Just use
import numpy as np
def fitfunc(p, x):
a = np.zeros(7)
a[1:7] = p[:6]
a[0] = 1 - a[1:4].sum()
return a[0] + a[1]*exp(-x/a[4]) + a[2]*exp(-x/a[5]) + a[3]*exp(-x/a[6])
def errfunc(p, x, y1, y2):
return np.concatenate((
fitfunc(p[:6], x) - y1,
fitfunc(p[6:12], x) - y2
))
Generally, lambda-functions are considered bad style (and they don't add anything in your code). To have several functions in the least square fit you may just append the functions as I indicated using np.concatenate. It doesn't make much sense, if none of the parameters are correlated, though. It will only slow down convergence of the algorithm. The side condition you asked for, is implemented by just calculating one weight based on the constraint you gave (see 1 - a[1:4].sum()).

If you can't solve the equations for you constraints, and you can live with the constraint being satisfied with some tolerance, another possibility is to add a term to the chi-square with the large weight which guarantees that the constraint is almost satisfied.
E.g if you need that \sum(sin(p[i])==1 ,you can do the following:
constraint_func = lambda a: sin(a).sum()-1
def fitfunc (a,x):
np.concatenate((a[0]+a[1]*exp(-x/a[4])+a[2]*exp(-x/a[5])+a[3]*exp(-x /a[6]),
[constraint_func(a)]))
def errfunc(a,x,y):
tolerance = 1e-10
return np.concatenate((fitfunc(a,x) - y, [tolerance]))
Obviously the convergence will be slower, but will be still guaranteed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.