Why won't Perceptron Learning Algorithm converge?

Why won't Perceptron Learning Algorithm converge? - python

I have implemented the Perceptron Learning Algorithm in Python as below. Even with 500,000 iterations, it still won't converge.
I have a training data matrix X with target vector Y, and a weight vector w to be optimized.
My update rule is:
while(exist_mistakes):
# dot product to check for mistakes
output = [np.sign(np.dot(X[i], w)) == Y[i] for i in range(0, len(X))]
# find index of mistake. (choose randomly in order to avoid repeating same index.)
n = random.randint(0, len(X)-1)
while(output[n]): # if output is true here, choose again
n = random.randint(0, len(X)-1)
# once we have found a mistake, update
w = w + Y[n]*X[n]
Is this wrong? Or why is it not converging even after 500,000 iterations?

Perceptrons by Minsky and Papert (in)famously demonstrated in 1969 that the perceptron learning algorithm is not guaranteed to converge for datasets that are not linearly separable.
If you're sure that your dataset is linearly separable, you might try adding a bias to each of your data vectors, as described by the question: Perceptron learning algorithm not converging to 0 -- adding a bias can help model decision boundaries that do not pass through the origin.
Alternatively, if you'd like to use a variant of the perceptron learning algorithm that is guaranteed to converge to a margin of specified width, even for datasets that are not linearly separable, have a look at the Averaged Perceptron -- PDF. The averaged perceptron is an approximation to the voted perceptron, which was introduced (as far as I know) in a nice paper by Freund and Schapire, "Large Margin Classification Using the Perceptron Algorithm" -- PDF.
Using an averaged perceptron, you make a copy of the parameter vector after each presentation of a training example during training. The final classifier uses the mean of all parameter vectors.

The basic issue is that randomly chosen points are not necessarily linearly classifiable.
However, there is a worse problem in the algorithm:
Even if you go by a good reference like Vapnik's, "Statistical Learning Theory", you are not given the biggest problem in the OP's algorithm. The issue is NOT the learning rate parameter, either. It is trivial to prove that the learning rate parameter has no real effect on whether or not the algorithm converges - this is because the learning rate parameter is simply a scalar.
Imagine a set of four points to classify. The four points are orthogonal vectors of length sqrt(2). The "in class" point is (-1,1). The "out of class" points are (-1,-1), (1,1), and (1,-1). Regardless of any learning rate, the OP's original algorithm will never converge.
The reason why the original poster's algorithm fails to converge is because of the missing bias term (effectively the coefficient of the 0th dimensional term), which MUST augment the other dimensional terms. Without the bias term, the perceptron is not completely defined. This is trivial to prove by hand modeling a perceptron in 1D or 2D space.
The bias term is often hand-waved in literature as a way to "shift" the hyperplane, but it seems that's mostly due to the fact that many people tend to teach/learn perceptrons in 2D space for teaching purposes. The term "Shifting" doesn't provide an adequate explanation of why the bias term is needed in high dimensional spaces (what does it mean to "shift" a 100-dimensional hyperplane?)
One may notice that literature proving the mean convergence time of the perceptron excludes the bias term. This is due to the ability to simplify the perceptron equation if you assume that the perceptron will converge (see Vapnik, 1998, Wiley, p.377). That is a big (but necessary) assumption for the proof, but one cannot adequately implement a perceptron by assuming an incomplete implementation.
Alex B. J. Novikoff's 1962/1963 proofs of the perceptron convergence include this zero dimensional term.

Related

Why scaling down the parameter many times during training will help the learning speed be the same for all weights in Progressive GAN?

Equalized learning rate is one of the special things in Progressive Gan, a paper of the NVIDIA team. By using this method, they introduced that
Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights.
In detail, they inited all learnable parameters by normal distribution . During training time, each forward time, they will scale the result with per-layer normalization constant from He's initializer
I reproduced the code from pytorch GAN zoo Github's repo
def forward(self, x, equalized):
# generate He constant depend on the size of tensor W
size = self.module.weight.size()
fan_in = prod(size[1:])
weight = math.sqrt(2.0 / fan_in)
'''
A module example:
import torch.nn as nn
module = nn.Conv2d(nChannelsPrevious, nChannels, kernelSize, padding=padding, bias=bias)
'''
x = self.module(x)
if equalized:
x *= self.weight
return x
At first, I thought the He constant will be as He's paper
Normally, so can be scaled up which leads to the gradient in backpropagation is increased as the formula in ProGan's paper prevent vanishing gradient.
However, the code shows that .
In summary, I can't understand why to scale down the parameter many times during training will help the learning speed be more stable.
I asked this question in some communities e.g: Artificial Intelligent, mathematics, and still haven't had an answer yet.
Please help me explain it, thank you!

There is already an explanation in the paper for the reason for scaling down the parameters in every single pass:
The benefit of doing this dynamically instead of during initialization is somewhat
subtle and relates to the scale-invariance in commonly used adaptive stochastic gradient descent
methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These
methods normalize a gradient update by its estimated standard deviation, thus making the update
independent of the scale of the parameter. As a result, if some parameters have a larger dynamic
range than others, they will take longer to adjust. This is a scenario modern initializers cause, and
thus it is possible that a learning rate is both too large and too small at the same time.
I believe multiplying He's constant in every pass ensures that the range of the parameter will not be too wide at any point in backpropagation and therefore they will not take so long to adjust. So if for example discriminator at some point in the learning process adjusted faster than the generator, it will not take the generator to adjust itself and consequently, the learning process of them will equalize.

ConvergenceWarning: Liblinear failed to converge, increase the number of iterations

Running the code of linear binary pattern for Adrian. This program runs but gives the following warning:
C:\Python27\lib\site-packages\sklearn\svm\base.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning
I am running python2.7 with opencv3.7, what should I do?

Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. There are a few things you can try.
Normalize your training data so that the problem hopefully becomes more well
conditioned, which in turn can speed up convergence. One
possibility is to scale your data to 0 mean, unit standard deviation using
Scikit-Learn's
StandardScaler
for an example. Note that you have to apply the StandardScaler fitted on the training data to the test data. Also, if you have discrete features, make sure they are transformed properly so that scaling them makes sense.
Related to 1), make sure the other arguments such as regularization
weight, C, is set appropriately. C has to be > 0. Typically one would try various values of C in a logarithmic scale (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) before finetuning it at finer granularity within a particular interval. These days, it probably make more sense to tune parameters using, for e.g., Bayesian Optimization using a package such as Scikit-Optimize.
Set max_iter to a larger value. The default is 1000. This should be your last resort. If the optimization process does not converge within the first 1000 iterations, having it converge by setting a larger max_iter typically masks other problems such as those described in 1) and 2). It might even indicate that you have some in appropriate features or strong correlations in the features. Debug those first before taking this easy way out.
Set dual = True if number of features > number of examples and vice versa. This solves the SVM optimization problem using the dual formulation. Thanks #Nino van Hooff for pointing this out, and #JamesKo for spotting my mistake.
Use a different solver, for e.g., the L-BFGS solver if you are using Logistic Regression. See #5ervant's answer.
Note: One should not ignore this warning.
This warning came about because
Solving the linear SVM is just solving a quadratic optimization problem. The solver is typically an iterative algorithm that keeps a running estimate of the solution (i.e., the weight and bias for the SVM).
It stops running when the solution corresponds to an objective value that is optimal for this convex optimization problem, or when it hits the maximum number of iterations set.
If the algorithm does not converge, then the current estimate of the SVM's parameters are not guaranteed to be any good, hence the predictions can also be complete garbage.
Edit
In addition, consider the comment by #Nino van Hooff and #5ervant to use the dual formulation of the SVM. This is especially important if the number of features you have, D, is more than the number of training examples N. This is what the dual formulation of the SVM is particular designed for and helps with the conditioning of the optimization problem. Credit to #5ervant for noticing and pointing this out.
Furthermore, #5ervant also pointed out the possibility of changing the solver, in particular the use of the L-BFGS solver. Credit to him (i.e., upvote his answer, not mine).
I would like to provide a quick rough explanation for those who are interested (I am :)) why this matters in this case. Second-order methods, and in particular approximate second-order method like the L-BFGS solver, will help with ill-conditioned problems because it is approximating the Hessian at each iteration and using it to scale the gradient direction. This allows it to get better convergence rate but possibly at a higher compute cost per iteration. That is, it takes fewer iterations to finish but each iteration will be slower than a typical first-order method like gradient-descent or its variants.
For e.g., a typical first-order method might update the solution at each iteration like
x(k + 1) = x(k) - alpha(k) * gradient(f(x(k)))
where alpha(k), the step size at iteration k, depends on the particular choice of algorithm or learning rate schedule.
A second order method, for e.g., Newton, will have an update equation
x(k + 1) = x(k) - alpha(k) * Hessian(x(k))^(-1) * gradient(f(x(k)))
That is, it uses the information of the local curvature encoded in the Hessian to scale the gradient accordingly. If the problem is ill-conditioned, the gradient will be pointing in less than ideal directions and the inverse Hessian scaling will help correct this.
In particular, L-BFGS mentioned in #5ervant's answer is a way to approximate the inverse of the Hessian as computing it can be an expensive operation.
However, second-order methods might converge much faster (i.e., requires fewer iterations) than first-order methods like the usual gradient-descent based solvers, which as you guys know by now sometimes fail to even converge. This can compensate for the time spent at each iteration.
In summary, if you have a well-conditioned problem, or if you can make it well-conditioned through other means such as using regularization and/or feature scaling and/or making sure you have more examples than features, you probably don't have to use a second-order method. But these days with many models optimizing non-convex problems (e.g., those in DL models), second order methods such as L-BFGS methods plays a different role there and there are evidence to suggest they can sometimes find better solutions compared to first-order methods. But that is another story.

I reached the point that I set, up to max_iter=1200000 on my LinearSVC classifier, but still the "ConvergenceWarning" was still present. I fix the issue by just setting dual=False and leaving max_iter to its default.
With LogisticRegression(solver='lbfgs') classifier, you should increase max_iter. Mine have reached max_iter=7600 before the "ConvergenceWarning" disappears when training with large dataset's features.

Explicitly specifying the max_iter resolves the warning as the default max_iter is 100. [For Logistic Regression].
logreg = LogisticRegression(max_iter=1000)

Please incre max_iter to 10000 as default value is 1000. Possibly, increasing no. of iterations will help algorithm to converge. For me it converged and solver was -'lbfgs'
log_reg = LogisticRegression(solver='lbfgs',class_weight='balanced', max_iter=10000)

What is the key feature in MNIST Dataset that is used to classify images

I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this

First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)

See retained variance in scikit-learn manifold learning methods

I have a dataset of images that I would like to run nonlinear dimensionality reduction on. To decide what number of output dimensions to use, I need to be able to find the retained variance (or explained variance, I believe they are similar). Scikit-learn seems to have by far the best selection of manifold learning algorithms, but I can't see any way of getting a retained variance statistic. Is there a part of the scikit-learn API that I'm missing, or simple way to calculate the retained variance?

I don't think there is a clean way to derive the "explained variance" of most non-linear dimensionality techniques, in the same way as it is done for PCA.
For PCA, it is trivial: you are simply taking the weight of a principal component in the eigendecomposition (i.e. its eigenvalue) and summing the weights of the ones you use for linear dimensionality reduction.
Of course, if you keep all the eigenvectors, then you will have "explained" 100% of the variance (i.e. perfectly reconstructed the covariance matrix).
Now, one could try to define a notion of explained variance in a similar fashion for other techniques, but it might not have the same meaning.
For instance, some dimensionality reduction methods might actively try to push apart more dissimilar points and end up with more variance than what we started with. Or much less if it chooses to cluster some points tightly together.
However, in many non-linear dimensionality reduction techniques, there are other measures that give notions of "goodness-of-fit".
For instance, in scikit-learn, isomap has a reconstruction error, tsne can return its KL-divergence, and MDS can return the reconstruction stress.

How to extract info from scikits.learn classifier to then use in C code

I have trained a bunch of RBF SVMs using scikits.learn in Python and then Pickled the results. These are for image processing tasks and one thing I want to do for testing is run each classifier on every pixel of some test images. That is, extract the feature vector from a window centered on pixel (i,j), run each classifier on that feature vector, and then move on to the next pixel and repeat. This is far too slow to do with Python.
Clarification: When I say "this is far too slow..." I mean that even the Libsvm under-the-hood code that scikits.learn uses is too slow. I'm actually writing a manual decision function for the GPU so classification at each pixel happens in parallel.
Is it possible for me to load the classifiers with Pickle, and then grab some kind of attribute that describes how the decision is computed from the feature vector, and then pass that info to my own C code? In the case of linear SVMs, I could just extract the weight vector and bias vector and add those as inputs to a C function. But what is the equivalent thing to do for RBF classifiers, and how do I get that info from the scikits.learn object?
Added: First attempts at a solution.
It looks like the classifier object has the attribute support_vectors_ which contains the support vectors as each row of an array. There is also the attribute dual_coef_ which is a 1 by len(support_vectors_) array of coefficients. From the standard tutorials on non-linear SVMs, it appears then that one should do the following:
Compute the feature vector v from your data point under test. This will be a vector that is the same length as the rows of support_vectors_.
For each row i in support_vectors_, compute the squared Euclidean distance d[i] between that support vector and v.
Compute t[i] as gamma * exp{-d[i]} where gamma is the RBF parameter.
Sum up dual_coef_[i] * t[i] over all i. Add the value of the intercept_ attribute of the scikits.learn classifier to this sum.
If the sum is positive, classify as 1. Otherwise, classify as 0.
Added: On numbered page 9 at this documentation link it mentions that indeed the intercept_ attribute of the classifier holds the bias term. I have updated the steps above to reflect this.

Yes your solution looks alright. To pass the raw memory of a numpy array directly to a C program you can use the ctypes helpers from numpy or wrap you C program with cython and call it directly by passing the numpy array (see the doc at http://cython.org for more details).
However, I am not sure that trying to speedup the prediction on a GPU is the easiest approach: kernel support vector machines are known to be slow at prediction time since their complexity directly depend on the number of support vectors which can be high for highly non-linear (multi-modal) problems.
Alternative approaches that are faster at prediction time include neural networks (probably more complicated or slower to train right than SVMs that only have 2 hyper-parameters C and gamma) or transforming your data with a non linear transformation based on distances to prototypes + thresholding + max pooling over image areas (only for image classification).
for the first method you will find good documentation on the deep learning tutorial
for the second read the recent papers by Adam Coates and have a look at this page on kmeans feature extraction
Finally you can also try to use NuSVC models whose regularization parameter nu has a direct impact on the number of support vectors in the fitted model: less support vectors mean faster prediction times (check the accuracy though, it will be a trade-off between prediction speed and accuracy in the end).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.