sklearn svm SVC fails, but does not report fit_status_==1 - python
I am trying to fit a weighted linear SVC to the "noisy circles" dataset. For some reason, the weighted version finds a decision function that is very very very bad. Yet, libsvm reports that the fit was successful. My weights are not totally strange, so I'm not sure why the algorithm fails. Worse, I'm not sure how to predict under what circumstances the algorithm will fail, or what to do about it.
Here is the offensive code
import numpy as np
import sklearn.datasets
import sklearn.svm
## GET THE NOISY CIRCLES DATASET
n = 200
noise=0.04
factor = 0.3
SEED = 1
np.random.seed(SEED)
noisy_circles, c = sklearn.datasets.make_circles(n_samples=n, factor=factor,
noise=noise)
## HARDCODED WEIGHTS 4 STACKOVERFLOW
weights = np.array([0.93301464, 0.92261151, 0.93367401, 0.38632274, 0.35437395,
0.43346701, 1.09297683, 1.19747184, 0.96349809, 0.32426173,
0.29397037, 1.03628304, 1.05908521, 1.10653401, 0.37677232,
0.35153446, 0.24747971, 0.90887151, 0.24463193, 0.85877582,
0.89405636, 1.03921294, 0.87729103, 1.1589434 , 0.93196245,
0.22982046, 0.82391095, 0.95794411, 0.39876209, 0.96383222,
0.91290011, 0.24322639, 0.41364025, 0.32605574, 0.3712862 ,
1.13075687, 0.33799184, 0.94422961, 0.96021123, 0.29392899,
0.40880845, 0.37780868, 0.4861022 , 1.06077845, 0.89866461,
1.07030338, 0.34269111, 0.86699042, 0.39481626, 0.33021158,
1.17056528, 0.24180542, 0.2446189 , 0.87293221, 0.91510412,
0.32998597, 0.37407169, 0.41486528, 0.42505555, 0.20065111,
0.38846804, 0.92251402, 0.99049091, 0.90580681, 0.97491595,
1.08819797, 0.26700098, 0.42487132, 0.93167479, 1.02463133,
0.89980578, 1.1096191 , 0.37254448, 0.2359968 , 0.28334117,
0.33311215, 1.08758973, 0.32901317, 1.13315268, 0.29888742,
0.14581565, 1.07038078, 1.03316864, 0.35451779, 0.45098287,
1.12772454, 1.08896868, 0.28236812, 0.46117373, 0.83258909,
1.174982 , 0.89901124, 0.12965322, 0.41543288, 0.17358532,
0.45842307, 0.42685333, 0.42375945, 0.210712 , 0.377017 ,
1.03517938, 0.9891231 , 1.07126936, 0.19820075, 1.1002386 ,
0.93338903, 1.1061464 , 0.20301447, 1.08130118, 0.34030289,
1.16104716, 0.15868522, 1.07481773, 0.94876721, 0.93468891,
0.3231601 , 1.04994012, 0.32166893, 0.90920628, 0.90999114,
1.03839278, 1.14232502, 0.18056755, 0.2639544 , 0.16631772,
1.10689008, 0.36852231, 0.20091628, 0.28666013, 1.05392917,
0.91207713, 1.13049957, 0.40367044, 0.33333911, 0.3380625 ,
1.0615807 , 0.30797683, 1.08206638, 0.39374589, 0.40647774,
0.23565583, 0.22030266, 0.33806818, 0.44739648, 0.94079254,
1.03878309, 0.84132066, 0.2772951 , 0.40448219, 1.14960352,
0.89091529, 0.97398981, 1.00992373, 0.87505294, 0.98439767,
1.13634672, 0.2694606 , 0.89735526, 0.21407159, 0.31951442,
0.37647624, 0.90387395, 0.36897273, 0.32483939, 0.42423936,
1.14167808, 0.88631001, 0.34304598, 1.12320881, 0.91640671,
1.0111603 , 0.8649317 , 0.97180267, 1.17381377, 0.4581278 ,
0.15286761, 1.14522941, 1.17181889, 1.02299728, 0.91620512,
0.18773065, 0.2600077 , 0.23665254, 0.20477831, 0.16430318,
0.38680433, 1.0352136 , 0.31850732, 1.02505276, 0.24500125,
1.01564276, 0.20866012, 0.2194238 , 0.37527691, 1.05327402,
0.18154061, 0.25013442, 0.99024356, 0.15072547, 0.87641354])
## MODEL SETUP AND TRAINING
model = sklearn.svm.SVC(C=30.,kernel="linear")
model.fit(noisy_circles, c, sample_weight=weights)
print(model.coef_, model.intercept_, model.fit_status_)
Note that the fit_status reports success. However, the fitted model parameters are total nonsense. To see this, here is the plot of the data (with size of dot scaled as the weight of the point):
Here is the fitted line along the same range in x:
Whatever is happening here seems to be driving the decision surface off to infinity. At first I thought that it was my having such a large C that was simply overpowering the part of the SVM that was trying to learn anything, but reducing C to 0.0001 does not change anything.
What is going on with the algorithm that produces this counter-intuitive behavior? Under what circumstances should I expect the algorithm to fail in this way?
UPDATE: The nightly build of sklearn supports sample weights for LinearSVC. Switching over to LinearSVC, I am witnessing the same behavior when the loss is set to "hinge", but not for this particular set of weights. This causes me to suspect that there is some kind of ill-conditioning in the problem somewhere. I'm still not sure exactly what is happening, but possibly this sheds some light on the problem.
The problem doesn't lie in sample_weight or C, it lies in the linear nature of the kernel. You are trying to learn a non-linear decision boundary (circular in this case) using a function that simply can not express anything but linear decision boundary. This applies to both SVC(kernel="linear") and LinearSVC. In my experiments, simply using a non-linear kernel like rbf completely solved it.
All SVMs in fact learn a linear boundary. So why something like rbf performs well? The answer lies in something called "kernel trick". Put simply, rbf transforms the dataset in some ways (projection to higher dimensional space is the technical way to put it), so that linearly separating the classes in that transformed space actually results in non-linear boundary in our original space. Here is a more detailed explanation for it.
Update: As for how weights contribute to the failure for linear kernel, the answer most likely lies in the fact that the avg weight assigned to the classes is imbalanced. In particular, the fact that the avg weight assigned to class-0 is 3 times higher than 1. Here are few results that point to this conclusion:
The Linear kernel learns "reasonable" boundary (meaning boundary is within the input space of samples) when weights are all 1.0 or randomly generated.
Also reasonable if we balance the weights of the classes using class_weight={0:(w[y==1].sum()/w[y==0].sum()),1:1} formula in constructor.
If we create weight imbalance in some other ways: like using uniform weights but with different class weights, or assigning class-1 3 times higher than 0, or if we remove weights altogether and simply make frequency of one class 1/3rd of other, the above problem reappears.
This imbalance seems to push co-efficients to zero, although rbf kernel doesn't seem to be affected by it. As for why libsvm can't report the failure, that unfortunately I do not know.
Related
Why are statsmodels and sklearn returning different Lasso estimates?
I'm currently doing a project investigating the Bayesian Lasso and part of the project involves running some simulations. It can be shown that if we set independent and identically distributed conditional Laplace priors on the regression coefficients beta, the posterior mode is a frequentist Lasso estimate with tuning parameter 2 x sigma x lambda. So to check my work, I often make use of both SciKit and Statmodels (in particular their Lasso implementations) to calculate the frequentist Lasso estimate that should be approximately equal to the posterior mode (using the medians of sigma and lambda as estimates for 2 x sig x lam) and superimpose them onto my histograms. In the simulations I've run involving independent predictor variables, all Lasso estimates that I compute with scikit and statsmodels align, and they appear to coincide with the posterior mode when I superimpose them on my histograms. However, if I use virtually the same code but in the case of (a) multicollinearity, (b) p > n or (c) n > p but n small e.g. n=12 and p = 9, sklearn and statsmodels output different Lasso estimates occasionally for the same tuning hyperparameter, which is confusing. Here's my code for the multicollinear predictors case (lam=1 here): sigmamedian = np.median(np.sqrt(burned_sig2_tracesLam)) # Check with SciKit's Lasso skmodel = Lasso(alpha=2*sigmamedian*lam/(2*len(y_train)),fit_intercept=False,tol=1e-12) skmodel = skmodel.fit(X_trainStd,y_train-np.mean(y_train)) print(skmodel.coef_) # StatsModel's Lasso smLasso = sm.OLS(y_train - np.mean(y_train),X_trainStd).fit_regularized(alpha=2*sigmamedian*lam/(2*len(y_train))) print(smLasso.params) The output in this case is: [ 0.28284396, -1.23878332, 1.08344865, 0.29263474, 0. , 0.00655085] [ 0.45950192, -1.32361768, 0.8906759, 0.28951489, 0. , 0.] These aren't the same, which is confusing because I've checked the documentation and I haven't made the common mistake of dealing with the intercept parameters the same way in scikit and statsmodels, the tuning hyperparameters inputted are the same, both modules use coordinate descent and LASSO estimates should be unique for n < p (and actually Tibshirani has shown that LASSO estimates are almost surely unique if the data is generated from a continuous distribution, even if p > n). After superimposing both of these onto the histograms of the posterior distributions, it appears to be that scikit's implementation that returns the posterior mode (or at least, a very good approximation of it). If I change lambda to 2, then scikit and statsmodels still return different things but statsmodels returns the estimate that accurately approximates the posterior mode. I also generated data with p=25 and n = 20; the theory suggests that Lasso should set 5 of the coefficients to zero, but neither scikit nor statsmodels did this. What's going on here?
python sklearn: what is the different between "sklearn.preprocessing.normalize(X, norm='l2')" and "sklearn.svm.LinearSVC(penalty='l2')"
here is two method of normalize : 1:this one is using in the data Pre-Processing: sklearn.preprocessing.normalize(X, norm='l2') 2:the other method is using in the classify method : sklearn.svm.LinearSVC(penalty='l2') i want to know ,what is the different between them? and does the two step must be used in a completely model ? is it right that just use a method is enough?
These 2 are different things and you normally need them both in order to make a good SVC model. 1) The first one means that in order to scale (normalize) the X data matrix you need to divide with the L2 norm of each column, which is just this : sqrt(sum(abs(X[:,j]).^2)) , where j is each column in your data matrix X . This ensures that none of the values of each column become too big, which makes it tough for some algorithms to converge. 2) Irrespective of how scaled (and small in values) your data is, there still may be outliers or some features (j) that are way too dominant and your algorithm (LinearSVC()) may over trust them while it shouldn't. This is where L2 regularization comes into play , that says apart from the function the algorithm minimizes, a cost will be applied to the coefficients so that they don't become too big . In other words the coefficients of the model become additional cost for the SVR cost function. How much cost ? is decided by the C (L2) value as C*(beta[j])^2 To sum up, first one tells with which value to divide each column of the X matrix. How much weight should a coefficient burden the cost function with is the second.
scikit-learn PCA: matrix transformation produces PC estimates with flipped signs
I'm using scikit-learn to perform PCA on this dataset. The scikit-learn documentation states that Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion. The problem is that I don't think that I'm using different estimator objects, but the signs of some of my PCs are flipped, when compared to results in SAS's PROC PRINCOMP procedure. For the first observation in the dataset, the SAS PCs are: PC1 PC2 PC3 PC4 PC5 2.0508 1.9600 -0.1663 0.2965 -0.0121 From scikit-learn, I get the following (which are very close in magnitude): PC1 PC2 PC3 PC4 PC5 -2.0536 -1.9627 -0.1666 -0.297 -0.0122 Here's what I'm doing: import pandas as pd import numpy as np from sklearn.decomposition.pca import PCA sourcef = pd.read_csv('C:/mydata.csv') frame = pd.DataFrame(sourcef) # Some pandas evals, regressions, etc... that I'm not showing # but not affecting the matrix # Make sure we are working with the proper data -- drop the response variable cols = [col for col in frame.columns if col not in ['response']] # Separate out the data matrix from the response variable vector # into numpy arrays frame2_X = frame[cols].values frame2_y = frame['response'].values # Standardize the values X_means = np.mean(frame2_X,axis=0) X_stds = np.std(frame2_X,axis=0) y_mean = np.mean(frame2_y) y_std = np.std(frame2_y) frame2_X_stdz = np.copy(frame2_X) frame2_y_stdz = frame2_y.astype(numpy.float32, copy=True) for (x,y), value in np.ndenumerate(frame2_X_stdz): frame2_X_stdz[x][y] = (value - X_means[y])/X_stds[y] for index, value in enumerate(frame2_y_stdz): frame2_y_stdz[index] = (float(value) - y_mean)/y_std # Show the first 5 elements of the standardized values, to verify print frame2_X_stdz[:,0][:5] # Show the first 5 lines from the standardized response vector, to verify print frame2_y_stdz[:5] Those check out ok: [ 0.9508 -0.5847 -0.2797 -0.4039 -0.598 ] [ 1.0726 -0.5009 -0.0942 -0.1187 -0.8043] Continuing on... # Create a PCA object pca = PCA() pca.fit(frame2_X_stdz) # Create the matrix of PC estimates pca.transform(frame2_X_stdz) Here's the output of the last step: Out[16]: array([[-2.0536, -1.9627, -0.1666, -0.297 , -0.0122], [ 1.382 , -0.382 , -0.5692, -0.0257, -0.0509], [ 0.4342, 0.611 , 0.2701, 0.062 , -0.011 ], ..., [ 0.0422, 0.7251, -0.1926, 0.0089, 0.0005], [ 1.4502, -0.7115, -0.0733, 0.0013, -0.0557], [ 0.258 , 0.3684, 0.1873, 0.0403, 0.0042]]) I've tried it by replacing the pca.fit() and pca.transform() with pca.fit_transform(), but I end up with the same results. What am I doing wrong here that I'm getting PCs with the signs flipped?
You're doing nothing wrong. What the documentation is warning you about is that repeated calls to fit may yield different principal components - not how they relate to another PCA implementation. Having a flipped sign on all components doesn't make the result wrong - the result is right as long as it fulfills the definition (each component is chosen such that it captures the maximum amount of variance in the data). As it stands, it seems the projection you got is simply mirrored - it still fulfills the definition, and is, thus, correct. If, beneath correctness, you're worried about consistency between implementations, you can simply multiply the components by -1, when it's necessary.
SVD decompositions are not guaranteed unique - only the values will be identical, as different implementations of svd() can produce different signs. Any of the eigenvectors can have flipped signs, and will produce identical results when transformed, then transformed back into the original space. Most algorithms in sklearn which use SVD decomposition use the function sklearn.utils.extmath.svd_flip() to correct this, and enforce an identical convention across algorithms. For historical reasons, PCA() never got this fix (though maybe it should...) In general, this is not something to worry about - just a limitation of the SVD algorithm as typically implemented. On an additional note, I find assigning importance to PC weights (and parameter weights in general) dangerous, because of exactly these kinds of issues. Numerical/implementation details should not influence your analysis results, but many times it is hard to tell what is a result of the data, and what is a result of the algorithms you use for exploration. I know this is a homework assignment, not a choice, but it is important to keep these things in mind!
Scikit-learn: Parallelize stochastic gradient descent
I have a fairly large training matrix (over 1 billion rows, two features per row). There are two classes (0 and 1). This is too large for a single machine, but fortunately I have about 200 MPI hosts at my disposal. Each is a modest dual-core workstation. Feature generation is already successfully distributed. The answers in Multiprocessing scikit-learn suggest it is possible to distribute the work of a SGDClassifier: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again. When I have run partial_fit for the second time on each estimator, where do I go from there to get a final aggregate estimator? My best guess was to average the coefs and the intercepts again and make an estimator with those values. The resulting estimator gives a different result than an estimator constructed with fit() on the entire data. Details Each host generates a local matrix and a local vector. This is n rows of the test set and the corresponding n target values. Each host uses the local matrix and local vector to make an SGDClassifier and do a partial fit. Each then sends the coef vector and the intercept to root. Root averages these and sends them back to the hosts. The hosts do another partial_fit and sends the coef vector and the intercept to root. Root constructs a new estimator with these values. local_matrix = get_local_matrix() local_vector = get_local_vector() estimator = linear_model.SGDClassifier() estimator.partial_fit(local_matrix, local_vector, [0,1]) comm.send((estimator.coef_,estimator.intersept_),dest=0,tag=rank) average_coefs = None avg_intercept = None comm.bcast(0,root=0) if rank > 0: comm.send( (estimator.coef_, estimator.intercept_ ), dest=0, tag=rank) else: pairs = [comm.recv(source=r, tag=r) for r in range(1,size)] pairs.append( (estimator.coef_, estimator.intercept_) ) average_coefs = np.average([ a[0] for a in pairs ],axis=0) avg_intercept = np.average( [ a[1][0] for a in pairs ] ) estimator.coef_ = comm.bcast(average_coefs,root=0) estimator.intercept_ = np.array( [comm.bcast(avg_intercept,root=0)] ) estimator.partial_fit(metric_matrix, edges_exist,[0,1]) if rank > 0: comm.send( (estimator.coef_, estimator.intercept_ ), dest=0, tag=rank) else: pairs = [comm.recv(source=r, tag=r) for r in range(1,size)] pairs.append( (estimator.coef_, estimator.intercept_) ) average_coefs = np.average([ a[0] for a in pairs ],axis=0) avg_intercept = np.average( [ a[1][0] for a in pairs ] ) estimator.coef_ = average_coefs estimator.intercept_ = np.array( [avg_intercept] ) print("The estimator at rank 0 should now be working") Thank you!
Training a linear model on a dataset with 1e9 samples and 2 features is very likely to underfit or waste CPU / IO time in case the data is actually linearly separable. Don't waste time thinking about parallelizing such a problem with a linear model: either switch to a more complex class of models (e.g. train random forests on smaller partitions of the data that fit in memory and aggregate them) or either select random subsamples of your dataset of increasing and train linear models. Measure the predictive accuracy on an held out test and stop when you see diminishing returns (probably after a couple 10s of thousands of samples of the minority class).
What you are experiencing is normal and expected. First is the fact that using SGD means you are never going to hit an exact result. You will quickly converge towards the optimal solution (since this is a convex problem) and then hover around that area for the remainder. Different runs with the whole data set alone should produce slightly different results each time. where do I go from there to get a final aggregate estimator? In theory, you would just keep doing that over and over until you are happy with convergence. Completely unnecessary for what you are doing. Other systems switch to using more sophisticated methods (like L-BFGS) to converge towards the final solution now that they have a good "warm start" on the solution. However this will not get you any drastic gains in accuracy (think maybe getting a whole percentage point if you are lucky) - so don't consider it a make or break. Consider it what it is, a fine tuning. Second is the fact that linear models don't parallellize well. Despite the claims of vowpalwabbit and some other libraries, you are not going to get linear scaling out of training a linear model in parallel. Simply averaging the intermediate results is a bad way to parallelize such a system, and sadly thats about as good as it gets for training linear models in parallel. The fact is, you only have 2 features. You should be able to easily train far more complicated models using only a smaller subset of your data. 1 billion rows is overkill for just 2 features.
Mathematical background of statsmodels wls_prediction_std
wls_prediction_std returns standard deviation and confidence interval of my fitted model data. I would need to know the way the confidence intervals are calculated from the covariance matrix. (I already tried to figure it out by looking at the source code but wasn't able to) I was hoping some of you guys could help me out by writing out the mathematical expression behind wls_prediction_std.
There should be a variation on this in any textbook, without the weights. For OLS, Greene (5th edition, which I used) has se = s^2 (1 + x (X'X)^{-1} x') where s^2 is the estimate of the residual variance, x is vector or explanatory variables for which we want to predict and X are the explanatory variables used in the estimation. This is the standard error for an observation, the second part alone is the standard error for the predicted mean y_predicted = x beta_estimated. wls_prediction_std uses the variance of the parameter estimate directly. Assuming x is fixed, then y_predicted is just a linear transformation of the random variable beta_estimated, so the variance of y_predicted is just x Cov(beta_estimated) x' To this we still need to add the estimate of the error variance. As far as I remember, there are estimates that have better small sample properties. I added the weights, but never managed to verify them, so the function has remained in the sandbox for years. (Stata doesn't return prediction errors with weights.) Aside: Using the covariance of the parameter estimate should also be correct if we use a sandwich robust covariance estimator, while Greene's formula above is only correct if we don't have any misspecified heteroscedasticity. What wls_prediction_std doesn't take into account is that, if we have a model for the heteroscedasticity, then the error variance could also depend on the explanatory variables, i.e. on x.