Scikit-learn: Parallelize stochastic gradient descent

Scikit-learn: Parallelize stochastic gradient descent - python

I have a fairly large training matrix (over 1 billion rows, two features per row). There are two classes (0 and 1).
This is too large for a single machine, but fortunately I have about 200 MPI hosts at my disposal. Each is a modest dual-core workstation.
Feature generation is already successfully distributed.
The answers in Multiprocessing scikit-learn suggest it is possible to distribute the work of a SGDClassifier:
You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.
When I have run partial_fit for the second time on each estimator, where do I go from there to get a final aggregate estimator?
My best guess was to average the coefs and the intercepts again and make an estimator with those values. The resulting estimator gives a different result than an estimator constructed with fit() on the entire data.
Details
Each host generates a local matrix and a local vector. This is n rows of the test set and the corresponding n target values.
Each host uses the local matrix and local vector to make an SGDClassifier and do a partial fit. Each then sends the coef vector and the intercept to root. Root averages these and sends them back to the hosts. The hosts do another partial_fit and sends the coef vector and the intercept to root.
Root constructs a new estimator with these values.
local_matrix = get_local_matrix()
local_vector = get_local_vector()
estimator = linear_model.SGDClassifier()
estimator.partial_fit(local_matrix, local_vector, [0,1])
comm.send((estimator.coef_,estimator.intersept_),dest=0,tag=rank)
average_coefs = None
avg_intercept = None
comm.bcast(0,root=0)
if rank > 0:
comm.send( (estimator.coef_, estimator.intercept_ ), dest=0, tag=rank)
else:
pairs = [comm.recv(source=r, tag=r) for r in range(1,size)]
pairs.append( (estimator.coef_, estimator.intercept_) )
average_coefs = np.average([ a[0] for a in pairs ],axis=0)
avg_intercept = np.average( [ a[1][0] for a in pairs ] )
estimator.coef_ = comm.bcast(average_coefs,root=0)
estimator.intercept_ = np.array( [comm.bcast(avg_intercept,root=0)] )
estimator.partial_fit(metric_matrix, edges_exist,[0,1])
if rank > 0:
comm.send( (estimator.coef_, estimator.intercept_ ), dest=0, tag=rank)
else:
pairs = [comm.recv(source=r, tag=r) for r in range(1,size)]
pairs.append( (estimator.coef_, estimator.intercept_) )
average_coefs = np.average([ a[0] for a in pairs ],axis=0)
avg_intercept = np.average( [ a[1][0] for a in pairs ] )
estimator.coef_ = average_coefs
estimator.intercept_ = np.array( [avg_intercept] )
print("The estimator at rank 0 should now be working")
Thank you!

Training a linear model on a dataset with 1e9 samples and 2 features is very likely to underfit or waste CPU / IO time in case the data is actually linearly separable. Don't waste time thinking about parallelizing such a problem with a linear model:
either switch to a more complex class of models (e.g. train random forests on smaller partitions of the data that fit in memory and aggregate them)
or either select random subsamples of your dataset of increasing and train linear models. Measure the predictive accuracy on an held out test and stop when you see diminishing returns (probably after a couple 10s of thousands of samples of the minority class).

What you are experiencing is normal and expected. First is the fact that using SGD means you are never going to hit an exact result. You will quickly converge towards the optimal solution (since this is a convex problem) and then hover around that area for the remainder. Different runs with the whole data set alone should produce slightly different results each time.
where do I go from there to get a final aggregate estimator?
In theory, you would just keep doing that over and over until you are happy with convergence. Completely unnecessary for what you are doing. Other systems switch to using more sophisticated methods (like L-BFGS) to converge towards the final solution now that they have a good "warm start" on the solution. However this will not get you any drastic gains in accuracy (think maybe getting a whole percentage point if you are lucky) - so don't consider it a make or break. Consider it what it is, a fine tuning.
Second is the fact that linear models don't parallellize well. Despite the claims of vowpalwabbit and some other libraries, you are not going to get linear scaling out of training a linear model in parallel. Simply averaging the intermediate results is a bad way to parallelize such a system, and sadly thats about as good as it gets for training linear models in parallel.
The fact is, you only have 2 features. You should be able to easily train far more complicated models using only a smaller subset of your data. 1 billion rows is overkill for just 2 features.

Related

Correct way of normalizing and scaling the MNIST dataset

I've looked everywhere but couldn't quite find what I want. Basically the MNIST dataset has images with pixel values in the range [0, 255]. People say that in general, it is good to do the following:
Scale the data to the [0,1] range.
Normalize the data to have zero mean and unit standard deviation (data - mean) / std.
Unfortunately, no one ever shows how to do both of these things. They all subtract a mean of 0.1307 and divide by a standard deviation of 0.3081. These values are basically the mean and the standard deviation of the dataset divided by 255:
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True)
print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(trainset.data.min(), trainset.data.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(trainset.data.float().mean(), trainset.data.float().std()))
print('Scaled Mean Pixel Value {} \nScaled Pixel Values Std: {}'.format(trainset.data.float().mean() / 255, trainset.data.float().std() / 255))
This outputs the following
Min Pixel Value: 0
Max Pixel Value: 255
Mean Pixel Value 33.31002426147461
Pixel Values Std: 78.56748962402344
Scaled Mean: 0.13062754273414612
Scaled Std: 0.30810779333114624
However clearly this does none of the above! The resulting data 1) will not be between [0, 1] and will not have mean 0 or std 1. In fact this is what we are doing:
[data - (mean / 255)] / (std / 255)
which is very different from this
[(scaled_data) - (mean/255)] / (std/255)
where scaled_data is just data / 255.

Euler_Salter
I may have stumbled upon this a little too late, but hopefully I can help a little bit.
Assuming that you are using torchvision.Transform, the following code can be used to normalize the MNIST dataset.
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('./data', train=True
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
Usually, 'transforms.ToTensor()' is used to turn the input data in the range of [0,255] to a 3-dimensional Tensor. This function automatically scales the input data to the range of [0,1]. (This is equivalent to scaling the data down to 0,1)
Therefore, it makes sense that the mean and std used in the 'transforms.Normalize(...)' will be 0.1307 and 0.3081, respectively. (This is equivalent to normalizing zero mean and unit standard deviation.)
Please refer to the link below for better explanation.
https://pytorch.org/vision/stable/transforms.html

I think you misunderstand one critical concept: these are two different, and inconsistent, scaling operations. You can have only one of the two:
mean = 0, stdev = 1
data range [0,1]
Think about it, considering the [0,1] range: if the data are all small positive values, with min=0 and max=1, then the sum of the data must be positive, giving a positive, non-zero mean. Similarly, the stdev cannot be 1 when none of the data can possibly be as much as 1.0 different from the mean.
Conversely, if you have mean=0, then some of the data must be negative.
You use only one of the two transformations. Which one you use depends on the characteristics of your data set, and -- ultimately -- which one works better for your model.
For the [0,1] scaling, you simply divide by 255.
For the mean=0, stdev=1 scaling, you perform the simple linear transformation you already know:
new_val = (old_val - old_mean) / old_stdev
Does that clarify it for you, or have I entirely missed your point of confusion?

Purpose
Two of the most important reasons for features scaling are:
You scale features to make them all of the same magnitude (i.e. importance or weight).
Example:
Dataset with two features: Age and Weight. The ages in years and the weights in grams! Now a fella in the 20th of his age and weights only 60Kg would translate to a vector = [20 yrs, 60000g], and so on for the whole dataset. The Weight Attribute will dominate during the training process. How is that, depends on the type of the algorithm you are using - Some are more sensitive than others: E.g. Neural Network where the Learning Rate for Gradient Descent get affected by the magnitude of the Neural Network Thetas (i.e. Weights), and the latter varies in correlation to the input (i.e. features) during the training process; also Feature Scaling improves Convergence. Another example is the K-Mean Clustering Algorithm requires Features of the same magnitude since it is isotropic in all directions of space. INTERESTING LIST.
You scale features to speed up execution time.
This is straightforward: All these matrices multiplications and parameters summation would be faster with small numbers compared to very large number (or very large number produced from multiplying features by some other parameters..etc)
Types
The most popular types of Feature Scalers can be summarized as follows:
StandardScaler: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1 and Mean=0. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.
MinMaxScaler: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.
RobustScaler: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.
MaxAbsScaler: Mainly used for sparse data.
Unit Normalization: It basically scales the vector for each sample to have unit norm, independently of the distribution of the samples.
Which One & How Many
You need to get to know your dataset first. As per mentioned above, there are things you need to look at before, such as: the Distribution of the Data, the Existence of Outliers, and the Algorithm being utilized.
Anyhow, you need one scaler per dataset, unless there is a specific requirement, such that if there exist an algorithm that works only if data are within certain range and has mean of zero and standard deviation of 1 - all together. Nevertheless, I have never come across such case.
Key Takeaways
There are different types of Feature Scalers that are used based on some rules of thumb mentioned above.
You pick one Scaler based on the requirements, not randomly.
You scale data for a purpose, for example, in the Random Forest Algorithm you do NOT usually need to scale.

Well the data gets scaled to [0,1] using torchvision.transforms.ToTensor() and then the normalization (0.1306,0.3081) is applied.
You can look about it in the Pytorch documentation : https://pytorch.org/vision/stable/transforms.html.
Hope that answers your question.

python sklearn: what is the different between "sklearn.preprocessing.normalize(X, norm='l2')" and "sklearn.svm.LinearSVC(penalty='l2')"

here is two method of normalize :
1:this one is using in the data Pre-Processing: sklearn.preprocessing.normalize(X, norm='l2')
2:the other method is using in the classify method : sklearn.svm.LinearSVC(penalty='l2')
i want to know ,what is the different between them? and does the two step must be used in a completely model ? is it right that just use a method is enough?

These 2 are different things and you normally need them both in order to make a good SVC model.
1) The first one means that in order to scale (normalize) the X data matrix you need to divide with the L2 norm of each column, which is just this : sqrt(sum(abs(X[:,j]).^2)) , where j is each column in your data matrix X . This ensures that none of the values of each column become too big, which makes it tough for some algorithms to converge.
2) Irrespective of how scaled (and small in values) your data is, there still may be outliers or some features (j) that are way too dominant and your algorithm (LinearSVC()) may over trust them while it shouldn't. This is where L2 regularization comes into play , that says apart from the function the algorithm minimizes, a cost will be applied to the coefficients so that they don't become too big . In other words the coefficients of the model become additional cost for the SVR cost function. How much cost ? is decided by the C (L2) value as C*(beta[j])^2
To sum up, first one tells with which value to divide each column of the X matrix. How much weight should a coefficient burden the cost function with is the second.

How to change parameters of a scikit learn function dynamically i.e. find best parameter

I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?

You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)

It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.

scikit-learn PCA: matrix transformation produces PC estimates with flipped signs

I'm using scikit-learn to perform PCA on this dataset. The scikit-learn documentation states that
Due to implementation subtleties of the Singular Value Decomposition
(SVD), which is used in this implementation, running fit twice on the
same matrix can lead to principal components with signs flipped
(change in direction). For this reason, it is important to always use
the same estimator object to transform data in a consistent fashion.
The problem is that I don't think that I'm using different estimator objects, but the signs of some of my PCs are flipped, when compared to results in SAS's PROC PRINCOMP procedure.
For the first observation in the dataset, the SAS PCs are:
PC1 PC2 PC3 PC4 PC5
2.0508 1.9600 -0.1663 0.2965 -0.0121
From scikit-learn, I get the following (which are very close in magnitude):
PC1 PC2 PC3 PC4 PC5
-2.0536 -1.9627 -0.1666 -0.297 -0.0122
Here's what I'm doing:
import pandas as pd
import numpy as np
from sklearn.decomposition.pca import PCA
sourcef = pd.read_csv('C:/mydata.csv')
frame = pd.DataFrame(sourcef)
# Some pandas evals, regressions, etc... that I'm not showing
# but not affecting the matrix
# Make sure we are working with the proper data -- drop the response variable
cols = [col for col in frame.columns if col not in ['response']]
# Separate out the data matrix from the response variable vector
# into numpy arrays
frame2_X = frame[cols].values
frame2_y = frame['response'].values
# Standardize the values
X_means = np.mean(frame2_X,axis=0)
X_stds = np.std(frame2_X,axis=0)
y_mean = np.mean(frame2_y)
y_std = np.std(frame2_y)
frame2_X_stdz = np.copy(frame2_X)
frame2_y_stdz = frame2_y.astype(numpy.float32, copy=True)
for (x,y), value in np.ndenumerate(frame2_X_stdz):
frame2_X_stdz[x][y] = (value - X_means[y])/X_stds[y]
for index, value in enumerate(frame2_y_stdz):
frame2_y_stdz[index] = (float(value) - y_mean)/y_std
# Show the first 5 elements of the standardized values, to verify
print frame2_X_stdz[:,0][:5]
# Show the first 5 lines from the standardized response vector, to verify
print frame2_y_stdz[:5]
Those check out ok:
[ 0.9508 -0.5847 -0.2797 -0.4039 -0.598 ]
[ 1.0726 -0.5009 -0.0942 -0.1187 -0.8043]
Continuing on...
# Create a PCA object
pca = PCA()
pca.fit(frame2_X_stdz)
# Create the matrix of PC estimates
pca.transform(frame2_X_stdz)
Here's the output of the last step:
Out[16]: array([[-2.0536, -1.9627, -0.1666, -0.297 , -0.0122],
[ 1.382 , -0.382 , -0.5692, -0.0257, -0.0509],
[ 0.4342, 0.611 , 0.2701, 0.062 , -0.011 ],
...,
[ 0.0422, 0.7251, -0.1926, 0.0089, 0.0005],
[ 1.4502, -0.7115, -0.0733, 0.0013, -0.0557],
[ 0.258 , 0.3684, 0.1873, 0.0403, 0.0042]])
I've tried it by replacing the pca.fit() and pca.transform() with pca.fit_transform(), but I end up with the same results.
What am I doing wrong here that I'm getting PCs with the signs flipped?

You're doing nothing wrong.
What the documentation is warning you about is that repeated calls to fit may yield different principal components - not how they relate to another PCA implementation.
Having a flipped sign on all components doesn't make the result wrong - the result is right as long as it fulfills the definition (each component is chosen such that it captures the maximum amount of variance in the data). As it stands, it seems the projection you got is simply mirrored - it still fulfills the definition, and is, thus, correct.
If, beneath correctness, you're worried about consistency between implementations, you can simply multiply the components by -1, when it's necessary.

SVD decompositions are not guaranteed unique - only the values will be identical, as different implementations of svd() can produce different signs. Any of the eigenvectors can have flipped signs, and will produce identical results when transformed, then transformed back into the original space. Most algorithms in sklearn which use SVD decomposition use the function sklearn.utils.extmath.svd_flip() to correct this, and enforce an identical convention across algorithms. For historical reasons, PCA() never got this fix (though maybe it should...)
In general, this is not something to worry about - just a limitation of the SVD algorithm as typically implemented.
On an additional note, I find assigning importance to PC weights (and parameter weights in general) dangerous, because of exactly these kinds of issues. Numerical/implementation details should not influence your analysis results, but many times it is hard to tell what is a result of the data, and what is a result of the algorithms you use for exploration. I know this is a homework assignment, not a choice, but it is important to keep these things in mind!

Pseudoexperiments in PyMC

Is it possible to perform "pseudoexperiments" using PyMC?
By pseudoexperiments, I mean generating random "observations" by sampling from the prior, and then, given each pseudoexperiment, drawing samples from the posterior. Afterwards, one would compare the trace for each parameter to the sample (obtained from the prior) used in sampling from the posterior.
A more concrete example: Suppose that I want to know the rate of process X. I count how many occurrences there are in a certain period of time. However, I know that process Y also sometimes occurs and will contaminate my count. The rate of process Y is known with some uncertainty. So, I build a model, include my observations, and sample from the posterior:
import pymc
class mymodel:
rate_x = pymc.Uniform('rate_x', lower=0, upper=100)
rate_y = pymc.Normal('rate_y', mu=150, tau=1./(15**2))
total_rate = pymc.LinearCombination('total_rate', [1,1], [rate_x, rate_y])
data = pymc.Poisson('data', mu=total_rate, value=193, observed=True)
Mod = pymc.Model(mymodel)
MCMC = pymc.MCMC(Mod)
MCMC.sample(100000, burn=5000, thin=5)
print MCMC.stats()['rate_x']['quantiles']
However, before I do my experiment (or before I "unblind" my analysis and look at my data), I would like to know how sensitive I expect to be -- what will be the uncertainty on my measurement of rate_x?
To answer this, I could sample from the prior
Mod.draw_from_prior()
but this only samples rate_x, rate_y, and calculates total_rate. But once the values of those are set by draw_from_prior(), I can draw a pseudoexperiment:
Mod.data.random()
This just returns a number, so I have to set the value of Mod.data to a random sample. Because Mod.data has the observed flag set, I have to also "force" it:
Mod.data.set_value(Mod.data.random(), force=True)
Now I can sample from the posterior again
MCMC.sample(100000, burn=500, thin=5)
print MCMC.stats()['rate_x']['quantiles']
All this works, so I suppose the simple answer to my question is "yes". But it feels very hacky. Is there a better or more natural way to accomplish this?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.