Sklearn /scikit learn using fit method - python

how does fit() method works in sklearn.preproessing using Imputer class
what does exactly fit() do in back ground how it is necessary for below code and
everywhere im seeing fitting what fitting with what , why and how ?
from sklearn.preprocessing import Imputer
impt = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
impt = impt.fit(X[:,1:3])
X[:,1:3] = impt.transform(X[:,1:3])

The idea is to 'fit' your pre-processing on your training data only (as you would your model). It will learn some state, for the imputer this might be the mean of your feature. Then when you transform on your test / validation data, you use the state (i.e. the mean in this case) to impute the new unseen data. Using this design, it makes it really easy to avoid data leaks. Consider if you had imputed on your entire dataset. The mean that you use for the imputation now uses some of the information from your supposedly unsees test data. This is a data leak, your data is no longer truly unseen. Scikit-learn uses the fit / transform pattern to easily mitigate this common pitfall in machine learning.
Furthermore, because ALL sklearn transformers and estimators use this fit API, you can chain them up in a pipeline making it possible to do all your pre-processing easily on each fold of a k-fold cross-validation, which otherwise would be a very fiddly, tricky thing to do without errors.

Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
The above line creates an Imputer object which will impute/replace the missing values which are denoted as NaN's with the mean value of the values.
impt = impt.fit(X[:,1:3])
So it needs some data from which it can calculate mean which can be replaced by the missing values. This is normally done by a method fit which will calculate the values needed, mean in this case. The fit takes in some data to calculate these values and it is normally called the training phase.
impt.transform(X[:,1:3])
Once the values are calculated they can be used on the new data presented to it. In this case, it will replace the missing data with the calculated (in fit method ) mean. This is done via a transform method.
Sometimes one might want to run fit and transform of the same data. In such cases instead of calling fit followed by transform we can use fit_transform method.
X[:,1:3] = impt.fit_transform(X[:,1:3])

Well, the aim of "fit" in preprocessing stage is to compute the necessary values (like min and max of each variable). Then with this value scikit learn can then preprocess your data but it couldn't before. It is also useful because you can then re use your preprocessor object later.
You can also use fit_transform if you like to do these 2 steps in one.

Related

Is there any reason to do .fit() and .transform() instead of just .fit_transform()?

I just started learning ML and wondered why one would do .fit() and .transform() separately, when .fit_transform() exists. Also, I am generally confused on what exactly fitting/.fit() does.
I assume you are talking about sklearn's scalers or sklearn's feature transformation algorithms in general.
Let's say your dataset is splitted in 5 sub-sets and you want to scale each of them between -1 and 1:
You fit your scaler on each sub-set using fit, this basically searches for the maximum and minimum over all of your sets
Then, you can scale your sub-sets using transform
If you had used fit_transform, on the first sub-set, then used it on the second one, it would have been scaled differently, and you don't want that.
Moreover, instead of sub-sets, you can think of fitting once on your training set and keeping the transformation in memory to scale future samples you want to pass to your model.

Is StandardScaler() or scale in sklearn for scaling data better for supervised machine learning model? [duplicate]

I understand that scaling means centering the mean(mean=0) and making unit variance(variance=1).
But, What is the difference between preprocessing.scale(x)and preprocessing.StandardScalar() in scikit-learn?
Those are doing exactly the same, but:
preprocessing.scale(x) is just a function, which transforms some data
preprocessing.StandardScaler() is a class supporting the Transformer API
I would always use the latter, even if i would not need inverse_transform and co. supported by StandardScaler().
Excerpt from the docs:
The function scale provides a quick and easy way to perform this operation on a single array-like dataset
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline
My understanding is that scale will transform data in min-max range of the data, while standardscaler will transform data in range of [-1, 1].

Using cross_val_predict for predictions

I have the following code where I want to use k-fold cross validation for a Linear Regression model:
kf = KFold(n_splits=100)
predi = cross_val_predict(model, train[columns], train[target], cv = kf)
predi = pandas.Series(predi)
model.fit(data[columns], data[target])
pred_test = model.predict(test[columns])
print(mean_squared_error(pred_test, test[target]))
However, I am not sure whether the code does what I would like it to do. Specifically, I am not sure about the model.fit part. Does it even use the cross-validation?
The reason why I am not sure that calculating it like this yields worse results than without cross-validation.
No. CV is just for checking the performance of model on a data (or rather different parts of it)
When you call fit(), it will fit the whole data supplied at the time whereas cross-validation only uses parts of the data (leaving 1 fold in each iteration). So this data difference may cause the estimator to perform better or worse.
model.fit doesn't have any functionality to divide the data. It just works on the cost function minimization problem and creates a model (means find parameters).
Also if you think that you create a loop and you divide the data on every iteration and call model.fit again and again you get the more generalized model, then it's not possible because on calling fit 2nd time on linear regression model object, it forgets about old data.

Getting 'ValueError: shapes not aligned' on SciKit Linear Regression

Quite new to SciKit and linear algebra/machine learning with Python in general, so I can't seem to solve the following:
I have a training set and a test set of data, containing both continuous and discrete/categorical values. The CSV files are loaded into Pandas DataFrames and match in shape, being (1460,81) and (1459,81).
However, after using Pandas' get_dummies, the shapes of the DataFrames change to (1460, 306) and (1459, 294). So, when I do linear regression with the SciKit Linear Regression module, it builds a model for 306 variables and it tries to predict one with only 294 with it. This then, naturally, leads to the following error:
ValueError: shapes (1459,294) and (306,1) not aligned: 294 (dim 1) != 306 (dim 0)
How could I tackle such a problem? Could I somehow reshape the (1459, 294) to match the other one?
Thanks and I hope I've made myself clear :)
This is an extremely common problem when dealing with categorical data. There are differing opinions on how to best handle this.
One possible approach is to apply a function to categorical features that limits the set of possible options. For example, if your feature contained the letters of the alphabet, you could encode features for A, B, C, D, and 'Other/Unknown'. In this way, you could apply the same function at test time and abstract from the issue. A clear downside, of course, is that by reducing the feature space you may lose meaningful information.
Another approach is to build a model on your training data, with whichever dummies are naturally created, and treat that as the baseline for your model. When you predict with the model at test time, you transform your test data in the same way your training data is transformed. For example, if your training set had the letters of the alphabet in a feature, and the same feature in the test set contained a value of 'AA', you would ignore that in making a prediction. This is the reverse of your current situation, but the premise is the same. You need to create the missing features on the fly. This approach also has downsides, of course.
The second approach is what you mention in your question, so I'll go through it with pandas.
By using get_dummies you're encoding the categorical features into multiple one-hot encoded features. What you could do is force your test data to match your training data by using reindex, like this:
test_encoded = pd.get_dummies(test_data, columns=['your columns'])
test_encoded_for_model = test_encoded.reindex(columns = training_encoded.columns,
fill_value=0)
This will encode the test data in the same way as your training data, filling in 0 for dummy features that weren't created by encoding the test data but were created in during the training process.
You could just wrap this into a function, and apply it to your test data on the fly. You don't need the encoded training data in memory (which I access with training_encoded.columns) if you create an array or list of the column names.
For anyone interested: I ended up merging the train and test set, then generating the dummies, and then splitting the data again at exactly the same fraction. That way there wasn't any issue with different shapes anymore, as it generated exactly the same dummy data.
This works for me:
Initially, I was getting this error message:
shapes (15754,3) and (4, ) not aligned
I found out that, I was creating a model using 3 variables in my train data. But what I add constant X_train = sm.add_constant(X_train) the constant variable is automatically gets created. So, in total there are now 4 variables.
And when you test this model by default the test variable has 3 variables. So, the error gets pops up for dimension miss match.
So, I used the trick that creates a dummy variable for y_test also.
`X_test = sm.add_constant(X_test)`
Though this a useless variable, but this solves all the issue.

Set the weights of decision functions through stdin in Sklearn

Is there a method that I can input the coefficients to the clf of SVC in my script, then apply clf.score() or clf.predict() function for further test?
Currently I am using joblib.dump(clf,'file.plk') to save all the information of a trained clf. But this involves the disk writing/reading. It will be helpful for me if I can just define a clf with two arrays representing the support vector (clf.support_vectors_), weights (clf.coef_/clf.dual_coef_), and bias (clf.intercept_) respectively.
This line calls the prediction function from libsvm. It looks like this (but please take a look at the whole function _dense_predict):
libsvm.predict(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_,
self.probA_, self.probB_, svm_type=svm_type, kernel=kernel,
degree=self.degree, coef0=self.coef0, gamma=self._gamma,
cache_size=self.cache_size)
You can use this line and give it all the relevant information directly and will obtain a raw prediction. In order to do this, you must import the libsvm from sklearn.svm import libsvm. If your initial fitted classifier is called svc, then you can obtain all the relevant information from it by replacing all the self keywords with svc and keeping the values. If svc._impl gives you "c_svc", then you set svm_type=0.
Note that at the beginning of the _dense_predict function you have X = self._compute_kernel(X). If your data is X, then you need to transform it by doing K = svc._compute_kernel(X), and call the libsvm.predict function with K as the first argument
Scoring is independent from all this. Take a look at sklearn.metrics, where you will find e.g. the accuracy_score, which is the default score in SVM.
This is of course a somewhat suboptimal way of doing things, but in this specific case, if is impossible (I didn't check very hard) to set coefficients, then going into the code and seeing what it does and extracting the relevant part is surely an option.
Check out this blog post on memory usage of sklearn models using succinct tries to see if it is applicable.
If the other location does not have access to the sklearn packages you would need to create your own score and predict functions. clf.score() and clf.predict() requires clf to be an sklearn object.

Categories

Resources