Third parameter (kwargs) sklearn fit() function

Third parameter (kwargs) sklearn fit() function - python

I'm quite new to scikit-learn and I have a question about the fit() function. I tried to look for information on the internet but couldn't find much.
In an assignement I have to create a dict of parameters passed to the fit function of a classifier, which means the function will take 3 arguments (X, y, kwargs). What parameters is this dictionary supposed to have? Apparently those are hyper parameters for the fit function. Online I only found information for xgbooster but I'm not supposed to use that, only classifiers from sklearn.
I also found online that fit can take a dictionary called **fit_params but there is nothing about the parameters the function might take.
I hope my question is clear, thanks a lot in advance!

The model hyperparameters are not arguments to the fit function, but to the model class object that you need to create beforehand.
If you have a dictionary with parameters that you want to pass to your model, you need to do things this way (here with a Logistic Regression):
from sklearn.linear_model import LogisticRegression
params = {"C":10, "max_iter":200}
LR = LogisticRegression(**params)
Now that you have created the model specifying the hyperparameters, you can proceed and fit it with your data.
LR.fit(X, y)

I haven't used scikit-learn before, but you can get the docs of a function that you are unsure about by using the __doc__ method. The fit() method of an estimator returns this for its __doc__ method:
Fit the SVM model according to the given training data.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
For kernel="precomputed", the expected shape of X is
(n_samples, n_samples).
y : array-like of shape (n_samples,)
Target values (class labels in classification, real numbers in
regression)
sample_weight : array-like of shape (n_samples,), default=None
Per-sample weights. Rescale C per sample. Higher weights
force the classifier to put more emphasis on these points.
Returns
-------
self : object
Notes
-----
If X and y are not C-ordered and contiguous arrays of np.float64 and
X is not a scipy.sparse.csr_matrix, X and/or y may be copied.
If X is a dense array, then the other methods will not support sparse
matrices as input.
I ran this to get that output:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
print(clf.fit.__doc__)

Related

How to output mean and stdv of Gaussian Process Classifier in sklearn?

Im fitting some data for a classification task using Gaussian Process Classifiers in sklearn. I know that for the Gaussian Process Regressor one can pass return_std in
y_test, std = gp.predict(x_test, return_std=True)
to output the standard deviation of the test sample (like in this question)
However, I couldn't find such a parameter for the GP Classifier.
Is there such thing as outputting the predictive mean and stdv of test data from a GP Classifiers? And is there a way to output the posterior mean and covariance of the fitted model?

There is not standard deviation for categorical data, hence there is no the parameter return_std in the Classifier.
However, if you want to quantify the uncertainty of the classifier predictions, you could use the .predict_proba(X)method. Once you get the probabilites of each posible class you could compute the entropy of the predicted probabilities.

You could get the variance associated with the logit function by going to the predict_proba function definition in _gpc.py and returning the 'var_f_star' value. I have modified the predict_proba and created a function to return the logit variance below:
def predict_var(self, X):
"""Return probability estimates for the test vector X.
Parameters
----------
X : array-like of shape (n_samples, n_features) or list of object
Query points where the GP is evaluated for classification.
Returns
-------
C : array-like of shape (n_samples, n_classes)
Returns the probability of the samples for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute ``classes_``.
"""
check_is_fitted(self)
# Based on Algorithm 3.2 of GPML
K_star = self.kernel_(self.X_train_, X) # K_star =k(x_star)
f_star = K_star.T.dot(self.y_train_ - self.pi_) # Line 4
v = solve(self.L_, self.W_sr_[:, np.newaxis] * K_star) # Line 5
# Line 6 (compute np.diag(v.T.dot(v)) via einsum)
var_f_star = self.kernel_.diag(X) - np.einsum("ij,ij->j", v, v)

how initial bias value is chosen in sklearn logistic regression?

When training logistic regression it goes through an iterative process where at each process it calculates weights of x variables and bias value to minimize the loss function.
From official sklearn code class LogisticRegression | linear model in scikit-learn, the logistic regression class' fit method is as follows
def fit(self, X, y, sample_weight=None):
"""
Fit the model according to the given training data.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like of shape (n_samples,)
Target vector relative to X.
sample_weight : array-like of shape (n_samples,) default=None
Array of weights that are assigned to individual samples.
If not provided, then each sample is given unit weight.
.. versionadded:: 0.17
*sample_weight* support to LogisticRegression.
I am guessing sample_weight = weight of x variables which are set to 1 if not given, is the bias value also 1?

You sound somewhat confused, perhaps looking for an analogy here with the weights & biases of a neural network. But this is not the case; sample_weight here has nothing to do with the weights of a neural network, even as a concept.
sample_weight is there so that, if the (business) problem requires so, we can give more weight (i.e. more importance) to some samples compared with others, and this importance directly affects the loss. It is sometimes used in cases of imbalanced data; quoting from the Tips on practical use section of the documentation (it is about decision trees, but the rationale is the same):
Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.
and from a relevant thread at Cross Validated:
Sample weights are used to increase the importance of a single data-point (let's say, some of your data is more trustworthy, then they receive a higher weight). So: The sample weights exist to change the importance of data-points
You can see a practical demostration of how changing the weight of some samples changes the final model in the SO thread What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn? (again, it is about decision trees, but the rationale is the same).
Having clarified that, it should now be apparent that there is no room here for any kind of "bias" parameter whatsoever. In fact, the introductory paragraph in your question is wrong: logistic regression does not compute such weights and biases; it returns coefficients and an intercept term (sometimes itself called bias), and these coefficients & intercept have nothing to do with sample_weight.

TimeSeries K-means clustering for multi-dimensional data

I'm using Tslearn's TimeSeriesKmeans library to cluster my dataset with shape (3000,300,8), However the documentation only talks about cases where the dimension of the dataset being (n_samples,timesteps,1)i.e (single feature). Can anybody help me understand if I can perform clustering with a higher dimension?
I'm using "DTW" as my distance metric.

I used TimeSeriesKMeans from tslearn.clustering library. As you mentioned, the only example available on tslearn documentation is using 1 dimension input. However, it is very common to work with time series data with higher dimensions. For instance, in my case, I was clustering human motion which was 30 frames of 135 joint key points for each frame. Therefore, my data shape was like (number_of_samples, number_of_frames, features).
In order to use tslearns's Timeserieskmeans, you need to input an ndarray with (n_sample, m_time_step(sequence_length), k_features(k_dimensions) ).
If you take a look at the documentations, fit function parameters is as follows:
fit(X, y=None)[source] Compute k-means clustering.
Parameters: X : array-like of shape=(n_ts, sz, d) Time series
dataset.
y Ignored
The point is, your input data should be an ndarray with shape of (n_sample, seq_length, n_features) otherwise, it won't work. For example, at the first, my data was like a list of (n_samples,) and each element in that list was like (seq_length, features). It wan't work until I converted it to an ndarray with (n_sample, seq_length, features).

How can I solve inverse_transform with shape problem?

here is my code
scaler = MinMaxScaler() #default set 0~1
dataset= scaler.fit_transform(dataset)
...
make model
...
predicted = model.predict(X_test) #shape : (5, 1)
and when I run predict = scaler.inverse_transform(predicted)
ValueError occur ValueError: non-broadcastable output operand with shape (5,1) doesn't match the broadcast shape (5,2)
My model have 2 feature as input
I tried scaler.inverse_transform(predict)[:, [0]] and reshape in several directions
but occur same ValueError
how can I solve this Problem? please give me some advice
I need your priceless opinion and will be very much appreciated.

You are using inverse_transform in a wrong way: while you have used fit_transform to your features, you are using inverse_transform to your predictions, which are of a different shape, hence the error.
This is not the intended usage of inverse_transform; have a look at the docs for more:
inverse_transform(self, X)
Undo the scaling of X according to feature_range.
Parameters: X : array-like, shape [n_samples, n_features]
Input data that will be transformed. It cannot be sparse.
It is not clear from your post why you attempt to "transform back" your predictions; this only makes sense if you already have transformed your labels (it is not clear from your post if you have done so), and you want, say, to scale back measures like MSE in the original scale of the labels. In such a case, you should use a separate scaler for your labels - see own answer in How to interpret MSE in Keras Regressor for details (the example there is with StandardScaler, but the rationale is the same).

Why is intercept_ an array in sklearn linear regression?

In sklearn linear regression, intercept_ returned is an array and not a scalar. Why so?
Other type of regressors, for example HuberRegressor allow intercept_ to be returned as scalar. So code consistency throughout the api should not be the reason.

I would rephrase your question as "why some algorithms return intercept_ as a scalar value?"
For multiple features we usually need multiple biases (intercepts) if we are talking about linear models...
In HuberRegressor the intercept is set explicitly to a scalar value:
if self.fit_intercept:
self.intercept_ = parameters[-2]
else:
self.intercept_ = 0.0
self.coef_ = parameters[:X.shape[1]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.