The documentation says this:
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
What would X be overwritten with? And which X, during training or testing?
It is written in the source code that the input data needs to be centered and normalized in order to apply the algorithm: https://github.com/scikit-learn/scikit-learn/blob/bac89c2/sklearn/linear_model/base.py#L93.
It seems that the function responsible for the transformation self._preprocess_data is only called during the fitting part https://github.com/scikit-learn/scikit-learn/blob/bac89c2/sklearn/linear_model/base.py#L463. So only the training set could be modified.
Hope it helped.
Related
I am looking for help about the implementation of a logit model with statsmodel for binary variables.
Here is my code :
(I am using feature selection methods : MinimumRedundancyMaximumRelevance and RecursiveFeatureElimination available on python)
for i_mrmr in range(4,20):
for i_rfe in range(3,i_mrmr):
regressors_step1 = I am selecting the MRMR features
regressors_step2 = I am selecting features from the previous list with RFE method
for method in ['newton', 'nm', 'bfgs', 'lbfgs', 'powell', 'cg', 'ncg']:
logit_model = Logit(y,X.loc[:,regressors_step2])
try:
result = logit_model.fit(method=method, cov_type='HC1')
print(result.summary)
except:
result = "error"
I am using Logit from statsmodels.discrete.discrete_model.Logit.
The y variable, the target, is a binary.
All explanatory variables, the X, are binary too.
The logit model is "functionning" for the different optimization methods. That is to say, I end up with some summary to print. Nonetheless, various warnings print such as : "Maximum Likelihood optimization failed to converge."
The optimization methods presented in the statsmodel algorithm are the ones from scipy :
‘newton’ for Newton-Raphson, ‘nm’ for Nelder-Mead
‘bfgs’ for Broyden-Fletcher-Goldfarb-Shanno (BFGS)
‘lbfgs’ for limited-memory BFGS with optional box constraints
‘powell’ for modified Powell’s method
‘cg’ for conjugate gradient
‘ncg’ for Newton-conjugate gradient
We can find these methods on scipy.optimize.
Here are my questions :
I did not find anywhere any argument against the use of these optimization methods for a binary set of variables. But, because of the warnings, I am asking myself if it is correct to do so. And then, what is the best method, the one which is the more appropriate in this case ?
Here : Scipy minimize: how to restrict x only to 0 and 1? it is implicitely said that a model of the kind Python MIP (Mixed-Integer Linear Programming) could be better in the binary set of variables case. In the documentation of the MIP package of python it appears that to implement this kind of model I should explicitly give a function to minimize or maximize and also I should express the constraints... (see : https://docs.python-mip.com/en/latest/quickstart.html#creating-models)
Therefore I am wondering if i need to define a logit function as the objective function ? What are the constraints I should express ? Is there any easier way to do ?
A few questions about feature importance in xgboost in python:
I'm trying to print the feature importance using xgboost.get_booster().get_score(). However, the function sometimes doesn't return anything for some variables. Does that mean the score for those variables is 0?
Why the results of xgboost.get_booster().get_score() looks so different from xgboost.feature_importance_ ?
About the importance_type in get_score(); what would be a a reasonable option for a multi-class classification task? It seems the default in python is "weight", but according to this article importance_type = 'gain' makes more sense (I agree). However, when I use importance_type = 'gain' the results don't make as much sense as the default value for importance_type.
Thank you!
I am having a problem with a function I am trying to fit to some data. I have a model, given by the equation inside the function which I am using to find a value for v. However, the order in which I write the variables in the function definition greatly effects the value the fit gives for v. If, as in the code block below, I have def MAR_fit(v,x) where x is the independent variable, the fit gives a value for v hugely different from if I have the definition def MAR_fit(x,v). I haven't had a huge amount of experience with the curve_fit function in the scipy package and the docs still left me wondering.
Any help would be great!
def MAR_fit(v,x):
return (3.*((2.-1.)**2.)*0.05*v)/(2.*(2.-1.)*(60.415**2.)) * (((3.*x*((2.-1.)**2.)*v)/(60.415**2.))+1.)**(-((5./2.)-1.)/(2.-1.))
x = newCD10_AVB1_AMIN01['time_phys'][1:]
y = (newCD10_AVB1_AMIN01['MAR'][1:])
popt_tf, pcov = curve_fit(MAR_fit, x, y)
Have a look at the documentation again, it says that the callable that you pass to curve_fit (the function you are trying to fit) must take the independent variable as its first argument. Further arguments are the parameters you are trying to fit. You must use MAR_fit(x,v) because that is what curve_fit expects.
In my program, I am applying box cox transform to my data and I am interested to reverse the box-cox transformation at a certain step through my experiment. However I noticed there are two variants of boxcox:
scipy.special.boxcox
scipy.stats.boxcox
I learned that the first option has a function that reverses the box cox transform here.
However I just want to know why in scipy.special the lambda parameter cannot be None while in scipy.stats it could be. In my code I am actually using scipy.stats and the Lamda is None. Now if I want to revert to using scipy.special in order to use its reverse function, what should I set lamda to ?
Here is my current code:
elif self.output_box:
y_train, self.y_train_lambda_ = boxcox(y_train)
y_test, self.y_test_lambda_ = boxcox(y_test)
They both use the same formula for the transformation so it seems that the only difference is that with scipy.stats you can calculate the optimal lambda for the data. If you use scipy.stats.boxcox with lambda=None it returns two parameters: the transformed array and the lambda that maximizes the log-likelihood function (and if alpha is not None too it returns the confidence interval for lambda). Therefore, that’s the lambda that you have to use with the inverse transformation.
I want to do SVM classification (i.e. OneClassSVM) with sklearn.svm.OneClassSVM on physical states that come from a different library (tenpy). I'd define a custom kernel
def overlap(X,Y):
return np.array([[x.overlap(y) for y in Y] for x in X])
where overlap() is a defined function in said library to calculate the overlap between states. When I try to fit with my data
clf = OneClassSVM(kernel=overlap)
clf.fit(states)
where states is a list of such state objects, I get the error
TypeError: float() argument must be a string or a number, not 'MPS'
Is there a way to tell sklearn to ignore this test (w/o editing the source code)?
To my understanding the nature of the data and how it's processed is in principal not essential to the algorithm as long as there is a well-defined kernel for the objects.