I have seen some people using numpy.log1p to reduce the skewness of the data and then again use standard scaler to scale the data . what's the difference here ? I mean can't we directly use standard scaler instead of using numpy.log1p followed by standard scaler ?
Related
i am using sklearn to generate regression datasets like this
from sklearn.datasets import make_regression
X,y=make_regression()
now that function only seems to be able to generate a Feature Matrix X with only float values. However, in my example i would need some features to be binary ([0,1]). Is there a way to tell sklearn to include a number on binary features.
Maybe it could also be a solution to forge some features and assign a binary values based on the median of the feature.
I read it from a post that someone said:
For feature scaling, you learn the means and standard deviation of the training set, and then:
Standardize the training set using the training set means and
standard deviations.
Standardize any test set using the training set means and standard
deviations.
But now my question is, after fitting a model using scaled training data, should I then apply this fitted model onto scaled or unscaled test data? Thanks!
Yes, you should also scale the test data. If you have scaled your training data and fitted a model to that scaled data, then the test set should also undergo equivalent preprocessing as well. This is standard practice, as it ensures that the model is always provided a data set of consistent form as input.
In Python, the process might look as follows:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
There is a detailed write up on this topic on another thread that might be of interest to you.
In order to properly fit a regularized linear regression model like the Elastic Net, the independent variables have to be stanardized first. However, the coefficients have then different meaning. In order to extract the proper weights of such model, do I need to calculate them manually with this equation:
b = b' * std_y/std_x
or is there already some built-in feature in sklearn?
Also: I don't think I can just use normalize=True parameter, since I have dummy variables which should probably remain unscaled
You can unstandardize using the mean and standard deviation. sklearn provides them after you use StandardScaler.
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit_transform(X_train) # or whatever you called it
unstandardized_coefficients = model.coef_ * np.sqrt(ss.var_) + ss.mean_
That would put them on the scale of the unstandardized data.
However, since you're using regularization, it becomes a biased estimator. There is a tradeoff between performance and interpretability when it comes to biased/unbiased estimators. This is more a discussion for stats.stackexchange.com. There's a difference between an unbiased estimator and a low MSE estimator. Read about biased estimators and interpretability here: When is a biased estimator preferable to unbiased one?.
tl;dr It doesn't make sense to do what you suggested.
I understand that scaling means centering the mean(mean=0) and making unit variance(variance=1).
But, What is the difference between preprocessing.scale(x)and preprocessing.StandardScalar() in scikit-learn?
Those are doing exactly the same, but:
preprocessing.scale(x) is just a function, which transforms some data
preprocessing.StandardScaler() is a class supporting the Transformer API
I would always use the latter, even if i would not need inverse_transform and co. supported by StandardScaler().
Excerpt from the docs:
The function scale provides a quick and easy way to perform this operation on a single array-like dataset
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline
My understanding is that scale will transform data in min-max range of the data, while standardscaler will transform data in range of [-1, 1].
When I use sklearn to built a decisiontree,examples:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,Y)
result = clf.predict(testdata)
X is the training input samples,if there is "None" in X,How to do with it?
Decision Trees and Ensemble methods like Random Forests (based on such trees) only accept numerical data since it performs splits on each node of the tree in order to minimize a given impurity function (entropy, Gini index ...)
If you have some categorical features or some Nan in your data, the learning step will throw an error.
To circumvent this :
Transform categorical data into numerical data : to do this use for example a One Hot Encoder. Here is a link to sklearn's documentation.
Warning : If you have a feature with a lot of categories (e.g. an ID feature) OneHotEncoding may lead to memory issues. Try to avoid encoding such features.
Impute some values to the missing ones. Many strategies exist (mean, median, most frequent ...). Here is a link to sklearn's documentation.
Once you've done this preprocessing, you can fit your Decision Tree to your data.