generate a sklearn regression data set with categorical features - python

i am using sklearn to generate regression datasets like this
from sklearn.datasets import make_regression
X,y=make_regression()
now that function only seems to be able to generate a Feature Matrix X with only float values. However, in my example i would need some features to be binary ([0,1]). Is there a way to tell sklearn to include a number on binary features.
Maybe it could also be a solution to forge some features and assign a binary values based on the median of the feature.

Related

How to use skmultilearn to train models on label specific data

I am using skmultilearn library to solve a multi-label machine learning problem. There are 5 labels with binary data (0 or 1). Sklearn logistic regression is being used as base classifier. But I need to set label specific features for each classifier. The label data of one classifier to be used as feature of another classifier.
I am not able to figure out on how to do that.
Any help appreciated.
One-vs-Rest is the method of solving the multi-label problem you are trying to address, it is the transformation type. You just need to generate a different training set for each simple classifier so that you have all the combinations between the original attributes and each of the labels. Pandas can be useful for the manipulation of the data and the generation of the different datasets for each simple classifier. Note that using this strategy in its original form ignores the relationships between the tags.

Sklearn regression with clustered data

I'm trying to run a multinomial LogisticRegression in sklearn with a clustered dataset (that is, there are more than 1 observations for each individual, where only some features change and others remain constant per individual).
I am aware in statsmodels it is possible to account for this the following way:
mnl = MNLogit(x,y).fit(cov_type="cluster", cov_kwds={"groups": cluster_groups)
Is there a way to replicate this with the sklearn package instead?
In order to run multinomial Logistic Regression in sklearn, you can use the LogisticRegression module and then set the parameter multi_class to multinomial.
Reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Is StandardScaler() or scale in sklearn for scaling data better for supervised machine learning model? [duplicate]

I understand that scaling means centering the mean(mean=0) and making unit variance(variance=1).
But, What is the difference between preprocessing.scale(x)and preprocessing.StandardScalar() in scikit-learn?
Those are doing exactly the same, but:
preprocessing.scale(x) is just a function, which transforms some data
preprocessing.StandardScaler() is a class supporting the Transformer API
I would always use the latter, even if i would not need inverse_transform and co. supported by StandardScaler().
Excerpt from the docs:
The function scale provides a quick and easy way to perform this operation on a single array-like dataset
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline
My understanding is that scale will transform data in min-max range of the data, while standardscaler will transform data in range of [-1, 1].

How to deal with "None" when I using sklearn-decisiontreeclassifier?

When I use sklearn to built a decisiontree,examples:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,Y)
result = clf.predict(testdata)
X is the training input samples,if there is "None" in X,How to do with it?
Decision Trees and Ensemble methods like Random Forests (based on such trees) only accept numerical data since it performs splits on each node of the tree in order to minimize a given impurity function (entropy, Gini index ...)
If you have some categorical features or some Nan in your data, the learning step will throw an error.
To circumvent this :
Transform categorical data into numerical data : to do this use for example a One Hot Encoder. Here is a link to sklearn's documentation.
Warning : If you have a feature with a lot of categories (e.g. an ID feature) OneHotEncoding may lead to memory issues. Try to avoid encoding such features.
Impute some values to the missing ones. Many strategies exist (mean, median, most frequent ...). Here is a link to sklearn's documentation.
Once you've done this preprocessing, you can fit your Decision Tree to your data.

is it proper to use float64 data type with scikit-learn ML algorithms?

I am trying to execute Decision Tree and SVM for a dataset given here using scikit-learn. My purpose is to compare these two algorithms so that I am using KFold cross-validation method for both algorithms and show the difference. But the dataset I am using, consist real number like 0.00057. I get accuracy that I can say there is no overfitting, but I am not sure if real numbers effect the results.
Is it a problem to give scikit-learn built-in classification functions real numbers ? If it is , what should I do get better results ?
PS: when I check the type of a single data in python I see it is float64.
DecisionTreeClassifier and SVC internally use float32 to represent the features. They will convert any input data into this format. For machine learning tasks, that is usually more than enough precision.

Categories

Resources