I have a set of alphanumeric categorical features (c_1,c_2, ..., c_n) and one numeric target variable (prediction) as a pandas dataframe. Can you please suggest to me any feature selection algorithm that I can use for this data set?
I'm assuming you are solving a supervised learning problem like Regression or Classification.
First of all I suggest to transform the categorical features into numeric ones using one-hot encoding. Pandas provides an useful function that already does it:
dataset = pd.get_dummies(dataset, columns=['feature-1', 'feature-2', ...])
If you have a limited number of features and a model that is not too computationally expensive you can test the combination of all the possible features, it is the best way however it is seldom a viable option.
A possible alternative is to sort all the features using the correlation with the target, then sequentially add them to the model, measure the goodness of my model and select the set of features that provides the best performance.
If you have high dimensional data, you can consider to reduce the dimensionality using PCA or another dimensionality reduction technique, it projects the data into a lower dimensional space reducing the number of features, obviously you will loose some information due to the PCA approximation.
These are only some examples of methods to perform feature selection, there are many others.
Final tips:
Remember to split the data into Training, Validation and Test set.
Often data normalization is recommended to obtain better results.
Some models have embedded mechanism to perform feature selection (Lasso, Decision Trees, ...).
I'm wondering how to train a Multivariate Bayesian Structural Time Series (BSTS) model that automatically performs feature selection on hundreds of input time series using Tensorflow Probability.
The TF-Probability BSTS blog post shows how to include seasonal effects alongside a single input feature:
...
temp_effect = sts.LinearRegression(
design_matrix=tf.reshape(temp - np.mean(temp),
(-1, 1)), name='temp_effect')
...
model = sts.Sum([..., temp_effect,...],
observed_time_series=observed_time_series)
But what about when there are multiple input time series?
Reading through the documentation makes it seem that with many inputs the SparseLinearRegression would be preferrable, which makes sense, but how should I adapt my code?
The documentation for both LinearRegression and SparseLinearRegression method suggests using design_matrix=tf.stack([series1, series2], axis=-1), weights_prior_scale=0.1), but since that's different from how TF-Probability's own blog post uses it I am unsure if that is the best way to go.
Should I be adding all (several hundred) input features inside the design_matrix of a single SparseLinearRegression, or should I be adding a separate LinearRegression for each feature and then use sts.Sum() to combine them all into the model? Though I would like the functionality of visualizing the impact of each feature, I am most interested in having the model automatically perform feature selection and generate weights for the remaining features which I can have access to.
I want to use tpot. The data I have includes multi-output continuous variables only (i.e. output shape is: (n_samples, n_output_variables), where all items are floats).
This could be achievable using sklearn's MultiOutputRegressor class. But because I have over 100 different output variables, I want to avoid to apply tpot for each individual output.
Now, how can I use tpot to only search for multi-output models? Is there a way to tell tpot that only multi-output models (such as DecisionTree) should be used?
About regressors with multiple output:
You have a multioutput regression problem. I suggest that you check this answer: Multi-output regression.
There are regressors which do natively support multiple output on the target, for example KNeighborsRegressor, DecisionTreeRegressor, GradientBoostingRegressor, ExtraTreesRegressor and RandomForestRegressor. Others (like SGDRegressor, ElasticNetCV, etc...) can be used with multiple output if you use MultiOutputRegressor as you already mentioned.
About TPOT and multiple output regression:
Currently TPOT can be used with all the regressors that support multiple output natively but you have to adjust a file for that because it is not implemented yet, take a look at https://github.com/EpistasisLab/tpot/issues/971. If you want to compare the other regressors (single output) together with MultiOutputRegressor, TPOT will currently let you only choose one at a time. That is you can specify only one of the several algorithms and then search for the best pipeline. Then you could rerun with another algorithm.
Regarding your question about specifying which algorithm you want to search for: first take a look at the official documentation and read the section Customizing TPOT's operators and parameters. If you want to just use some specific algorithms, one way to achieve this is to copy the standard TPOT configuration for regression (https://github.com/EpistasisLab/tpot/blob/master/tpot/config/regressor.py), to include it in your code and uncomment (or add) all the algorithms you do not (or do) want to include in your search.
I can't figure out how the sklearn.pipeline.Pipeline works exactly.
There are a few explanation in the doc. For example what do they mean by:
Pipeline of transforms with a final estimator.
To make my question clearer, what are steps? How do they work?
Edit
Thanks to the answers I can make my question clearer:
When I call pipeline and pass, as steps, two transformers and one estimator, e.g:
pipln = Pipeline([("trsfm1",transformer_1),
("trsfm2",transformer_2),
("estmtr",estimator)])
What happens when I call this?
pipln.fit()
OR
pipln.fit_transform()
I can't figure out how an estimator can be a transformer and how a transformer can be fitted.
Transformer in scikit-learn - some class that have fit and transform method, or fit_transform method.
Predictor - some class that has fit and predict methods, or fit_predict method.
Pipeline is just an abstract notion, it's not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.
Here is a good example of Pipeline usage.
Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
With just:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor.
Answer to edit:
When you call pipln.fit() - each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can't call fit_transform or transform on pipeline, last step of which is predictor.
I think that M0rkHaV has the right idea. Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let's break down the two major components:
Transformers are classes that implement both fit() and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer. If you look at the docs for these preprocessing tools, you'll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!
Estimators are classes that implement both fit() and predict(). You'll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn't necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn't be able to call predict().
As for your edit: let's go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.
bin = LabelBinarizer() #first we initialize
vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized
Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer 'knows' about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn't make any sense. This is true if you print out the list of classes before trying to fit the data.
print bin.classes_
I get the following error when trying this:
AttributeError: 'LabelBinarizer' object has no attribute 'classes_'
But when you fit the binarizer on the vec list:
bin.fit(vec)
and try again
print bin.classes_
I get the following:
['cat' 'dog']
print bin.transform(vec)
And now, after calling transform on the vec object, we get the following:
[[0]
[1]
[1]
[1]]
As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what's important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.
ML algorithms typically process tabular data. You may want to do preprocessing and post-processing of this data before and after your ML algorithm. A pipeline is a way to chain those data processing steps.
What are ML pipelines and how do they work?
A pipeline is a series of steps in which data is transformed. It comes from the old "pipe and filter" design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Pipelining Machine Learning as:
Pipe and filters. The pipeline’s steps process data, and they manage their inner state which can be learned from the data.
Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
Directed Acyclic Graphs (DAG). A pipeline step's output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That’s an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.
Methods of a Scikit-Learn Pipeline
Pipelines (or steps in the pipeline) must have those two methods:
“fit” to learn on the data and acquire state (e.g.: neural network’s neural weights are such state)
“transform" (or "predict") to actually process the data and generate a prediction.
It's also possible to call this method to chain both:
“fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.
Problems of the sklearn.pipeline.Pipeline class
Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?
Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:
Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.
Solutions that we’ve Found to Those Scikit-Learn's Problems
For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!
Inability to Reasonably do Automatic Machine Learning (AutoML)
Problem: Defining the Search Space (Hyperparameter Distributions)
Problem: Defining Hyperparameters in the Constructor is Limiting
Problem: Different Train and Test Behavior
Problem: You trained a Pipeline and You Want Feedback on its Learning.
Inability to Reasonably do Deep Learning Pipelines
Problem: Scikit-Learn Hardly Allows for Mini-Batch Gradient Descent (Incremental Fit)
Problem: Initializing the Pipeline and Deallocating Resources
Problem: It is Difficult to Use Other Deep Learning (DL) Libraries in Scikit-Learn
Problem: The Ability to Transform Output Labels
Not ready for Production nor for Complex Pipelines
Problem: Processing 3D, 4D, or ND Data in your Pipeline with Steps Made for Lower-Dimensional Data
Problem: Modify a Pipeline Along the Way, such as for Pre-Training or Fine-Tuning
Problem: Getting Model Attributes from Scikit-Learn Pipeline
Problem: You can't Parallelize nor Save Pipelines Using Steps that Can't be Serialized "as-is" by Joblib
Additional pipeline methods and features offered through Neuraxle
Note: if a step of a pipeline doesn’t need to have one of the fit or transform methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.
As a starter, it is possible for pipelines or their steps to also optionally define those methods:
“setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
“teardown”, which is the opposite of the “setup” method: it clears resources.
The following methods are provided by default to allow for managing hyperparameters:
“get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter’ keys are chained with double underscores “__” separators.
“set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
“get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you’ll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3) which means 1 to 3 layers. You can call .rvs() on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
“set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.
For more info on our suggested solutions, read the entries in the big list with links above.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import pandas as pd
class TextTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование текстовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None, *parg, **kwarg):
return self
def transform(self, X):
return X[self.key]
class NumberTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование числовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[[self.key]]
def fit_predict(model, X_train, X_test, y_train, y_test):
vec_tdidf = TfidfVectorizer(ngram_range=(2,2), analyzer='word', norm='l2')
text = Pipeline([
('transformer', TextTransformer(key='clear_messages')),
('vectorizer', vec_tdidf)
])
word_numeric = Pipeline([
('transformer', NumberTransformer(key='word_count')),
('scalar', StandardScaler())
])
word_class = Pipeline([
('transformer', NumberTransformer(key='preds')),
('scalar', StandardScaler())
])
# Объединение всех признаков
features = FeatureUnion([('Text_Feature', text),
('Num1_Feature', word_numeric),
('Num2_Feature', word_class)
])
# Классификатор
clf = model
# Объединение классификатора и признаков
pipe = Pipeline([('features', features),
('clf',clf)
])
# Обучение модели
pipe_fit=pipe.fit(X_train, y_train)
# Предсказание данных
preds = pipe_fit.predict(X_test)
return preds, pipe_fit
Is there a method that I can input the coefficients to the clf of SVC in my script, then apply clf.score() or clf.predict() function for further test?
Currently I am using joblib.dump(clf,'file.plk') to save all the information of a trained clf. But this involves the disk writing/reading. It will be helpful for me if I can just define a clf with two arrays representing the support vector (clf.support_vectors_), weights (clf.coef_/clf.dual_coef_), and bias (clf.intercept_) respectively.
This line calls the prediction function from libsvm. It looks like this (but please take a look at the whole function _dense_predict):
libsvm.predict(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_,
self.probA_, self.probB_, svm_type=svm_type, kernel=kernel,
degree=self.degree, coef0=self.coef0, gamma=self._gamma,
cache_size=self.cache_size)
You can use this line and give it all the relevant information directly and will obtain a raw prediction. In order to do this, you must import the libsvm from sklearn.svm import libsvm. If your initial fitted classifier is called svc, then you can obtain all the relevant information from it by replacing all the self keywords with svc and keeping the values. If svc._impl gives you "c_svc", then you set svm_type=0.
Note that at the beginning of the _dense_predict function you have X = self._compute_kernel(X). If your data is X, then you need to transform it by doing K = svc._compute_kernel(X), and call the libsvm.predict function with K as the first argument
Scoring is independent from all this. Take a look at sklearn.metrics, where you will find e.g. the accuracy_score, which is the default score in SVM.
This is of course a somewhat suboptimal way of doing things, but in this specific case, if is impossible (I didn't check very hard) to set coefficients, then going into the code and seeing what it does and extracting the relevant part is surely an option.
Check out this blog post on memory usage of sklearn models using succinct tries to see if it is applicable.
If the other location does not have access to the sklearn packages you would need to create your own score and predict functions. clf.score() and clf.predict() requires clf to be an sklearn object.