I am running some algorithms in scikit. Like Currently I use RandomisedLasso. But this question pertain to any ml algo in scikit.
My initial training data is 149x56. Now here is what I do:
from sklearn.linear_model import RandomizedLasso
est_rlasso = RandomizedLasso(max_iter=1000)
# Running Randomised Lasso
x=est_rlasso.fit_transform(tourism_X,tourism_Y)
x.shape
>>> (149x36).
So if you see it gives out 36 best features to be retained out of 56 initially and transforms the dataset from 149x56 to 149x36. But the problem is which 36 features did it retain? The biggest problem with scikit is that it strips off the variable headers. So now I am left clueless which features did this algorithm keep and which one it removed as the final X has no header to cross-check.
THis is common across any ml algorithm implementation in scikit. How does one overcome this? Like if I need to find which variables it gave as significant or if I am running a Regression model then the coefficient stand for which variables as I might have used Onehotencoder to transform categorical variables and then it would change the var order from original.
Any idea?
From the docs
get_support([indices]) Return a mask, or list, of the features/indices
selected.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLasso.html
Related
I´m trying to cluster a dataframe with 36 features and a lot (88%) of zeros. It's my first job at ML, I started with K-Means, but for any K I choose, 99.5% of my data remains on cluster 0. I've tried some PCA do reduce features, but the same problem appeared.
Any thoughts on approaches I can try?
Have you tried techniques such as sequential feature selection? These are so-called 'wrapper methods' where you add (for forward selection) or eliminate (for backward elimination) one feature at a time and assess the performance of the model accordingly. I tend to use supervised learning models in my job but I believe you can use sequential selection algorithms to assess unsupervised models as well. I have used the sklearn library for this: https://scikit-learn.org/stable/modules/feature_selection.html
I encountered a problem while doing my ML project. Hope to get some advice from you!
I fit logistic LASSO on a dataset with only 15 features trying to predict a binary outcome. I know that LASSO is supposed to do feature selection and eliminate the unimportant ones (coefficient = 0), but in my analysis, it has selected all the features and did not eliminate any one of them. My questions are:
Is this because I have too few features, or that the features are not correlated with each other(low co-linearity?)
Is this a bad thing or a good thing for a classification model?
some coefficients of the features LASSO selected are less than 0.1, can I interpret them as non-important or not that important to the model?
p.s. I run the model using the sklearn package in python.
Thank you!
Lasso did not fail to perform feature selection. It just determined that none of the 15 features were unimportant. For the one's where you get coefficients = 0.1 this just means that they are less important when compared to other more important features. So I would not be concerned!
Also 15 features is not a large amount of features for Lasso to determine the important one's. I mean it depends on the data so for some datasets, it can eliminate some features from a dataset of 10 features and sometimes it won't eliminate any from a dataset of 20. It just depends on the data!
Cheers!
In my dataset X I have two continuous variables a, b and two boolean variables c, d, making a total of 4 columns.
I have a multidimensional target y consisting of two continuous variables A, B and one boolean variable C.
I would like to train a model on the columns of X to predict the columns of y. However, having tried LinearRegression on X it didn't perform so well (my variables vary several orders of magnitude and I have to apply suitable transforms to get the logarithms, I won't go into too much detail here).
I think I need to use LogisticRegression on the boolean columns.
What I'd really like to do is combine both LinearRegression on the continuous variables and LogisticRegression on the boolean variables into a single pipeline. Note that all the columns of y depend on all the columns of X, so I can't simply train the continuous and boolean variables independently.
Is this even possible, and if so how do I do it?
I've used something called a "Model Tree" (see link below) for the same sort of problem.
https://github.com/ankonzoid/LearningX/tree/master/advanced_ML/model_tree
But it will need to be customized for your application. Please ask more questions if you get stuck using it.
Here's a screen shot of what it does
If your target data Y has multiple columns you need to use multi-task learning approach. Scikit-learn contains some multi-task learning algorithms for regression like multi-task elastic-net but you cannot combine logistic regression with linear regression because these algorithms use different loss functions to optimize. Also, you may try neural networks for your problem.
What i understand you want to do is to is to train a single model that both predicts a continuous variable and a class. You would need to combine both loses into one single loss to be able to do that which I don't think is possible in scikit-learn. However I suggest you use a deep learning framework (tensorflow, pytorch, etc) to implement your own model with the required properties you need which would be more flexible. In addition you can also tinker with solving the above problem using neural networks which would improve your results.
Have read multiple cases where StandardScaler is used on y_train and y_test and also where it is not used. Is there any specific rules where it should be used on them?
Quoting from here:
Standardization of a dataset is a common requirement for many machine
learning estimators: they might behave badly if the individual
features do not more or less look like standard normally distributed
data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a
learning algorithm (such as the RBF kernel of Support Vector Machines
or the L1 and L2 regularizers of linear models) assume that all
features are centered around 0 and have variance in the same order. If
a feature has a variance that is orders of magnitude larger that
others, it might dominate the objective function and make the
estimator unable to learn from other features correctly as expected.
So probably When your features has different scales/distributions you should standardize/scale their values.
I had some trouble finding a good transductive svm (semi-supervised support vector machine or s3vm) implementation for python. Finally I found the implementation of Fabian Gieseke of Oldenburg University, Germany (code is here: https://www.ci.uni-oldenburg.de/60506.html, paper title: Fast and Simple Gradient-Based Optimization for Semi-Supervised Support Vector Machines).
I now try to integrate the learned model into my scikit-learn code.
1) This works already:
I've got a binary classification problem. I defined a new method inside the S3VM-code returning the self.__c-coeficients (these are needed for the decision function of the classifier).
I then assign these (in my own scikit-code where clf stands for a svm.SVC-classifier) to clf.dual_coefs_ and properly change clf.support_ too (which holds the indices of the support vectors). It takes a while because sometimes you need numpy-arrays and sometimes lists etc. But it works quite well.
2) This doesnt work at all:
I want to adapt this now to a multi-class-classification problem (again with an svm.SVC-classifier).
I see the scheme for multi-class dual_coef_ in the docs at
http://scikit-learn.org/stable/modules/svm.html
I tried some things already but seem to mess it up all the time. My strategy is as follows:
for all pairs in classes:
calculate the coefficients with qns3vm for the properly binarized labeled training set (filling 0s into spaces in the coef-vector where instances have been in the labeled training set that are not in the current class-pair) --> get a 1x(l+u)-np.array of coefficients
horizontally stack these to get a (n_class*(n_class-1)/2)x(l+u) matrix | I do not have a clue why the specs say that this should be of shape [n_class-1, n_SV(=l+u)]?
replace clf.dual_coef_ with this matrix
Does anybody know the right way to replace dual_coef_ in the multi-class-setting? Or is there a neat piece of example code someone can recommend? Or at least a better explanation for the shape of dual_coef_ in the one-vs-one-multiclass-setting?
Thanks!
Damian