I have a dependent variable y and 6 independent variables. I want to make a linear regression out of it. I use sklearn library to do it.
The problem is some of my independent variables have correlation more than 0.5. So I can't have them in my model at the same time
I searched throw internet but didn't find any solution to select best set of independent variables to draw linear regression and output the variables that had been selected.
If you see that you have a correlation between independent variables. You should consider to remove them.
I see you are working with scikit-learn. If you don't want to do any feature selection manually, you could always use one of the feature selection methods in scikit-learns feature_selection module. There are many ways to automatically remove features, and you should cross-validate to determine which one is best for your problem.
You are probably looking for a k-fold validation model.
The idea is to randomly select your features, and have a way to validate them against each other.
The idea is to train your model with your feature selection on (k-1) partitions of your data. And validate it against the last partition. You do it for each partition and take the average of your score (MAE / RMSE for instance)
Your score is an objectif figure to compare your models aka your features selections
Related
I have a dataset with a large number of predictive variables and I want to use them to predict a number of output variables. However, some of the things I want to predict are categorical, and others are continuous; the things I want to predict are not independent. Is it possible with scikit-learn to, for example, mix a classifier and a regressor so that I can predict and disentangle these variables? (I'm currently looking at gradient boosting classifiers/regressors, but there may be better options.)
You can certainly use One Hot Encoding or Dummy Variable Encoding, to convert labels to numerics. See the link below for all details.
https://codefires.com/how-convert-categorical-data-numerical-data-python/
As an aside, Random Forest is a popular machine learning model that is commonly used for classification tasks as can be seen in many academic papers, Kaggle competitions, and blog posts. In addition to classification, Random Forests can also be used for regression tasks. A Random Forest’s nonlinear nature can give it a leg up over linear algorithms, making it a great option. However, it is important to know your data and keep in mind that a Random Forest can’t extrapolate. It can only make a prediction that is an average of previously observed labels. In this sense it is very similar to KNN. In other words, in a regression problem, the range of predictions a Random Forest can make is bound by the highest and lowest labels in the training data. This behavior becomes problematic in situations where the training and prediction inputs differ in their range and/or distributions. This is called covariate shift and it is difficult for most models to handle but especially for Random Forest, because it can’t extrapolate.
https://towardsdatascience.com/a-limitation-of-random-forest-regression-db8ed7419e9f
https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn
In closing, Scikit-learn uses numpy matrices as inputs to its models. As such all features become de facto numerical (if you have categorical feature you’ll need to convert them to numerical).
I don't think there's a builtin way. There are ClassifierChain and RegressorChain that allow you to use earlier predictions as features in later predictions, but as the names indicate they assume either classification or regression. Two options come to mind:
Manually patch those together for what you want to do. For example, use a ClassifierChain to predict each of your categorical targets using just the independent features, then add those predictions to the dataset before training a RegressorChain with the numeric targets.
Use those classes as a base for defining a custom estimator. In that case you'll probably look mostly at their common parent class _BaseChain. Unfortunately that also uses a single estimator attribute, whereas you'd need (at least) two, one classifier and one regressor.
I am trying to get a sense of the rationale between some independent variables and quantify their importance on a dependent variable. I came across methods like the random forest that can quantify the importance of variables and then predict the outcome. However, I have an issue with the nature of the data to be used with the random forest or similar methods. An example of data structure is provided below, and as you can see the time series have some variables like population and Age that do not change with time, though different among the different city. While other variables such as temperature and #internet users are changing through time and within the cities. My question is: how can I quantify the importance of these variables on the “Y” variable? BTW, I prefer to apply the method in python environment.
"How can I quantity the importance" is very common question also known as "feature-importance".
The feature importance depends on your model; with a regression you have importance in your coefficients, in random forest you can use (but, some would not recommend) the build-in feature_importances_ or better the SHAP-values. Further more you can use som correlaion i.e Spearman/Pearson correlation between your features and your target.
Unfortunately there is no "free lunch", you will need to decide that based on what you want to use it for, how your data looks like etc.
I think the one you came across might be Boruta where you shuffle up your variables, add them to your data set and then create a threshold based on the "best shuffled variable" in a Random Forest.
My idea is as follows. Your outcome variable 'Y' has only a few possible values. You can build a classifier (Random Forest is one of many existing classifiers), to predict say 'Y in [25-94,95-105,106-150]'. You will here have three different outcomes that rule out each other. (Other interval limits than 95 and 105 are possible, if that better suits your application).
Some of your predictive variables are time series whereas others are constant, as you explain. You should use a sliding window technique where your classifier predicts 'Y' based on the time-related variables in say the month January. It doesn't matter that some variables are constant, as the actual variable 'City' has the four outcomes: '[City_1,City_2,City_3,City_4]'. Similarly, use 'Population' and 'Age_mean' as the actual variables.
Once you use classifiers, many approaches to feature ranking and feature selection have been developed. You can use a web service like insight classifiers to do it for you, or download a package like Weka for that.
Key point is that you organize your model and its predictive variables such that a classifier can learn correctly.
If city and month are also your independent variables, you should convert them from index into columns. Using pandas to read your file, then use df.reset_index() can do the job for you.
In my dataset X I have two continuous variables a, b and two boolean variables c, d, making a total of 4 columns.
I have a multidimensional target y consisting of two continuous variables A, B and one boolean variable C.
I would like to train a model on the columns of X to predict the columns of y. However, having tried LinearRegression on X it didn't perform so well (my variables vary several orders of magnitude and I have to apply suitable transforms to get the logarithms, I won't go into too much detail here).
I think I need to use LogisticRegression on the boolean columns.
What I'd really like to do is combine both LinearRegression on the continuous variables and LogisticRegression on the boolean variables into a single pipeline. Note that all the columns of y depend on all the columns of X, so I can't simply train the continuous and boolean variables independently.
Is this even possible, and if so how do I do it?
I've used something called a "Model Tree" (see link below) for the same sort of problem.
https://github.com/ankonzoid/LearningX/tree/master/advanced_ML/model_tree
But it will need to be customized for your application. Please ask more questions if you get stuck using it.
Here's a screen shot of what it does
If your target data Y has multiple columns you need to use multi-task learning approach. Scikit-learn contains some multi-task learning algorithms for regression like multi-task elastic-net but you cannot combine logistic regression with linear regression because these algorithms use different loss functions to optimize. Also, you may try neural networks for your problem.
What i understand you want to do is to is to train a single model that both predicts a continuous variable and a class. You would need to combine both loses into one single loss to be able to do that which I don't think is possible in scikit-learn. However I suggest you use a deep learning framework (tensorflow, pytorch, etc) to implement your own model with the required properties you need which would be more flexible. In addition you can also tinker with solving the above problem using neural networks which would improve your results.
I have a model trained using LightGBM (LGBMRegressor), in Python, with scikit-learn.
On a weekly basis the model in re-trained, and an updated set of chosen features and associated feature_importances_ are plotted. I want to compare these magnitudes along different weeks, to detect (abrupt) changes in the set of chosen variables and the importance of each of them. But I fear the raw importances are not directly comparable (I am using default split option, that gives importance to a feature based on the number of times it appears in the candidate models).
My question is: assuming these different raw importances along weeks are not directly comparable, is a normalization enough to allow the comparison? If yes, what is the best way to do such normalization? (maybe just a division by the highest importance of the week).
Thank you very much
I am wondering whether there exists some correlation among the hyperparameters of two different classifiers.
For example: let us say that we run LogisticRegression on a dataset with best hyperparameters (by finding through GridSearch) and want to run another classifier like SVC (SVM classifier) on the same dataset but instead of finding all hyperparameters using GridSearch, can we fix some values (or reduce range to limit the search space for GridSearch) of hyperparameters?
As an experimentation, I used scikit-learn's classifiers like LogisticRegression, SVS, LinearSVC, SGDClassifier and Perceptron to classifiy some well know datasets. In some cases, I am able to see some correlation empirically, but not always for all datasets.
So please help me to clear this point.
I don't think you can correlated different parameters of different classifiers together like this. This is mainly because each classifier behaves differently as it has it's own way of adjusting the data along their own set of equations. For example, take the case of SVC with two different kernels rbf and sigmoid. It might be the case that rbf may fit perfectly over the data with the intercept parameter C set to say 0.001, while 'sigmoidkernel over the same data may fit withC` value 0.00001. Both values may also be equal. However, you can never say that for sure. When you say that :
In some cases, I am able to see some correlation empirically, but not always for all datasets.
It may simply be a coincidence. Since it all depends on the and the classifiers. You cannot apply it globally.Correlation does not always equal to causation
You can visit this site and see for yourself that although different regressor functions have the same parameter a, their equations are vastly different and hence over the same dataset you might drastically different values of a.