I have the dataset in my disposal which consists of around 500 columns which I need to explore and keep only relevant columns. Pandas info(verbose = True) method does not even display this number properly. I also used missingno library to visualise nulls. However, it uses a lot of RAM. What to use instead of matplotlib here?
How do you approach datasets with a lot of features (more than 100)? Any useful workflow to eliminate useless features? How to use info() or any alternative?
Yeah, also used expand options to view everything. Question here is how to set it locally?
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
UPDATE:
Methods or solutions to explore initial raw data are of interest. For instance one cell script which summarises numerical features as distributions, categorical as counts and possibly something else. This can be written by myself, however, maybe there is a library or just your function which does so?
Regarding the issue of useless features, you could easily estimate some metrics associated with feature effectiveness and filter it out using some threshold. Check out the sklearn feature selection docs.
Of course before doing that you'll have to make sure features are numeric and their representation is fit for the tests of your choice. To do that I suggest you check out sklearn pipelines (optional) and preprocessing docs.
Before estimating feature usefulness, make sure you cover missing data handling, encoding categorical variables and feature scaling.
You can use XGBoost's feature_importance attribute. Though, you first need to train your data using XGB & then using feature_importance, consider only important features (by setting a threshold of your choice)
Dimension reduction can come handy using PCA or some other algorithm.
Related
I have a set of alphanumeric categorical features (c_1,c_2, ..., c_n) and one numeric target variable (prediction) as a pandas dataframe. Can you please suggest to me any feature selection algorithm that I can use for this data set?
I'm assuming you are solving a supervised learning problem like Regression or Classification.
First of all I suggest to transform the categorical features into numeric ones using one-hot encoding. Pandas provides an useful function that already does it:
dataset = pd.get_dummies(dataset, columns=['feature-1', 'feature-2', ...])
If you have a limited number of features and a model that is not too computationally expensive you can test the combination of all the possible features, it is the best way however it is seldom a viable option.
A possible alternative is to sort all the features using the correlation with the target, then sequentially add them to the model, measure the goodness of my model and select the set of features that provides the best performance.
If you have high dimensional data, you can consider to reduce the dimensionality using PCA or another dimensionality reduction technique, it projects the data into a lower dimensional space reducing the number of features, obviously you will loose some information due to the PCA approximation.
These are only some examples of methods to perform feature selection, there are many others.
Final tips:
Remember to split the data into Training, Validation and Test set.
Often data normalization is recommended to obtain better results.
Some models have embedded mechanism to perform feature selection (Lasso, Decision Trees, ...).
I am trying to get the best features for my data for classification. For this I want try feature selection using SVM, KNN, LDA and QDA.
Also the way to test this data is a leave one out approach and not cross-validation by splitting data into parts (basically can't split one file/matrix but have to leave one file for testing while training with other files)
I tried using sfs with SVM in Matlab but keep getting only the first feature and nothing else (there are 254 features)
Is there any way to do this in Python or Matlab ?
If you're trying to code the feature selector from scratch, I think you'd better first get deeper in the theory of your algorithm of choice.
But if you're looking for a way to get results faster, scikit-learn provides you with a variety of tools for feature selection. Have a look at this page.
Using two different methods in XGBOOST feature importance, gives me two different most important features, which one should be believed?
Which method should be used when? I am confused.
Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import xgboost as xgb
df = sns.load_dataset('mpg')
df = df.drop(['name','origin'],axis=1)
X = df.iloc[:,1:]
y = df.iloc[:,0]
Numpy arrays
# fit the model
model_xgb_numpy = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_numpy.fit(X.to_numpy(), y.to_numpy())
plt.bar(range(len(model_xgb_numpy.feature_importances_)), model_xgb_numpy.feature_importances_)
Pandas dataframe
# fit the model
model_xgb_pandas = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_pandas.fit(X, y)
axsub = xgb.plot_importance(model_xgb_pandas)
Problem
Numpy method shows 0th feature cylinder is most important. Pandas method shows model year is most important. Which one is the CORRECT most important feature?
References
How to get feature importance in xgboost?
Feature importance 'gain' in XGBoost
It is hard to define THE correct feature importance measure. Each has pros and cons. It is a wide topic with no golden rule as of now and I personally would suggest to read this online book by Christoph Molnar: https://christophm.github.io/interpretable-ml-book/. The book has an excellent overview of different measures and different algorithms.
As a rule of thumb, if you can not use an external package, i would choose gain, as it is more representative of what one is interested in (one typically is not interested in raw occurrence of splits on a particular features, but rather how much those splits helped), see this question for a good summary: https://datascience.stackexchange.com/q/12318/53060. If you can use other tools, shap exhibits very good behaviour and I would always choose it over build-in xgb tree measures, unless computation time is strongly constrained.
As for the difference that you directly pointed at in your question, the root of the difference comes from the fact that xgb.plot_importance uses weight as the default extracted feature importance type, while the XGBModel itself uses gain as the default type. If you configure them to use the same importance type, then you will get similar distributions (up to additional normalisation in feature_importance_ and sorting in plot_importance).
There are 3 ways to get feature importance from Xgboost:
use built-in feature importance (I prefer gain type),
use permutation-based feature importance
use SHAP values to compute feature importance
In my post I wrote code examples for all 3 methods. Personally, I'm using permutation-based feature importance. In my opinion, the built-in feature importance can show features as important after overfitting to the data(this is just an opinion based on my experience). SHAP explanations are fantastic, but sometimes computing them can be time-consuming (and you need to downsample your data).
From the answer here, which gives a neat explanation:
feature_importances_ returns weights - what we usually think of as "importance".
plot_importance returns the number of occurrences in splits.
Note: I think that the selected answer above does not actually cover the point.
I have two data sets, training and test set.
If I have NA values in the training set but not in the test set, I usually drop the rows (if they are few) in the training set and that's all.
But now, I got a lot of NA values in both sets, so I have dropped the features which got lot most of NA values, and I was wondering what to do now.
Should I just drop the same features in the test set and impute the rest missing values?
Is there any other technique I could use to preprocess the data?
Can Machine Learning algorithms like Logistic Regression, Decision Trees or Neural Netwroks handle missing values?
The data sets come from a Kaggle competition so I can't do the preprocessing before splitting the data
Thanks in advance
This question is not so easy to answer, because it depends on the type of NA values.
Are the NA values due to some random reason? Or is there a reason they are missing (no matching multiple choice answer in a survey or maybe something people would not like to answer)
For the first, it would be fine to use a simple imputation strategy, so that you can fit your model on the data. Thereby, I mean something like mean imputation or sampling from an estimated probability distribution. Or even sampling values at random. Note, that if you simply take the mean of the existing values, you change the statistics of the dataset, i.e. you reduce the standard deviation. You should keep that in mind when choosing your model.
For the second, you will have to apply you domain knowledge to find good fill values.
Regarding your last question: if you want to fill the values with a machine learning model, you may use the other features of the dataset and implicitly assume a dependency between the missing feature and the other features. Depending on the model you will later use for prediction, you may not benefit from the intermediate estimation.
I hope this helps, but the correct answer really depends on the data.
In general, machine learning algorithms do not cope well with missing values (for mostly good reasons, as it is not known why they are missing or what it means to be missing, which could even be different for different observations).
Good practice would be to do the preprocessing before the split between training and test sets (are your training and test data truly random subsets of the data, as they should be?) and ensure that both sets are treated identically.
There is a plethora of ways to deal with your missing data and it depends strongly on the data, as well as on your goals, which are the better ways. Feel free to get in contact if you need more specific advice.
Missing values are a common problem in data analysis. One common strategy seems to be that missing values are replaced by values randomly sampled from the distribution of existing values.
Is there Python library code that conveniently performs this preprocessing step on a data frame? As far as I see the sklearn.preprocessing module does not offer this strategy.
To sample from a distribution of existing values you need to know the distribution. If the distribution is not known you can use kernel density estimation to fit it. This blog post has a nice overview of kernel density estimation implementations for Python: http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/.
There is an implementation in scikit-learn (see http://scikit-learn.org/stable/modules/density.html#kernel-density); sklearn's KernelDensity has .sample() method. There is also a kernel density estimator in statsmodels (http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.html); it supports categorical features.
Another method is to choose random existing values, without trying to generate values not seen in a dataset. The problem with this solution is that value could depend on other values in the same row, and random.sample without taking this in account may produce unrealistic examples.