Forward feature selection with custom criterion - python

I am trying to get the best features for my data for classification. For this I want try feature selection using SVM, KNN, LDA and QDA.
Also the way to test this data is a leave one out approach and not cross-validation by splitting data into parts (basically can't split one file/matrix but have to leave one file for testing while training with other files)
I tried using sfs with SVM in Matlab but keep getting only the first feature and nothing else (there are 254 features)
Is there any way to do this in Python or Matlab ?

If you're trying to code the feature selector from scratch, I think you'd better first get deeper in the theory of your algorithm of choice.
But if you're looking for a way to get results faster, scikit-learn provides you with a variety of tools for feature selection. Have a look at this page.

Related

Sample to choose when using Least square method v/s sklearn Regression method?

While using sklearn Linear Regression library, as we split the data using traintestsplit, do we have to use the training data for the OLS (least square method) or we can use the full data for OLS method and deduce the regression result.
There are many mistakes that data-scientists make as a beginner and one of them is to use test data as something in the learning process, look at this diagram from here:
As you can see the data is separated during training process and this is really important to be kept this way.
Now the question you ask is about least square method, while you may think that by using full data you are improving the process, you are forgetting about the evaluation part which then would be better not because the regression is better. It is just better because you have shown the model the data you are testing it with.

how to do cross validation and hyper parameter tuning for huge dataset?

I have a csv file of 10+gb ,i used "chunksize" parameter available in the pandas.read_csv() to read and pre-process the data,for training the model want to use one of the online learning algo.
normally cross-validation and hyper-parameter tuning is done on the entire training data set and train the model using the best hyper-parameter,but in the case of the huge data, if i do the same on the chunk of the training data how to choose the hyper-parameter?
I believe you are looking for online learning algorithms like the ones mentioned on this link Scaling Strategies for large datasets. You should use algorithms that support partial_fit parameter to load these large datasets in chunks. You can also look at the following links to see which one helps you the best, since you haven't specified the exact problem or the algorithm that you are working on:
Numpy save partial results in RAM
Scalling Computationally - Sklearn
Using Large Datasets in Sklearn
Comparision of various Online Sovers -Sklearn
EDIT : If you want to solve the class imbalance problem, you can try this : imabalanced-learn library in Python

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.

How to map XGBoost predictions to the corresponding data rows?

XGBoost generates a list of predictions for the test dataset. My question is how can I map the generated predictions to the actual test file rows ? Is it strictly safe to assume that the nth prediction corresponds to the nth data row ? XGBoost leverages multi-threading for its operations. So, in such a setting can it be trusted that the prediction results strictly map to the test data rows ? Ideally would have really loved if there was a way to annotate the predictions with some row identifier from the test data file ?
I am using this example and working with DMatrix data format of XGBoost. https://github.com/dmlc/xgboost/tree/master/demo/binary_classification
I'm not sure if its strictly safe but based on my experience, that assumption works. Also, for most of the code snippets using xgboost I have seen on Kaggle competitions like this one, folks make this same assumption and it works. In short, you can be rest assured that it works, however, I haven't dug into the documentation and so I cant say that it works all the time.

Is it possible to adapt the sci-kit CountVectorizer for other features (not just n-grams)?

I'm new to scikit and working with text data in general, and I've been using the sci-kit CountVectorizer as a start to get used to basic features of text data (n-grams) but I want to extend this to analyze for other features.
I would prefer to adapt the countvectorizer rather than make my own, because then I wouldn't have to reimplement sci-kits tf-idf transformer and classifier.
EDIT:
I'm actually still thinking about specific features to be honest, but for my project I wanted to do style classification between documents. I know that for text classification, lemmatizing and stemming is popular for feature extraction, so that may be one. Other features that I am thinking of analyzing include
Length of sentences per document within each style
Distinct words per style. A more formal style may have a more eloquent and varied vocabulary
An offshoot of the previous point, but counts of adjectives in particular
Lengths of particular words, again, slang might use much shorter phrases than a formal style
Punctuation, especially marked pauses between statements, lengths of statements
These are a few ideas I was thinking of, but I'm thinking of more features to test!
You can easily extend extend the class (you can see the source of it here) and implement what you need. However, it depends on what you want to do, which is not very clear in your question.
Are you asking how to implement the features you listed in terms of a scikit-learn compatible transformer? Then maybe have a look at the developer docs in particular rolling your own estimator.
You can just inherit from BaseEstimator and implement a fit and a transform. That is only necessary if you want to use pipelining, though. For using sklearn classifiers and the tfidf-transformer, it is only necessary that your feature extraction creates numpy arrays or scipy sparse matrices.

Categories

Resources