Feature Selection for Supervised Learning - python

import numpy as np
from sklearn import svm
from sklearn.feature_selection import SelectKBest, f_classif
I have 3 labels (male, female, na), denoted as follows:
labels = [0,1,2]
Each label was defined by 3 features (height, weight, and age) as the training data:
Training data for males:
male_height = np.array([111,121,137,143,157])
male_weight = np.array([60,70,88,99,75])
male_age = np.array([41,32,73,54,35])
males = np.vstack([male_height,male_weight,male_age]).T
Training data for females:
female_height = np.array([91,121,135,98,90])
female_weight = np.array([32,67,98,86,56])
female_age = np.array([51,35,33,67,61])
females = np.vstack([female_height,female_weight,female_age]).T
Training data for not availables:
na_height = np.array([96,127,145,99,91])
na_weight = np.array([42,97,78,76,86])
na_age = np.array([56,35,49,64,66])
nas = np.vstack([na_height,na_weight,na_age]).T
So, the complete training data are:
trainingData = np.vstack([males,females,nas])
Complete labels are:
labels = np.repeat(labels,5)
Now, I want to select the best features, output their names, and apply only those best features for fitting the support vector machine model.
I tried below according to the answer from #eickenberg and the comments from #larsmans
selector = SelectKBest(f_classif, k=keep)
clf = make_pipeline(selector, StandardScaler(), svm.SVC())
clf.fit(trainingData, labels)
selected = trainingData[selector.get_support()]
print selected
[[111 60 41]
[121 70 32]]
However, all the selected elements belongs to the label 'male' with the features: height, weight, and age respectively. I could not figure out where I am messing up? Could someone guide me into right direction?

You can use e.g. SelectKBest as follows
from sklearn.feature_selection import SelectKBest, f_classif
keep = 2
selector = SelectKBest(f_classif, k=keep)
and place it into your pipeline
pipe = make_pipeline(selector, StandardScaler(), svm.SVC())
pipe.fit(trainingData, labels)

To be honest, I have used the Support Vector Machine Model on text classification (which is an entirely different problem altogether). But, through that experience, I can confidently say that the more features you have, the better your predictions will be.
To summarize, do not filter out the features that are most important because the Support Vector Machine will make use of features no matter how little importance.
But, if this is a huge necessity, look into scikit learn's Random Forest Classifier. It can accurately assess which features are more important, using the "feature_importances_" attribute.
Here's an example of how I would use it (code not tested):
clf = RandomForestClassifier() #tweak the parameters yourself
clf.fit(X,Y) #if you're passing in a sparse matrix, apply .toarray() to X
print clf.feature_importances_
Hope that helps.

Related

Why is this accuracy of this Random forest sentiment classification so low?

I want to use RandomForestClassifier for sentiment classification. The x contains data in string text, so I used LabelEncoder to convert strings. Y contains data in numbers. And my code is this:
import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.ensemble import *
from sklearn import *
from sklearn.preprocessing.label import LabelEncoder
data = pd.read_csv('data.csv')
x = data['Reviews']
y = data['Ratings']
le = LabelEncoder()
x_encoded = le.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_encoded,y, test_size = 0.2)
x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
Then I printed out the accuracy like below:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
And here's the output:
Accuracy: 0.5975
I have read that Random forests has high accuracy, because of the number of decision trees participating in the process. But I think that the accuracy is much lower than it should be. I have looked for some similar questions on Stack Overflow, but I couldn't find a solution for my problem.
Is there any problem in my code using Random Forest library? Or is there any exceptions of cases when using Random forest?
It is not a problem regarding Random Forests or the library, it is rather a problem how you transform your text input into a feature or feature vector.
What LabelEncoding does is; given some labels like ["a", "b", "c"] it transforms those labels into numeric values between 0 and n-1 with n-being the number of distinct input labels. However, I assume Reviews contain texts and not pure labels so to say. This means, all your reviews (if not 100% identical) are transformed into different labels. Eventually, this leads to your classifier doing random stuff. give that input. This means you need something different to transform your textual input into a numeric input that Random Forests can work on.
As a simple start, you can try something like TfIDF or also some simple count vectorizer. Those are available from sklearn https://scikit-learn.org/stable/modules/feature_extraction.html section 6.2.3. Text feature extraction. There are more sophisticated ways of transforming texts into numeric vectors but that should be a good start for you to understand what has to happen conceptually.
A last important note is that you fit those vectorizers only on the training set and not on the full dataset. Otherwise, you might leak information from training to evaluation/testing. A good way of doing this would be to build a sklearn pipeline that consists of a feature transformation step and the classifier.

How to deal with dataset that contains both discrete and continuous data

I was training a model that contains 8 features that allows us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking value between 1 and 10)
Date:The date of stay (an integer between 1‐365, here we consider only one‐day
request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept:Whether this post gets accepted (someone took it, 1) or not (0) in the end
Column Accept is the "y". Hence, this is a binary classification.
We have plot the data and some of the data were skewed so we applied power transform.
We tried a neural network, ExtraTrees, XGBoost, Gradient boost, Random forest. They all gave about 0.77 AUC. However, when we tried them on the test set, the AUC dropped to 0.55 with a precision of 27%.
I am not sure where when wrong but my thinking was that the reason may due to the mixing of discrete and continuous data. Especially some of them are either 0 or 1.
Can anyone help?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
df_train = pd.read_csv('case2_training.csv')
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
transform_list = ['Pic Quality', 'Review', 'Price']
X_train[transform_list] = pt.fit_transform(X_train[transform_list])
X_test[transform_list] = pt.transform(X_test[transform_list])
for i in transform_list:
df = X_train[i]
ax = df.plot.hist()
ax.set_title(i)
plt.show()
# Normalization
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from torch import nn
from skorch import NeuralNetBinaryClassifier
import torch
model = nn.Sequential(
nn.Linear(8,64),
nn.BatchNorm1d(64),
nn.GELU(),
nn.Linear(64,32),
nn.BatchNorm1d(32),
nn.GELU(),
nn.Linear(32,16),
nn.BatchNorm1d(16),
nn.GELU(),
nn.Linear(16,1),
# nn.Sigmoid()
)
net = NeuralNetBinaryClassifier(
model,
max_epochs=100,
lr=0.1,
# Shuffle training data on each epoch
optimizer=torch.optim.Adam,
iterator_train__shuffle=True,
)
net.fit(X_train, y_train)
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(silent=0,
learning_rate=0.01,
min_child_weight=1,
max_depth=6,
objective='binary:logistic',
n_estimators=500,
seed=1000)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
Here is an attachment of a screenshot of the data.
Sample data
This is the fundamental first step of Data Analytics. You need to do two things here:
Data understanding - do the data fields in their current format make sense (data types, value range etc.)
Data preparation - what should I do to update these data fields before passing them to our model? Also which inputs do you think will be useful for your model and which will provide little benefit? Are there outliers I need to consider/handle?
A good book if you're starting in the field of data analytics is Fundamentals of Machine Learning for Predictive Data Analytics (I have no affiliation with this book).
Looking at your dataset there's a couple of things you could try to see how it influences your prediction results:
Unless region order is actually ranked in importance/value I would change this to a one hot encoded feature, you can do this in sklearn. Otherwise you run the risk of your model thinking that regions with a higher number (say 10) are more important than regions with a lower value (say 1).
You could attempt to normalise certain categories if they are much larger than some of your other data fields Why Data Normalization is necessary for Machine Learning models
Consider looking at the Kaggle competition House Prices: Advanced Regression Techniques. It's doing a similar thing to what you're attempting to do, and it might have some pointers for how you should approach the problem in the Notebooks and Discussion tabs.
Without deeply exploring all the data you are using it is hard to say for certain what is causing the drop in accuracy (or AUC) when moving from your training set to the testing set. It is unlikely to be caused by the mixed discrete/continuous data.
The drop just suggests that your models are over-fitting to your training data (and therefore not transferring well). This could be caused by too many learned parameters (given the amount of data you have)--more often a problem with neural networks than with some of the other methods you mentioned. Or, the problem could be with the way the data was split into training/testing. If the distribution of the data has a significant difference (that's maybe not obvious) then you wouldn't expect the testing performance to be as good. If it were me, I'd look carefully at how the data was split into training/testing (assuming you have a reasonably large set of data). You may try repeating your experiments with a number of random training/testing splits (search k-fold cross validation if you're not familiar with it).
your model is overfit. try to make a simple model first and use a lower parameter value. for tree-based classification, scaling does not have any impact on the model.

How to weigh data points with sklearn training algorithms

I am looking to train either a random forest or gradient boosting algorithm using sklearn. The data I have is structured in a way that it has a variable weight for each data point that corresponds to the amount of times that data point occurs in the dataset. Is there a way to give sklearn this weight during the training process, or do I need to expand my dataset to a non-weighted version that has duplicate data points each represented individually?
You can definitely specify the weights while training these classifiers in scikit-learn. Specifically, this happens during the fit step. Here is an example using RandomForestClassifier but the same goes also for GradientBoostingClassifier:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)
Here I define some arbitrary weights just for the sake of the example:
weights = np.random.choice([1,2],len(y_train))
And then you can fit your model with these models:
rfc = RandomForestClassifier(n_estimators = 20, random_state = 42)
rfc.fit(X_train,y_train, sample_weight = weights)
You can then evaluate your model on your test data.
Now, to your last point, you could in this example resample your training set according to the weights by duplication. But in most real world examples, this could end up being very tedious because
you would need to make sure all your weights are integers to perform duplication
you would have to uselessly multiply the size of your data, which is memory-consuming and is most likely going to slow down the training procedure

scikit-learn SelectPercentile TFIDF data feature reduction

I am using the various mechanisms in scikit-learn to create a tf-idf representation of a training data set and a test set consisting of text features. Both data sets are preprocessed to use the same vocabulary so the features and the number of features are the same. I can create a model on the training data and assess its performance on the test data. I am wondering if I use SelectPercentile to reduce the number of features in the training set after transformation, how can identify the same features in the test set to utilise in prediction?
trainDenseData = trainTransformedData.toarray()
testDenseData = testTransformedData.toarray()
if ( useFeatureReduction== True):
reducedTrainData = SelectPercentile(f_regression,percentile=10).fit_transform(trainDenseData,trainYarray)
clf.fit(reducedTrainData, trainYarray)
# apply feature reduction to the test data
See code and comments below.
import numpy as np
from sklearn.datasets import make_classification
from sklearn import feature_selection
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
sp = feature_selection.SelectPercentile(feature_selection.f_regression, percentile=30)
sp.fit_transform(X[:-1], y[:-1]) #here, training are the first 9 data vectors, and the last one is the test set
idx = np.arange(0, X.shape[1]) #create an index array
features_to_keep = idx[sp.get_support() == True] #get index positions of kept features
x_fs = X[:,features_to_keep] #prune X data vectors
x_test_fs = x_fs[-1] #take your last data vector (the test set) pruned values
print x_test_fs #these are your pruned test set values
You should store the SelectPercentile object, and use it to transform the test data:
select = SelectPercentile(f_regression,percentile=10)
reducedTrainData = select.fit_transform(trainDenseData,trainYarray)
reducedTestData = select.transform(testDenseData)

Train scikit SVM, customize score assessment

I plan on using scikit svm for class prediction.
I have a two-class dataset consisting of about 100 experiments. Each experiment encapsulates my data-points (vectors) + classification.
Training of an SVM according to http://scikit-learn.org/stable/modules/svm.html should straight forward.
I will have to put all vectors in an array and generate another array with the corresponding class labels, train SVM. However, in order to run leave-one-out error estimation, I need to leave out a specific subset of vectors - one experiment.
How do I achieve that with the available score function?
Cheers,
EL
You could manually train on everything but the one observation, using numpy indexing to drop it out. Then you can use any of sklearn's helpers to evaluate the classification. For example:
import numpy as np
from sklearn import svm
clf = svm.SVC(...)
idx = np.arange(len(observations))
preds = np.zeros(len(observations))
for i in idx:
is_train = idx != i
clf.fit(observations[is_train, :], labels[is_train])
preds[i] = clf.predict(observations[i, :])
Alternatively, scikit-learn has a helper to do leave-one-out, and another helper to get cross-validation scores:
from sklearn import svm, cross_validation
clf = svm.SVC(...)
loo = cross_validation.LeaveOneOut(len(observations))
was_right = cross_validation.cross_val_score(clf, observations, labels, cv=loo)
total_acc = np.mean(was_right)
See the user's guide for more. cross_val_score actually returns a score for each fold (which is a little strange IMO), but since we have one fold per observation, this will just be 0 if it was wrong and 1 if it was right.
Of course, leave-one-out is very slow and has terrible statistical properties to boot, so you should probably use KFold instead.

Categories

Resources