Selecting number of features for RFE model in Python - python

I have a dataset with more than 120 features, and I want to use RFE for selecting which features / column names I should use.
I have a problem because RFE is very slow. My code looks like this:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
full_df = pd.read_csv('data.csv')
x = full_df.iloc[:,:-1]
y = full_df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
model = LogisticRegression(solver ='lbfgs')
for i in range(1,120):
rfe = RFE(model, i)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
print(acc)
print(fit.support_)
My problem is this: rfe = RFE(model, i). I do not know what's the best number for i. That's why I put it in for i in range(1,120). Is there any better way to do this? is there any better function in scikit learn that can help me determine the number of features and names of those features?
Because this took to long, I changed my approach, and I want to see what you think about it, is it good / correct approach.
First I did PCA, and I found out that each column participates with around 1-0.4%, except last 9 columns. Last 9 columns participate with less than 0.00001% so I removed them. Now I have 121 features.
pca = PCA()
fit = pca.fit(x)
Then I split my data into train and test (with 121 features).
Then I used SelectFromModel, and I tested it with 4 different classifiers. Each classifier in SelectFromModel reduced the number of columns. I chosed the number of column that was determined by classifier that gave me the best accuracy:
model = SelectFromModel(clf, prefit=True)
#train_score = clf.score(x_train, y_train)
test_score = clf.score(x_test, y_test)
column_res = model.transform(x_train).shape
End finally I used 'RFE'. I have used number of columns that i get with 'SelectFromModel'.
rfe = RFE(model, number_of_columns)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
Is this a good approach, or I did something wrong?
Also, If I got the biggest accuracy in SelectFromModel with one classifier, do I need to use the same classifier in RFE?

Related

Should we apply normalization on whole data set or only X

I am doing a project based on Machine learning (Python) and trying all models on my data.
Really confused in
For Classification and For Regression
If I have to apply normalization, Z Score or Standard deviation on whole data set and then set the values of Features(X) and output(y).
def normalize(df):
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(df)
scaled = scaler.transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
return scaled_df
data=normalize(data)
X=data.drop['col']
y=data['col']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Or only have to apply on features(X)
X=data.drop['col']
y=data['col']
def normalize(df):
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(df)
scaled = scaler.transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
return scaled_df
X=normalize(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
TLDR; do normalization on input data, but don't do it on output.
Logically, the normalization is both algorithm dependent and also feature based.
Some algorithms do not require any normalization (like decision trees).
Applying normalization on the dataset: You should perform normalization per feature but on all examples existing in the whole dataset if you have more than one feature in your dataset.
For example, let's say you have two features of X and Y. feature X is always a decimal in the range [0,10]. On the other hand, you have Y in the range [100K,1M]. If you do normalization once for X and Y and once for X and Y combined, you would see how the values of feature X become insignificant.
For Output (labels):
Generally, there is no need to normalize output or labels for any regression or classification tasks. But, make sure to do normalization on training data during training time and inference time.
if the task is the classification, the common approach is just encoding the class numbers (if you have classes dog and cat. you assign 0 to one and 1 to the other)

KFold Cross Validation does not fix overfitting

I am separating the features in X and y then I preprocess my train test data after splitting it with k fold cross validation. After that i fit the train data to my Random Forest Regressor model and calculate the confidence score. Why do i preprocess after splitting? because people tell me that it's more correct to do it that way and i'm keeping that principle since that for the sake of my model performance.
This is my first time using KFold Cross Validation because my model score overifts and i thought i could fix it with cross validation. I'm still confused of how to use this, i have read the documentation and some articles but i do not really catch how do i really imply it to my model but i tried anyway and my model still overfits. Using train test split or cross validation resulting my model score is still 0.999, I do not know what is my mistake since i'm very new using this method but i think maybe i did it wrong so it does not fix the overfitting. Please tell me what's wrong with my code and how to fix this
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
# X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
confidence = rfr.score(X_test, y_test)
print(confidence)
The reason you're overfitting is because a non-regularized tree-based model will adjust to the data until all training samples are correctly classified. See for example this image:
As you can see, this does not generalize well. If you don't specify arguments that regularize the trees, the model will fit the test data poorly because it will basically just learn the noise in the training data. There are many ways to regularize trees in sklearn, you can find them here. For instance:
max_features
min_samples_leaf
max_depth
With proper regularization, you can get a model that generalizes well to the test data. Look at a regularized model for instance:
To regularize your model, instantiate the RandomForestRegressor() module like this:
rfr = RandomForestRegressor(max_features=0.5, min_samples_leaf=4, max_depth=6)
These argument values are arbitrary, it's up to you to find the ones that fit your data best. You can use domain-specific knowledge to choose these values, or a hyperparameter tuning search like GridSearchCV or RandomizedSearchCV.
Other than that, imputing the mean and median might bring a lot of noise in your data. I would advise against it unless you had no other choice.
While #NicolasGervais answer gets to the bottom of why your specific model is overfitting, I think there is a conceptual misunderstanding with regards to cross-validation in the original question; you seem to think that:
Cross-validation is a method that improves the performance of a machine learning model.
But this is not the case.
Cross validation is a method that is used to estimate the performance of a given model on unseen data. By itself, it cannot improve the accuracy.
In other words, the respective scores can tell you if your model is overfitting the training data, but simply applying cross-validation does not make your model better.
Example:
Let's look at a dataset with 10 points, and fit a line through it:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
X = np.random.randint(0,10,10)
Y = np.random.randint(0,10,10)
fig = plt.figure(figsize=(1,10))
def line(x, slope, intercept):
return slope * x + intercept
for i in range(5):
# note that this is not technically 5-fold cross-validation
# because I allow the same datapoint to go into the test set
# several times. For illustrative purposes it is fine imho.
test_indices = np.random.choice(np.arange(10),2)
train_indices = list(set(range(10))-set(test_indices))
# get train and test sets
X_train, Y_train = X[train_indices], Y[train_indices]
X_test, Y_test = X[test_indices], Y[test_indices]
# training set has one feature and multiple entries
# so, reshape(-1,1)
X_train, Y_train, X_test, Y_test = X_train.reshape(-1,1), Y_train.reshape(-1,1), X_test.reshape(-1,1), Y_test.reshape(-1,1)
# fit and evaluate linear regression
reg = LinearRegression().fit(X_train, Y_train)
score_train = reg.score(X_train, Y_train)
score_test = reg.score(X_test, Y_test)
# extract coefficients from model:
slope, intercept = reg.coef_[0], reg.intercept_[0]
print(score_test)
# show train and test sets
plt.subplot(5,1,i+1)
plt.scatter(X_train, Y_train, c='k')
plt.scatter(X_test, Y_test, c='r')
# draw regression line
plt.plot(np.arange(10), line(np.arange(10), slope, intercept))
plt.ylim(0,10)
plt.xlim(0,10)
plt.title('train: {:.2f} test: {:.2f}'.format(score_train, score_test))
You can see that the scores on training and test set are vastly different. You can also see that the estimated parameters vary a lot with the change of train and test set.
That does not make your linear model any better at all.
But now you know exactly how bad it is :)

MAE using Pipeline and GridSearchCV

I am facing a challenge finding Mean Average Error (MAE) using Pipeline and GridSearchCV
Background:
I have worked on a Data Science project (MWE as below) where a MAE value would be returned of a classifier as it's performance metric.
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling
RF_model = RandomForestClassifier(n_estimators=100, random_state=0)
RF_model.fit(X_train, y_train)
#RandomForest Prediction
y_predict = RF_model.predict(X_valid)
#MAE
print(mean_absolute_error(y_valid, y_predict))
#Output:
# 0.38727149627623564
Challenge:
Now I am trying to implement the same using Pipeline and GridSearchCV (MWE as below). The expectation is the same MAE value would be returned as above. Unfortunately I could not get it right using the 3 approaches below.
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling via Pipeline and Hyper-parameter tuning
steps = [('rf', RandomForestClassifier(random_state=0))]
pipeline = Pipeline(steps) # define the pipeline object.
parameters = {'rf__n_estimators':[100]}
grid = GridSearchCV(pipeline, param_grid=parameters, scoring='neg_mean_squared_error', cv=None, refit=True)
grid.fit(X_train, y_train)
#Approach 1:
print(grid.best_score_)
# Output:
# -0.508130081300813
#Approach 2:
y_predict=grid.predict(X_valid)
print("score = %3.2f"%(grid.score(y_predict, y_valid)))
# Output:
# ValueError: Expected 2D array, got 1D array instead:
# array=[0. 0. 0. ... 0. 1. 0.].
# Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
#Approach 3:
y_predict_df = pd.DataFrame(y_predict.reshape(len(y_predict), -1),columns=['fault_severity'])
print("score = %3.2f"%(grid.score(y_predict_df, y_valid)))
# Output:
# ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 1
Discussion:
Approach 1:
As in GridSearchCV() the scoring variable is set to neg_mean_squared_error, tried to read the grid.best_score_. But it did not get the same MAE result.
Approach 2:
Tried to get the y_predict values using grid.predict(X_valid). Then tried to get the MAE using grid.score(y_predict, y_valid) as the scoring variable in GridSearchCV() is set to neg_mean_squared_error. It returned a ValueError complaining "Expected 2D array, got 1D array instead".
Approach 3:
Tried to reshape y_predict and it did not work either. This time it returned "ValueError: Number of features of the model must match the input."
It would be helpful if you can assist to point where I could have made the error?
If you need, the data.csv is available at https://www.dropbox.com/s/t1h53jg1hy4x33b/data.csv
Thank you very much
You are trying to compare mean_absolute_error with neg_mean_squared_error which is very different refer here for more details. You should have used neg_mean_absolute_error in your GridSearchCV object creation like shown below:
grid = GridSearchCV(pipeline, param_grid=parameters,scoring='neg_mean_absolute_error', cv=None, refit=True)
Also, the score method in sklearn takes (X,y) as inputs, where x is your input feature of shape (n_samples, n_features) and y is the target labels, you need to change your grid.score(y_predict, y_valid) into grid.score(X_valid, y_valid).

Support vector machine overfitting my data

I am trying to make predictions for the iris dataset. I have decided to use svms for this purpose. But, it gives me an accuracy 1.0. Is it a case of overfitting or is it because the model is very good? Here is my code.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
svm_model = svm.SVC(kernel='linear', C=1,gamma='auto')
svm_model.fit(X_train,y_train)
predictions = svm_model.predict(X_test)
accuracy_score(predictions, y_test)
Here, accuracy_score returns a value of 1. Please help me. I am a beginner in machine learning.
You can try cross validation:
Example:
from sklearn.model_selection import LeaveOneOut
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
#load iris data
iris = datasets.load_iris()
X = iris.data
Y = iris.target
#build the model
svm_model = SVC( kernel ='linear', C = 1, gamma = 'auto',random_state = 0 )
#create the Cross validation object
loo = LeaveOneOut()
#calculate cross validated (leave one out) accuracy score
scores = cross_val_score(svm_model, X,Y, cv = loo, scoring='accuracy')
print( scores.mean() )
Result (the mean accuracy of the 150 folds since we used leave-one-out):
0.97999999999999998
Bottom line:
Cross validation (especially LeaveOneOut) is a good way to avoid overfitting and to get robust results.
The iris dataset is not a particularly difficult one from where to get good results. However, you are right not trusting a 100% classification accuracy model. In your example, the problem is that the 30 test points are all correctly well classified. But that doesn't mean that your model is able to generalise well for all new data instances. Just try and change the test_size to 0.3 and the results are no longer 100% (it goes down to 97.78%).
The best way to guarantee robustness and avoid overfitting is using cross validation. An example on how to do this easily from your example:
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
iris = datasets.load_iris()
X = iris.data[:, :4]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
svm_model = svm.SVC(kernel='linear', C=1, gamma='auto')
scores = cross_val_score(svm_model, iris.data, iris.target, cv=10) #10 fold cross validation
Here cross_val_score uses different parts of the dataset as testing data iteratively (cross validation) while keeping all your previous parameters. If you check score you will see that the 10 accuracies calculated now range from 87.87% to 100%. To report the final model performance you can for example use the mean of the scored values.
Hope this helps and good luck! :)

Training a Logistic Classifier

I'm trying to train a logistic classifier. My dataset has the following columns.
name , review, rating, reviews_cleaned , word_count, sentiment,
The sentiment is either +1 or -1 based on whether the rating is greater than 3 or less. The word count contains a dict of words with occurences and reviews_cleaned just strips off the reviews off punctuations.
This is my code to train a LogisticClassifier.
train_data, test_data = train_test_split(products, test_size = 0.2)
sentiment_model = LogisticRegression(penalty='l2', C=1)
sentiment_model.fit(products['sentiment'], products['word_count'])
I get the following error,
ValueError: Found input variables with inconsistent numbers of samples: [1, 166752]
PS: The equivalent statment using graphLab create is
sentiment_model = graphlab.logistic_classifier.create(train_data,
target = 'sentiment',
features=['word_count'],
validation_set=None)
What am I doing wrong?
Your training data looks like it's a 1-dimensional vector but sklearn requires it to be 2-dimensional - if you reshape it you should be okay. Also you make your train/test split but you're not actually using the data that you're producing (fit with train_data instead).
Using GraphLab in that course is very irritating to say the least. Give this a whirl:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv('amazon_baby.csv', header = 0)
df.dropna(how="any", inplace= True)
products = df[df['rating'] != 3] #drop the products with 3-star rating
products['sentiment'] = products['rating'] >= 4
X_train, X_test, y_train, y_test = train_test_split(products['review'], products['sentiment'], test_size = .2 ,random_state = 0)
vect = CountVectorizer()
X_train = vect.fit_transform(X_train.values)
X_test = vect.transform(X_test.values)
model = LogisticRegression(penalty ='l2', C = 1)
model.fit(X_train, y_train)
I'm not sure what the direct translation between Sklearn/Pandas and GraphLab is, but this looks like it's what they are doing.
When I score the model, I get:
model.score(X_test, y_test)
> .93155480
Let me know what results you get or if this works for you.

Categories

Resources