Scikit Learn Tree-based feature selection keeping the columns name? - python

i want to make a selection of features tree-based.
My dataset has about 30 columns and after doing, there are about 5.
Which for me is great, the problem i have, is that the dataset of 5 columns that i get, does not keep the names of the columns and i can not identify them.
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
data = pd.read_csv(file)
X = data.drop('target', 1)
y = data['target']
X.shape #(100000, 30)
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
clf.feature_importances_
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape #(100000, 5)
Can someone help me please?

Now when I'm more sure of the answer, please try the following:
mask = model.get_support(indices=False) # this will return boolean mask for the columns
X_new = X.loc[:, mask] # the sliced dataframe, keeping selected columns
featured_col_names = X_new.columns # columns name index
If all you need is just the column names:
X.columns[model.get_support()]

Related

Eliminating a certain column in each FOR loop

I have dataframe with 40 columns as inputs of algorithm. I initially create a for loop which predict the target variable in each iteration using only 1 of columns (For example column1 in first iteration, column2 in second and so on).
first I select one of the columns (named vapor) as the base of operation:
so the for loop above is iterate through columns and add them to vapor (constructing a dataframe with 2 columns) and predict the target variable (HI). based on the code below, I find the columns from all 40 which acquired better performance.
lets call this new best column 'wind'. So what I need help is how should I add another for loop so that It drop the for example 'wind' from X_new so that the shape of X_new become 39 instead of 40
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv (r'Input_hi.csv')
df = df.fillna(0)
X = df.drop(['HI', 'Station', 'lat', 'lon'],1)
y = df['HI']
# I think the for loop should be add here!
X_new = X.drop(['vapor'],1)
scaler = MinMaxScaler()
e = pd.DataFrame()
f = pd.DataFrame()
names = []
# So this Model will be updated in each for loop iteration
# and drop a column (the best in previous iteration)
for i in range(0, 40):
Model = X_new.iloc[:,i]
Model = pd.DataFrame(Model)
column_headers = list(Model)
names.append(column_headers)
e = pd.concat([first_model, Model], axis=1)
scaledX = scaler.fit_transform(e)
X_train, X_test, y_train, y_test = train_test_split(scaledX, y,
test_size=0.25,
random_state=4)
GBC = RandomForestRegressor()
GBC.fit(X_train, y_train)
GBCtest=GBC.predict(X_test)
GBCperform1=GBC.score(X_test, y_test)
A = abs(GBCperform1)
metr = pd.DataFrame({A})
f = pd.concat([f , metr], axis=0)
f.reset_index(drop=True, inplace=True)
names = pd.DataFrame(names)
data_2 = pd.concat([names , f], axis=1)
data_2.columns = ['Variable', 'R-Sq']
data_2 = data_2.sort_values(by="R-Sq",ascending=False)
data_2.reset_index(drop=True, inplace=True)
most_2 = data_2.Variable[0]
What I want in final is a dataframe with 40 rows (40 iteration of for loop), the name of best column in each for loop and corresponding metric (metr)

Using Imputer in a Pipeline doesn't remove NaNs, gives ''Input contains NaN, infinity or a value too large for dtype('float64')''

So I'm using a pipeline to perform a Ridge Regression on some data, which also includes an imputer to remove the NaNs.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
#making Data ready
gapdata = pd.read_csv('/Users/naveedanwer/desktop/Python Files/Life Expectancy Data.csv')
gapdata.columns = gapdata.columns.str.strip()
gapdata.rename(columns={'Life expectancy':'life'},
inplace=True)
gapdata.Status = gapdata.Status.astype('category')
model_data = gapdata.drop('Country',axis =1)
model_data = pd.get_dummies(model_data)
#initialize imputer
imp = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
#initializing regression model
reg = Ridge(alpha = 0.5, normalize = True)
#steps for pipeline
steps = [('imputation',imp),('Ridge',reg)]
#initializing pipeline
pipeline = Pipeline(steps)
#target and feature variables
X = model_data.drop('life', axis = 1)
y = model_data.loc[:,'life']
#splitting into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
pipeline.fit(X_train, y_train)
The original data does contain a lot of NaN values, which is why the imputer is in place. However, the follow error is output after the code is executed:
Input contains NaN, infinity or a value too large for dtype('float64')
Which indicates that there are still NaNs in the data, despite the presence of the imputer. Any idea why this is happening?
There are a few types of NaNs that can be present in the dataframe.
Type 1: np.nan
Type 2: math.nan
Type 3: float('nan')
When you try comparing the three with each other, they return False. So one thing you can do is to check if these NaNs were saved in a specific format in the CSV and then use that format for the missing_value argument.
You can also use the pandas fillna() function to impute values instead of the sklearn Imputer to see if it's a problem there. For example if you want to impute each column with the columns mean value something like this should work:
df = df.apply(lambda x:x.fillna(x.mean(), inplace = True))

ValueError: Found array with 0 feature(s) (shape=(546, 0)) while a minimum of 1 is required

I was just trying out for DataPreprocessing where I frequently get this error.Can anyone explain me what is wrong in this particular code for the given dataset?
Thanks in advance!
# STEP 1: IMPORTING THE LIBARIES
import numpy as np
import pandas as pd
# STEP 2: IMPORTING THE DATASET
dataset = pd.read_csv("https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/datasets/Data.csv", error_bad_lines=False)
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,1:3].values
# STEP 3: HANDLING THE MISSING VALUES
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN",strategy = "mean",axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
# STEP 4: ENCODING CATEGPRICAL DATA
from sklearn.preprocessing import LaberEncoder,OneHotEncoder
labelencoder_X = LabelEncoder() # Encode labels with value between 0 and n_classes-1.
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0]) # All the rows and first columns
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
# Step 5: Splitting the datasets into training sets and Test sets
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
# Step 6: Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
Returns Error:
ValueError: Found array with 0 feature(s) (shape=(546, 0)) while a minimum of 1 is required.
Your link in this line
dataset = pd.read_csv("https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/datasets/Data.csv", error_bad_lines=False)
is wrong.
The current link returns the webpage on github where this csv is shown, but not the actual csv data. So whatever data is present in dataset is invalid.
Change that to:
dataset = pd.read_csv("https://raw.githubusercontent.com/Avik-Jain/100-Days-Of-ML-Code/master/datasets/Data.csv", error_bad_lines=False)
Other than that, there is a spelling mistake in LabelEncoder import.
Now even if you correct these, there will still be errors, because of
Y = labelencoder_Y.fit_transform(Y)
LabelEncoder only accepts a single column array as input, but your current Y will be of 2 columns due to
Y = dataset.iloc[:,1:3].values
Please explain more clearly what do you want to do.

is there away to output selected columns names from SelectFromModel method?

i performed feature selection using ExtraTreesClassifier and SelectFromModel in data set that loaded as DataFrame, however i want to save these selected feature as DataFrame to csv file while maintaining columns name as well. note that output is numpy array return important features whole columns not columns header
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
import numpy as np
df = pd.read_csv('los_10_one_encoder.csv')
y = df['LOS'] # target
X= df.drop('LOS',axis=1) # drop LOS column
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
print clf.feature_importances_
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
model = SelectFromModel(clf, prefit=True)
feature_idx = model.get_support()
feature_name = df.columns[feature_idx]
Use the method DataFrame.to_csv() to save your dataframe as a csv file.
Do the following :
X_new.to_csv("your/path", sep=';')
Here is a link to the documentation of the method.

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
To merge these predictions back with the original df, I try this:
df['y_hats'] = y_hats
But that raises:
ValueError: Length of values does not match length of index
I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.
your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
I have the same problem (almost)
I fixed it this way
...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
You can create a y_hat dataframe copying indices from X_test then merge with the original data.
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)
Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.
Try this:
y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2
You can probably make a new dataframe and add to it the test data along with the predicted values:
data['y_hats'] = y_hats
data.to_csv('data1.csv')
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'],
index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True,
right_index=True)
This worked well for me. It maintains the indexing positions.
pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)
Here is a solution that worked for me:
It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects' IDs (in my code: 'SubjID').
You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.
I hope this helps!
FoldNr = 0
for train_index, test_index in skf.split(X, y):
FoldNr = FoldNr + 1
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# [...] your model
# performance is measured on test set
y_true, y_pred = y_test, clf.predict(X_test)
# Save predicted values for each test set
a = pd.DataFrame(y_true).reset_index()
b = pd.Series(y_pred, name = 'y_pred')
globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']
First you need to convert y_val or y_test data into the DataFrame.
compare_df = pd.DataFrame(y_val)
then just create a new column with predicted data.
compare_df['predicted_res'] = y_pred_val
After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.
test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]
you can also use
y_hats = model.predict(X)
df['y_hats'] = y_hats.reset_index()['name of the target column']

Categories

Resources