i performed feature selection using ExtraTreesClassifier and SelectFromModel in data set that loaded as DataFrame, however i want to save these selected feature as DataFrame to csv file while maintaining columns name as well. note that output is numpy array return important features whole columns not columns header
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
import numpy as np
df = pd.read_csv('los_10_one_encoder.csv')
y = df['LOS'] # target
X= df.drop('LOS',axis=1) # drop LOS column
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
print clf.feature_importances_
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
model = SelectFromModel(clf, prefit=True)
feature_idx = model.get_support()
feature_name = df.columns[feature_idx]
Use the method DataFrame.to_csv() to save your dataframe as a csv file.
Do the following :
X_new.to_csv("your/path", sep=';')
Here is a link to the documentation of the method.
Related
I'm testing code like this.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
#Seaborn for easier visualization
import seaborn as sns
# Load Iris Flower Dataset
# Load data
df = pd.read_csv('C:\\path_to_file\\train.csv')
df.shape
list(df)
# the model can only handle numeric values so filter out the rest
# data = df.select_dtypes(include=[np.number]).interpolate().dropna()
df1 = df.select_dtypes(include=[np.number])
df1.shape
list(df1)
df1.dtypes
df1 = df1.fillna(0)
#Prerequisites
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
#Split train/test sets
# y = df1.SalePrice
X = df1.drop(['index'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
# Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)
# Feature Importance
headers = ['name', 'score']
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt='plain'))
(pd.Series(model.feature_importances_, index=X.columns)
.nlargest(10)
.plot(kind='barh'))
This works fine on some sample data that I found online. Now, rather than predicting a sales price as my y variable. I'm trying to figure out how to just get the model to make some kind of prediction like target = True or Target = False or maybe my approach is wrong.
It's a bit confusing for me, because of this line: df1 = df.select_dtypes(include=[np.number]). So, only numbers are included, which makes sense for a RandomForestRegressor classifier. I'm just looking for some guidance on how to deal with a non-numeric prediction here.
You are dealing with a classification problem here with 2 classes (True, False). To get started take a look at a simple logistic regression model.
https://en.wikipedia.org/wiki/Logistic_regression
Since you are using sklearn try:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
After reading so many examples with 'inconsistent number of samples' errors, I am still not able to see what is wrong with my code.
In an excel file, sheet 1 contains data. Sheet 2 contains a shortlisted list of variables.
I saved the variables in sheet 2 into an array. And feed it to a Random Forest model to evaluate its impact on a parameter in sheet 1.
But I am getting "Found input variables with inconsistent numbers of samples: [54, 2016]"
54 is the number of variables in sheet 2.
2016 is the number of rows of data in sheet 1.
I am trying to see how these 54 variables impact 'Target' variable in sheet 1.
How should i manipulate my data to make this work?
Many thanks in advance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvar = list()
for each_var in df2.columns:
allvar.append(each_var)
allvar = np.array(allvar)
print(allvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.6)
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf.fit(allvar_train, target_train)
for feature in zip(feat_labels, clf.feature_importances_):
print(feature)
Sheet 1 (saved as df) looks like this
Sheet 1
Sheet 2 (saved as df2) looks like this
Sheet2
Error log is as shown
Error log
Error log 2: Unknown label type: 'continuous'Error Log 2
allvar_train
target train
The issue is with 'train_test_spilt', where you're only passing the feature column name not the data. Use the list of columns to get data from the DataFrame like this.
allvar_train,allvar_test,target_train,target_test= train_test_split(df[allvar],target, random_state=0, test_size=0.6)
You don't necessarily need to convert 'allvar' and 'target' to numpy array it can directly be used in 'train_test_split'.
Note: This issue has got nothing to do with Random Forest
Here is the code that works for me.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvarlist = list()
for each_var in df2.columns:
allvarlist.append(each_var)
countvar = len(allvarlist)
allvar = df[allvarlist]
allvar = allvar.values.reshape(len(allvar),countvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)
clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
#print(allvar_train)
#print(target_train)
clf.fit(allvar_train,np.ravel(target_train))
for feature in zip(allvarlist, clf.feature_importances_):
print(feature)
importances = clf.feature_importances_
#indices = np.argsort(importances)
plt.figure().set_size_inches(14,16)
plt.barh(range(allvar_train.shape[1]), importances, color="r")
plt.yticks(range(allvar_train.shape[1]),allvarlist)
I'm a newbie to the random forest (as well as python).
I'm using random forest classifier, the dataset is defined 't2002'.
t2002.column
So here are the columns:
Index(['IndividualID', 'ES2000_B01ID', 'NSSec_B03ID', 'Vehicle',
'Age_B01ID',
'IndIncome2002_B02ID', 'MarStat_B01ID', 'EcoStat_B03ID',
'MainMode_B03ID', 'TripStart_B02ID', 'TripEnd_B02ID',
'TripDisIncSW_B01ID', 'TripTotalTime_B01ID', 'TripTravTime_B01ID',
'TripPurpFrom_B01ID', 'TripPurpTo_B01ID'],
dtype='object')
I'm using codes as below to run the classifier:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X_all = t2002.drop(['MainMode_B03ID'],axis=1)
y_all = t2002['MainMode_B03ID']
p = 0.2
X_train,X_test, y_train, y_test = train_test_split(X_all,y_all,test_size=p,
random_state=23)
clf = RandomForestClassifier()
acc_scorer = make_scorer(accuracy_score)
parameters = {
} # parameter is blank
grid_obj = GridSearchCV(clf,parameters,scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train,y_train)
clf = grid_obj.best_estimator_
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print(accuracy_score(y_test,predictions))
In this case, how could I use pandas to generate a crosstab (like a table) to show the detailed prediction results?
Thanks in advance!
you can first create a confusion matrix using sklearn and then convert it to pandas data frame.
from sklearn.metrics import confusion_matrix
#creating confusion matrix as array
confusion = confusion_matrix(t2002['MainMode_B03ID'].tolist(),predictions)
#converting to df
new_df = pd.DataFrame(confusion,
index = t2002['MainMode_B03ID'].unique(),
columns = t2002['MainMode_B03ID'].unique())
Its easy to show all the predicted results using pandas. Use cv_results_ as described in docs.
import pandas as pd
results = pd.DataFrame(clf.cv_results_) # clf is the GridSearchCV object
print(results.head())
In the train.csv data in Titanic Machine Learning project, some passengers have their age data missing so the pandas module fills it in as 'NaN' and when feeding it into a sklearn algorithm it does not accept it. I tried dataset.fillna('') but now it turns into a empty string and not a float. Please send help.
https://www.kaggle.com/c/titanic/data
import pandas as pd
from sklearn.cross_validation import train_test_split
dataset = pd.read_csv('train.csv')
#dataset = dataset.fillna()
def preprocess(df):
from sklearn.preprocessing import LabelEncoder
processed_df = df.copy()
le = LabelEncoder()
done = le.fit_transform(processed_df)
return done
survival = preprocess(dataset.Survived)
data = dataset.drop('Survived',axis= 1)
data = data.drop('PassengerId',axis=1)
data = data.drop('Embarked',axis = 1)
data = data.drop('Cabin',axis = 1)
data = data.drop('Fare',axis = 1)
data = data.drop('Ticket',axis = 1)
data = data.drop('Name',axis=1)
x_train,x_test,y_train,y_test=
train_test_split(data,survival,test_size=0.25,random_state=0)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import svm
from sklearn.metrics import accuracy_score
pipeline = make_pipeline(StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))
pipeline.fit(x_train,y_train)
print(accuracy_score(pipeline.predict(x_test),y_test))
fillna replaces the Nan values with what you write so if you write '', it will be an empty string. just write:
dataset.fillna(0)
if you need to distinguish between 0 and Nan, you can try replace it with -1, that's what we do.
there are many methods you can use to deal with the missing values in a machine learning project :
drop all the column with missing values
drop row containing missing values
Set the values to some value (zero, the mean, the median, etc.).
For the third option :
Scikit-Learn provides a handy class to take care of missing values:
Imputer. Here is how to use it. First, you need to create an Imputer
instance, specifying that you want to replace each attribute’s missing
values with the median of that attribute:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median") #or mean as you want
x_train = imputer.fit_transform(x_train)
x_test = imputer.fit_transform(x_test)
The result is a plain Numpy array containing the transformed features. If you want to put it back into a
Pandas DataFrame, it’s simple.
NB : You could also add the imputer in the pipeline just before the scaler .
pipeline = make_pipeline(Imputer(strategy="median"),
StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))
i want to make a selection of features tree-based.
My dataset has about 30 columns and after doing, there are about 5.
Which for me is great, the problem i have, is that the dataset of 5 columns that i get, does not keep the names of the columns and i can not identify them.
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
data = pd.read_csv(file)
X = data.drop('target', 1)
y = data['target']
X.shape #(100000, 30)
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
clf.feature_importances_
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape #(100000, 5)
Can someone help me please?
Now when I'm more sure of the answer, please try the following:
mask = model.get_support(indices=False) # this will return boolean mask for the columns
X_new = X.loc[:, mask] # the sliced dataframe, keeping selected columns
featured_col_names = X_new.columns # columns name index
If all you need is just the column names:
X.columns[model.get_support()]