Random Forest Error (input variables with inconsistent numbers of samples) - python

After reading so many examples with 'inconsistent number of samples' errors, I am still not able to see what is wrong with my code.
In an excel file, sheet 1 contains data. Sheet 2 contains a shortlisted list of variables.
I saved the variables in sheet 2 into an array. And feed it to a Random Forest model to evaluate its impact on a parameter in sheet 1.
But I am getting "Found input variables with inconsistent numbers of samples: [54, 2016]"
54 is the number of variables in sheet 2.
2016 is the number of rows of data in sheet 1.
I am trying to see how these 54 variables impact 'Target' variable in sheet 1.
How should i manipulate my data to make this work?
Many thanks in advance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvar = list()
for each_var in df2.columns:
allvar.append(each_var)
allvar = np.array(allvar)
print(allvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.6)
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf.fit(allvar_train, target_train)
for feature in zip(feat_labels, clf.feature_importances_):
print(feature)
Sheet 1 (saved as df) looks like this
Sheet 1
Sheet 2 (saved as df2) looks like this
Sheet2
Error log is as shown
Error log
Error log 2: Unknown label type: 'continuous'Error Log 2
allvar_train
target train

The issue is with 'train_test_spilt', where you're only passing the feature column name not the data. Use the list of columns to get data from the DataFrame like this.
allvar_train,allvar_test,target_train,target_test= train_test_split(df[allvar],target, random_state=0, test_size=0.6)
You don't necessarily need to convert 'allvar' and 'target' to numpy array it can directly be used in 'train_test_split'.
Note: This issue has got nothing to do with Random Forest

Here is the code that works for me.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvarlist = list()
for each_var in df2.columns:
allvarlist.append(each_var)
countvar = len(allvarlist)
allvar = df[allvarlist]
allvar = allvar.values.reshape(len(allvar),countvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)
clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
#print(allvar_train)
#print(target_train)
clf.fit(allvar_train,np.ravel(target_train))
for feature in zip(allvarlist, clf.feature_importances_):
print(feature)
importances = clf.feature_importances_
#indices = np.argsort(importances)
plt.figure().set_size_inches(14,16)
plt.barh(range(allvar_train.shape[1]), importances, color="r")
plt.yticks(range(allvar_train.shape[1]),allvarlist)

Related

How to get second best predicted value by Python code using Random Forest Technique?

I have tried to learn a machine learning with Random Forest classifier. Now I can predict best crop using following python code. Screenshot is displayed below.
screenshot of best crop
Now I want to know how to get the second best recommended crop using this code. What should I do? My existing Code is like this.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn import tree
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('crops.csv')
df.head()
df.tail()
#print(df.head())
df.describe()
#print(df.describe())
df['Crop_Label'].unique()
#list_crop= df['Crop_Label'].unique()
#print(list_crop)
#b = np.array('Maniocs')
#c = np.setdiff1d(list_crop,b)
#print(c)
s = df.corr()
#print(s)
sns.heatmap(s,annot = True)
features = df[['Avg_Temp','Avg_Rainfall','Avg_Humidity','Extent','Production']]
target = df['Crop_Label']
print(target)
# Initialzing empty lists to append all model's name and corresponding name
acc = []
model = []
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(features,target,test_size= 0.2,random_state = 2)
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(n_estimators=29, criterion = 'entropy',random_state=0)
RF.fit(X_train,y_train)
predicted = RF.predict(X_test)
x = metrics.accuracy_score(y_test,predicted)
acc.append(x)
model.append('Random Forest')
#print("Random Forest Accuracy is ",x * 100)
#print(classification_report(y_test,predicted))
score = cross_val_score(RF,features,target,cv = 2)
score
data = np.array([[29,150, 80, 24006, 100]])
prediction = RF.predict(data)
print(prediction)
My sample CSV file is here. Link is :-
https://drive.google.com/file/d/1IcIwZQI08sQxxTNOV0MPP30llkjvqtD2/view?usp=sharing
Please give me any idea to get the second best crop using prediction using above code. Thank You.

Good practices - Sklearn Linear Regression with pandas

Is this the best way to work with pandas and vectorizer ? Converting a dataframe to a dict, vectorize and put all in a new dataframe? Or there is a better way to work with?
import pandas as pd
# Putting AmesHousing.txt data into a dataframe
data = pd.read_csv('AmesHousing.txt', encoding='UTF-8', delimiter='\t')
data = data.fillna(0)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
df = pd.DataFrame(vec.fit_transform(data.T.to_dict().values()), columns = [vec.get_feature_names()])
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
#Here we are splitting our data with 2 pieces: train and test. Test will have 33% of data; train will have all the rest
test, train = train_test_split(df,test_size=0.33, random_state=42)
model = LinearRegression()
model.fit(train.drop(['SalePrice'], axis=1), train[['SalePrice']])
predict = model.predict(test.drop(['SalePrice'], axis=1))
MSE = mean_squared_error(predict,test[['SalePrice']])
RMSE = np.sqrt(MSE)
print('MSE:',MSE,'RMSE:',RMSE)

How can I forecast a y-variable based on multiple x-variables?

I'm testing code like this.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
#Seaborn for easier visualization
import seaborn as sns
# Load Iris Flower Dataset
# Load data
df = pd.read_csv('C:\\path_to_file\\train.csv')
df.shape
list(df)
# the model can only handle numeric values so filter out the rest
# data = df.select_dtypes(include=[np.number]).interpolate().dropna()
df1 = df.select_dtypes(include=[np.number])
df1.shape
list(df1)
df1.dtypes
df1 = df1.fillna(0)
#Prerequisites
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
#Split train/test sets
# y = df1.SalePrice
X = df1.drop(['index'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
# Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)
# Feature Importance
headers = ['name', 'score']
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt='plain'))
(pd.Series(model.feature_importances_, index=X.columns)
.nlargest(10)
.plot(kind='barh'))
This works fine on some sample data that I found online. Now, rather than predicting a sales price as my y variable. I'm trying to figure out how to just get the model to make some kind of prediction like target = True or Target = False or maybe my approach is wrong.
It's a bit confusing for me, because of this line: df1 = df.select_dtypes(include=[np.number]). So, only numbers are included, which makes sense for a RandomForestRegressor classifier. I'm just looking for some guidance on how to deal with a non-numeric prediction here.
You are dealing with a classification problem here with 2 classes (True, False). To get started take a look at a simple logistic regression model.
https://en.wikipedia.org/wiki/Logistic_regression
Since you are using sklearn try:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

How to use pandas to create a crosstab to show the prediction result of random forest predictor?

I'm a newbie to the random forest (as well as python).
I'm using random forest classifier, the dataset is defined 't2002'.
t2002.column
So here are the columns:
Index(['IndividualID', 'ES2000_B01ID', 'NSSec_B03ID', 'Vehicle',
'Age_B01ID',
'IndIncome2002_B02ID', 'MarStat_B01ID', 'EcoStat_B03ID',
'MainMode_B03ID', 'TripStart_B02ID', 'TripEnd_B02ID',
'TripDisIncSW_B01ID', 'TripTotalTime_B01ID', 'TripTravTime_B01ID',
'TripPurpFrom_B01ID', 'TripPurpTo_B01ID'],
dtype='object')
I'm using codes as below to run the classifier:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X_all = t2002.drop(['MainMode_B03ID'],axis=1)
y_all = t2002['MainMode_B03ID']
p = 0.2
X_train,X_test, y_train, y_test = train_test_split(X_all,y_all,test_size=p,
random_state=23)
clf = RandomForestClassifier()
acc_scorer = make_scorer(accuracy_score)
parameters = {
} # parameter is blank
grid_obj = GridSearchCV(clf,parameters,scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train,y_train)
clf = grid_obj.best_estimator_
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print(accuracy_score(y_test,predictions))
In this case, how could I use pandas to generate a crosstab (like a table) to show the detailed prediction results?
Thanks in advance!
you can first create a confusion matrix using sklearn and then convert it to pandas data frame.
from sklearn.metrics import confusion_matrix
#creating confusion matrix as array
confusion = confusion_matrix(t2002['MainMode_B03ID'].tolist(),predictions)
#converting to df
new_df = pd.DataFrame(confusion,
index = t2002['MainMode_B03ID'].unique(),
columns = t2002['MainMode_B03ID'].unique())
Its easy to show all the predicted results using pandas. Use cv_results_ as described in docs.
import pandas as pd
results = pd.DataFrame(clf.cv_results_) # clf is the GridSearchCV object
print(results.head())

is there away to output selected columns names from SelectFromModel method?

i performed feature selection using ExtraTreesClassifier and SelectFromModel in data set that loaded as DataFrame, however i want to save these selected feature as DataFrame to csv file while maintaining columns name as well. note that output is numpy array return important features whole columns not columns header
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
import numpy as np
df = pd.read_csv('los_10_one_encoder.csv')
y = df['LOS'] # target
X= df.drop('LOS',axis=1) # drop LOS column
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
print clf.feature_importances_
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
model = SelectFromModel(clf, prefit=True)
feature_idx = model.get_support()
feature_name = df.columns[feature_idx]
Use the method DataFrame.to_csv() to save your dataframe as a csv file.
Do the following :
X_new.to_csv("your/path", sep=';')
Here is a link to the documentation of the method.

Categories

Resources