RandomForest Regressor: Predict and check performance - python

I am trying predict price for 5 days in future. I followed this tutorial. This tutorial is about predicting categorical variable and is hence using RandomForest Classifier. I am using the same approach as defined in this tutorial but using RandomForest Regressor as I have to predict last price for 5 days in future. I am confused that how do I predict
Here is my code:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics.ranking import roc_curve, auc, roc_auc_score
priceTrainData = pd.read_csv('trainPriceData.csv')
#read test data set
priceTestData = pd.read_csv('testPriceData.csv')
priceTrainData['Type'] = 'Train'
priceTestData['Type'] = 'Test'
target_col = "last"
features = ['low', 'high', 'open', 'last', 'annualized_volatility', 'weekly_return',
'daily_average_volume_10',# try to use log in 10, 30,
'daily_average_volume_30', 'market_cap']
priceTrainData['is_train'] = np.random.uniform(0, 1, len(priceTrainData)) <= .75
Train, Validate = priceTrainData[priceTrainData['is_train']==True], priceTrainData[priceTrainData['is_train']==False]
x_train = Train[list(features)].values
y_train = Train[target_col].values
x_validate = Validate[list(features)].values
y_validate = Validate[target_col].values
x_test = priceTestData[list(features)].values
random.seed(100)
rf = RandomForestRegressor(n_estimators = 1000)
rf.fit(x_train, y_train)
status = rf.predict(x_validate)
My first question is that how do I specify to get 5 values for prediction and second question is that how do I check the performance of RandomForest Regressor? Kindly assist me.

Your x_validate is 'pandas.core.series.Series' in nature. So you could execute this:
x_validate[0:5]
This will solve your 2nd question by calculating the R square value.
rf.score(x_train,y_train)

Related

Sales Order Delivery time Prediction Using Random Forest

This is a very noob question. But I have implemented Random forest algorithm to predict number of days taken for delivery depending on origin, destination, vendor, etc.
I already implemented RF using the past 12 month's data(80% Train,20% Test data) and got good results
My question is that for implementing RF I already had no. of days taken for delivery but for the future In my dataset, I will not have that column. How am I suppose to use this already trained model for future predictions using origin, destination, dates, etc?
This is my randomforest, as you can see i split the dataset in 2 pieces: y and x. y is the predicted value or column and x is the whole dataset minus y. This way you can use your training set to predict in your case the delivery time.
NOTE: this code is for a forest REGRESSOR, if you need the classifier code, let me know!
Just the dataframe definitions:
y = df[targetkolom] #predicted column or target column
x = df.drop(targetkolom, 1) #Whole dataset minus target column
Whole code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv('Dataset Carprices.csv')
df.head()
df = df.drop(['car_ID', 'highwaympg', 'citympg'], 1)
targetkolom = 'price'
#Preperation on CarName
i =0
while i < len(df.CarName):
df.CarName[i] = df.CarName[i].split()[0]
i += 1
pd.set_option('display.max_columns', 200)
#(df.describe())
#Dataset standardization
df = pd.get_dummies(df, columns=['CarName','fueltype','aspiration','doornumber','carbody',
'drivewheel','enginelocation','enginetype','cylindernumber',
'fuelsystem'], prefix="", prefix_sep="")
#print(df.info())
y = df[targetkolom]
x = df.drop(targetkolom, 1)
#Normalisation
x = (x-x.min())/(x.max()-x.min())
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3 ,random_state=7)
model = RandomForestRegressor(n_estimators=10000, random_state=1)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('R2 score:', r2_score(y_test,y_pred))

ValueError: dimension mismatch While Predicting New Values Sentiment Analysis

I am relatively new to the machine learning subject. I am trying to do sentiment analysis prediction.
Type column includes the sentiment of the tweet(pos, neg or neutral as 0,1 and 2). Tweet column includes the tweets.
I am trying to predict new set of tweets's sentiments as 0,1 and 2.
When I wrote the code given here I got dimension mismatch error.
import pandas as pd
train_tweets = pd.read_csv("tweets_type.csv")
from sklearn.model_selection import train_test_split
y = train_tweets.Type
X= train_tweets.Tweet
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(train_X)
train_X_dtm = vect.transform(train_X)
test_X_dtm = vect.transform(test_X)
test_X_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
%time nb.fit(train_X_dtm, train_y)
# make class predictions for X_test_dtm
y_pred_class = nb.predict(test_X_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
metrics.accuracy_score(test_y, y_pred_class)
march_tweets = pd.read_csv("march_data.csv")
X=march_tweets.Tweet
vect.fit(X)
train_new_dtm = vect.transform(X)
new_pred_class = nb.predict(train_new_dtm)
The error I am getting is here:
Would be so glad if you could help me.
It seems I made a mistake fitting X after I already fitted train_X. I found out there is no use of doing that repeatedly once you the model is fitted. So what I did is I removed this line and it worked perfectly.
vect.fit(X)

How can I forecast a y-variable based on multiple x-variables?

I'm testing code like this.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
#Seaborn for easier visualization
import seaborn as sns
# Load Iris Flower Dataset
# Load data
df = pd.read_csv('C:\\path_to_file\\train.csv')
df.shape
list(df)
# the model can only handle numeric values so filter out the rest
# data = df.select_dtypes(include=[np.number]).interpolate().dropna()
df1 = df.select_dtypes(include=[np.number])
df1.shape
list(df1)
df1.dtypes
df1 = df1.fillna(0)
#Prerequisites
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
#Split train/test sets
# y = df1.SalePrice
X = df1.drop(['index'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
# Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)
# Feature Importance
headers = ['name', 'score']
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt='plain'))
(pd.Series(model.feature_importances_, index=X.columns)
.nlargest(10)
.plot(kind='barh'))
This works fine on some sample data that I found online. Now, rather than predicting a sales price as my y variable. I'm trying to figure out how to just get the model to make some kind of prediction like target = True or Target = False or maybe my approach is wrong.
It's a bit confusing for me, because of this line: df1 = df.select_dtypes(include=[np.number]). So, only numbers are included, which makes sense for a RandomForestRegressor classifier. I'm just looking for some guidance on how to deal with a non-numeric prediction here.
You are dealing with a classification problem here with 2 classes (True, False). To get started take a look at a simple logistic regression model.
https://en.wikipedia.org/wiki/Logistic_regression
Since you are using sklearn try:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

How to use pandas to create a crosstab to show the prediction result of random forest predictor?

I'm a newbie to the random forest (as well as python).
I'm using random forest classifier, the dataset is defined 't2002'.
t2002.column
So here are the columns:
Index(['IndividualID', 'ES2000_B01ID', 'NSSec_B03ID', 'Vehicle',
'Age_B01ID',
'IndIncome2002_B02ID', 'MarStat_B01ID', 'EcoStat_B03ID',
'MainMode_B03ID', 'TripStart_B02ID', 'TripEnd_B02ID',
'TripDisIncSW_B01ID', 'TripTotalTime_B01ID', 'TripTravTime_B01ID',
'TripPurpFrom_B01ID', 'TripPurpTo_B01ID'],
dtype='object')
I'm using codes as below to run the classifier:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X_all = t2002.drop(['MainMode_B03ID'],axis=1)
y_all = t2002['MainMode_B03ID']
p = 0.2
X_train,X_test, y_train, y_test = train_test_split(X_all,y_all,test_size=p,
random_state=23)
clf = RandomForestClassifier()
acc_scorer = make_scorer(accuracy_score)
parameters = {
} # parameter is blank
grid_obj = GridSearchCV(clf,parameters,scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train,y_train)
clf = grid_obj.best_estimator_
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print(accuracy_score(y_test,predictions))
In this case, how could I use pandas to generate a crosstab (like a table) to show the detailed prediction results?
Thanks in advance!
you can first create a confusion matrix using sklearn and then convert it to pandas data frame.
from sklearn.metrics import confusion_matrix
#creating confusion matrix as array
confusion = confusion_matrix(t2002['MainMode_B03ID'].tolist(),predictions)
#converting to df
new_df = pd.DataFrame(confusion,
index = t2002['MainMode_B03ID'].unique(),
columns = t2002['MainMode_B03ID'].unique())
Its easy to show all the predicted results using pandas. Use cv_results_ as described in docs.
import pandas as pd
results = pd.DataFrame(clf.cv_results_) # clf is the GridSearchCV object
print(results.head())

How to make (yes/no or 1-0) decisions with random forest?

This is the data set from Kaggle's Titanic competition (train and test csv files). Each file has features of passengers such as ID, sex, age, etc. The train file has a "survived" column with 0 and 1 values. The test file is missing the survived column as it has to be predicted.
This is my simple code using random forest to give me a benchmark for the starter:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
ID_col = ['PassengerId']
target_col = ["Survived"]
cat_cols = ['Name','Ticket','Sex','Cabin','Embarked']
num_cols= ['Pclass','Age','SibSp','Parch','Fare']
other_col=['Type'] #Test and Train Data set identifier
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with -9999
fullData[cat_cols] = fullData[cat_cols].fillna(value = -9999)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Survived"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Survived"].values
x_test=test[list(features)].values
Train[list(features)]
#*************************
from sklearn import tree
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
status = rf.predict_proba(x_validate)
fpr, tpr, _ = roc_curve(y_validate, status[:,1]) #metrics. added by me
roc_auc = auc(fpr, tpr)
print(roc_auc)
final_status = rf.predict_proba(x_test)
test["Survived2"]=final_status[:,1]
test['my prediction']=np.where(test.Survived2 > 0.6, 1, 0)
test
As you can see, the final_status gives the probability of survival. I'm wondering how to get yes/no (1 or 0) answers from it. The easiest thing that I could think of was to say if probability is greater than 0.6 then the person survived and otherwise died ('my prediction' column) but once I submit the results, the predictions are not good at all.
I appreciate any insights. Thanks
Transforming your probability into binary output is the right way to go, but why did you choose > .6 and not > .5?
Also, if you are having bad results in that case, it is most likely because you did not do a proper job in data cleaning and feature extraction. For example, the title ("Mr", "Mrs",...) can give you indication on the gender, which is a super important feature to consider in your problem (I assume this is the titanic competition from kaggle).
I just needed to use a line like:
out = rf.predict(x_test)
and that would be the 0/1 answers I was looking for.

Categories

Resources