Good practices - Sklearn Linear Regression with pandas - python

Is this the best way to work with pandas and vectorizer ? Converting a dataframe to a dict, vectorize and put all in a new dataframe? Or there is a better way to work with?
import pandas as pd
# Putting AmesHousing.txt data into a dataframe
data = pd.read_csv('AmesHousing.txt', encoding='UTF-8', delimiter='\t')
data = data.fillna(0)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
df = pd.DataFrame(vec.fit_transform(data.T.to_dict().values()), columns = [vec.get_feature_names()])
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
#Here we are splitting our data with 2 pieces: train and test. Test will have 33% of data; train will have all the rest
test, train = train_test_split(df,test_size=0.33, random_state=42)
model = LinearRegression()
model.fit(train.drop(['SalePrice'], axis=1), train[['SalePrice']])
predict = model.predict(test.drop(['SalePrice'], axis=1))
MSE = mean_squared_error(predict,test[['SalePrice']])
RMSE = np.sqrt(MSE)
print('MSE:',MSE,'RMSE:',RMSE)

Related

How can I optimize my code so my Google Colab doens't crash

I ran into a issue where Google Colab's ram is running out. I use the free version and I'm not sure if it's because it can't handle or if my code is very bad optimized. As I'm new to the field I believe my code is very slow and badly optimized. Wanted to ask a bit of help as I'm still learning.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('path/beforeNeural.csv')
df.shape
df.head()
df.isnull().sum()
encoder = LabelEncoder()
df['Property Type'] = encoder.fit_transform(df['Property Type'])
df['Old/New'] = encoder.fit_transform(df['Old/New'])
df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
df['County'] = encoder.fit_transform(df['County'])
df['District'] = encoder.fit_transform(df['District'])
df['Town/City'] = encoder.fit_transform(df['Town/City'])
df['Duration'] = encoder.fit_transform(df['Duration'])
df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])
X = df.drop(columns='Price', axis=1)
Y = df['Price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
df.shape
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)
I'll give it a try, here is a possible option to optimize your code,
Code:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('path/beforeNeural.csv')
categorical_columns = ['Property Type', 'Old/New', 'Record Status - monthly file only', 'PPDCategory Type', 'County', 'District', 'Town/City', 'Duration', 'Transaction unique identifier', 'Date of Transfer']
encoder = OneHotEncoder()
X_concat = encoder.fit_transform(df[categorical_columns])
# Approach 1:
X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
# Approach 2:
X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))
X_numerical = df.drop(columns = categorical_columns + ['Price'])
X = pd.concat([X_numerical, X_concat], axis = 1)
Y = df['Price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)
Note, I removed the unused imports and deleted the calls such as
df.head() for example in the middle of the code, which does nothing and also
does not print anything when you use it like that in the middle of the
code
Code Explanation:
Instead of using LabelEncoder, I used OneHotEncoder in order to one-hot-encode all of the categorical features.
This creates a new binary column for each unique value in the categorical features.
In general, one-hot-encoding is usually a better approach to handle categorical features when using machine learning other than just assigning the integer values using the LabelEncoder.
I extracted the names of all of the categorical columns into a list, that way it's easier to modify them when it's needed.

How to use Machine Learning in Python to predict a binary outcome with a Pandas Dataframe

I have the following code:
import nfl_data_py as nfl
pbp = nfl.import_pbp_data([2022], downcast=True, cache=False, alt_path=None)
which returns a dataframe of every play that occurred in the 2022 NFL season. The columns I want to train it on are score_differential, yardline_100, ydstogo, down and half_seconds_remaining to predict the play_type - either run, or pass.
Example: I feed it a -4 score differential, 25 yard line, 4th down, 16 yards to go, and 300 half seconds remaining - it would return whatever it learned from the dataframe, probably pass.
How would I go about doing this? Should I use a scikeylearn decision tree?
Here you go:
import nfl_data_py as nfl
import pandas as pd
#import train_test_split
from sklearn.model_selection import train_test_split
#we need to encode the play_type column
from sklearn.preprocessing import LabelEncoder
#import the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
pbp = nfl.import_pbp_data([2022], downcast=True, cache=False, alt_path=None)
df = pd.DataFrame(pbp)
#there are definitely other features you can use, but these are the ones you want.
df = df[['score_differential', 'yardline_100', 'ydstogo', 'down', 'half_seconds_remaining', 'play_type']]
df = df.dropna()
# drop the rows which are 'None', 'No_play'
df = df[df['play_type'] != 'None']
df = df[df['play_type'] != 'no_play']
#reset the index
df = df.reset_index(drop=True)
#encode the play_type column
le = LabelEncoder()
df['play_type_encode'] = le.fit_transform(df['play_type'])
# train test split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['play_type', 'play_type_encode'], axis=1), df['play_type_encode'], test_size=0.3, random_state=42)
#instantiate the model
rfc = RandomForestClassifier(n_estimators=100)
#fit the model
rfc.fit(X_train, y_train)
#predict the model
rfc_pred = rfc.predict(X_test)
#evaluate the model
print(classification_report(y_test, rfc_pred))
#plot the confusion matrix
plt.figure(figsize=(10,6))
sns.heatmap(confusion_matrix(y_test, rfc_pred), annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')

How can I forecast a y-variable based on multiple x-variables?

I'm testing code like this.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
#Seaborn for easier visualization
import seaborn as sns
# Load Iris Flower Dataset
# Load data
df = pd.read_csv('C:\\path_to_file\\train.csv')
df.shape
list(df)
# the model can only handle numeric values so filter out the rest
# data = df.select_dtypes(include=[np.number]).interpolate().dropna()
df1 = df.select_dtypes(include=[np.number])
df1.shape
list(df1)
df1.dtypes
df1 = df1.fillna(0)
#Prerequisites
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
#Split train/test sets
# y = df1.SalePrice
X = df1.drop(['index'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
# Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)
# Feature Importance
headers = ['name', 'score']
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt='plain'))
(pd.Series(model.feature_importances_, index=X.columns)
.nlargest(10)
.plot(kind='barh'))
This works fine on some sample data that I found online. Now, rather than predicting a sales price as my y variable. I'm trying to figure out how to just get the model to make some kind of prediction like target = True or Target = False or maybe my approach is wrong.
It's a bit confusing for me, because of this line: df1 = df.select_dtypes(include=[np.number]). So, only numbers are included, which makes sense for a RandomForestRegressor classifier. I'm just looking for some guidance on how to deal with a non-numeric prediction here.
You are dealing with a classification problem here with 2 classes (True, False). To get started take a look at a simple logistic regression model.
https://en.wikipedia.org/wiki/Logistic_regression
Since you are using sklearn try:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Random Forest Error (input variables with inconsistent numbers of samples)

After reading so many examples with 'inconsistent number of samples' errors, I am still not able to see what is wrong with my code.
In an excel file, sheet 1 contains data. Sheet 2 contains a shortlisted list of variables.
I saved the variables in sheet 2 into an array. And feed it to a Random Forest model to evaluate its impact on a parameter in sheet 1.
But I am getting "Found input variables with inconsistent numbers of samples: [54, 2016]"
54 is the number of variables in sheet 2.
2016 is the number of rows of data in sheet 1.
I am trying to see how these 54 variables impact 'Target' variable in sheet 1.
How should i manipulate my data to make this work?
Many thanks in advance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvar = list()
for each_var in df2.columns:
allvar.append(each_var)
allvar = np.array(allvar)
print(allvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.6)
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf.fit(allvar_train, target_train)
for feature in zip(feat_labels, clf.feature_importances_):
print(feature)
Sheet 1 (saved as df) looks like this
Sheet 1
Sheet 2 (saved as df2) looks like this
Sheet2
Error log is as shown
Error log
Error log 2: Unknown label type: 'continuous'Error Log 2
allvar_train
target train
The issue is with 'train_test_spilt', where you're only passing the feature column name not the data. Use the list of columns to get data from the DataFrame like this.
allvar_train,allvar_test,target_train,target_test= train_test_split(df[allvar],target, random_state=0, test_size=0.6)
You don't necessarily need to convert 'allvar' and 'target' to numpy array it can directly be used in 'train_test_split'.
Note: This issue has got nothing to do with Random Forest
Here is the code that works for me.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvarlist = list()
for each_var in df2.columns:
allvarlist.append(each_var)
countvar = len(allvarlist)
allvar = df[allvarlist]
allvar = allvar.values.reshape(len(allvar),countvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)
clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
#print(allvar_train)
#print(target_train)
clf.fit(allvar_train,np.ravel(target_train))
for feature in zip(allvarlist, clf.feature_importances_):
print(feature)
importances = clf.feature_importances_
#indices = np.argsort(importances)
plt.figure().set_size_inches(14,16)
plt.barh(range(allvar_train.shape[1]), importances, color="r")
plt.yticks(range(allvar_train.shape[1]),allvarlist)

How to use pandas to create a crosstab to show the prediction result of random forest predictor?

I'm a newbie to the random forest (as well as python).
I'm using random forest classifier, the dataset is defined 't2002'.
t2002.column
So here are the columns:
Index(['IndividualID', 'ES2000_B01ID', 'NSSec_B03ID', 'Vehicle',
'Age_B01ID',
'IndIncome2002_B02ID', 'MarStat_B01ID', 'EcoStat_B03ID',
'MainMode_B03ID', 'TripStart_B02ID', 'TripEnd_B02ID',
'TripDisIncSW_B01ID', 'TripTotalTime_B01ID', 'TripTravTime_B01ID',
'TripPurpFrom_B01ID', 'TripPurpTo_B01ID'],
dtype='object')
I'm using codes as below to run the classifier:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X_all = t2002.drop(['MainMode_B03ID'],axis=1)
y_all = t2002['MainMode_B03ID']
p = 0.2
X_train,X_test, y_train, y_test = train_test_split(X_all,y_all,test_size=p,
random_state=23)
clf = RandomForestClassifier()
acc_scorer = make_scorer(accuracy_score)
parameters = {
} # parameter is blank
grid_obj = GridSearchCV(clf,parameters,scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train,y_train)
clf = grid_obj.best_estimator_
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print(accuracy_score(y_test,predictions))
In this case, how could I use pandas to generate a crosstab (like a table) to show the detailed prediction results?
Thanks in advance!
you can first create a confusion matrix using sklearn and then convert it to pandas data frame.
from sklearn.metrics import confusion_matrix
#creating confusion matrix as array
confusion = confusion_matrix(t2002['MainMode_B03ID'].tolist(),predictions)
#converting to df
new_df = pd.DataFrame(confusion,
index = t2002['MainMode_B03ID'].unique(),
columns = t2002['MainMode_B03ID'].unique())
Its easy to show all the predicted results using pandas. Use cv_results_ as described in docs.
import pandas as pd
results = pd.DataFrame(clf.cv_results_) # clf is the GridSearchCV object
print(results.head())

Categories

Resources