Convert object feature into float - python

I'm trying to run a DecisionTreeClassifier on the Kaggle titanic database. (https://www.kaggle.com/rahulsah06/titanic?select=train.csv)
This is my code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
titanic_file_path = '../input/titanic/train.csv'
titanic_data = pd.read_csv(titanic_file_path)
#I create X and y
features= ['Pclass', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
X= titanic_data[features]
y = titanic_data.Survived
#Split into validation and training data
train_X, val_x, train_y, val_y = train_test_split(X,y, random_state=1)
#model definition and fit
titanic_model = DecisionTreeClassifier(random_state=1)
titanic_model.fit(train_X, train_y)
But when I run the code I get an error:
could not convert string to float: 'female'
How to resolve this?

A quick fix is to convert your columns to categorial values using the get_dummies method.
X = pd.get_dummies(X)
Although probably you should take more preprocessing steps than you currently are. But for a toy run, I guess get dummies will suffice.

Related

How can I optimize my code so my Google Colab doens't crash

I ran into a issue where Google Colab's ram is running out. I use the free version and I'm not sure if it's because it can't handle or if my code is very bad optimized. As I'm new to the field I believe my code is very slow and badly optimized. Wanted to ask a bit of help as I'm still learning.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('path/beforeNeural.csv')
df.shape
df.head()
df.isnull().sum()
encoder = LabelEncoder()
df['Property Type'] = encoder.fit_transform(df['Property Type'])
df['Old/New'] = encoder.fit_transform(df['Old/New'])
df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
df['County'] = encoder.fit_transform(df['County'])
df['District'] = encoder.fit_transform(df['District'])
df['Town/City'] = encoder.fit_transform(df['Town/City'])
df['Duration'] = encoder.fit_transform(df['Duration'])
df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])
X = df.drop(columns='Price', axis=1)
Y = df['Price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
df.shape
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)
I'll give it a try, here is a possible option to optimize your code,
Code:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('path/beforeNeural.csv')
categorical_columns = ['Property Type', 'Old/New', 'Record Status - monthly file only', 'PPDCategory Type', 'County', 'District', 'Town/City', 'Duration', 'Transaction unique identifier', 'Date of Transfer']
encoder = OneHotEncoder()
X_concat = encoder.fit_transform(df[categorical_columns])
# Approach 1:
X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
# Approach 2:
X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))
X_numerical = df.drop(columns = categorical_columns + ['Price'])
X = pd.concat([X_numerical, X_concat], axis = 1)
Y = df['Price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)
Note, I removed the unused imports and deleted the calls such as
df.head() for example in the middle of the code, which does nothing and also
does not print anything when you use it like that in the middle of the
code
Code Explanation:
Instead of using LabelEncoder, I used OneHotEncoder in order to one-hot-encode all of the categorical features.
This creates a new binary column for each unique value in the categorical features.
In general, one-hot-encoding is usually a better approach to handle categorical features when using machine learning other than just assigning the integer values using the LabelEncoder.
I extracted the names of all of the categorical columns into a list, that way it's easier to modify them when it's needed.

How to use Machine Learning in Python to predict a binary outcome with a Pandas Dataframe

I have the following code:
import nfl_data_py as nfl
pbp = nfl.import_pbp_data([2022], downcast=True, cache=False, alt_path=None)
which returns a dataframe of every play that occurred in the 2022 NFL season. The columns I want to train it on are score_differential, yardline_100, ydstogo, down and half_seconds_remaining to predict the play_type - either run, or pass.
Example: I feed it a -4 score differential, 25 yard line, 4th down, 16 yards to go, and 300 half seconds remaining - it would return whatever it learned from the dataframe, probably pass.
How would I go about doing this? Should I use a scikeylearn decision tree?
Here you go:
import nfl_data_py as nfl
import pandas as pd
#import train_test_split
from sklearn.model_selection import train_test_split
#we need to encode the play_type column
from sklearn.preprocessing import LabelEncoder
#import the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
pbp = nfl.import_pbp_data([2022], downcast=True, cache=False, alt_path=None)
df = pd.DataFrame(pbp)
#there are definitely other features you can use, but these are the ones you want.
df = df[['score_differential', 'yardline_100', 'ydstogo', 'down', 'half_seconds_remaining', 'play_type']]
df = df.dropna()
# drop the rows which are 'None', 'No_play'
df = df[df['play_type'] != 'None']
df = df[df['play_type'] != 'no_play']
#reset the index
df = df.reset_index(drop=True)
#encode the play_type column
le = LabelEncoder()
df['play_type_encode'] = le.fit_transform(df['play_type'])
# train test split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['play_type', 'play_type_encode'], axis=1), df['play_type_encode'], test_size=0.3, random_state=42)
#instantiate the model
rfc = RandomForestClassifier(n_estimators=100)
#fit the model
rfc.fit(X_train, y_train)
#predict the model
rfc_pred = rfc.predict(X_test)
#evaluate the model
print(classification_report(y_test, rfc_pred))
#plot the confusion matrix
plt.figure(figsize=(10,6))
sns.heatmap(confusion_matrix(y_test, rfc_pred), annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')

How to use a new data set on a trained model?

I am trying to use a new data set on a previously trained model to see how accurate the model is. I use the following code and receive the below error. Would another method solve this problem? thanks
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
df = pd.read_excel('xxxx.xlsx')
enc = LabelEncoder()
X = df[df.columns[1:]]
Y = df[df.columns[0]].values.ravel()
Y2 = enc.fit_transform(Y)
df.insert(0, "Unit Status", Y2, True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y2, random_state = 0, test_size = 0.25)
clf = LinearSVC(random_state=0,dual=False, tol=1e-5)
clf.fit(X, Y2)
Y_pred = clf.predict(X_test)
confusion_matrix(Y_test, Y_pred)
classifier_predictions = clf.predict(X_test)
print(accuracy_score(Y_test, classifier_predictions)*100)
df2 = pd.read_excel('xxxx_v2.xlsx')
y_pred=clf.predict(df2)
ValueError: could not convert string to float: '20-002'
The data in the new dataframe must all be floats or at least can be converted to float, the first and second columns have string data which cannot be converted to numbers, thus the model cannot train or predict on this data. from looking at the data, you could use labelEncoder on the second column and decide whether or not to use OneHotEncoder, but it looks to me that the first column doesn't contain categorical data. If the model needs the first column's data, then you need to convert it to numbers somehow, otherwise just drop the column.

How can I forecast a y-variable based on multiple x-variables?

I'm testing code like this.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
#Seaborn for easier visualization
import seaborn as sns
# Load Iris Flower Dataset
# Load data
df = pd.read_csv('C:\\path_to_file\\train.csv')
df.shape
list(df)
# the model can only handle numeric values so filter out the rest
# data = df.select_dtypes(include=[np.number]).interpolate().dropna()
df1 = df.select_dtypes(include=[np.number])
df1.shape
list(df1)
df1.dtypes
df1 = df1.fillna(0)
#Prerequisites
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
#Split train/test sets
# y = df1.SalePrice
X = df1.drop(['index'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
# Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)
# Feature Importance
headers = ['name', 'score']
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt='plain'))
(pd.Series(model.feature_importances_, index=X.columns)
.nlargest(10)
.plot(kind='barh'))
This works fine on some sample data that I found online. Now, rather than predicting a sales price as my y variable. I'm trying to figure out how to just get the model to make some kind of prediction like target = True or Target = False or maybe my approach is wrong.
It's a bit confusing for me, because of this line: df1 = df.select_dtypes(include=[np.number]). So, only numbers are included, which makes sense for a RandomForestRegressor classifier. I'm just looking for some guidance on how to deal with a non-numeric prediction here.
You are dealing with a classification problem here with 2 classes (True, False). To get started take a look at a simple logistic regression model.
https://en.wikipedia.org/wiki/Logistic_regression
Since you are using sklearn try:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

How to use pandas to create a crosstab to show the prediction result of random forest predictor?

I'm a newbie to the random forest (as well as python).
I'm using random forest classifier, the dataset is defined 't2002'.
t2002.column
So here are the columns:
Index(['IndividualID', 'ES2000_B01ID', 'NSSec_B03ID', 'Vehicle',
'Age_B01ID',
'IndIncome2002_B02ID', 'MarStat_B01ID', 'EcoStat_B03ID',
'MainMode_B03ID', 'TripStart_B02ID', 'TripEnd_B02ID',
'TripDisIncSW_B01ID', 'TripTotalTime_B01ID', 'TripTravTime_B01ID',
'TripPurpFrom_B01ID', 'TripPurpTo_B01ID'],
dtype='object')
I'm using codes as below to run the classifier:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X_all = t2002.drop(['MainMode_B03ID'],axis=1)
y_all = t2002['MainMode_B03ID']
p = 0.2
X_train,X_test, y_train, y_test = train_test_split(X_all,y_all,test_size=p,
random_state=23)
clf = RandomForestClassifier()
acc_scorer = make_scorer(accuracy_score)
parameters = {
} # parameter is blank
grid_obj = GridSearchCV(clf,parameters,scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train,y_train)
clf = grid_obj.best_estimator_
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print(accuracy_score(y_test,predictions))
In this case, how could I use pandas to generate a crosstab (like a table) to show the detailed prediction results?
Thanks in advance!
you can first create a confusion matrix using sklearn and then convert it to pandas data frame.
from sklearn.metrics import confusion_matrix
#creating confusion matrix as array
confusion = confusion_matrix(t2002['MainMode_B03ID'].tolist(),predictions)
#converting to df
new_df = pd.DataFrame(confusion,
index = t2002['MainMode_B03ID'].unique(),
columns = t2002['MainMode_B03ID'].unique())
Its easy to show all the predicted results using pandas. Use cv_results_ as described in docs.
import pandas as pd
results = pd.DataFrame(clf.cv_results_) # clf is the GridSearchCV object
print(results.head())

Categories

Resources