I was trying to understand differences between OneHotEncoder and get_dummies from this link: enter link description here
When I wrote exact same code, I am getting an error and it says
AttributeError: 'OneHotEncoder' object has no attribute 'get_feature_names_out'
Here is the code:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = sns.load_dataset('tips')
df = df[['total_bill', 'tip', 'day', 'size']]
df.head(5)
X = df.drop('tip', axis=1)
y = df['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False, dtype='int')
ohe.fit(X_train[['day']])
def get_ohe(df):
temp_df = pd.DataFrame(data=ohe.transform(df[['day']]), columns=ohe.get_feature_names_out())
df.drop(columns=['day'], axis=1, inplace=True)
df = pd.concat([df.reset_index(drop=True), temp_df], axis=1)
return df
X_train = get_ohe(X_train)
X_test = get_ohe(X_test)
X_train.head()
I checked OneHotEncoder from sklearn.preprocessing module and get_feature_names_out() method is there and it is not deprecated. I don't know why I am getting this error.
If you're using scikit-learn version lower than 1.0, you need to use get_feature_names method. For newer versions of scikit-learn, get_feature_names_out will work fine.
Related
I ran into a issue where Google Colab's ram is running out. I use the free version and I'm not sure if it's because it can't handle or if my code is very bad optimized. As I'm new to the field I believe my code is very slow and badly optimized. Wanted to ask a bit of help as I'm still learning.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('path/beforeNeural.csv')
df.shape
df.head()
df.isnull().sum()
encoder = LabelEncoder()
df['Property Type'] = encoder.fit_transform(df['Property Type'])
df['Old/New'] = encoder.fit_transform(df['Old/New'])
df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
df['County'] = encoder.fit_transform(df['County'])
df['District'] = encoder.fit_transform(df['District'])
df['Town/City'] = encoder.fit_transform(df['Town/City'])
df['Duration'] = encoder.fit_transform(df['Duration'])
df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])
X = df.drop(columns='Price', axis=1)
Y = df['Price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
df.shape
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)
I'll give it a try, here is a possible option to optimize your code,
Code:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('path/beforeNeural.csv')
categorical_columns = ['Property Type', 'Old/New', 'Record Status - monthly file only', 'PPDCategory Type', 'County', 'District', 'Town/City', 'Duration', 'Transaction unique identifier', 'Date of Transfer']
encoder = OneHotEncoder()
X_concat = encoder.fit_transform(df[categorical_columns])
# Approach 1:
X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
# Approach 2:
X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))
X_numerical = df.drop(columns = categorical_columns + ['Price'])
X = pd.concat([X_numerical, X_concat], axis = 1)
Y = df['Price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)
boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)
Note, I removed the unused imports and deleted the calls such as
df.head() for example in the middle of the code, which does nothing and also
does not print anything when you use it like that in the middle of the
code
Code Explanation:
Instead of using LabelEncoder, I used OneHotEncoder in order to one-hot-encode all of the categorical features.
This creates a new binary column for each unique value in the categorical features.
In general, one-hot-encoding is usually a better approach to handle categorical features when using machine learning other than just assigning the integer values using the LabelEncoder.
I extracted the names of all of the categorical columns into a list, that way it's easier to modify them when it's needed.
I am working on an assignment and I run into this error. I am using python to perform an KNN on a data set. I pretty sure I defined the variable but it says otherwise. This code is written below.
`
import pandas as PD
import numpy as np
import matplotlib.pyplot as mtp
data_set= PD.read_csv('hw6.data.csv.gz')
x= data_set.iloc[:,[2,3]].valuesS
y= data_set.iloc[:, 4].values
from sklearn.model_selection import train_test_split
x_train, x_train, y_train, y_train= train_test_split(x,y, test_size=.25, random_state=0)
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
`
`
import pandas as PD
import numpy as np
import matplotlib.pyplot as mtp
data_set= PD.read_csv('hw6.data.csv.gz')
x= data_set.iloc[:,[2,3]].valuesS
y= data_set.iloc[:, 4].values
from sklearn.model_selection import train_test_split
x_train, x_train, y_train, y_train= train_test_split(x,y, test_size=.25, random_state=0)
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
`
The error says "x_test" is not defined Pylance (reportUndefinedVarible)
I have the following toy code.
I use a pipeline to automatically normalize numerical variables and apply one-hot-encoding to the categorical ones.
I can get the coefficients of the linear regression model easily using pipe['logisticregression'].coef_ but how can I get all the feature names in the right order as this appearing in the coef matrix?
from sklearn.compose import ColumnTransformer
import numpy as np, pandas as pd
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
# data from https://www.kaggle.com/datasets/uciml/adult-census-income
data = pd.read_csv("adult.csv")
data = data.iloc[0:3000,:]
target = "workclass"
y = data[target]
X = data.drop(columns=target)
numerical_columns_selector = make_column_selector(dtype_exclude=object)
categorical_columns_selector = make_column_selector(dtype_include=object)
numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)
ct = ColumnTransformer([ ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_columns) ,
('std', StandardScaler(), numerical_columns)])
model = LogisticRegression(max_iter=500)
pipe = make_pipeline(ct, model)
data_train, data_test, target_train, target_test = train_test_split(
X, y, random_state=42)
pipe.fit(data_train, target_train)
pipe['logisticregression'].coef_.shape
I'm trying to run a DecisionTreeClassifier on the Kaggle titanic database. (https://www.kaggle.com/rahulsah06/titanic?select=train.csv)
This is my code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
titanic_file_path = '../input/titanic/train.csv'
titanic_data = pd.read_csv(titanic_file_path)
#I create X and y
features= ['Pclass', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
X= titanic_data[features]
y = titanic_data.Survived
#Split into validation and training data
train_X, val_x, train_y, val_y = train_test_split(X,y, random_state=1)
#model definition and fit
titanic_model = DecisionTreeClassifier(random_state=1)
titanic_model.fit(train_X, train_y)
But when I run the code I get an error:
could not convert string to float: 'female'
How to resolve this?
A quick fix is to convert your columns to categorial values using the get_dummies method.
X = pd.get_dummies(X)
Although probably you should take more preprocessing steps than you currently are. But for a toy run, I guess get dummies will suffice.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
df = pd.read_csv('CarSeats_Dataset.csv')
df=df.dropna()
dummies=pd.get_dummies(df[['ShelveLoc', 'Urban', 'US']])
X = df.drop('Sales',axis=1)
y = np.log(df['Sales'])
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
regressor = DecisionTreeRegressor(random_state = 42)
regressor.fit(X_train, y_train)
I was trying to predict the Sales but when tried to fit the regressor I got the error : <ValueError: could not convert string to float: 'Bad'/>
I am a beginner in this and I do not know how to fix it. Can anyone help me with that please?
import pandas as pd
Data = {'Product': ['ABC','XYZ'],
'Price': ['250','270']}
df = pd.DataFrame(Data)
df['Price'] = df['Price'].astype(float)
print (df)
print (df.dtypes)