One-Hot Encoding Question - Concept and Solution to My Problem (Kaggle Dataset) - python
I'm working on an exercise in Kaggle, it's on their module for categorical variables, specifically the one - hot encoding section: https://www.kaggle.com/alexisbcook/categorical-variables
I'm through the entire workbook fine, and there's one last piece I'm trying to work out, it's the optional piece at the end to apply the one - hot encoder to predict the house sale values. I've worked out the following code`, but on the line in bold: OH_cols_test = pd.DatFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols])), i'm getting the error that the input contains NaN.
So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column? And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers? Can someone please let me know where I'm going wrong here? Thanks very much!:
from sklearn.preprocessing import OneHotEncoder
# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
**OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols]))**
# One-hot encoding removed index; put it back
OH_cols_test.index = X_test.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_test = X_test.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)
So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column?
NA's are just the absence of data, and so you can loosely think of rows with NA's as being incomplete. You may find yourself dealing with a dataset where NAs occur in half of the rows, and will require some clever feature engineering to compensate for this. Think about it this way: if one hot encoding is a simple way to represent binary state (e.g. is_male, salary_is_less_than_100000, etc...), then what does NaN/null mean? You have a bit of a Schrodinger's cat on your hands there. You're generally safe to drop NA's so long as it doesn't mangle your dataset size. The amount of data loss you're willing to handle is entirely situation-based (it's probably fine for a practice exercise).
And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers?
May I suggest this.
I deal with this topic on my blog. You can check the link at the bottom of this answer. All my code/logic appears directly below.
# There are various ways to deal with missing data points.
# You can simply drop records if they contain any nulls.
# data.dropna()
# You can fill nulls with zeros
# data.fillna(0)
# You can also fill with mean, median, or do a forward-fill or back-fill.
# The problems with all of these options, is that if you have a lot of missing values for one specific feature,
# you won't be able to do very reliable predictive analytics.
# A viable alternative is to impute missing values using some machine learning techniques
# (regression or classification).
import pandas as pd
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
# Load data
data = pd.read_csv('C:\\Users\\ryans\\seaborn-data\\titanic.csv')
print(data)
list(data)
data.dtypes
# Now, we will use a simple regression technique to predict the missing values
data_with_null = data[['survived','pclass','sibsp','parch','fare','age']]
data_without_null = data_with_null.dropna()
train_data_x = data_without_null.iloc[:,:5]
train_data_y = data_without_null.iloc[:,5]
linreg.fit(train_data_x,train_data_y)
test_data = data_with_null.iloc[:,:5]
age = pd.DataFrame(linreg.predict(test_data))
# check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
# Find any/all missing data points in entire data set
print(data_with_null.isnull().sum().sum())
# WOW 177 NULLS!!
# LET'S IMPUTE MISSING VALUES...
# View age feature
age = list(linreg.predict(test_data))
print(age)
# Finally, we will join our predicted values back into the 'data_with_null' dataframe
data_with_null.age = age
# Check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
https://github.com/ASH-WICUS/Notebooks/blob/master/Fillna%20with%20Predicted%20Values.ipynb
One final thought, just in case you don't already know about this. There are two kinds of categorical data:
Labeled Data: The categories have an inherent order (small, medium, large)
When your data is labeled in some kind of order, USE LABEL ENCODING!
Nominal Data: The categories do not have an inherent order (states in the US)
When your data is nominal, and there is no specific order, USE ONE HOT ENCODING!
https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/
Related
Replace +200 values from my dataframe python
I want to change the value of my data in my dataframe. Obviously, I can use the replace function. df['COLUMN'].replace(['SOC','MR','MME',...,'N230'], [0,1,2,...,230], inplace=True) However, since there are more than 200 different values I'm looking for a method to avoid replacing the 200+ values with this method.
If you want them to replace with just unique random numbers, you can use the sklearn label encoder. from sklearn import preprocessing le = preprocessing.LabelEncoder() le.fit(df['Column']) df['Column']=le.transform(df['Column']) #if you want to revert the changes df['Column']=le.inverse_transform(df['Column']) Check the documentation : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
When I one hot encode a column with corresponding values. It gets null values in between them
I hope this finds you all well. I have been working with a steel data-set. I was trying to one_hot_encode a categorical data column with their corresponding values using mapping method. However when I do this the column gets null values in between. I am unable to understand why. The column before one_hot_encoding did not have any null values. However, after mapping with the corresponding values it gets null values in between. Here is the code: df["material_spec"].unique() array(['Material_0', 'Material_1', 'Material_2', 'Material_3', 'Material_4', 'Material_5', 'Material_6', 'Material_7', 'Material_8', 'Material_9', 'Material_10', 'Material_11', 'Material_12', 'Material_13', 'Material_14', 'Material_15', 'Material_16', 'Material_17', 'Material_18', 'Material_19', 'Material_20', 'Material_21', 'Material_22', 'Material_23', 'Material_24', 'Material_25', 'Material_26', 'Material_27', 'Material_28', 'Material_29', 'Material_30', 'Material_31', 'Material_32', 'Material_33', 'Material_34', 'Material_35', 'Material_36', 'Material_37', 'Material_38', 'Material_39', 'Material_40', 'Material_41', 'Material_42', 'Material_43', 'Material_44', 'Material_45', 'Material_46', 'Material_47', 'Material_48'], dtype=object) This is how I am one_hot_encoding the data: df["material_spec"] = df["material_spec"].map({"Material_0":0, "Material_1":1,"Material_2":2,"Material_3":3,"Material_4":4,"Material_5":5, "Material_6":6,"Material_7":7,"Material_8":8,"Material_9":9,"Material_10":10,"Material_11":11,"Material_12":12, "Material_13":13,"Material_14":14,"Material_15":15,"Material_16":16,"Material_17":17,"Material_18":18,"Material_19":19,"Material:20":20,"Material_21":21,"Material_22":22,"Material_23":23,"Material_24":24,"Material_25":25,"Material_26":26,"Material_27":27,"Material_28":28, "Material_29":29,"Material_30":30,"Material_31":31,"Material_32":32,"Material_33":33,"Material_34":34, "Material_35":35,"Material_36":36,"Material_37":37,"Material_38":38,"Material_39":39, "Material_40":40,"Material_41":41,"Material_42":42,"Material_43":43,"Material_44":44, "Material_45":45,"Material_46":46,"Material_47":47,"Material_48":48}) And this results after this mapping: df["material_spec"].isnull().sum() 122 Can anyone tell me what am I doing wrong here. Is my way of one hot encoding wrong or is it due to some other error? Any suggestions would be helpful. Thanks
#ansev has answered your immediate question in the comments. Here's another way to do what you want to do that may be easier for you: df["material_spec"].str.extract(r'Material_(\d+)').astype(int) But what you are doing is not really one-hot encoding is it? I think of one-hot encoding to be more like this: df["material_spec"].str.get_dummies()
categorical feature setting error in PMML GBDTLRClassifier
I try to set up my GBDTLRClassifier following the instruction here. First, I have done label encode on my columns. Then I define my categorical and continuous features, putting column names in two list. cat # categorical column names conts # continuous column names gbm = lgb.LGBMClassifier(n_estimator = 90) classifier = GBDTLRClassifier(gbm, LogisticRegression(penalty='l2')) dm = DataFrameMapper([([cat_col], CategoricalDomain()) for cat_col in cat] + [(conts, ContinuousDomain())]) pipeline = PMMLPipeline([('mapper', dm), ('classifier', classifier)]) pipeline.fit(df[cat + conts], df['y'], classifier__gbdt__eval_set=[(val[cat + conts], val['y'])], classifier__gbdt__early_stopping_rounds = 5, classifier__gbdt__categorical_feature=cat) pp = make_pmml_pipeline(pipelin, target_fields=['y']) sklearn2pmml(pp, '/tmp/lgb+lr.pmml') I get error message in fitting:TypeError: Wrong type(str) or unknown name(root) in categorical_feature. While root is definitely in cat. Looks like lgbm not aware of which columns are categorical, which is confusing. Moreover, when I remove the mapper part, no fitting error but convert failed in making pmml file with message: transformer object of the first step does not specify the number of input features. Does anyone could tell how to make this procedure work. THx
Based on comment here, need to set feature_name when I send string column names into categorical_feature. A little tricky here.
Keeping track of the output columns in sklearn preprocessing
How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following: What is the source variable of each column in the output array? If a column of the output array comes from one-hot encoding of a categorical variable, what is that category? What is the exact imputed value for each variable? What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.) I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.
The answer which had mentioned is based on this in Sklearn. You can get the answer for your first two question using the following snippet. def get_feature_names(columnTransformer): output_features = [] for name, pipe, features in columnTransformer.transformers_: if name!='remainder': for i in pipe: trans_features = [] if hasattr(i,'categories_'): trans_features.extend(i.get_feature_names(features)) else: trans_features = features output_features.extend(trans_features) return output_features import pandas as pd pd.DataFrame(preprocessor.fit_transform(X_train), columns=get_feature_names(preprocessor)) transformed_cols = get_feature_names(preprocessor) def get_original_column(col_index): return transformed_cols[col_index].split('_')[0] get_original_column(3) # 'embarked' get_original_column(0) # 'age' def get_category(col_index): new_col = transformed_cols[col_index].split('_') return 'no category' if len(new_col)<2 else new_col[-1] print(get_category(3)) # 'Q' print(get_category(0)) # 'no category' Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.
Joining List to a Pandas Frame - Have I kept the order?
So I have 2 scripts for an Artificial Neural Network on insurance claims - one script is to train/test and one to execute going forward. I am done with the first one and developing the second one using real production data as a test of it. The target/class label is a binary 1 or 0. Input data is initially in a dataframe of shape (5914, 23) and it is all numeric data. I then do a df.values.tolist() on it, I do StandardScaler() on all values (other than the first one which is a Claim ID) and in the process, it goes through np.asarray. I then run it through ANN_Model.Predict_Proba which gives me a list of 5,914 pairs of probabilities. Now I want to merge back to the dataframe which I had before I did the tolist(), all of the probabilities (called "predicted_probs") and to do so into a new column on that original dataframe (column called "Results") and to do so for one class (I am only interested in the positive class). I do so via the following code. But I don't know if the order of my results is the same as the order of the dataframe. Is it? for i in range (0,len(predicted_probs)): original_df["Results"] = pd.Series(predicted_probs[i]) print (predicted_probs[[i],[1]]) Should I be doing it another way? I have to replicate what is done in the training script in order to expect like-for-like results, hence the StandardScaler(), np.asarray etc. Thanks in advance
Your dataframe's shape is (5914, 23) and the output from ann_model.predict_proba is 5914. Since a row from your df will output a single probability you can expect that the order of your results is the same as the order of your dataframe. To add the probability of the positive class to the dataframe, original_df['Results'] = [i[1] for i in predicted_probs] There is no need for you to loop through the predicted_probs