Factorize real time data with consistent mappings to training data?

Factorize real time data with consistent mappings to training data? - python

In production level, I would like to use the beforehand-saved model to predict my real-time data.
However, I don't know how to set my real time data to have a consistent mapping with training data when factorizing categorical data.
From this article I know I can stack training data & new data together and make them consistent.
However, stacking and going through the whole process (doing the whole feature engineering, training and prediction) is too time consuming.
Whole process:15 mins v.s. model.prediction only: 3 sec
As the production level system is time sensitive, is there any method that I can use to factorize the new data to have the same mapping as training data?
Or can I only do it by «manually», such as
df.loc[df['col_name']=='YES', 'col_name'] = '1'
which could lead to very long coding?

If what you mean is accounting for novel categorical values as they come in (say, you get a new value of 'blue-green' for df.color), you could bounce any unexpected values to the same -1 bucket (unknown, let's say) and then handle that in post-processing or whenever it is that you re-tune the model.
Essentially, you could catch the category-exceptions and then handle them later on.

After few hours of work, I switch from pd.factorize to LabelEncoder().
As LabelEncoder only supports pd.series data, I tried to use a loop to go through all columns and store each LabelEncoder() fitted model to a dictionary.
In training data part
# list you want to do Label encoding
col_list = ['colA', 'colB', 'colC']
df[col_list]= df[col_list].fillna('NAN')
# create a dictionary to save LabelEncoder() model for each col
model = {}
# convert Categorical Data
for x in col_list:
encoder = LabelEncoder()
df[x]=encoder.fit_transform(df.__getattr__(x))
model[x]= encoder
# save dictionary to pickle file
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
In real-time data part :
with open('model.pkl', 'rb') as f:
model= pickle.load(f)
for x in col_list:
encoder = model[x]
try:
df[x]=encoder.transform(df[x].astype(str))
except:
df[x]=encoder.transform(df[x].astype(int))
As result it cost me 1.5 sec to load the data, do feature engineering and prediction.

Which algorithm are you using? I have come accross the same problem, but since I am using a LGBM, it turns out there is no need to factorize my data, the algorithm can handle catergorised values. I had to change the data type from 'object' to 'category'.
categorical_feats = [f for f in combined_data.columns if combined_data[f].dtype == 'object']
categorical_feats
for f_ in categorical_feats:
# Set feature type as categorical
combined_data[f_] = combined_data[f_].astype('category')

Related

Getting diff. num of features in train and test data after passing through the same pipeline

I am working on the introductory Titanic problem in Kaggle.
Here I wanted to design a pipeline to pre-process data.
I have written the following code for it:
dropColumns = ['PassengerId','Name','Ticket','Cabin']
numColumns = ['Age','SibSp','Parch','Fare']
catColumns = ['Sex','Embarked']
Num Pipeline:
num_pipeline = Pipeline([
('imputer',SimpleImputer(strategy="median")),
('std_scaler',StandardScaler()),])
Full Pipeline:
DataPreperationPipeline = ColumnTransformer([
("num",num_pipeline,numColumns),
("cat",OneHotEncoder(),catColumns),])
When I do:
predictions = model.predict(test_prepared)
Error I am getting:
X has 9 features, but model is expecting 10 features as input.
PS: This is the case with every model I am training. Even though before transformation test and train data has the same features.
Please tell me what to do?

Apologies Guys,
I am answering my own question here.
I made a mistake and sharing it here because I think I learned something very important.
The reason that I was getting this error was because a category column 'Embarked' also had null values in it, which were not present in the test_data set. So when I passed it to OneHotEncoder() in my full_pipeline. It created four columns of 0 and 1 to distinguish between the four values ( 3 original values, 1 null), which my model thought to contain an extra feature.
As soon as I removed the null rows everything went well.
Learning:
Always remove nulls values from your categorical columns before passing them to encoders.

One-Hot Encoding Question - Concept and Solution to My Problem (Kaggle Dataset)

I'm working on an exercise in Kaggle, it's on their module for categorical variables, specifically the one - hot encoding section: https://www.kaggle.com/alexisbcook/categorical-variables
I'm through the entire workbook fine, and there's one last piece I'm trying to work out, it's the optional piece at the end to apply the one - hot encoder to predict the house sale values. I've worked out the following code`, but on the line in bold: OH_cols_test = pd.DatFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols])), i'm getting the error that the input contains NaN.
So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column? And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers? Can someone please let me know where I'm going wrong here? Thanks very much!:
from sklearn.preprocessing import OneHotEncoder
# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
**OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols]))**
# One-hot encoding removed index; put it back
OH_cols_test.index = X_test.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_test = X_test.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)

So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column?
NA's are just the absence of data, and so you can loosely think of rows with NA's as being incomplete. You may find yourself dealing with a dataset where NAs occur in half of the rows, and will require some clever feature engineering to compensate for this. Think about it this way: if one hot encoding is a simple way to represent binary state (e.g. is_male, salary_is_less_than_100000, etc...), then what does NaN/null mean? You have a bit of a Schrodinger's cat on your hands there. You're generally safe to drop NA's so long as it doesn't mangle your dataset size. The amount of data loss you're willing to handle is entirely situation-based (it's probably fine for a practice exercise).
And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers?
May I suggest this.

I deal with this topic on my blog. You can check the link at the bottom of this answer. All my code/logic appears directly below.
# There are various ways to deal with missing data points.
# You can simply drop records if they contain any nulls.
# data.dropna()
# You can fill nulls with zeros
# data.fillna(0)
# You can also fill with mean, median, or do a forward-fill or back-fill.
# The problems with all of these options, is that if you have a lot of missing values for one specific feature,
# you won't be able to do very reliable predictive analytics.
# A viable alternative is to impute missing values using some machine learning techniques
# (regression or classification).
import pandas as pd
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
# Load data
data = pd.read_csv('C:\\Users\\ryans\\seaborn-data\\titanic.csv')
print(data)
list(data)
data.dtypes
# Now, we will use a simple regression technique to predict the missing values
data_with_null = data[['survived','pclass','sibsp','parch','fare','age']]
data_without_null = data_with_null.dropna()
train_data_x = data_without_null.iloc[:,:5]
train_data_y = data_without_null.iloc[:,5]
linreg.fit(train_data_x,train_data_y)
test_data = data_with_null.iloc[:,:5]
age = pd.DataFrame(linreg.predict(test_data))
# check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
# Find any/all missing data points in entire data set
print(data_with_null.isnull().sum().sum())
# WOW 177 NULLS!!
# LET'S IMPUTE MISSING VALUES...
# View age feature
age = list(linreg.predict(test_data))
print(age)
# Finally, we will join our predicted values back into the 'data_with_null' dataframe
data_with_null.age = age
# Check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
https://github.com/ASH-WICUS/Notebooks/blob/master/Fillna%20with%20Predicted%20Values.ipynb
One final thought, just in case you don't already know about this. There are two kinds of categorical data:
Labeled Data: The categories have an inherent order (small, medium, large)
When your data is labeled in some kind of order, USE LABEL ENCODING!
Nominal Data: The categories do not have an inherent order (states in the US)
When your data is nominal, and there is no specific order, USE ONE HOT ENCODING!
https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

Keeping track of the output columns in sklearn preprocessing

How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:
What is the source variable of each column in the output array?
If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
What is the exact imputed value for each variable?
What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)
I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.

The answer which had mentioned is based on this in Sklearn.
You can get the answer for your first two question using the following snippet.
def get_feature_names(columnTransformer):
output_features = []
for name, pipe, features in columnTransformer.transformers_:
if name!='remainder':
for i in pipe:
trans_features = []
if hasattr(i,'categories_'):
trans_features.extend(i.get_feature_names(features))
else:
trans_features = features
output_features.extend(trans_features)
return output_features
import pandas as pd
pd.DataFrame(preprocessor.fit_transform(X_train),
columns=get_feature_names(preprocessor))
transformed_cols = get_feature_names(preprocessor)
def get_original_column(col_index):
return transformed_cols[col_index].split('_')[0]
get_original_column(3)
# 'embarked'
get_original_column(0)
# 'age'
def get_category(col_index):
new_col = transformed_cols[col_index].split('_')
return 'no category' if len(new_col)<2 else new_col[-1]
print(get_category(3))
# 'Q'
print(get_category(0))
# 'no category'
Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.

One hot encoding when reading into tf.dataset

I am running a tensorflow model on the gcp-ai platform. The dataset is large and not everything can be kept in memory at the same time, therefore I read the data into a tf.dataset using the following code:
def read_dataset(filepattern):
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0])
features=[cols[1],cols[2]]
label = cols[0]
return features, label
# Create list of files that match pattern
file_list = tf.io.gfile.glob(filepattern)
# Create dataset from file list
dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
return dataset
training_data=read_dataset(<filepattern>)
The problem is that the second column in my data is categorical, and I need to use one hot encoding. How can this be done, either in the function decode_csv or manipulate the tf.dataset later.

You could use tf.one_hot. Assuming that the second column is cols[1] and that the categorical values have been converted to integers, you could do the following:
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0]])
features=[cols[1], tf.one_hot(cols[2], nb_classes)]
label = cols[0]
return features, label
NOTE: Not tested.

Pre-process non-image data to feed into Tensorflow DNN

I have a large amount of non-image data spread across several delimited files which I want to use as inputs to a DNN in TensorFlow. The data need some pre-processing, so I am trying to use the CIFAR10 example in the TensorFlow source as an example because it has pre-processing, it processes multiple files, and queues data for the model.
I cannot figure out how the data should be represented given that I'll have multiple FeatureColumns and the data are read record by record.
My input data look like below, delimited by '|'. The first column I want to pre-process, the result of which is two values; these values will then convert into Tensors with tf.contrib.layers.sparse_column_with_hash_bucket; the second is a real-valued column which I want to convert with tf.contrib.layers.real_valued_column; and the third is the label I want to predict.
uywohy|12.3|0
asdfsvjlk|2.2|1
nlnliu|1.0|1
nlwljw|9.6|0
My plan is to read the data with tf.TextLineReader, split the data on the delimiter, and then pre-process. The example code starts here.
# Read in and pre-process a single record
DELIMITER = "|"
reader = tf.TextLineReader()
unparsed_record = reader.read()
col1, col2, label = unparsed_record.split(DELIMITER)
result.label = tf.cast(label, tf.int32)
col1_a, col1_b = _preprocess(col1)
# How to convert col1_a, col1_b, and col2 into a Tensor?
However, I'm not sure how to then re-assemble the data (col1_a, col1_b, and col2) into something that can be fed into the model. The CIFAR10 model doesn't make use of feed_dict so I don't see how the assemble the data.
Any help is much appreciated.

You can use tf.learn which requires a input function that populates the mappings from feature columns to tensors or sparse tensors. Here is an example: https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/learn/wide_n_deep_tutorial.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Factorize real time data with consistent mappings to training data? - python

Related

Getting diff. num of features in train and test data after passing through the same pipeline

One-Hot Encoding Question - Concept and Solution to My Problem (Kaggle Dataset)

Keeping track of the output columns in sklearn preprocessing

One hot encoding when reading into tf.dataset

Pre-process non-image data to feed into Tensorflow DNN

Categories

Resources