categorical feature setting error in PMML GBDTLRClassifier - python

I try to set up my GBDTLRClassifier following the instruction here.
First, I have done label encode on my columns. Then I define my categorical and continuous features, putting column names in two list.
cat # categorical column names
conts # continuous column names
gbm = lgb.LGBMClassifier(n_estimator = 90)
classifier = GBDTLRClassifier(gbm, LogisticRegression(penalty='l2'))
dm = DataFrameMapper([([cat_col], CategoricalDomain()) for cat_col in cat] + [(conts, ContinuousDomain())])
pipeline = PMMLPipeline([('mapper', dm), ('classifier', classifier)])
pipeline.fit(df[cat + conts], df['y'], classifier__gbdt__eval_set=[(val[cat + conts], val['y'])], classifier__gbdt__early_stopping_rounds = 5, classifier__gbdt__categorical_feature=cat)
pp = make_pmml_pipeline(pipelin, target_fields=['y'])
sklearn2pmml(pp, '/tmp/lgb+lr.pmml')
I get error message in fitting:TypeError: Wrong type(str) or unknown name(root) in categorical_feature. While root is definitely in cat. Looks like lgbm not aware of which columns are categorical, which is confusing.
Moreover, when I remove the mapper part, no fitting error but convert failed in making pmml file with message: transformer object of the first step does not specify the number of input features.
Does anyone could tell how to make this procedure work. THx

Based on comment here, need to set feature_name when I send string column names into categorical_feature. A little tricky here.

Related

Getting diff. num of features in train and test data after passing through the same pipeline

I am working on the introductory Titanic problem in Kaggle.
Here I wanted to design a pipeline to pre-process data.
I have written the following code for it:
dropColumns = ['PassengerId','Name','Ticket','Cabin']
numColumns = ['Age','SibSp','Parch','Fare']
catColumns = ['Sex','Embarked']
Num Pipeline:
num_pipeline = Pipeline([
('imputer',SimpleImputer(strategy="median")),
('std_scaler',StandardScaler()),])
Full Pipeline:
DataPreperationPipeline = ColumnTransformer([
("num",num_pipeline,numColumns),
("cat",OneHotEncoder(),catColumns),])
When I do:
predictions = model.predict(test_prepared)
Error I am getting:
X has 9 features, but model is expecting 10 features as input.
PS: This is the case with every model I am training. Even though before transformation test and train data has the same features.
Please tell me what to do?
Apologies Guys,
I am answering my own question here.
I made a mistake and sharing it here because I think I learned something very important.
The reason that I was getting this error was because a category column 'Embarked' also had null values in it, which were not present in the test_data set. So when I passed it to OneHotEncoder() in my full_pipeline. It created four columns of 0 and 1 to distinguish between the four values ( 3 original values, 1 null), which my model thought to contain an extra feature.
As soon as I removed the null rows everything went well.
Learning:
Always remove nulls values from your categorical columns before passing them to encoders.

Apply custom function to a column in data frame if the column value is not equal to nan

Maybe Python geniuses can help me here-
So I have a machine learning algorithm where I am obtaining my classifier like :
clf = MultinomialNB().fit(X_train_tfidf, y_train)
To get it to predict I simply write
print(clf.predict(count_vect.transform(["I am happy today"])))
and it gives me back the Category in which this sentence falls.
Now I want to use this clf to predict results for me for a column. I have another data frame dfnew which consists of 2 columns, and the second column with the header 'Additional Information' has the strings that I need to pass to clf.predict.
But this columns has blank values as in the figure below
How do I pass this 'additional Information' column to clf.predict() such that it skips the nan ones and only predicts for strings present. At best, I would want the results to be added in the same data frame dfnew in the third column.
Any help or guidance would be really appreciated. Thank you
I guess what you're looking for is the apply method. Simply:
def classify_when_possible(add_info):
if add_info is None:
return None
else:
return clf.predict(count_vect.transform([add_info]))
#if you want it to be added to your dataframe
dfnew['third column'] = dfnew.apply(lambda row: classify_when_possible(row['Additional Information']), axis=1)
Note: there's many ways to use apply, some are more optimized depending on use case.

One-Hot Encoding Question - Concept and Solution to My Problem (Kaggle Dataset)

I'm working on an exercise in Kaggle, it's on their module for categorical variables, specifically the one - hot encoding section: https://www.kaggle.com/alexisbcook/categorical-variables
I'm through the entire workbook fine, and there's one last piece I'm trying to work out, it's the optional piece at the end to apply the one - hot encoder to predict the house sale values. I've worked out the following code`, but on the line in bold: OH_cols_test = pd.DatFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols])), i'm getting the error that the input contains NaN.
So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column? And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers? Can someone please let me know where I'm going wrong here? Thanks very much!:
from sklearn.preprocessing import OneHotEncoder
# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
**OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols]))**
# One-hot encoding removed index; put it back
OH_cols_test.index = X_test.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_test = X_test.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)
So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column?
NA's are just the absence of data, and so you can loosely think of rows with NA's as being incomplete. You may find yourself dealing with a dataset where NAs occur in half of the rows, and will require some clever feature engineering to compensate for this. Think about it this way: if one hot encoding is a simple way to represent binary state (e.g. is_male, salary_is_less_than_100000, etc...), then what does NaN/null mean? You have a bit of a Schrodinger's cat on your hands there. You're generally safe to drop NA's so long as it doesn't mangle your dataset size. The amount of data loss you're willing to handle is entirely situation-based (it's probably fine for a practice exercise).
And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers?
May I suggest this.
I deal with this topic on my blog. You can check the link at the bottom of this answer. All my code/logic appears directly below.
# There are various ways to deal with missing data points.
# You can simply drop records if they contain any nulls.
# data.dropna()
# You can fill nulls with zeros
# data.fillna(0)
# You can also fill with mean, median, or do a forward-fill or back-fill.
# The problems with all of these options, is that if you have a lot of missing values for one specific feature,
# you won't be able to do very reliable predictive analytics.
# A viable alternative is to impute missing values using some machine learning techniques
# (regression or classification).
import pandas as pd
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
# Load data
data = pd.read_csv('C:\\Users\\ryans\\seaborn-data\\titanic.csv')
print(data)
list(data)
data.dtypes
# Now, we will use a simple regression technique to predict the missing values
data_with_null = data[['survived','pclass','sibsp','parch','fare','age']]
data_without_null = data_with_null.dropna()
train_data_x = data_without_null.iloc[:,:5]
train_data_y = data_without_null.iloc[:,5]
linreg.fit(train_data_x,train_data_y)
test_data = data_with_null.iloc[:,:5]
age = pd.DataFrame(linreg.predict(test_data))
# check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
# Find any/all missing data points in entire data set
print(data_with_null.isnull().sum().sum())
# WOW 177 NULLS!!
# LET'S IMPUTE MISSING VALUES...
# View age feature
age = list(linreg.predict(test_data))
print(age)
# Finally, we will join our predicted values back into the 'data_with_null' dataframe
data_with_null.age = age
# Check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
https://github.com/ASH-WICUS/Notebooks/blob/master/Fillna%20with%20Predicted%20Values.ipynb
One final thought, just in case you don't already know about this. There are two kinds of categorical data:
Labeled Data: The categories have an inherent order (small, medium, large)
When your data is labeled in some kind of order, USE LABEL ENCODING!
Nominal Data: The categories do not have an inherent order (states in the US)
When your data is nominal, and there is no specific order, USE ONE HOT ENCODING!
https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

Keeping track of the output columns in sklearn preprocessing

How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:
What is the source variable of each column in the output array?
If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
What is the exact imputed value for each variable?
What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)
I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.
The answer which had mentioned is based on this in Sklearn.
You can get the answer for your first two question using the following snippet.
def get_feature_names(columnTransformer):
output_features = []
for name, pipe, features in columnTransformer.transformers_:
if name!='remainder':
for i in pipe:
trans_features = []
if hasattr(i,'categories_'):
trans_features.extend(i.get_feature_names(features))
else:
trans_features = features
output_features.extend(trans_features)
return output_features
import pandas as pd
pd.DataFrame(preprocessor.fit_transform(X_train),
columns=get_feature_names(preprocessor))
transformed_cols = get_feature_names(preprocessor)
def get_original_column(col_index):
return transformed_cols[col_index].split('_')[0]
get_original_column(3)
# 'embarked'
get_original_column(0)
# 'age'
def get_category(col_index):
new_col = transformed_cols[col_index].split('_')
return 'no category' if len(new_col)<2 else new_col[-1]
print(get_category(3))
# 'Q'
print(get_category(0))
# 'no category'
Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.

How to get the indices for randomly selected rows in a list (Python)

Okay, I don't know if I phrased it badly or something, but I can't seem to find anything similar here for my problem.
So I have a 2D list, each row representing a case and each column representing a feature (for machine learning). In addition, I have a separated list (column) as labels.
I want to randomly select the rows from the 2D list to train a classifier while using the rest to test for accuracy. Thus I want to be able to know all the indices of rows I used for training to avoid repeats.
I think there are 2 parts of the question:
1) how to randomly select
2) how to get indices
again I have no idea why I can't find good info here by searching (maybe I just suck)
Sorry I'm still new to the community so I might have made a lot of format mistake. If you have any suggestion, please let me know.
Here's the part of code I'm using to get the 2D list
#273 = number of cases
feature_list=[[0]*len(mega_list)]*273
#create counters to use for index later
link_count=0
feature_count=0
#print len(mega_list)
for link in url_list[:-1]:
#setup the url
samp_url='http://www.mtsamples.com'+link
samp_url = "%20".join( samp_url.split() )
#soup it for keywords
samp_soup=BeautifulSoup(urllib2.urlopen(samp_url).read())
keywords=samp_soup.find('meta')['content']
keywords=keywords.split(',')
for keys in keywords:
#print 'megalist: '+ str(mega_list.index(keys))
if keys in mega_list:
feature_list[link_count][mega_list.index(keys)]=1
mega_list: a list with all keywords
feature_list: the 2D list, with any word in mega_list, that specific cell is set to 1, otherwise 0
I would store the data in a pandas data frame instead of a 2D list. If I understand your data right you could do that like this:
import pandas as pd
df = pd.DataFrame(feature_list, columns = mega_list)
I don't see any mention of a dependent variable, but I'm assuming you have one because you mentioned a classifier algorithm. If your dependent variable is called "Y" and is in a list format with indices that align with your features, then this code will work for you:
from sklearn import cross_validation
x_train, x_test, y_train, y_test = cross_validation.train_test_split(
df, Y, test_size=0.8, random_state=0)
As I understand the problem, you have a list and you want to sample the list and save the indices for future use.
See: https://docs.python.org/2/library/random.html
you could do a random.sample(xrange(sizeoflist),sizeofsample) which will return the indices of your sample. You can then use that sample for training and skip over them (or get fancy and do a Set difference) for validation.
Hope this helps

Categories

Resources