Feature selection using mixed data types

Feature selection using mixed data types - python

I am trying to create some code that gives weight to the most impactful features.
My dataframe contains both nominal and categorical data.
example data:
[Brand] [Model] [Car_price] [...] [Prime]
BMW X1 40,000 300
The Y is the prime and X is all other columns.
I tried using the following:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv(data, delimiter=";")
#df = df.dropna(axis=1)
array = df.values
X = array[:,(6,7,9,12,13,14,15,16,17,18,19,20,21,22,23,24,25,27,34,35,37,44,45,47,48,54,61,62)]
Y = array[:,51]
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
forest.fit(X, Y)
And get the following error: ValueError: could not convert string to float
I know there is a way to transform from string into numerical data, but was wondering if it is necessary. What fixes can I apply to get weighted features?

Related

Categorical interactions in statsmodels?

In R, I have a data frame with two categorical predictors, one of which has multiple levels, and a categorical response. I am running a multinomial logistic regression on each of the categorical predictors, plus the interaction of the two categorical predictors.
library(VGAM)
x1 <- as.factor(rep(c(1,2,3,3,2,1,1,2,3,3,2,1),5))
x2 <- as.factor(rep(c(1,1,1,1,1,1,2,2,2,2,2,2),5))
y <- as.factor(rep(c(1,2,3,1,2,3,1,2,3,1,2,3),5))
VGAM's vglm function has the ability to handle the categorical variables and their interactions.
M <- vglm(y ~ x1*x2, family=multinomial)
However, I now have to do this work in Python, and I am having a hard time getting the categorical variables to function as cleanly in statsmodels as they do in R. R does the categorical encoding from a factor variable just fine and then does the interactions. statsmodels has not done that for me (yet).
I have the Python function that fits multinomial logistic regressions, smf.mnlogit (smf coming from `import statsmodels.formulas.api as smf'). How can I use that with the factor variables to get the interactions that I get in R?
Here is the Python code I've tried:
# import packages
#
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Define data
#
x1 = np.array([1,2,3,3,2,1,1,2,3,3,2,1] * 5)
x2 = np.array([1,1,1,1,1,1,2,2,2,2,2,2] * 5)
y = np.array([1,2,3,1,2,3,1,2,3,1,2,3] * 5)
# Make data frame
#
df = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})
# Make the columns categorical
#
df['x1'] = df['x1'].astype('category')
df['x2'] = df['x2'].astype('category')
df['y'] = df['y'].astype('category')
# fit the multinomial logistic regression
#
mlr = smf.mnlogit(formula='y ~ x1*x2', data=df).fit()
I get the following error:
ValueError: endog has evaluated to an array with multiple columns that has shape (60, 3). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).

I think the categorical columns get one hot encoded once they are used as a target variable due to which you are getting this error. A possible solution to this would be to encode various categories to numbers and then normalize to supply it to the logit() function (Although it is not right to encode string categories to integer values).
Consider the following example:
df_log[target] = pd.Categorical(df_log[target])
df_log[target] = df_log[target].cat.codes
min_max_scaler = preprocessing.MinMaxScaler()
df_log[[target]] = min_max_scaler.fit_transform(df_log[[target]])
formula = "target ~ x1 + x2"
model = smf.logit(formula=formula, data=df_log).fit()

Python - How to determine the feature / column names returned by Chi Squared test [duplicate]

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:
Let's assume I would like to conduct the experiment selecting 5 best features:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
Now, if I add the line:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
My question is how to create the features_names list?
I know that I should use:
select_k_best_classifier.get_support()
Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.
How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?

This doesn't require loops.
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]

For me this code works fine and is more 'pythonic':
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]

You can do the following :
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
Then change the name of your features:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)

Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)

Select Best 10 feature according to chi2;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
Get features with get_support()
f = KBest.get_support(1) #the most important features
Create new df called X_new;
X_new = X[X.columns[f]] # final features`

As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())

There is an another alternative method, which ,however, is not fast as above solutions.
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]

# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)

Suppose that you want to choose 10 best features:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)

Python SciKitLearn and Pandas categoric data

I'm working on multivariable regression from a csv, predicting crop performance based on multiple factors. Some of my columns are numerical and meaningful. Others are numerical and categorical, or strings and categorical (for instance, crop variety, or plot code or whatever.) How do I teach Python to use them? I've found One Hot Encoder (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) but don't really understand how to apply it here.
My code so far:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('filepath.csv')
df.drop(df[df['LabeledDataColumn'].isnull()].index.tolist(),inplace=True)
scale = StandardScaler()
pd.options.mode.chained_assignment = None # default='warn'
X = df[['inputColumn1', 'inputColumn2', ...,'inputColumn20']]
y = df['LabeledDataColumn']
X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']] = scale.fit_transform(X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']].as_matrix())
#print (X)
est = sm.OLS(y, X).fit()
est.summary()

You could use the get_dummies function pandas provides and convert the categorical values.
Something like this..
predictor = pd.concat([data.get(['numerical_column_1','numerical_column_2','label']),
pd.get_dummies(data['categorical_column1'], prefix='Categorical_col1'),
pd.get_dummies(data['categorical_column2'], prefix='categorical_col2'),
axis=1)
then you could get the outcome/label column by doing
outcome = predictor['label']
del predictor['label']
Then call the model on the data doing
est = sm.OLS(outcome, predictor).fit()

The easiest way for getting feature names after running SelectKBest in Scikit Learn

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:
Let's assume I would like to conduct the experiment selecting 5 best features:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
Now, if I add the line:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
My question is how to create the features_names list?
I know that I should use:
select_k_best_classifier.get_support()
Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.
How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?

This doesn't require loops.
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]

For me this code works fine and is more 'pythonic':
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]

You can do the following :
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
Then change the name of your features:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)

Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)

Select Best 10 feature according to chi2;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
Get features with get_support()
f = KBest.get_support(1) #the most important features
Create new df called X_new;
X_new = X[X.columns[f]] # final features`

As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())

There is an another alternative method, which ,however, is not fast as above solutions.
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]

# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)

Suppose that you want to choose 10 best features:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)

Issue in denoting negative numbers as a category for Random Forest algorithm in PySpark

This question is in continuation to my another question at this link
I am working on Random Forest algorithm for classification in Spark MLlib using PySpark. My sample dataset looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
As you can see the fields are in non-numeric format and so require encoding before being passed to the model. The last value in each row is a numeric field in string format(unicode) with some of the values having - sign before them. Here whenever the features are say Level1,Male,New York,New York then prediction will be 352.888890. So 352.888890 becomes a category rather than just a numeric value.I wrote this code, where I read the data and form a training_set RDD. I then encode the non-numeric field and then form the RDD of LabeledPoint before passing it to the model for classification. This is my current code:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV
import pandas as pd
import sqlite3
from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col
from pyspark.mllib.tree import RandomForest, RandomForestModel
def extract(line):
return (line[0],line[1],line[2],line[3],line[4].lstrip('-'))
input_file = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)
input_data = (input_file
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 )
.map(extract)) # Map to tuples
# Divide the input data in training and test set with 80%-20% ratio
(training_data, test_data) = input_data.randomSplit([0.8, 0.2])
# the column in training_data which is label - a numeric field in string format
label_col = "x4"
# converting RDD to dataframe
training_data_df = training_data.toDF(("x0","x1","x2","x3","x4"))
# Indexers encode strings with doubles
string_indexers = [
StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
for x in training_data_df.columns if x != label_col
]
# Assembles multiple columns into a single vector
assembler = VectorAssembler(
inputCols=["idx_{0}".format(x) for x in training_data_df.columns if x != label_col],
outputCol="features"
)
pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(trainingData_df_1)
indexed = model.transform(trainingData_df_1)
label_points = (indexed
.select(col(label_col).cast("double").alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))
feature1 = training_data.map(lambda x: x[0]).distinct().collect()
feature2 = training_data.map(lambda x: x[1]).distinct().collect()
feature3 = training_data.map(lambda x: x[2]).distinct().collect()
feature4 = training_data.map(lambda x: x[3]).distinct().collect()
label_set = training_data.map(lambda x: x[4]).distinct().collect()
model_classifier = RandomForest.trainClassifier(label_points,numClasses=len(label_set),categoricalFeaturesInfo={0: len(feature1), 1: len(feature2), 2: len(feature3),3: len(feature4)},
numTrees=50, featureSubsetStrategy="auto",
impurity='gini', maxDepth=10, maxBins=max([len(feature1),len(feature2),len(feature3),len(feature4)]))
When I run this code I get error as ava.lang.IllegalArgumentException: GiniAggregator given label -495.8001345 but requires label is non-negative.
The problem is the some label values is negative numeric. How can I use negative numeric values to denote a category and not a number?

In Spark Source Code, there are Gini Impurity logic checking the label need in the range between 0 and numClasses, see below source code
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala
After do some research, I found some one point out the label which cause problem need be transformed to the range that Gini Impurity could handle properly
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-Error-td23847.html
Hope this can help

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Feature selection using mixed data types - python

Related

Categorical interactions in statsmodels?

Python - How to determine the feature / column names returned by Chi Squared test [duplicate]

Python SciKitLearn and Pandas categoric data

The easiest way for getting feature names after running SelectKBest in Scikit Learn

Issue in denoting negative numbers as a category for Random Forest algorithm in PySpark

Categories

Resources