Scikit-learn binarize categorical data

Scikit-learn binarize categorical data - python

I've been trying to load a CSV file into scikit via pandas and setting the target column to be a list of 20 categorical variables. I've tried using label_binarize but that didn't seem to do any good so after some reading I've switched to LabelEncoder but it doesn't appear to change much.
from io import StringIO
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import permutation_test_score
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc, confusion_matrix
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.preprocessing import label_binarize, MultiLabelBinarizer, LabelEncoder
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
#loading the data
data=pd.read_csv("data.csv")
y = data.iloc[:,19]
X = data.iloc[:,1:18+20:22]
#Binarize the output
le = LabelEncoder()
le.fit(["0-1","1-1.5","1.5-2","2-2.5","2.5-3","3-3.5","3.5-4","4-4.5","4.5-5","5-5.5","5.5-6","6-6.5","6.5-7","7-7.5","7.5-8","8-8.5","8.5-9","9-9.5","9.5-10","10+"
])
LabelEncoder()
le.transform(y)
y = label_binarize(y, le)
n_classes = y.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
model3 = KNeighborsClassifier(n_neighbors=7)
yet when I run this I get:
Traceback (most recent call last):
File "file, line 30, in <module>
le.transform(y)
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 149, in transform
classes = np.unique(y)
File "\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py", line 198, in unique
ar.sort()
TypeError: '>' not supported between instances of 'str' and 'float'
Is this kind of target data even possible for scikit?

Ok so to solve this issue I found you needed to surround the categorical data itself with quotation marks like this: "0-1"
Otherwise Python would read it as the long of 0-1 and get confused. The data loads correctly.

Related

Invalid Syntax Error in a certain line of code in python Decision Tree algorithm

Following is my code
I am running it on IDLE python 3.8
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn import trees
from sklearn.metrics import accuracy_score,classification_report
import warnings
from sklearn.preprocessing import StandardScalar
from sklearn.neural_networks import MLPClassifier
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
data=pd.read_csv('data.csv')
cols_to_retain=[]
x-feature=data[cols_to_retain]
x_dict=x_feature.T.to_dict.values()
vect=DictVectorizer(sparse=False)
x_vector=vect.fit_transform(x_dict)
print(x_vector)
x_train=[:-1]
x_test=[-1:]
print('Train set')
print(x_train)
print('Test set')
print(x_test)
le=LabelEncoder
y_train=le.fit_transform(data['Goal'][:-1])
clf=tree.DecisionTreeClassifier(criteron='entropy')
clf=clf.fit_transform(x_train,y_train)
print('Test Data')
print(le.inverse_transform(clf.predict(x_test)))
It shows me error for these particular lines
It only says invalid syntax error
x_train=[:-1]
x_test=[-1:]
packages are imported correctly

Your code contains multiple issues:
The import should be StandardScaler not StandardScalar,
You got unused imports like MLPClassifier,
cols_to_retrain is empty. Thus, data[cols_to_retrain] will return an empty data frame,
to_dict should be to_dict(),
variable names x-feature and x_feature do not match,
LabelEncoder is missing brackets (),
x_train=[:-1] and x_test=[-1:] is not valid. You probably wanted to select a subset like x_train = x_vector[:-1] or x_test = x_vector[-1:]. Please add additional sample data, if you need help with this selection.
Here is an updated version of your code:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("data.csv")
print(data)
cols_to_retain = []
x_feature = data[cols_to_retain]
x_dict = x_feature.T.to_dict().values()
vect = DictVectorizer(sparse=False)
x_vector = vect.fit_transform(x_dict)
print(x_vector)
x_train = x_vector[:-1]
x_test = x_vector[-1:]
print("Train set")
print(x_train)
print("Test set")
print(x_test)
le = LabelEncoder()
y_train = le.fit_transform(data["Goal"][:-1])
clf = DecisionTreeClassifier(criteron="entropy")
clf = clf.fit_transform(x_train, y_train)
print("Test Data")
print(le.inverse_transform(clf.predict(x_test)))

SHAP: additivity check failed in Tree Explainer: adding in check_additivity=False leads to error about keys not found in axis name?

I have a data set like this, it's 343 columns of binary data, and it is sparsely encoded (i.e. there are many more 0s than 1s):
column1 ... column343
0 0 ... 0
1 0 ... 0
2 0 ... 0
3 0 ... 0
4 0 ... 0
.. ... ... ...
214 0 ... 0
215 0 ... 0
216 0 ... 0
217 0 ... 0
218 0 ... 0
[219 rows x 343 columns]
(219, 343)
Could someone please explain to me how to fix the issue where this script:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold,KFold
from sklearn.feature_selection import SelectKBest
#from xgboost import XGBClassifier
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest, RFECV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, recall_score, accuracy_score, precision_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_score,recall_score,f1_score,roc_auc_score
from sklearn import metrics
from sklearn.datasets import make_classification
from numpy import mean
from sklearn.model_selection import train_test_split
from numpy import std
from sklearn.utils import shuffle
import numpy as np
from sklearn.metrics import roc_curve
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import pickle
#import neptune.new as neptune
import pandas as pd
import shap
df = pd.read_csv('train.txt',sep='\t') #hard-coded
full_y_train = df['Event]
df = df.drop(['Event'],axis=1)
full_X_train = df
def run_model_with_grid_search(param_grid={}, output_plt_file = 'plt.png',model_name=RandomForestClassifier(),X_train=full_X_train,y_train=full_y_train,model_id='random_forest_with_hpo_no_fs_geno_class', n_splits=5, output_file='random_forest_with_hpo_no_fs_geno_class.txt'):
list_shap_values = list()
list_test_sets = list()
cv_outer = KFold(n_splits=5,shuffle=True,random_state=1)
for train_ix,test_ix in cv_outer.split(X_train):
split_x_train, split_x_test = X_train.iloc[train_ix,:],X_train.iloc[test_ix,:]
split_y_train, split_y_test = y_train.iloc[train_ix],y_train.iloc[test_ix]
model = model_name
cv_inner = KFold(n_splits=3,shuffle=True,random_state=1)
search = GridSearchCV(model,param_grid=param_grid,scoring='roc_auc',cv=cv_inner,refit=True)
result = search.fit(split_x_train,split_y_train)
best_model = result.best_estimator_
yhat = best_model.predict(split_x_test)
explainer = shap.TreeExplainer(result.best_estimator_)
shap_values = explainer.shap_values(split_x_test,check_additivity=False)
list_shap_values.append(shap_values)
list_test_sets.append(test_ix)
test_set = list_test_sets[0]
shap_values = np.array(list_shap_values[0])
for i in range(1,len(list_test_sets)):
test_set = np.concatenate((test_set,list_test_sets[i]),axis=0)
shap_values = np.concatenate((shap_values,np.array(list_shap_values[i])),axis=1)
X_test_df = pd.DataFrame(full_X_train[test_set])
cols = X_test_df.columns
shap_sum = np.abs(shap_values[1,:,:]).mean(0)
importance_df = pd.DataFrame({
'column_name':cols,
'shap_values':shap_sum
})
print(importance_df)
return
param_grid = [{
'min_samples_leaf':[1,3,5],
}]
run_model_with_grid_search(param_grid=param_grid)
Generates the error:
Traceback (most recent call last):
File "/home/data/ml_models_genotypic_only_fortest.py", line 103, in <module>
run_model_with_grid_search(param_grid=param_grid)
File "/home/data/ml_models_genotypic_only_fortest.py", line 80, in run_model_with_grid_search
X_test_df = pd.DataFrame(full_X_train[test_set])
File "/home/apps/easybuild/software/SciPy-bundle/2021.10-foss-2021b/lib/python3.9/site-packages/pandas/core/frame.py", line 3464, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
File "/home/apps/easybuild/software/SciPy-bundle/2021.10-foss-2021b/lib/python3.9/site-packages/pandas/core/indexing.py", line 1314, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis)
File "/home/apps/easybuild/software/SciPy-bundle/2021.10-foss-2021b/lib/python3.9/site-packages/pandas/core/indexing.py", line 1374, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([ 0, 4, 11, 16, 18, 19, 28, 29, 31, 33,\n ...\n 156, 157, 175, 178, 192, 203, 204, 207, 211, 215],\n dtype='int64', length=219)] are in the [columns]"
I do not get the error if I remove check_additivity=False from the script, however, if I remove the check_additivity parameter, I get the error:
shap.utils._exceptions.ExplainerError: Additivity check failed in TreeExplainer! Please ensure the data matrix you pass to the explainer is the same data shape that the model was trained on. If your data shape is correct, then please report this on GitHub.
Consider retrying with the feature perturbation=interventional option. This check failed because for one of the samples, the sum of the SHAP values is 0.908553, while the model output was 0.940000. If this difference is acceptable, you can set check_additivity=False to disable this check.
If I replace my data set with a fake data set:
full_X_train,full_y_train = make_classification(n_samples =500,n_features = 20, random_state=1, n_informative=10,n_redundant=10)
, I do not get the error.
So whether I leave check_additivity=False in or out of the script for my real data, leaves me with two different errors, and I'm not sure how to get around this?

It's hard to debug your code as it's not reproducible but you may follow the following code snippet that "just runs":
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold,KFold
from sklearn.feature_selection import SelectKBest
from sklearn.datasets import load_breast_cancer
#from xgboost import XGBClassifier
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest, RFECV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, recall_score, accuracy_score, precision_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_score,recall_score,f1_score,roc_auc_score
from sklearn import metrics
from sklearn.datasets import make_classification
from numpy import mean
from sklearn.model_selection import train_test_split
from numpy import std
from sklearn.utils import shuffle
import numpy as np
from sklearn.metrics import roc_curve
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import pickle
#import neptune.new as neptune
import pandas as pd
import shap
full_X_train, full_y_train = load_breast_cancer(return_X_y=True, as_frame=True)
def run_model_with_grid_search(
param_grid={},
output_plt_file="plt.png",
model_name=RandomForestClassifier(),
X_train=full_X_train,
y_train=full_y_train,
model_id="random_forest_with_hpo_no_fs_geno_class",
n_splits=5,
output_file="random_forest_with_hpo_no_fs_geno_class.txt",
):
list_shap_values = list()
list_test_sets = list()
cv_outer = KFold(n_splits=5, shuffle=True, random_state=1)
for train_ix, test_ix in cv_outer.split(X_train):
split_x_train, split_x_test = (
X_train.iloc[train_ix, :],
X_train.iloc[test_ix, :],
)
split_y_train, split_y_test = y_train[train_ix], y_train[test_ix]
model = model_name
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
search = GridSearchCV(
model, param_grid=param_grid, scoring="roc_auc", cv=cv_inner, refit=True
)
result = search.fit(split_x_train, split_y_train)
best_model = result.best_estimator_
yhat = best_model.predict(split_x_test)
explainer = shap.TreeExplainer(result.best_estimator_)
shap_values = explainer.shap_values(split_x_test, check_additivity=False)
list_shap_values.append(shap_values)
shap_values = np.vstack([sv[1] for sv in list_shap_values])
sv = np.abs(shap_values.mean(0))
cols = X_train.columns
importance_df = pd.DataFrame({"column_name": cols, "shap_values": sv})
return importance_df
param_grid = [{"min_samples_leaf": [1, 3, 5],}]
importance_df = run_model_with_grid_search(param_grid=param_grid)
print(importance_df)
column_name shap_values
0 mean radius 0.000202
1 mean texture 0.000585
2 mean perimeter 0.000728
3 mean area 0.000541
4 mean smoothness 0.000867
5 mean compactness 0.000098
6 mean concavity 0.000759
7 mean concave points 0.003325
8 mean symmetry 0.000033
9 mean fractal dimension 0.000349
...
Note, the above code runs on my machine with both True and False for check_additivity param

Text Classification/Text vectorization

my goal is to do text classification using ML supervised algorithms. I'm at the stage where I need to make my words so that computer would understand them. I'm trying vectorize method but I get error 'Series' object has no attribute 'lower. Is there other solution to prepare data for sentiment analysis? Or i'm going the right path and just need to find out how to vectorize words? My data is shown in picture and code is below:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn import preprocessing
#tostr = svietimas_data['text'].astype(str).tolist()
print(svietimas_data)
tfidf = TfidfVectorizer(max_features=3000)
X = svietimas_data['text']
y = svietimas_data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X = tfidf.fit_transform([X])

All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough

Can I ask, when I run this code, it produces an output without error:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score, cross_val_predict,cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import chi2, f_regression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SelectKBest
#from xgboost import XGBRegressor
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.feature_selection import SelectKBest
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.pipeline import Pipeline
from scipy.stats import spearmanr
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix,classification_report
import pickle
import numpy as np
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_score,recall_score
from sklearn.datasets import make_classification
#Generate fake data
X, y = make_classification(n_samples=5000, n_classes=2, n_features=20, n_redundant=0,random_state=0) #fake data
X_train = X[:4500] #.iloc for df
y_train = y[:4500]
X_test = X[4500:]#.reset_index(drop=True,inplace=True)
y_test = y[4500:]
scorers = {
'precision_score': make_scorer(precision_score),
'recall_score': make_scorer(recall_score),
'accuracy_score': make_scorer(accuracy_score)
}
def run_SVC(X_train, y_train, X_test, y_test,output_file,data_name,refit_score='precision_score'):
'''
run SVC algorithm, with CV and hyperparameter tuning.
'''
short_dataname = data_name.strip().split('/')
file_model_name = output_file + '_svc_' + short_dataname[-1]
clf = SVC()
skf = StratifiedKFold(n_splits=2,random_state=42,shuffle=True)
#fs = SelectKBest(score_func = mutual_info_classif)
pipeline = Pipeline(steps=[('svc',clf)]) #,('sel',fs)
print(pipeline.get_params().keys())
search = GridSearchCV(
pipeline,
param_grid={
'svc__C': [0.01, 0.1, 10, 1000], ##Regularization
'svc__gamma': [0.0001, 0.01, 1, 10],
'svc__kernel':['linear','rbf'],
},
return_train_score=True,
verbose=3,
refit=refit_score,
scoring=scorers,
cv=skf,
n_jobs=-1,
)
search.fit(X_train, y_train)
# make the predictions
y_pred = search.predict(X_test)
print('Best params for {}'.format(refit_score))
print(search.best_params_)
print(classification_report(y_test,y_pred)) #labels=['neg','pos']
return
print(run_SVC(X_train,y_train,X_test,y_test,'test.txt','dataset'))
When i comment in the only two lines that are commented out (#fs = SelectKBest(score_func = mutual_info_classif)) and fs in the line after that, I get the error:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SVC()' (type <class 'sklearn.svm._classes.SVC'>) doesn't
I can see that other people have addressed this on SO before, e.g. here, so I tried to follow that person's answer, but my SelectKBest is already before my pipeline - when I move the line with 'fs' to be higher in my code (which I thought was what the answer was saying), I get the same error.
Could someone show me where I'm going wrong here and what I'm meant to change to remove this error?

The order of the steps in a Pipeline matters, and only the last step can be a non-transformer like your svc.

ValueError: Number of labels=25 does not match number of samples=56

I am getting this in
C:/Users/HP/.PyCharmCE2019.1/config/scratches/scratch.py Traceback
(most recent call last):
File "C:/Users/HP/.PyCharmCE2019.1/config/scratches/scratch.py", line 25,
in dtree.fit(x_train,y_train)
File "C:\Users\HP\PycharmProjects\untitled\venv\lib\site-packages\sklearn\tree\tree.py",
line 801, in fit X_idx_sorted=X_idx_sorted)
File "C:\Users\HP\PycharmProjects\untitled\venv\lib\site-packages\sklearn\tree\tree.py",
line 236, in fit "number of samples=%d" % (len(y), n_samples))
ValueError: Number of labels=45 does not match number of samples=36
I am using DecisionTree model but I am getting error. Help will be appreciated.
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#reading the dataset
df=pd.read_csv(r'C:\csv\kyphosis.csv')
print(df)
print(df.head())
#visualising the dataset
print(sns.pairplot(df,hue='Kyphosis',palette='Set1'))
plt.show()
#training and testing
from sklearn.modelselection import traintestsplit
c=df.drop('Kyphosis',axis=1) d=df['Kyphosis']
xtrain,ytrain,xtest,ytest=traintestsplit(c,d,testsize=0.30)
#Decision_Tree
from sklearn.tree import DecisionTreeClassifier
dtree=DecisionTreeClassifier()
dtree.fit(xtrain,ytrain)
#Predictions
predictions=dtree.predict(xtest) from sklearn.metrics import
classificationreport,confusionmatrix
print(classificationreport(ytest,predictions))
print(confusionmatrix(y_test,predictions))
Expected result should be my classification_report and confusion_matrix

So, the error is thrown by the function dtree.fit(xtrain, ytrain), because xtrain and ytrain have unequal length.
Checking the part of code that generates it:
xtrain,ytrain,xtest,ytest=traintestsplit(c,d,testsize=0.30)
and comparing to the example in the documentation
import numpy as np
from sklearn.model_selection import train_test_split
[...]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
you can see two things:
1 traintestsplit should be train_test_split
2 by changing the order of variables left of the =, you assign different data to these variables.
so, your code should be:
xtrain, xtest, ytrain, ytest = train_test_split(c,d,testsize=0.30)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scikit-learn binarize categorical data - python

Ok so to solve this issue I found you needed to surround the categorical data itself with quotation marks like this: "0-1" Otherwise Python would read it as the long of 0-1 and get confused. The data loads correctly.

Related

Invalid Syntax Error in a certain line of code in python Decision Tree algorithm

SHAP: additivity check failed in Tree Explainer: adding in check_additivity=False leads to error about keys not found in axis name?

Text Classification/Text vectorization

All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough

ValueError: Number of labels=25 does not match number of samples=56

Categories

Resources