I am trying to get the classes names I used for Autokeras.
Both sentences and labels are strings.
import pandas as pd
import autokeras as ak
df = pd.read_csv("train.csv", names=['text', 'category'])
sentences = df.text.values
labels = df.category.values
x_train, x_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.10, random_state=2021)
clf = ak.TextClassifier(overwrite=True, max_trials=2)
clf.fit(x_train, y_train, epochs=400)
predicted_y = clf.predict(x_test)
clf.evaluate(x_test, y_test)
Now when I try importing to keras and running the prediction on a sample I'm only getting an index number.
model = clf.export_model()
probas = model.predict(["Hello world"])
predicted_class = probas.argmax(axis=-1)
predicted_class returns:
[20]
This assumes I have a list of the classes that were used with their order or a dict.
How can I know which string was assigned the index 20?
I am starting to write the learning machine model. I have a Y_train dataset containing the labels where there are 5 classes. The X_train dataset contains the samples. I try to make my model with the help of a logistic regression.
X_train ((560, 20531)) and Y_train ((560, 5)) have the same dimensions.
I have seen a few publications associated with the same problem but I have not been able to solve the problem.
I don't know how to correct this error,can you help me please ?
X = pd.read_csv('/Users/lottie/desktop/data.csv', header=None, skiprows=[0])
Y = pd.read_csv('/Users/lottie/desktop/labels.csv', header=None)
Y_encoded = list()
for i in Y.loc[0:,1] :
if i == 'BRCA' : Y_encoded.append(0)
if i == 'KIRC' : Y_encoded.append(1)
if i == 'COAD' : Y_encoded.append(2)
if i == 'LUAD' : Y_encoded.append(3)
if i == 'PRAD' : Y_encoded.append(4)
Y_bis = to_categorical(Y_encoded)
#separation of the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_bis, test_size=0.30, random_state=42)
regression_log = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')
X_train=X_train.iloc[:,1:]
#train model
train_train = regression_log.fit(X_train, Y_train)
You get that error because your label is categorical. You need to use a label encoder to encode it into 0,1,2.. , check out help page from scikit-learn. Below would be an implementation using an example dataset similar to yours:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
Y = pd.DataFrame({'label':np.random.choice(['BRCA','KIRC','COAD','LUAD','PRAD'],560)})
X = pd.DataFrame(np.random.normal(0,1,(560,5)))
Y_encoded = le.fit_transform(Y['label'])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_encoded, test_size=0.30, random_state=42)
regression_log = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')
X_train=X_train.iloc[:,1:]
train_train = regression_log.fit(X_train, Y_train)
I'm using catboost for my model and I'm getting the error as below to the following code:
from catboost import Pool, CatBoostClassifier, cv
#Split data
train = data[:split]
test = data[split:]
# Get variables for a model
x = train.drop(["Survived"], axis=1)
y = train["Survived"]
#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
cat_features = np.where(x.dtypes != float)[0]
cat = CatBoostClassifier(one_hot_max_size=7, iterations=21, random_seed=42, use_best_model=True, eval_metric='Accuracy')
cat.fit(X_train, y_train, cat_features = cat_features, eval_set=(X_test, y_test))
pred = cat.predict(X_test)
pool = Pool(X_train, y_train, cat_features=cat_features)
cv_scores = cv(pool, cat.get_params(), fold_count=10, plot=True)
print('CV score: {:.5f}'.format(cv_scores['test-Accuracy-mean'].values[-1]))
print('The test accuracy is :{:.6f}'.format(accuracy_score(y_test, cat.predict(X_test))))
...which raises:
---> 23 cv_scores = cv(pool, cat.get_params(), fold_count=10)
CatBoostError: catboost/libs/metrics/metric.cpp:5069: Cannot calc
metric which requires logits for absolute values.
Help would be appreciated. Thanks.
You need to add the parameter loss_function='Logloss' inside the CatBoostClassifier(). This issue is described here and was supposedly fixed but now it appeared again. I will reopen this issue because this is clearly a bug.
Trying to follow this article to perform over-sampling for imbalanced classification. My class ratio is about 8:1.
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook
I am confused on the pipeline + coding structure.
Should you over-sample after train/test splitting?
If so, how do you deal with the fact that the target label is dropped from X? I tried keeping it and then performed the over-sampling then dropped labels on X_train/X_test and replaced the new training set in my pipeline
however i get error "Found input variables with inconsistent numbers of samples" because the shapes are inconsistent since the new over-sampling df is doubled with a 50/50 label distribution.
I understand the issue however how does one solve this problem when wanting to perform over-sampling to reduce class imbalance?
X = df
#X = df.drop("label", axis=1)
y = df["label"]
X_train,\
X_test,\
y_train,\
y_test = train_test_split(X,\
y,\
test_size=0.2,\
random_state=11,\
shuffle=True,\
stratify=target)
target_count = df.label.value_counts()
print('Class 1:', target_count[0])
print('Class 0:', target_count[1])
print('Proportion:', round(target_count[0] / target_count[1], 2), ': 1')
target_count.plot(kind='bar', title='Count (target)');
# Class count
count_class_index_0, count_class_index_1 = X_train.label.value_counts()
# Divide by class
count_class_index_0 = X_train[X_train['label'] == '1']
count_class_index_1 = X_train[X_train['label'] == '0']
df_class_1_over = df_class_1.sample(count_class_index_0, replace=True)
df_test_over = pd.concat([count_class_index_0, df_class_1_over], axis=0)
print('Random over-sampling:')
print(df_test_over.label.value_counts())
Random over-sampling:
1 12682
0 12682
df_test_over.label.value_counts().plot(kind='bar', title='Count (target)')
# drop label for new X_train and X_test
X_train_OS = df_test_over.drop("label", axis=1)
X_test = X_test.drop("label", axis=1)
print(X_train_OS.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(25364, 9)
(3552, 9)
(14207,)
(3552,)
cat_transformer = Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])
num_transformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
('num_scaler', StandardScaler())])
text_transformer_0 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=SPLIT_PATTERN,\
stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()
text_transformer_1 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=SPLIT_PATTERN,\
stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()
FE = ColumnTransformer(
transformers=[
('cat', cat_transformer, CAT_FEATURES),
('num', num_transformer, NUM_FEATURES),
('text0', text_transformer_0, TEXT_FEATURES[0]),
('text1', text_transformer_1, TEXT_FEATURES[1])])
pipe = Pipeline(steps=[('feature_engineer', FE),
("scales", MaxAbsScaler()),
('rand_forest', RandomForestClassifier(n_jobs=-1, class_weight='balanced'))])
random_grid = {"rand_forest__max_depth": [3, 10, 100, None],\
"rand_forest__n_estimators": sp_randint(10, 100),\
"rand_forest__max_features": ["auto", "sqrt", "log2", None],\
"rand_forest__bootstrap": [True, False],\
"rand_forest__criterion": ["gini", "entropy"]}
strat_shuffle_fold = StratifiedKFold(n_splits=5,\
random_state=123,\
shuffle=True)
cv_train = RandomizedSearchCV(pipe, param_distributions=random_grid, cv=strat_shuffle_fold)
cv_train.fit(X_train_OS, y_train)
from sklearn.metrics import classification_report, confusion_matrix
preds = cv_train.predict(X_test)
print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds))
The problem you are having here gets very easily (and arguably more elegantly) solved by SMOTE. It's easy to use and allows you to keep the X_train, X_test, y_train, y_test syntax from train_test_split because it will perform the oversampling both on X and y at the same time.
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X,y)
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
So I believe I solved my own question ... the problem was how I was splitting the data ... I normally always follow the standard X_train, X_test, y_train, y_test train_test_split however it was causing the row count mismatch in the X_train and y_train when over-sampling so I did this instead and everything appears to be working. Please let me know if anyone has any recommendations! Thanks!
features = df_
target = df_l["label"]
train_set, test_set = train_test_split(features, test_size=0.2,\
random_state=11,\
shuffle=True)
print(train_set.shape)
print(test_set.shape)
(11561, 10)
(2891, 10)
count_class_1, count_class_0 = train_set.label.value_counts()
# Divide by class
df_class_1 = train_set[train_set['label'] == 1]
df_class_0 = train_set[train_set['label'] == 0]
df_class_0_over = df_class_0.sample(count_class_1, replace=True)
df_train_OS = pd.concat([df_class_1, df_class_0_over], axis=0)
print('Random over-sampling:')
print(df_train_OS.label.value_counts())
1 10146
0 10146
df_train_OS.label.value_counts().plot(kind='bar', title='Count (target)');
X_train_OS = df_train_OS.drop("label", axis=1)
y_train_OS = df_train_OS["label"]
X_test = test_set.drop("label", axis=1)
y_test = test_set["label"]
print(X_train_OS.shape)
print(y_train_OS.shape)
print(X_test.shape)
print(y_test.shape)
(20295, 9)
(20295,)
(2891, 9)
(2891,)
I have trained a classifier on 'Rocks and Mines' dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data)
And when calculating the accuracy score it always seems to be perfectly accurate (output is 1.0) which I find hard to believe. Am I making any mistakes, or naive bayes is this powerful?
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
data = urllib.request.urlopen(url)
df = pd.read_csv(data)
# replace R and M with 1 and 0
m = len(df.iloc[:, -1])
Y = df.iloc[:, -1].values
y_val = []
for i in range(m):
if Y[i] == 'M':
y_val.append(1)
else:
y_val.append(0)
df = df.drop(df.columns[-1], axis = 1) # dropping column containing 'R', 'M'
X = df.values
from sklearn.model_selection import train_test_split
# initializing the classifier
clf = GaussianNB()
# splitting the data
train_x, test_x, train_y, test_y = train_test_split(X, y_val, test_size = 0.33, random_state = 42)
# training the classifier
clf.fit(train_x, train_y)
pred = clf.predict(test_x) # making a prediction
from sklearn.metrics import accuracy_score
score = accuracy_score(pred, test_y)
# printing the accuracy score
print(score)
The X is the input and y_val is the output (I have converted 'R' and 'M' into 0's and 1's)
This is because of random_state argument inside train_test_split() function.
When you set random_state to an integer sklearn ensures that your data sampling is constant.
That means that everytime you run it by specifying random_state, you will get a same result, this is expected behavior.
Refer docs for further details.