I am trying to get the classes names I used for Autokeras.
Both sentences and labels are strings.
import pandas as pd
import autokeras as ak
df = pd.read_csv("train.csv", names=['text', 'category'])
sentences = df.text.values
labels = df.category.values
x_train, x_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.10, random_state=2021)
clf = ak.TextClassifier(overwrite=True, max_trials=2)
clf.fit(x_train, y_train, epochs=400)
predicted_y = clf.predict(x_test)
clf.evaluate(x_test, y_test)
Now when I try importing to keras and running the prediction on a sample I'm only getting an index number.
model = clf.export_model()
probas = model.predict(["Hello world"])
predicted_class = probas.argmax(axis=-1)
predicted_class returns:
[20]
This assumes I have a list of the classes that were used with their order or a dict.
How can I know which string was assigned the index 20?
Related
How do I create an array or dataframe to store seedN, clf.score(X_test, y_test),n_neighbors?
from sklearn.model_selection import train_test_split
for seedN in range(1,50,1):
X_train, X_test, y_train, y_test = train_test_split(indicators,data2['target'],
test_size=0.25, random_state=seedN)
training_accuracy = []
test_accuracy = []
neighbors_settings = range(1, 70) # try n_neighbors from 1 to 50
for n_neighbors in neighbors_settings:
clf = KNeighborsClassifier(n_neighbors=n_neighbors) # build the model
clf.fit(X_train, y_train)
training_accuracy.append(clf.score(X_train, y_train)) # record training set accuracy
test_accuracy.append(clf.score(X_test, y_test)) # record generalization accuracy
Create a temporary empty list to store the results :
tmp = []
For each fit, add a new list with the desired values :
for seedN in range(1, 50, 1):
# your code
for n_neighbors in neighbors_settings:
# your code
tmp.append([seedN, clf.score(X_test, y_test), n_neighbors])
Finally, create the dataframe with this temporary list :
df = pd.DataFrame(tmp, columns=["seedN", "score", "n_neighbors"])
I have 2 nominal values of a product (category and producer) and its price and try to identify if in any given category a producer has usually a higher price. In another word, I try to measure the impact of a brand on price. I used below Python code and couldn't manage to run and get this error:
Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead.
Could you please help me to resolve this?
# Load dataset
path = "Sales.xlsx"
names = ['Category', 'Producer', 'Average_base_price']
dataset = read_excel(path, dtype={'Average_base_price':float} ,names=names)
# creating instance of labelencoder
labelencoder = LabelEncoder()
array = dataset.values
# Split-out validation dataset
X, y = array[:, :-1], array[:, -1]
X[:, 0] = labelencoder.fit_transform(X[:, 0])
X[:, 1] = labelencoder.fit_transform(X[:, 1])
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
You get that error because your dependent variable is continuous and you are trying to do a stratified kfold which doesn't make sense
If it is like you said:
In another word, I try to measure the impact of a brand on price.
Then your dependent variable should be price. Your independent variables would be the nominal values. And you should do one-hot encoding instead of label encoding because these are not predictors, not labels.
Using an example dataset:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd
dataset = pd.DataFrame({'Category':np.random.choice(['A','B','C'],100),
'Producer':np.random.choice(['l','m','n'],100),
'Average_base_price':np.random.uniform(0,1,100)})
Onehot encode the predictor:
enc = OneHotEncoder(handle_unknown='ignore')
X = enc.fit_transform(dataset[['Category','Producer']])
y = dataset[['Average_base_price']]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)
Then fit the model, in this case you can use for example a simple linear regression:
model = LinearRegression()
cv_results = cross_val_score(model, X_train, Y_train, cv=10)
I am trying to do the machine learning practice problem of Heart disease , dataset from kaggle.
Then i tried to split data into train set and test set and after that combing models into single function and predicting,this error shows up in jupyter notebook .
Here's my code:
# Split data into X and y
X = df.drop("target", axis=1)
y = df["target"]
Spliting
# Split data into train and test sets
np.random.seed(42)
# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
Prediction function
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
"KNN": KNeighborsClassifier(),
"Random Forest": RandomForestClassifier()}
# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models : a dict of differetn Scikit-Learn machine learning models
X_train : training data (no labels)
X_test : testing data (no labels)
y_train : training labels
y_test : test labels
"""
# Set random seed
np.random.seed(42)
# Make a dictionary to keep model scores
model_scores = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_scores[name] = model.score(X_test, y_test)
return model_scores
And when i run this code , that error shows up
model_scores = fit_and_score(models=models,
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test)
model_scores
This is error
Your X_train, y_train, or both, seem to have entries that are not float numbers.
At some point in the code, try using
X_train = X_train.astype(float)
y_train = y_train.astype(float)
X_test = X_test.astype(float)
y_test = y_test.astype(float)
Either this will work and the error will go away, or one of the conversions will fail, at which point you will need to decide how (or if) you want to encode the data as a float.
I am trying to learn decision tree regressor and I have wrote below code.
X_train, X_test, y_train, y_test = train_test_split(
x, y, test_size = 0.3, random_state = 100)
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
I want to create a dataframe which include X_test and Y_test and Y_pred.
Is there any method or function for that.
Append the below code at the end of your prediction code:
final_df = X_test.copy()
final_df["Y_original"] = y_test
final_df["Y_predicted"] = y_pred
Here we are creating a new dataframe namely final_df and putting all the values you require into it. Would not suggest you to directly append values into X_test, as it might be needed for use again for prediction.
I'm using catboost for my model and I'm getting the error as below to the following code:
from catboost import Pool, CatBoostClassifier, cv
#Split data
train = data[:split]
test = data[split:]
# Get variables for a model
x = train.drop(["Survived"], axis=1)
y = train["Survived"]
#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
cat_features = np.where(x.dtypes != float)[0]
cat = CatBoostClassifier(one_hot_max_size=7, iterations=21, random_seed=42, use_best_model=True, eval_metric='Accuracy')
cat.fit(X_train, y_train, cat_features = cat_features, eval_set=(X_test, y_test))
pred = cat.predict(X_test)
pool = Pool(X_train, y_train, cat_features=cat_features)
cv_scores = cv(pool, cat.get_params(), fold_count=10, plot=True)
print('CV score: {:.5f}'.format(cv_scores['test-Accuracy-mean'].values[-1]))
print('The test accuracy is :{:.6f}'.format(accuracy_score(y_test, cat.predict(X_test))))
...which raises:
---> 23 cv_scores = cv(pool, cat.get_params(), fold_count=10)
CatBoostError: catboost/libs/metrics/metric.cpp:5069: Cannot calc
metric which requires logits for absolute values.
Help would be appreciated. Thanks.
You need to add the parameter loss_function='Logloss' inside the CatBoostClassifier(). This issue is described here and was supposedly fixed but now it appeared again. I will reopen this issue because this is clearly a bug.