How to output predicted values as a string in excel? - python

so I was able to output my predicted numerical values into an excel file but I was wondering if it is possible to instead of the numerical value, it actual exports the string instead.
Currently it looks like this,
Column 1
Answer Key
Predicted
Something Something
Cars
3
Instead of it returning 3, I would like for 3 be replaced as the actual string its associated with (for example, idk Truck).
here is my code so far, I know I have to mess with the exporting part of the code but I cannot seem to figure this out.
texts = df['without_Tags'].astype('str')
vector = TfidfVectorizer(ngram_range=(1, 2), min_df = 2, max_df = .95)
X = vector.fit_transform(texts) #features
LE = LabelEncoder()
df['tower_values'] = LE.fit_transform(df['Tower'])
y = df['tower_values'].values
print(X.shape)
print(y.shape)
lsa = TruncatedSVD (n_components=100, n_iter=10, random_state=3)
X = lsa.fit_transform(X)
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, shuffle = True, stratify = y, random_state = 3)
SG = SGDClassifier(random_state=3, loss='log')
SG.fit(X_train, y_train)
y_pred = SG.predict(X_test)
print("SG model accuracy:", accuracy_score(y_test, y_pred))
print("SG model Recall:", recall_score(y_test, y_pred, average="macro"))
print("SG model Precision:", precision_score(y_test, y_pred, average="macro"))
print("SG model F1 Score:", f1_score(y_test, y_pred, average="macro"))
y_pred = pd.DataFrame(y_pred, columns=['predictions']).to_csv('prediction.csv')
final = pd.read_csv('prediction.csv')
final['pre'] = y_pred
df.to_csv('prediction.csv')

Try using inverse transform method of LabelEncoder()
y_pred = LE.inverse_transform(y_pred)

Related

How can I make a table for shapley values with Python?

I made a random forest regression with filter for y and x variables and I also wanted to add more about shapley values by creating a graph and table with column for the variable and column for the shaply value result. The code plots the graph, but the table is not showing.
So far my code looks like this:
x=widgets.SelectMultiple(
options=list(dataset.select_dtypes('number').columns),
disabled=False,
value=("NUMBER_SPOTS",)
)
def randomforest(y, x):
x = dataset[list(x)]
y = dataset[y]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
shap.initjs()
model = RandomForestRegressor(random_state=0).fit(X_train, y_train)
y_predict = model.predict(X_test)
mean_squared_error(y_test, y_predict)**(0.5)
print('Mean Squared Error:', mean_squared_error(y_test, y_predict)**(0.5))
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, features=X_train, feature_names=X_train.columns, plot_size=[15,8])
shap_vals = shap_values[0, :]
feature_importance = pd.DataFrame(list(zip(X_train.columns, shap_vals)), columns=['X_train', 'shap_vals'])
feature_importance.sort_values(by=['shap_vals'], ascending=False,inplace=True)
feature_importance
interact(randomforest, y = list(dataset.select_dtypes('number').columns), x = x)

Roc_auc percentage calculation giving nan

I am trying to calculate the ROC score for my data but it is resulting in nan.
The code:
scoring = 'roc_auc'
kfold= KFold(n_splits=10, random_state=42, shuffle=True)
model = LinearDiscriminantAnalysis()
results = cross_val_score(model, df_n, y, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))
df_n is an array from the normalised values, I also tried it just with the X data value from the dataset.
y is an array of binary values.
df_n shape: (150, 4)
y shape: (150,)
I am stumped, it should work!
The problem is that roc_auc_score expects the probabilities and not the predictions in the case of multi-class classification. However, with that code the score is getting the output of predict instead.
Use a new scorer:
from sklearn.metrics import roc_auc_score, make_scorer
multi_roc_scorer = make_scorer(lambda y_in, y_p_in: roc_auc_score(y_in, y_p_in, multi_class='ovr'), needs_proba=True)
scores = cross_validate(model, X_s, y_s, scoring=multi_roc_scorer, cv=cv, error_score="raise")

Predict future values using linear regression

So i've made a model for values prediction using linear regression. And now i need to get it to predict for 2022-2024 years into the future. how can i do it? maybe add rows 2023-2024 to dataframe? but will it be correct?
Data
data['Year'] = pd.to_datetime(data['Year'])
data.index = data['Year']
data.drop(['Year'], axis=1, inplace=True)
data = data.bfill().ffill()
y = data['x4']
X = data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))
If you want to predict you just have to put the new data:
X_new = new_data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
y_new = model.predict(X_new)
model is your linear regression which was trained, now you predict with the new data, in the same order and same format you did for your X_train/X_test and that's it

cross-validation with Kfold

I'm trying to use three binary explanatory variables relating a banking history: default, housing, and loan to predict the binary response variable using a Logistic Regression classifier.
I have the following dataset:
mapping function to convert text no/yes to integer 0/1
convert_to_binary = {'no' : 0, 'yes' : 1}
default = bank['default'].map(convert_to_binary)
housing = bank['housing'].map(convert_to_binary)
loan = bank['loan'].map(convert_to_binary)
response = bank['response'].map(convert_to_binary)
I added my three explanatory variables and response to an array
data = np.array([np.array(default), np.array(housing), np.array(loan),np.array(response)]).T
kfold = KFold(n_splits=3)
scores = []
for train_index, test_index in kfold.split(data):
X_train, X_test = data[train_index], data[test_index]
y_train, y_test = response[train_index], response[test_index]
model = LogisticRegression().fit(X_train, y_train)
pred = model.predict(data[test_index])
results = model.score(X_test, y_test)
scores.append(results)
print(np.mean(scores))
my accuracy is always 100%, which I know is not correct. the accuracy should be somewhere around 50-65%?
Is there something I'm doing wrong?
The split is not correct
Here is the correct split
X_train, X_labels = data[train_index], response[train_index]
y_test, y_labels = data[test_index], response[test_index]
model = LogisticRegression().fit(X_train, X_labels)
pred = model.predict(y_test)
acc = sklearn.metrics.accuracy_score(y_labels,pred,normalize=True)

got error while using k fold split in multilabel in sklearn

I would like to do K-fold cross-validation. the code before K-fold cross validation is like this: and it working perfectly
df = pd.read_csv('finalupdatedothers-multilabel.csv')
X= df[['sentences']]
dfy = df[['ADR','WD','EF','INF','SSI','DI','others']]
df1 = dfy.stack().reset_index()
df1.columns = ['a','b','c']
y_train_text = df1.groupby('a')['b'].apply(list)
lb = preprocessing.MultiLabelBinarizer()
# Run classifier
stop_words = stopwords.words('english')
classifier=make_pipeline(CountVectorizer(),
TfidfTransformer(),
#SelectKBest(chi2, k=4),
OneVsRestClassifier(SGDClassifier()))
#combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
random_state = np.random.RandomState(0)
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train_text, test_size=.2,
random_state=random_state)
print y_train
# # Binarize the output classes
Y = lb.fit_transform(y_train)
Y_test=lb.transform(y_test)
classifier.fit(X_train, Y)
y_score = classifier.fit(X_train, Y).decision_function(X_test)
print ("y_score"+str(y_score))
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)
#print accuracy_score
print ("accuracy : "+str(accuracy_score(Y_test, predicted)))
print ("micro f-measure "+str(f1_score(Y_test, predicted, average='weighted')))
print("precision"+str(precision_score(Y_test,predicted,average='weighted')))
print("recall"+str(recall_score(Y_test,predicted,average='weighted')))
for item, labels in zip(X_test, all_labels):
print ('%s => %s' % (item, ', '.join(labels)))
when I change the code to use k fold cross-validation instead of train_tes_split. I got this error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 6008]
Updated with iloc
my code to use k-fold cross validation looks like this:
kf = KFold(n_splits=10)
kf.get_n_splits(X)
KFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y_train_text.iloc[train_index],
y_train_text.iloc[test_index]
would you please let me know which part Im doing incorrectly?
my data looks like this:
,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1.0,,,,,,
1,I am detoxing from Lexapro now.,,,,,,,1.0
2,I slowly cut my dosage over several months and took vitamin supplements to help.,,,,,,,1.0

Categories

Resources