So the following code never prints the accuracy.
1 #!/usr/bin/python
2
3 """.
4 This is the code to accompany the Lesson 2 (SVM) mini-project.
5
6 Use a SVM to identify emails from the Enron corpus by their authors:....
7 Sara has label 0
8 Chris has label 1
9 """
10 ....
11 import sys
12 from time import time
13 sys.path.append("../tools/")
14 from email_preprocess import preprocess
15 from sklearn import svm
16 from sklearn.metrics import accuracy_score
17
18
19 ### features_train and features_test are the features for the training
20 ### and testing datasets, respectively
21 ### labels_train and labels_test are the corresponding item labels
22 features_train, features_test, labels_train, labels_test = preprocess()
23 clf=svm.SVC(kernel='linear')
24 clf.fit(features_train, labels_train)
25 pred=clf.predict(features_test)
26 print(accuracy_score(labels_test, pred))
I am trying to find out why line print(accuracy_score(labels_test, pred)) does not print anything at all. It should print some value. What could be the issue?
I added this line of code which makes it print something. I have seen people using 1000 iterations normally:
clf=svm.SVC(kernel='linear',max_iter=100)
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
Please I want to know if it is possible to know the specific variables' influence in testing a sample data to a model. The model below clarifies the question;
Given a dataset to predict the score of students.
ID Studies hours Games hours lectures hours social Activities Score
0 1 20 5 15 2 78
1 2 15 6 13 3 69
2 3 31 2 16 1 95
3 4 22 2 15 2 80
4 5 19 7 15 4 71
5 6 10 8 10 8 52
6 7 13 7 11 6 59
7 8 34 1 16 1 96
8 9 25 6 15 1 83
9 10 22 3 16 2 76
10 11 17 7 15 1 66
11 12 28 2 14 2 87
12 13 21 3 16 3 77
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from numpy import absolute
from xgboost import XGBModel
import pickle
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
from xgboost import plot_importance
data = pd.read_csv("student_score.csv")
def perfomance(data):
X = data.iloc[:,:-1]
y = data.iloc[:,-1:]
model = XGBModel(booster='gbtree')
#model = XGBModel(booster='gblinear')
model.fit(X, y)
cv = RepeatedKFold(n_splits=3, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X,y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = np.absolute(scores)
metrics = ('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )
# save the model to disk
filename = 'score.sav'
pickle.dump(model, open(filename, 'wb'))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
# load the model from disk
loaded_model = pickle.load(open('score.sav', 'rb'))
result = loaded_model.predict(X_test)
print(result)
plt.rcParams["figure.figsize"] = (20,15)
plot_importance(model)
plt.show()
Feature Importances :
[5.6058721e-04 6.7560148e-01 3.1960118e-01 4.2312010e-03 5.4962843e-06]
The feature importance is the general importance ranked by the model.
What I need now is:
when I pick A sample test say test = pd.DataFrame([{"Studies hours":15, "Games hours":6, "lectures hours":13,"social Activities":3}])
and predict; loaded_model.predict(test) and I get a score like 68, Which of the variables specifically (not the general importance) didn't make this specific sample test not score 100 but rather 68?
For Example, the model should tell me studies hours were bad or were less than expected.
Can Machine Learning Model do that?
The topic you're describing is called model explainability or interpretability. The more sophisticated the model, the more accurate it is, but the harder it is to explain (really generally speaking). SHAP values are the most common way I see folks explaining the effect of each feature on predictions generally, and each feature value on the prediction for a given observation. The most common visualization of SHAP values is the force plot. It looks like this:
The blog from which I took this image explains how to build a force plot for any model: Explain Any Models with the SHAP Values — Use the KernelExplainer
You can look at explaining the model's decision for a specific example using SHAP. A waterfall or force plot can show why the model scored 68 for a specific example based on the the input variables.
I have a past demand of kilometers travelled by customers from 2003-2020 in a excel file named transport_demand.xlsx and I have to predict the demand in 2050 using linear regression model.
the figure goes like this.
Year Transport Demand
0 2003 1070000000000
1 2004 1090000000000
2 2005 1090000000000
3 2006 109900000
4 2007 1100000000000
5 2008 1110000000000
6 2009 1120000000000
7 2010 1120000000000
8 2011 1130000000000
9 2012 1140000000000
10 2013 1140000000000
11 2014 1160000000000
12 2015 1180000000000
13 2016 1200000000000
14 2017 1160000000000
15 2018 1160000000000
16 2019 1170000000000
17 2020 943000000000
i am ok wíth statistics so i thought to take 5 years yearly average but since the data is only 17 columns 5 years or 10 years average is difficult. I am new to python so training and testing with such small data is very confusing to me. I am confused about how to go forward or what to code.
import pandas as pd
demand=pd.read_excel('transport_demand.xlsx')
then I used following code to determine outliers.
demand.describe()
since all the data lied within the mean and standard deviation. I assume that there are no outliers here.
then I used the graph to see the trend.
# making Date value a true date-time
demand["Year"] = pd.to_datetime(demand["Year"], format="%Y")
# plot demand dataframe
ax = demand.plot("Year", "Transport Demand",color='green', marker='o')
the different ups and down made me confused how to predict. but still I tried to go forward.
import numpy as np
import seaborn as sns
# Equation: Demand = β0 + β1*Year + e
#Setting the value for X and Y
x = demand[['Year']]
y = demand['Transport Demand']
#Splitting the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 100)
#Fitting the Linear Regression model
from sklearn.linear_model import LinearRegression
slr = LinearRegression()
slr.fit(x_train, y_train)
I hope I am correct till now but I want the model to predict 2050 demand. How to do this?
i want to build a classifier, but i'm having trouble finding sources that can clearly explain keras functions and how to go about doing what i'm trying to do. i want to use the following data:
0 1 2 3 4 5 6 7
0 Name TRY LOC OUTPUT TYPE_A SIGNAL A-B SPOT
1 inc 1 2 20 TYPE-1 TORPEDO ULTRA A -21
2 inc 2 3 16 TYPE-2 TORPEDO ILH B -14
3 inc 3 2 20 BLACK47 TORPEDO LION A 49
4 inc 4 3 12 TYPE-2 CENTRALPA LION A 25
5 inc 5 3 10 TYPE-2 THREE LION A -21
6 inc 6 2 20 TYPE-2 ATF LION A -48
7 inc 7 4 2 NIVEA-1 ATF LION B -23
8 inc 8 3 16 NIVEA-1 ATF LION B 18
9 inc 9 3 18 BLENDER CENTRALPA LION B 48
10 inc 10 4 20 DELCO ATF LION B -26
11 inc 11 3 20 VE248 ATF LION B 44
12 inc 12 1 20 SILVER CENTRALPA LION B -35
13 inc 13 2 20 CALVIN3 SEVENX LION B -20
14 inc 14 3 14 DECK-BT CENTRALPA LION B -38
15 inc 15 4 4 10-LEVI BERWYEN OWL B -29
16 inc 16 4 14 TYPE-2 ATF NOV B -31
17 inc 17 4 10 NYNY TORPEDO NOV B 21
18 inc 18 2 20 NIVEA-1 CENTRALPA NOV B 45
19 inc 19 3 27 FMRA97 TORPEDO NOV B -26
20 inc 20 4 18 SILVER ATF NOV B -46
i want to use columns 1, 2, 4, 5, 6, 7 as input and the output would be 3 (OUTPUT).
the code i currently have is:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import one_hot
df = pd.read_csv("file.csv")
df.drop('Name', axis=1, inplace=True)
obj_df = df.select_dtypes(include=['object']).copy()
# print(obj_df.head())
obj_df["OUTPUT"] = obj_df["OUTPUT"].astype('category')
obj_df["TYPE_A"] = obj_df["TYPE_A"].astype('category')
obj_df["SIGNAL"] = obj_df["SIGNAL"].astype('category')
obj_df["A-B"] = obj_df["A-B"].astype('category')
# obj_df.dtypes
obj_df["OUTPUT_cat"] = obj_df["OUTPUT"].cat.codes
obj_df["TYPE_A_cat"] = obj_df["TYPE_A"].cat.codes
obj_df["SIGNAL_cat"] = obj_df["SIGNAL"].cat.codes
obj_df["A-B_cat"] = obj_df["A-B"].cat.codes
# print(obj_df.head())
df2 = df[['TRY', 'LOC', 'SPOT']]
df3 = obj_df[['OUTPUT_cat', 'TYPE_A_cat', 'SIGNAL_cat', 'A-B_cat']]
df4 = pd.concat([df2, df3], axis=1, sort=False)
target_column = ['OUTPUT_cat']
predictors = list(set(list(df4.columns))-set(target_column))
df4[predictors] = df4[predictors]/df4[predictors].max()
print(df4.describe())
X = df4[predictors].values
y = df4[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=6))
model.add(Dense(1000, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='softmax'))
# Compile the model
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# build the model
model.fit(X_train, y_train, epochs=20, batch_size=150)
i can't figure out why this is the result i'm getting:
Epoch 20/20
56/56 [==============================] - 4s 77ms/step - loss: 0.0000e+00 - accuracy: 1.8165e-04
i also can't seem to find any answers related to this problem. am i using keras functions incorrectly? is it the way i'm coverting object type to integers? assuming there are 1250 outputs, how would i fix the layers? any tips or help would be appreciated. thank you.
As I said in the comments it seems like a clear case of model underfitting - you have too little data for the size of the model itself. Rather than playing around with the sizes of layers, just try SVM or RandomForest classifiers first and see if it's even possible to get any reasonable classification with your data. Also with this amount of data neural network is hardly ever a good choice.
So do this instead:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
df = blablabla # This is your data
X = df.iloc[:, [i for i in range(8) if i != 3]]
y = df.iloc[:, 3]
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
rf = RandomForestClassifier(n_estimators=50, min_samples_leaf=5, n_jobs=-1)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
If this works and can make some predictions then you can go ahead and try to tune your sequential model.
EDIT: Just read your comment that you have 1250 class labels and 5000 samples in total. This is likely not going to work with most classifiers. Too many classes and too little sample support.
I would like to train a Logistic Regression classifier in online fashion with Sklearn. I know about the 'SAG' or 'SAGA' but I am not sure how to implement this.
Specifically, my goal is to get the algorithm to train on the last t-x months (e.g. x=3) at time t where t is a month in the year. I would want to make a prediction over the set of examples for the following month (time t+1).
Here is my df:
X.head()
year month age job marital
0 2008 5 56 3 1
1 2008 5 57 7 1
2 2008 5 37 7 1
3 2008 5 40 0 1
4 2008 5 56 7 1
y.head()
0 0
1 1
2 0
3 0
4 0
Name: y, dtype: int8
Say I have my clf as in the code below (in this example I have trained it on the entire dataset in batch):
clf = LogisticRegression(C=1, max_iter=100, class_weight = 'balanced')
y_pred = clf.predict(X)
cmx = pd.DataFrame(confusion_matrix(y, y_pred),
index = ['No', 'Yes'],
columns = ['No', 'Yes'])
Notice I am not just looking to get a model created for each month in the dataset, but to have a model train itself in an online (minibatch technically) fashion throughout the entire dataset
I am trying to apply Kernel Principle Component Analysis on a dataset without a dependent variable to do a cluster analysis with k-means, so that I can learn how to do so. Here is a sample of my dataset(according to the scenario, this is a dataset of a shopping mall, and the shopping mall wants to discover the segments of its customers according to the data below):
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
6 Female 22 17 76
7 Female 35 18 6
8 Female 23 18 94
9 Male 64 19 3
10 Female 30 19 72
11 Male 67 19 14
First, I omitted CustomerID column and then encoded the gender column to be able to apply kernel PCA. Here is how I did it:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the mall dataset with pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, 1:5].values
df = pd.DataFrame(X)
#df is in order to visualize the "X" on variable explorer
#Encoding independent categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
After executing this code, I could get the array with float64 Type. The sample from the array I created is below:
0 1 19 15 39
0 1 21 15 81
1 0 20 16 6
1 0 23 16 77
1 0 31 17 40
1 0 22 17 76
1 0 35 18 6
1 0 23 18 94
0 1 64 19 3
1 0 30 19 72
0 1 67 19 14
And then, I wanted to apply Kernel PCA to get the principal components which I will use at k-means. However, when I try to execute the code below, I get the error "TypeError: '<' not supported between instances of 'str' and 'int'".
# Applying Kernel PCA
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 'None', kernel = 'rbf')
X = kpca.fit_transform(X)
explained_variance = kpca.explained_variance_ratio_
Even if I encoded my categorical data and I don't have any strings in my dataset, I cannot understand why it gives this error. Is there anyone that could help?
Thank you very much in advance.
n_components = 'None' is the problem. you should not put a string here...
use:
kpca = KernelPCA(n_components = None, kernel = 'rbf')
I suspect this is what is happening:
This is an error of an included file, or some code that is running, prior to your running code. The "TypeError: '<' to which this is referring is a string "<error>". Which is what something prior to your code is returning.