Can someone help me with the code for getting the predicted probability values? My model is working fine and is giving me predictions as 1 and 0 however I need the probability values also. The code is in two python files. The first file uses the training data set to create the map file. The second python file (scoring file) uses the map file on the test data to predict. Can someone let me know the code I should insert to get the probability values.
The below code is from the scoring file and here I need the code the get the probability values
pred = model.predict(X.values)
data["Predicted"] = pred
# I NEED THE CODE HERE TO GET THE PROBABILITY VALUES.
data.to_excel(r'result.xlsx', index=False)
Thanks a lot
Check if your model has predict_proba method.
The usage is same as the same predict method.
prob = model.predict_proba(X.values)
Edit:
Some of the learning model implementations from sklearn provide the predict_proba method. It is not a metric but as I said, a method of the class of the learning model.
For example:
from sklearn.tree import DecisionTreeClassifier
# after split you have X_train,y_train,X_testy_test
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
proba = model.predict_proba(X_test)
I am no longer able to edit my question so putting it here.
Thanks for all the help. I am using random forest model. This is my code and the line 4 below is giving an error. If I remove line 4, the code runs but in the final excel file I do not get the probabilities but only the predictions as 1 and 0. Can someone please let me know how to resolve this error. The last line of the error says
ValueError: Wrong number of items passed 2, placement implies 1
pred = model.predict(X.values)
data["Predicted"] = pred
prob = model.predict_proba(X.values)
data["Pred Value"]= prob - this line causes error
data.to_excel(r'result.xlsx', index=False)'
Thanks
Related
I have trained my model and saved it as json file and then i have saved the wights and load it as h5
Then i have used it in another script in order to feed some records to test the accuracy of the model what i want to do is:
Making the model say what is the predicted label and then i must compare it with the true label from the csv file but I don’t know how to get the predicted values?
I tried this
predict_x= loaded_model(X_test) y_pred=np.argmax(predict_x,axis=1)
And then
for i in y_pred
to loop through y_pred, i thought i could see my y classes which are from 0-4 while actually i found numers like 17,22,13…
Can in one till me what I should do?
Thank you in advance
………………………………………………….
I'm relatively new to Python and I am trying to make a Multiple Linear Regression model which has two predictor variables and one dependent. While doing my research on this, I found that Scikit provides a class to do this. I tried to get a model for my variables and I got the following message:
Shape of passed values is (3, 1), indices imply (2, 1)
The code I've used is:
from sklearn import linear_model
data = pd.read_csv('data.csv',delimiter=',',header=0)
SEED_VALUE = 12356789
np.random.seed(SEED_VALUE)
data_train, data_test = train_test_split(data, test_size=0.3, random_state=SEED_VALUE)
print('Train size: {}'.format(data_train.shape[0]))
print('Test size: {}'.format(data_test.shape[0]))
data_train_X = data_train.values[:,0:2] #predictor variables
data_train_Y = data_train.values[:,2].astype('float') # dependant
model = linear_model.LinearRegression()
np.random.seed(SEED_VALUE)
model.fit(data_train_X, data_train_Y)
coef = pd.DataFrame([model.intercept_, *model.coef_], ['(Intercept)', *data_train_X.columns], columns=['Coefficients'])
coef
I got the Error in the model.fit(data_train_X, data_train_Y) line. I have searched online for different ways to use Scikit method and I have found that other people with the same code had no error, so I don't know where my mistake could be
Thank you all so much
The data file is like this:
"retardation","distrust","degree"
2.80,6.1,44
3.10,5.1,25
2.59,6.0,10
3.36,6.9,28
2.80,7.0,25
3.35,5.6,72
2.99,6.3,45
2.99,7.2,25
2.92,6.9,12
3.23,6.5,24
3.37,6.8,46
2.72,6.6, 8
Would be great if you could also provide the input of the data, but even without it's most likely due to the fact that you use index column from your file. You should remove it and it will be fine.
If you can give the example of data that you use (columns), will be able to check it further.
I rerun your code, with the data you've provided and it does not show any errors for model.fit(data_train_X, data_train_Y) line :)
I have used fasttext train_supervised utility to train a classification model according to their webpage https://fasttext.cc/docs/en/supervised-tutorial.html .
model = fasttext.train_supervised(input='train.txt', autotuneValidationFile='validation.txt', autotuneDuration=600)
After I got the model how could I explore what kind of best parameters for the model like in sklearn after a set of best parameters trained, we could always check the values for these parameters but I could not find any document to explain this.
I also used this trained model to make prediction on my data
model.predict(test_df.iloc[2, 1])
It will return the label with a probability like this
(('__label__2',), array([0.92334366]))
I'm wondering if I have 5 labels, every time when make prediction,is it possible for each text to get all the probability for each label?
Like for the above test_df text,
I could get something like
model.predict(test_df.iloc[2, 1])
(('__label__2',), array([0.92334366])),(('__label__1',), array([0.82334366])),
(('__label__3',), array([0.52333333])),(('__label__0',), array([0.07000000])),
(('__label__4',), array([0.00002000]))
could find anything related to make change to get such prediction results.
Any suggestion?
Thanks.
As you can see here in the documentation, when using predict method, you should specify k parameter to get the top-k predicted classes.
model.predict("Why not put knives in the dishwasher?", k=5)
OUTPUT:
((u'__label__food-safety', u'__label__baking', u'__label__equipment',
u'__label__substitutions', u'__label__bread'), array([0.0857 , 0.0657,
0.0454, 0.0333, 0.0333]))
I'm getting the following warning after upgrading to version 1.0 of scikit-learn:
UserWarning: X does not have valid feature names, but IsolationForest was
fitted with feature name
I cannot find in the docs on what is a "valid feature name". How do I deal with this warning?
I got the same warning message with another sklearn model. I realized that it was showing up because I fitted the model with a data in a dataframe, and then used only the values to predict. From the moment I fixed that, the warning disappeared.
Here is an example:
model_reg.fit(scaled_x_train, y_train[vp].values)
data_pred = model_reg.predict(scaled_x_test.values)
This first code had the warning, because scaled_x_train is a DataFrame with feature names, while scaled_x_test.values is only values, without feature names. Then, I changed to this:
model_reg.fit(scaled_x_train.values, y_train[vp].values)
data_pred = model_reg.predict(scaled_x_test.values)
And now there are no more warnings on my code.
I was getting very similar error but on module DecisionTreeClassifier for Fit and Predict.
Initially I was sending dataframe as input to fit with headers and I got the error.
When I trimmed to remove the headers and sent only values then the error got disappeared.
Sample code before and after changes.
Code with Warning:
model = DecisionTreeClassifier()
model.fit(x,y) #Here x includes the dataframe with headers
predictions = model.predict([
[20,1], [20,0]
])
print(predictions)
Code without Warning:
model = DecisionTreeClassifier()
model.fit(x.values,y) #Here x.values will have only values without headers
predictions = model.predict([
[20,1], [20,0]
])
print(predictions)
I had also the same problem .The problem was due to fact that I fitted the model with X train data as dataframe (model.fit(X,Y)) and I make a prediction with with X test as an array ( model.predict([ [20,0] ]) ) . To solve that I have converted the X train dataframe into an array as illustrated bellow .
BEFORE
model = DecisionTreeClassifier()
model.fit(X,Y) # X train here is a dataFrame
predictions = model.predict([20,0]) ## generates warning
AFTER
model = DecisionTreeClassifier()
X = X.values # conversion of X into array
model.fit(X,Y)
model.predict([ [20,0] ]) #now ok , no warning
The other answers so far recommend (re)training using a numpy array instead of a dataframe for the training data. The warning is a sort of safety feature, to ensure you're passing the data you meant to, so I would suggest to pass a dataframe (with correct column labels!) to the predict function instead.
Also, note that it's just a warning, not an error. You can ignore the warning and proceed with the rest of your code without problem; just be sure that the data is in the same order as it was trained with!
I got the same error while using dataframes but by passing only values it is no more there
use
reg = reg.predict( x[['data']].values , y)
It is showing error because our dataframe has feature names but we should fit the data as 2d array(or matrix) with values for training or testing the dataset.
Here is the image of the same thing mentioned above image of jupytr notebook code
So I am working on a model that attempts to use RandomForest to classify samples into 1 of 7 classes. I'm able to build and train the model, but when it comes to evaluating it using roc_auc function, I'm able to perform 'ovr' (oneVsrest) but 'ovo' is giving me some trouble.
roc_auc_score(y_test, rf_probs, multi_class = 'ovr', average = 'weighted')
The above works wonderfully, I get my output, however, when I switch multi_class to 'ovo' which I understand might be better with class imbalances, I get the following error:
roc_auc_score(y_test, rf_probs, multi_class = 'ovo')
IndexError: too many indices for array
(I pasted the whole traceback below!)
Currently my data is set up as follow:
y_test (61,1)
y_probs (61, 7)
Do I need to reshape my data in a special way to use 'ovo'?
In the documentation, https://thomasjpfan.github.io/scikit-learn-website/modules/generated/sklearn.metrics.roc_auc_score.html, it says "binary y_true, y_score is supposed to be the score of the class with greater label. The multiclass case expects shape = [n_samples, n_classes] where the scores correspond to probability estimates."
Additionally, the whole traceback seems to hint as using maybe using a more binary array (hopefully that's the right term! I'm new to this!)
Very, very thankful for any ideas/thoughts!
#Tirth Patel provided the right answer, I needed to reshape my test set using one hot encoding. Thank you!