I am predicting stock prices using support vector regression. I have trained with some values but when i predict the values every time I have to train based on that(online learning). So i have passed the values to train inside the loop after predicted.
inside loop
//prediction
clf.fit(testx[i],testy[i])
So when i call the fit function everytime how svr training work internally based on one input?
clf.fit is not incremental. You have to pass all the previous training points in addition to the new instance to re-train a new model that benefits from the new data points unfortunately.
This is a limitation of the SMO algorithm implemented by the libsvm library used internally in the sklearn.svm.SVR class.
Related
I am trying to get the confidence intervals from an XGBoost saved model in a .tar.gz file that is created using python XGBoost library.
The problem is that the model has already been fitted, and I dont have training data any more, I just have inference or serving data to predict. All the examples that I found entail using a training and test data to create either quantile regression models, or bagged models, but I dont think I have the chance to do that.
Why your desired approach will not work
I assume we are talking about regression here. Given a regression model that you cannot modify, I think you will not be able to achieve your desired result using only the given model. The model was trained to calculate a continuous value that appoximates some objective value (i.e., its true value) based on some given input. Nothing more.
Possible solution
The only workaround I can think of would be to train two more models. These model's training goal would be to predict the quality of the output of your given model. One would calculate the upper bound of a given (i.e., predefined by you at training time) confidence interval and the other one the lower bound. This would probably include a lot of feature engineering. One would probably like to find features that correlate with the prediction quality of the original model.
I have followed an example of applying SciKit Learning's machine learning to facial recognition.
https://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py
I have been able to adapt the example to my own data successfully. However, I am lost on one point:
after preparing the data, training the model, ultimately, you end up with the line:
Y_pred = clf.predict(X_test_pca)
This produces a vector of predictions, one per face.
What I can't figure out is how to get any confidence measurement to correspond with that.
The classification method is a forced choice, so that each face passed in MUST be classified as one of the known faces, even if it isn't even close.
How can I get a number per face that will reflect how well the result matches the known face?
It seems like you are looking for the .predict_proba() method of the scikit-learn estimators. It returns the probabilities of possible outcomes instead of a single prediction.
The example you are referring to is using an SVC. It is a little special in regard to this function as it states:
The model need to have probability information computed at training time: fit with attribute probability set to True.
So, if you are using the same model as in the example, instantiate it with:
SVC(kernel='rbf', class_weight='balanced', probability=True)
and use .predict_proba() instead of .predict():
y_pred = clf.predict_proba(X_test_pca)
This returns an array of shape (n_samples, n_classes), i.e. the probabilities for each class for each sample. Accessing the probabilities for class k could then be done by calling y_pred[k] for example.
Using the following guide, I've made an sklearn regression model for doing time series forecasting. I'm able to use the model to get predictions on a set of test data where I have the timestamps, as well as the independent variable data, since the model just takes those variables and and gives the output labels as predictions.
However, I'm not sure how, or even if I can use this model to do out of sample predictions, where I only have a future timestamp and none of the independent variable data that goes with it. Is there some sort of recursive method where the model can use data from a test set, make a prediction, then use the prediction and the data to make the next prediction, etc.? Thanks!
Yes, but it depends on whether you want to do single-step or multi-step forecasts.
For single-step forecasts, as you describe, use the last available window of your data as input to the prediction function, this returns the first step ahead forecasted value.
For multi-step forecasts, you have three options:
Direct: Fit one regressor for each step ahead and let each fitted regressor make a prediction with the last available window,
Recursive: Use the last available window to make the first step prediction, then use the first step prediction to roll the window and predict again.
DirRec: A combination of the above strategies, where you instead of rolling the window, you expand it with the previously predicted value, note however this requires to fit the regressors accordingly.
You can find more details in:
Bontempi, Gianluca, Souhaib Ben Taieb, and Yann-Aël Le Borgne. "Machine learning strategies for time series forecasting." European business intelligence summer school. Springer, Berlin, Heidelberg, 2012.
Also note that you have to be careful to appropriately evaluate your model. The train and test sets are not independent in this setting, as they represent measurements at subsequent time points of the same variable. So you have to account for the potential auto-correlation.
I have a trained model that uses regression to predict house prices. It was trained on a standardized dataset (StandatdScaler from sklearn). How do I scale my models input (a single example) now in a different python program? I can't use StandardScaler on the input, because all features would be reduced to 0 (MinMaxScaler doesn't work either, also tried saving and loading scaler from the training script - didn't work). So, how can I scale my input so that features won't be 0 allowing the model to predict the price correctly?
What you've described is a contradiction in terms. Scaling refers to a range of data; a single datum does not have a "range"; it's a point.
What you seem to be asking is how to scale the input data to fit the translation you made when you trained. The answer here is straightforward again: you have to use the same translation function you applied when you trained. Standard practice is to revert the model's ingestion (i.e. reverse that scaling function); if you didn't do that, and you didn't make any note of that function's coefficients, then you do not have the information needed to apply the same translation to future input -- in short, your trained model isn't particularly useful.
You could try to recover the coefficients by running the scaling function on the original data set, making sure to output the resulting function. Then you could apply that function to your input examples.
I am performing a machine learning task wherein I am using logistic regression for topic classification.
If this is my code:
model= LogisticRegression()
model= model.fit(mat_tmp, label_tmp)
y_train_pred = model.predict(mat_tmp_test)
print(metrics.accuracy_score(label_tmp_test, y_train_pred))
Is there a way I can output what exactly is happening inside the model. Like probably a working example of what my model is doing? Like maybe displaying 2-3 documents and how they are being classified?
In order to be fully aware of what is happening in your model, you must first take some time to study the logistic regression algorithm (eg. from lecture notes or Wikipedia). As with other supervised techniques, logistic regression has hyper-parameters and parameters. Hyper-parameters basically specify how your algorithm runs, which you must provide at initialisation (ie. before it sees any data). For example, you could have prior information about the distribution of classes, which then would be a hyper-parameter. Parameters are "learnt" from your data.
Once you understand the algorithm, the interesting question will be what the parameters of your model are (recall that these are retrieved from the data). By visiting the documentation, you find in the attributes section, that this classifier has 3 parameters, which you can access by their field names.
If you are not interested in such details, but only want to assess the accuracy of your classifier, a useful technique is cross-validation. You split your labeled data into k equal sized subsets, and train your classifier using k-1 of them. Then you evaluate the trained classifier on the remaining 1 subset and calculate the accuracy (ie. what proportion of the data could be predicted properly). This method has its drawbacks, but proves to be very useful in general.