I am working on a flight ticket price prediction data set using pyspark ML lib, which contains both train and test data sets. I have successfully implemented my model on train data set and predicted the price i.e, the label column, but don't know how to apply the same model on test data set for predicting the price of the ticket.
The following code is for training the model on train data set(containing both features and label column).
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(featuresCol="features",labelCol = "Price", maxIter = 10)
gbtModel = gbt.fit(training_data)
predictions_gbt = gbtModel.transform(testing_data)
predictions_gbt.select("features", "Price", "prediction").show()
Related
I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?
The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.
I have created ridge regression model to predict sales of an item say X. My final model contains around 180 features. I have pickled this model and now I want to see how model I created is performing on new data set containing same features as present in different model but on different timeframe.
I have to pass entire new dataset(dataframe) into existing model and check relevant score say r-square or any other score relevant to regression model.
Below is the code I'm using:
# loading library
import pickle
with open('ridge_model', 'wb') as p:
pickle.dump(ridge, p)
y_test = df.pop('Target')
x_test = df
# load saved model
with open('ridge_model' , 'rb') as p:
new_reg = pickle.load(p)
r_squared = new_reg.score(x_test,y_test)
print(r-squared)
Here ridge is the regression model I created
df is the new data on which prediction needs to be done
y_test contains target variable
x_test contains features in dataset except target variable
I want to understand:
Can we pass entire new data set onto existing model for prediction and
see how model is performing or not?
r_squared = new_reg.score(x_test,y_test) in this line does .score calculates r-square as
calculated on existing model or any other score it calculates?
I am training XGBoost Classifier on Big Query. The model is trained fine and then the bst (saved model) file is imported to a python notebook for plotting. I want to plot the trees present in the model to get an idea of how it is predicted.
When I plot the model, I get the results that are given below:
I am doing it like this:
import xgboost as xgb
bst = xgb.Booster(model_file='model.bst')
fig, ax = plt.subplots(figsize=(30, 30))
xgb.plot_tree(bst, num_trees=4, ax=ax)
plt.show()
I have come to know that the column names are masked like f182 stands for the 182nd feature that the model was trained on. I would like to create a mapping for these trees, with the actual column names that were used for training the model. The query used to train the model is given below:
CREATE OR REPLACE MODEL `d1.boost_clf1`
OPTIONS(
MODEL_TYPE='BOOSTED_TREE_CLASSIFIER',
INPUT_LABEL_COLS=['y'],
DATA_SPLIT_METHOD='CUSTOM',
DATA_SPLIT_COL='isTrain',
AUTO_CLASS_WEIGHTS=TRUE,
EARLY_STOP=TRUE,
L2_REG = 0.3,
ENABLE_GLOBAL_EXPLAIN = TRUE
) AS
SELECT
* except(isTrain, x1,x2,x3_timestamp,x4_timestamp, y)
,isTrain = 1 as isTrain
FROM d1.t1_preprocessed;
I have tried to print bst.feature_names but that doesn't print anything.
Any help in finding a way to plot the trees of XGBoost with actual column names would be highly appreciated. Thanks!
I have my validation and train dataset and I am trying to remove the validation dataset that exists in the train dataset and create a new subset.
val_data = data from a previous experiment
train_data = current training data
new_data = data from train_data the does not include val_data
I am trying to get the new_data
Please how do I go about it?
NOTE: I am not using train test to generate the validation dataset because it was already generated beforehand
Trying out Multinomial Naive Bayes. My data set (df) looks like this :
I created training and test dataset and did train the model. This is what I tried so far:
from pyspark.ml.classification import NaiveBayes
# Initialise the model
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
# Fit the model
model = nb.fit(train)
# Make predictions on test data
predictions = model.transform(test)
Now how do I predict the top labels for query = "something" ?