I have a Google Cloud Platform account with a Kubeflow Pipeline. The first component of the pipeline preprocesses some data and the second one trains a model (SKlearn Decision Tree Classifier) with that preprocessed data. For the purpose of showing a code sample, the sample below is a simple modification of the pipeline's second component:
import logging
import pandas as pd
import os
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics, datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data
y = iris.target
x_train_data, x_test_data, y_train_data, y_test_data = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)
print("Creating model")
model = DecisionTreeClassifier()
print(f"Training model ({type(model)})")
model.fit(x_train_data, y_train_data)
print("Evaluating model")
y_train_pred = model.predict(x_train_data)
print("y_train_pred: ", y_train_pred.shape)
y_test_pred = model.predict(x_test_data)
print("y_test_pred: ", y_test_pred.shape)
train_accuracy = metrics.accuracy_score(y_train_data, y_train_pred)
train_classification_report = metrics.classification_report(y_train_data, y_train_pred)
print("\nTraining result:")
print(f"Accuracy:\t{train_accuracy}")
print(f"Classification report:\t{type(train_classification_report)}\n{train_classification_report}")
test_accuracy = metrics.accuracy_score(y_test_data, y_test_pred)
test_classification_report = metrics.classification_report(y_test_data, y_test_pred)
print("\nTesting result:")
print(f"Accuracy:\t{test_accuracy}")
print(f"Classification report:\t{type(test_classification_report)}\n{test_classification_report}")
print("\nDONE !\n")
Here, instead of loading the preprocessed data, I'm using the IRIS Sklearn datset but the output is exactly the same. Everything seems to work as intended, every print statement appears on the Kubeflow platform output console as expected, however after the second component finishes executing (after the last print is correclty shown on the output console), an error appers:
Traceback (most recent call last):
File "<string>", line 181, in <module>
File "<string>", line 151, in _serialize_str
TypeError: Value "None" has type "<class 'NoneType'>" instead of str.
Do you have any idea why this is happening ?
Am I doing something wrong or is it some Google Cloud / Kubeflow Pipeline problem ?
Thanks in advance!
Related
I am doing a voice detection project on Python3.10. However, I cannot run my code it said: "ModuleNotFoundError: No module named 'PCM" (line 7) I tried to pip install PCM and pip install confusion-matrix again and again but it's still the same. I don't know how to fix it; hopefully, someone can help me. Thank you very much!
Here is my code
####### IMPORTS #############
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from PCM.PCM import plot_confusion_matrix
import pickle
##### Loading saved csv ##############
df = pd.read_pickle("final_audio_data_csv/audio_data.csv")
####### Making our data training-ready
X = df["feature"].values
X = np.concatenate(X, axis=0).reshape(len(X), 40)
y = np.array(df["class_label"].tolist())
####### train test split ############
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
##### Training ############
logit_reg = LogisticRegression(max_iter=10000)
logit_reg.fit(X_train, y_train)
score = logit_reg.score(X_test, y_test)
print("Model Score: \n")
print(score)
#### Evaluating our model ###########
print("Model Classification Report: \n")
y_pred = logit_reg.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))
plot_confusion_matrix(cm, classes=["Does not have WW", "Has WW"])
#### Save the model
pickle.dump(logit_reg, open('saved_model/WWD_ML.txt', 'wb'))
'''
To load the model again run this:
>>> model = pickle.load(open('saved_model/WWD_ML.txt', 'rb'))
>>> model.predict(<-- your matrix -->) # to predict
'''
here is my terminal
PS C:\Users\adamn\Documents\voice detection\WakeWordDetection-master\WakeWordDetection-master> & C:/Users/adamn/AppData/Local/Programs/Python/Python310/python.exe "c:/Users/adamn/Documents/voice detection/WakeWordDetection-master/WakeWordDetection-master/UsingML.py"
Traceback (most recent call last):
File "c:\Users\adamn\Documents\voice detection\WakeWordDetection-master\WakeWordDetection-master\UsingML.py", line 7, in <module>
from PCM.PCM import plot_confusion_matrix
ModuleNotFoundError: No module named 'PCM'
PS C:\Users\adamn\Documents\voice detection\WakeWordDetection-master\WakeWordDetection-master>
I am trying to create model using XGBoost.
It seems like I manage to train the model, however, when I try to predict my test data and to see the actual prediction, I get the following error:
ValueError: Data must be 1-dimensional
This is how I tried to predict my data:
from dask_ml.model_selection import train_test_split
import dask
import xgboost
import dask_xgboost
from dask.distributed import Client
import dask_ml.model_selection as dcv
#split the data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.33,random_state=42)
client = Client(n_workers=10, threads_per_worker=1)
#Trying to do hyperparamter running
model_xgb = xgb.XGBRegressor(seed=42,verbose=True)
params={
'learning_rate':[0.1,0.01,0.05],
'max_depth':[1,5,8],
'gamma':[0,0.5,1],
'scale_pos_weight':[1,3,5]
}
grid_search = GridSearchCV(model_xgb, params, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)
#train data with best paraeters
bst = dask_xgboost.train(client, grid_search.best_params_, x_train, y_train, num_boost_round=10)
#predict data
dask_xgboost.predict(client, bst, x_test).persist()
The last line with the predict works, but when I addl compute to the endd in order to see the actual array I get the dimensional error:
dask_xgboost.predict(client, bst, x_test).persist().compute()
>>>ValueError: Data must be 1-dimensional
How can I get predictions with .predict?
As noted in the pip page for dask-xgboost:
Dask-XGBoost has been deprecated and is no longer maintained.
The functionality of this project has been included directly
in XGBoost. To use Dask and XGBoost together, please use
xgboost.dask instead
https://xgboost.readthedocs.io/en/latest/tutorials/dask.html.
The code you provided has a few missing assignments and expressions (e.g. how x is defined, where GridSearchCV is imported from). A few things that probably should be changed:
# note the .dask
model_xgb = xgb.dask.DaskXGBRegressor(seed=42, verbose=True)
grid_search = GridSearchCV(model_xgb, params, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)
#train data with best params
model_xgb.client = client
model_xgb.set_params(grid_search.best_params_)
model_xgb.fit(X_train, y_train, eval_set=[(X_test, y_test)])
Dataset: I created a very simple dataset of "Supplier", "Item description" columns . This dataset has a list of item descriptions and preferred supplier for that item
Requirement: I would like to write a program that will take an "Item Description" and predict the "Supplier". To keep it very simple, I just have only 5 Unique supplier-Item Description combinations out of the 950 rows in the .txt file
Issue: The accuracy shows up 1 and confusing matrix shows no false positives. But when I give a new data, the prediction is wrong.
Steps Done
Read .txt for "Supplier" and "Item Description"
Label Encoder applied on the "Item Description"
train test and split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
Created a Pipeline for applying the TfidfVectorizer and MultinomialNB
pipeline = Pipeline([('vect', vectorizer),
('clf', MultinomialNB())
])
model = pipeline.fit(X_train, y_train)
fit model and predict :
y_pred=model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
acc= accuracy_score(y_test,y_pred)
# acc is 1.0 and the cm shows no false positives/negatgives
So far, things look ok
dumped the pickle
pickle.dump(model, open(r'supplier_predictions.pkl','wb'))
Tried prediction on a Item Description= 'Lego, Barbie and other Toy Items' ; I was expecting "Toys R Us"
The prediction was wrong, it came up as "Office Depot".
loadedModel = pickle.load(open("supplier_predictions.pkl","rb"))
new_items = {'ITEM_DESCRIPTION': ['Lego, Barbie and other Toy Items']}
new_X = pd.DataFrame(new_items, columns = ['ITEM_DESCRIPTION'])
new_y_pred=loadedModel.predict(new_X)
Can you please let me know
what I am doing wrong here to get the wrong prediction, new_y_pred for the test item description passed in (new_X)
This is my first ML code. I have tried debugging this by looking at various articles, but no luck.
Thanks
== Complete Code, if it is helpful ==
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import re # librarie for cleaning data
import nltk # library for NLP
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import pickle
df=pd.read_csv('git_suppliers.txt', sep='\t')
# Prep the data - Item Description
from sklearn.feature_extraction.text import TfidfVectorizer
stemmer = PorterStemmer()
words = stopwords.words("english")
df['ITEM_DESCRIPTION'] = df['ITEM_DESCRIPTION'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z0-9]", " ", x).split() if i not in words]).lower())
# Feature Generation using the TF-IDF
vectorizer = TfidfVectorizer(min_df= 3, stop_words="english", sublinear_tf=True, norm='l2', ngram_range=(1, 2))
final_features = vectorizer.fit_transform(df['ITEM_DESCRIPTION']).toarray()
final_features.shape
# final_features shows only 43 features - not going to use SelectKBest for such such less features count
#
# Split into training and test data
#
X = df['ITEM_DESCRIPTION']
y = df['SUPPLIER']
from sklearn.preprocessing import LabelEncoder
labelObj = LabelEncoder()
y=labelObj.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
y_test_decoded=labelObj.inverse_transform(y_test)
#
# Create a pipeline, fit the model, predict for test data and save in pickle
#
pipeline = Pipeline([('vect', vectorizer),
('clf', MultinomialNB())
])
model = pipeline.fit(X_train, y_train)
# Predict for test data
y_pred=model.predict(X_test)
# Accuracy shows up as 1.0 and the confusion matrix shows no false positives/negatives
from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
acc= accuracy_score(y_test,y_pred)
print(acc)
# Dump the model and lets predict for one item description,
# for which i expect Toys R Us as the supplier/Seller
pickle.dump(model, open(r'supplier_predictions.pkl','wb'))
loadedModel = pickle.load(open("supplier_predictions.pkl","rb"))
new_items = {'ITEM_DESCRIPTION': ['Lego, Barbie and other Toy Items']}
new_X = pd.DataFrame(new_items, columns = ['ITEM_DESCRIPTION'])
new_y_pred=loadedModel.predict(new_X)
labelObj.inverse_transform(new_y_pred)
### Shows Office Depot
My bad - the input to the predict was wrong type. Passed in a series and it worked fine.
new_items = pd.Series(new_items)
new_y_pred=loadedModel.predict(new_items)
labelObj.inverse_transform(new_y_pred)
Lately, I have been working on applying grid search cross validation (sklearn GridSearchCV) for hyper-parameter tuning in Keras with Tensorflow backend. An soon as my model is tuned
I am trying to save the GridSearchCV object for later use without success.
The hyper-parameter tuning is done as follows:
x_train, x_val, y_train, y_val = train_test_split(NN_input, NN_target, train_size = 0.85, random_state = 4)
history = History()
kfold = 10
regressor = KerasRegressor(build_fn = create_keras_model, epochs = 100, batch_size=1000, verbose=1)
neurons = np.arange(10,101,10)
hidden_layers = [1,2]
optimizer = ['adam','sgd']
activation = ['relu']
dropout = [0.1]
parameters = dict(neurons = neurons,
hidden_layers = hidden_layers,
optimizer = optimizer,
activation = activation,
dropout = dropout)
gs = GridSearchCV(estimator = regressor,
param_grid = parameters,
scoring='mean_squared_error',
n_jobs = 1,
cv = kfold,
verbose = 3,
return_train_score=True))
grid_result = gs.fit(NN_input,
NN_target,
callbacks=[history],
verbose=1,
validation_data=(x_val, y_val))
Remark: create_keras_model function initializes and compiles a Keras Sequential model.
After the cross validation is performed I am trying to save the grid search object (gs) with the following code:
from sklearn.externals import joblib
joblib.dump(gs, 'GS_obj.pkl')
The error I am getting is the following:
TypeError: can't pickle _thread.RLock objects
Could you please let me know what might be the reason for this error?
Thank you!
P.S.: joblib.dump method works well for saving GridSearchCV objects that are used
for the training MLPRegressors from sklearn.
Use
import joblib directly
instead of
from sklearn.externals import joblib
Save objects or results with:
joblib.dump(gs, 'model_file_name.pkl')
and load your results using:
joblib.load("model_file_name.pkl")
Here is a simple working example:
import joblib
#save your model or results
joblib.dump(gs, 'model_file_name.pkl')
#load your model for further usage
joblib.load("model_file_name.pkl")
Try this:
from sklearn.externals import joblib
joblib.dump(gs.best_estimator_, 'filename.pkl')
If you want to dump your object into one file - use:
joblib.dump(gs.best_estimator_, 'filename.pkl', compress = 1)
Simple Example:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
gs = GridSearchCV(svc, parameters)
gs.fit(iris.data, iris.target)
joblib.dump(gs.best_estimator_, 'filename.pkl')
#['filename.pkl']
EDIT 1:
you can also save the whole object:
joblib.dump(gs, 'gs_object.pkl')
Subclass the sklearn.model_selection._search.BaseSearchCV class. Override the fit(self, X, y=None, groups=None, **fit_params) method, and modify its internal evaluate_candidates(candidate_params) function. Instead of immediately returning the results dictionary from evaluate_candidates(candidate_params), perform your serialization here (or in the _run_search method depending on your use case). With some additional modifications, this approach has the added benefit of allowing you to execute the grid search sequentially (see the comment in the source code here: _search.py). Note that the results dictionary returned by evaluate_candidates(candidate_params) is the same as the cv_results dictionary. This approach worked for me, but I was also attempting to add save-and-restore functionality for interrupted grid search executions.
I'm working on the Titanic competition on Spyder IDE. The code is barely complete but I'm doing it one step at a time (and this is the first time I've ever built a learning model). Now, I'm getting a Found input variables with inconsistent numbers of samples: [891, 183] error in the log while trying to run my code. This is what I have so far:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = train_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)
Idk whether its coming from the excel file or the parameters. I'm sorry if this is a simple question. I couldn't implement other people's solutions.
The error is because you are taking labels from filtered data and taking x from unfiltered data
Change the following line
x = train_data[columns_of_interest]
to
x = filtered_titanic_data[columns_of_interest]