Using trained GB classifier for new data - python

I have trained my Gradient Boosting Classifier and saved the model using pickle
with open("model.bin", 'wb') as f_out:
pickle.dump(xgb_clf, f_out)
As a data source, I had .csv-file.
Now I need to test the performance on completely new data, but I do not now how.
I found several tutorials, but was unable to proceed.
I understand that the key is to load the saved model
with open('model.bin', 'rb') as f_in:
model = pickle.load(f_in)
but I do not know how to apply this model on new data I have in csv.
Could you help, please?
Thank you.

The model object you are using should have a method, similar to model.predict(x), depending on the library (I'm assuming it is scikit-learn).
You need to load the data from the .csv file:
import pandas as pd
data = pd.read_csv('data.csv')
Select columns that belong to x:
x = data[['col1', 'col2']]
And call the prediction:
res = model.predict(x)

You can directly use the predict function.
model.predict(data)

Related

Changing the content of Pickle File

I have trained a deep learning model and it got saved in a pickle file. Due to some reason, I have to slightly change the code from which I got the pickle file. It took me months in training & I want to anyhow use the last pickle file created, as the weights will remains same. Is there any way to view and change the content of the pickle file?
Edit: For example, if we have the stylegan2 pre-trained network pickle file and suppose we made changes on the G_synthesis function code (present in https://github.com/NVlabs/stylegan2/blob/master/training/networks_stylegan2.py) then how can we use the old pickled file.
If you just want to change some functions but keep the same weights, can you just copy the weights to new model like this:
import pickle
from old_model_file import old_model
from new_model_file import new_model
# 1.load pickle file
with open('old.pickle','rb') as f:
old_pickle = pickle.load(f)
# 2.create model based new model
new_pickle = new_model()
# 3. copy weights from old model
'''
##you should copy all weights from old_pickle to new_pickle
##for example:
new_pickle.weight_A = old_pickle.weight_A
new_pickle.weight_B = old_pickle.weight_B
'''
# 4. save the new model
with open('new.pickle','wb') as f:
pickle.dump(new_pickle,f)
Is this what you want?

Tensorflow concatenate tf.data.Dataset.list_files

I'm running my current code on a data set which has the following naming convetions:
Training files: training-??-?? where the ?? are wildcards (placeholders for any range).
The same convetion goes for validation and test files (e.g. validation-??-??).
In my code I create the file pattern like this:
training_file_pattern = os.path.join(config['data_dir'], "training-??-of-??")
But now I wanted to train my model also on the validation and training set together. But I'm having problems figuring out how I can take both datasets. For training I would do:
tf_data_files = tf.data.Dataset.list_files(training_file_pattern, seed=1234, shuffle=self.shuffle)
I thought to do the same with the validation set and concatenate it to like this:
tf_data_files = tf.concat(tf_data_files, tf.data.Dataset.list_files(validation_file_pattern, seed=1234, shuffle=self.shuffle))
But it doesn't work correctly. What would the correct way to do it?
I also tried to define the file_pattern differently to contain also validation but I don't know how to do it without taking also the test set (they are all in the same folder). So I cannot do this:
training_and_validation_file_pattern = os.path.join(config['data_dir'], "?-??-of-??")
Because this would also take the test set right?
Any help would be very appreciated.
If I got your point, you can simply do
dataset = tf.data.Dataset.list_files(os.listdir('path'))
dataset = tf.data.TextLineDataset(dataset)
Dataset API also has concatenate method
dataset = dataset_1.concatenate(dataset_2)
but it's not completely clear wether you need it
Edit:
list_files will create dataset with filenames
dataset = tf.data.Dataset.list_files(['f1.csv', 'f2.csv'])
for i in ds:
print(i) #output 'f1.csv'
I'm using TF 2.0 version just for clarity.
On the other hand, tf.data.TextLineDataset() outputs actual values form text file, like
tf.Tensor(b'0.7079635943784122,0.9659163071487907'
So using just list_files will create dataset from files, not their contents and will require to apply additional parse function to dataset

Using sklearn-python model in spark ML 2.2.0 for prediction

I am working on a text classification problem in python using sklearn. I have created the model and saved it in a pickle.
Below is the code I used in sklearn.
vectorizerPipe = Pipeline([('tfidf', TfidfVectorizer(lowercase=True,
stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),])
prd=vectorizerPipe.fit(features_used,labels_used])
f = open(file_path, 'wb')
pickle.dump(prd, f)
Is there any way to use this same pickle to get the output in DataFrame based apache spark and not RDD based. I have gone through the following articles but didn't find a proper way to implement.
what-is-the-recommended-way-to-distribute-a-scikit-learn-classifier-in-spark
how-to-do-prediction-with-sklearn-model-inside-spark
-> I found both these questions on StackOverflow and find it useful.
deploy-a-python-model-more-efficiently-over-spark
I am a beginner in Machine learning. So, pardon me If the explanation is naive. Any related example or implementation would be helpful.
RDD -> spark dataframe using Spark
like:
import spark.implicits._
val testDF = rdd.map {line=>
(line._1,line._2)
}.toDF("col1","col2")

Sentiment analysis using pyspark

Since I am all new to pyspark, can anyone help me with the pyspark implementation of sentiment analysis. I have done the Python implementation. Can anyone tell me what changes are to be made?
import nltk
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from nltk.classify import NaiveBayesClassifier
def format_sentence(sent):
return({word: True for word in nltk.word_tokenize(sent)})
#print(format_sentence("The cat is very cute"))
pos = []
with open("./pos_tweets.txt") as f:
for i in f:
pos.append([format_sentence(i), 'pos'])
neg = []
with open("./neg_tweets.txt") as fp:
for i in fp:
neg.append([format_sentence(i), 'neg'])
# next, split labeled data into the training and test data
training = pos[:int((.8)*len(pos))] + neg[:int((.8)*len(neg))]
test = pos[int((.8)*len(pos)):] + neg[int((.8)*len(neg)):]
classifier = NaiveBayesClassifier.train(training)
example1 = "no!"
print(classifier.classify(format_sentence(example1)))
The pattern would typically be:
convert your data into a spark DataFrame
df = spark.read.csv('./neg_tweets.txt')
you can use train/test split here:
df.randomSplit([0.8, 0.2])
find a suitable model: if naive bayes works for you it will look somethig like this
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
Otherwise, for sentiment analysis there may not be one precisely built in to spark.ml/mllib. You may need to look for external projects.
Iterate, iterate on the model and tuning parameters..
You can run an evaluator for the metrics you decide are important to your problem. Some examples for binary classification problems are here:
https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification
metrics = BinaryClassificationMetrics(predictionAndLabels)

Pyspark - Load trained model word2vec

I want to use word2vec with PySpark to process some data.
I was previously using Google trained model GoogleNews-vectors-negative300.bin with gensim in Python.
Is there a way I can load this bin file with mllib.word2vec ?
Or does it make sense to export the data as a dictionary from Python {word : [vector]} (or .csv file) and then load it in PySpark?
Thanks
Binary import is supported in Spark 3.x:
spark.read.format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data")
However, this would require processing the binary data. Hence a gensim export is rather recommended:
# Save gensim model
filename = "stored_model.csv"
trained_model.save(filename)
Then load the model in pyspark:
df = spark.read.load("stored_model.csv",
format="csv",
sep=";",
inferSchema="true",
header="true")

Categories

Resources