Pyspark - Load trained model word2vec - python

I want to use word2vec with PySpark to process some data.
I was previously using Google trained model GoogleNews-vectors-negative300.bin with gensim in Python.
Is there a way I can load this bin file with mllib.word2vec ?
Or does it make sense to export the data as a dictionary from Python {word : [vector]} (or .csv file) and then load it in PySpark?
Thanks

Binary import is supported in Spark 3.x:
spark.read.format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data")
However, this would require processing the binary data. Hence a gensim export is rather recommended:
# Save gensim model
filename = "stored_model.csv"
trained_model.save(filename)
Then load the model in pyspark:
df = spark.read.load("stored_model.csv",
format="csv",
sep=";",
inferSchema="true",
header="true")

Related

How to extract metadata from tflite model

I'm loading this object detection model in python. I can load it with the following lines of code:
import tflite_runtime.interpreter as tflite
model_path = 'path_to_model_file.tf'
interpreter = tflite.Interpreter(model_path)
I'm able to perform inferences on this without any problem. However, labels are suposed to be included in the metadata, according to model's documentation, but I can't extract it.
The closest I was, it was when following this:
from tflite_support import metadata as _metadata
displayer = _metadata.MetadataDisplayer.with_model_file(model_path)
export_json_file = "extracted_metadata.json")
json_file = displayer.get_metadata_json()
# Optional: write out the metadata as a json file
with open(export_json_file, "w") as f:
f.write(json_file)
but the very first line of code, fails with this error: {AtributeError}'int' object has no attribute 'tobytes'.
How to extract it?
If you only care about the label file, you can simply run command like unzip model_path on Linux or Mac. TFLite model with metadata is essentially a zip file. See the public introduction for more details.
You code snippet to extract metadata works on my end. Make sure to double check model_path. It should be a string, such as "lite-model_ssd_mobilenet_v1_1_metadata_2.tflite".
If you'd like to read label files in an Android app, here is the sample code to do so.

Using trained GB classifier for new data

I have trained my Gradient Boosting Classifier and saved the model using pickle
with open("model.bin", 'wb') as f_out:
pickle.dump(xgb_clf, f_out)
As a data source, I had .csv-file.
Now I need to test the performance on completely new data, but I do not now how.
I found several tutorials, but was unable to proceed.
I understand that the key is to load the saved model
with open('model.bin', 'rb') as f_in:
model = pickle.load(f_in)
but I do not know how to apply this model on new data I have in csv.
Could you help, please?
Thank you.
The model object you are using should have a method, similar to model.predict(x), depending on the library (I'm assuming it is scikit-learn).
You need to load the data from the .csv file:
import pandas as pd
data = pd.read_csv('data.csv')
Select columns that belong to x:
x = data[['col1', 'col2']]
And call the prediction:
res = model.predict(x)
You can directly use the predict function.
model.predict(data)

Using sklearn-python model in spark ML 2.2.0 for prediction

I am working on a text classification problem in python using sklearn. I have created the model and saved it in a pickle.
Below is the code I used in sklearn.
vectorizerPipe = Pipeline([('tfidf', TfidfVectorizer(lowercase=True,
stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),])
prd=vectorizerPipe.fit(features_used,labels_used])
f = open(file_path, 'wb')
pickle.dump(prd, f)
Is there any way to use this same pickle to get the output in DataFrame based apache spark and not RDD based. I have gone through the following articles but didn't find a proper way to implement.
what-is-the-recommended-way-to-distribute-a-scikit-learn-classifier-in-spark
how-to-do-prediction-with-sklearn-model-inside-spark
-> I found both these questions on StackOverflow and find it useful.
deploy-a-python-model-more-efficiently-over-spark
I am a beginner in Machine learning. So, pardon me If the explanation is naive. Any related example or implementation would be helpful.
RDD -> spark dataframe using Spark
like:
import spark.implicits._
val testDF = rdd.map {line=>
(line._1,line._2)
}.toDF("col1","col2")

Word2Vec Vocabulary not definded error

I am new to python and word2vec and keep getting a "you must first build vocabulary before training the model" error. What is wrong with my code?
Here is my code:
file_object=open("SupremeCourt.txt","w")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)
out=model.most_similar()
print(out[1])
print(out[2])
I could see some wrong things in your code like the file is opened in write mode and the model which you have loaded doesn't contain the word which you want to find the most similar words.
I would like to suggest to use the predefined models like google_news_vectors to load in the gensim or to build your own word2vec model so that you won't get the error.
the usage of most_similar in gensim is out = model.most_similar("word-name")
file_object=open("SupremeCourt.txt","r")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)#use google news vectors here
out=model.most_similar("word")
print(out)
You're opening that file in write mode with this line:
file_object = open("SupremeCourt.txt", "w")
By doing this, you're erasing the contents of your file, so that when you try to pass the file the model for training, there is no data to read. That's why that error is thrown.
Remove that line (and also restore your file contents), and it'll work.

Update spaCy Vocabulary

I was wondering if it is possible to update spacys default vocabulary. What I am trying doing is this:
run word2vec on my own corpus with gensim
load the vectors into my model with nlp.vocab.load_vectors_from_bin_loc(\path)
But since a lot of words in my corpus aren't in spacys default vocabulary I can't make use of the imported vectors. Is there an (easy) way to add those missing types?
Edit:
I realize it might be problematic to mix vectors. So my question is:
How can I import a custom vocabulary into spacy?
This is much easier in the next version, which should be out this week --- I'm just finishing testing it. For now:
By default spaCy loads a data/vocab/vec.bin file, where the "data" directory is within the spacy.en module directory
Create the vec.bin file from a bz2 file using spacy.vocab.write_binary_vectors
Either replace spaCy's vec.bin file, or call nlp.vocab.load_rep_vectors at run-time, with the path to the binary file.
The above is a bit inconvenient at first, but the binary file format is much smaller and faster to load, and the vectors files are fairly big. Note that GloVe distributes in gzip format, not bzip.
Out of interest: are you using the GloVe vectors, or something you trained on your own data? If your own data, did you use Gensim? I'd like to make this much easier, so I'd appreciate suggestions for what work-flow you'd like to see.
Load new vectors at run-time, optionally converting them
import spacy.vocab
def set_spacy_vectors(nlp, binary_loc, bz2_loc=None):
if bz2_loc is not None:
spacy.vocab.write_binary_vectors(bz2_loc, binary_loc)
write_binary_vectors(bz2_input_loc, binary_loc)
nlp.vocab.load_rep_vectors(binary_loc)
Replace the vec.bin, so your vectors will be loaded by default
from spacy.vocab import write_binary_vectors
import spacy.en
from os import path
def main(bz2_loc):
bin_loc = path.join(path.dirname(spacy.en.__file__), 'data', 'vocab', 'vec.bin')
write_binary_vectors(bz2_loc, bin_loc)
if __name__ == '__main__':
plac.call(main)

Categories

Resources