AttributeError: 'Tokenizer' object has no attribute 'oov_token' in Keras - python

I am trying to encode my text using my loaded tokenizer but am getting the following error
AttributeError: 'Tokenizer' object has no attribute 'oov_token'
I included the code below:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.models import Model, Input, Sequential, load_model
import pickle
import h5py
maxlen = 100
tok = open('tokenizer.pickle', 'rb')
tokenizer = pickle.load(tok)
tok.close()
model = load_model('weights.h5')
def predict():
new_text = sequence.pad_sequences((tokenizer.texts_to_sequences(['heyyyy'])), maxlen=maxlen)
prediction = model.predict(new_text,batch_size=1,verbose=2)
The problem occurs on the line tokenizer.texts_to_sequences(['heyyyy']) and I'm not sure why. Is the problem with pickle? the tokenizer.texts_to_sequences works with 'hey', 'heyy', and 'heyyy'.
Any guidance is appreciated!

This is most probably this issue:
You can manually set tokenizer.oov_token = None to fix this.
Pickle is not a reliable way to serialize objects since it assumes
that the underlying Python code/modules you're importing have not
changed. In general, DO NOT use pickled objects with a different
version of the library than what was used at pickling time. That's not
a Keras issue, it's a generic Python/Pickle issue. In this case
there's a simple fix (set the attribute) but in many cases there will
not be.

Related

OSError when loading tokenizer for huggingface model

I am trying to use this huggingface model and have been following the example provided, but I am getting an error when loading the tokenizer:
from transformers import AutoTokenizer
task = 'sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
OSError: Can't load tokenizer for 'cardiffnlp/twitter-roberta-base-sentiment'. Make sure that:
'cardiffnlp/twitter-roberta-base-sentiment' is a correct model identifier listed on 'https://huggingface.co/models'
or 'cardiffnlp/twitter-roberta-base-sentiment' is the correct path to a directory containing relevant tokenizer files
What I find very weird is that I was able to run my script several times but ran into an error after some time, while I don't recall changing anything in the meantime. Does anyone know what's the solution here?
EDIT: Here is my entire script:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
task = 'sentiment'
MODEL = f"nlptown/bert-base-multilingual-uncased-{task}"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
labels = ['very_negative', 'negative', 'neutral', 'positive', 'very_positive']
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)
text = "I love you"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
print(scores)
The error seems to start happening when I run model.save_pretrained(MODEL), but this might be a coincidence.
I just came across this same issue. It seems like a bug with model.save_pretrained(), as you noted.
I was able to resolve by deleting the directory where the model had been saved (cardiffnlp/) and running again without model.save_pretrained().
Not sure what your application is. For me, re-downloading the model each time takes ~5s and that is acceptable.

AttributeError: Can't get attribute 'tokenizer' on <module '__main__'>

I trained a logistic regression model on textual data and saved the model using pickle. But for testing when I try to load the model I got the error mentioned in the title while executing the following line:
model = pickle.load(open("sentiment.model", "rb"))
Following is the code used for saving the model:
import pickle
print("[INFO] saving Model...")
f = open('sentiment.model', "wb")
# first I saved the best_estimator_
f.write(pickle.dumps(gs_lr_tfidf.best_estimator_))
# but again I saved the model completely without mentioning any attribute i.e:
# f.write(pickle.dumps(gs_lr_tfidf))
# but none of them helped and I got the same error
f.close()
print("[INFO] Model saved!")
This error doesn't show up when I load the model in the same notebook just after finishing the training process (in the same runtime). But this error occurs when I try to load the model separately in different runtime even if the model loader code is the same. Why this is happening?
I think the problems is from the behaviour of pickle, as what #hafiz031 said, it's normal when run the same code in the file. So short answer is you need to import tokenizer(from whatever lib you use) before you load the model
For people who know chinese, you can go to this CSDN link for more info.
For people who don't know chinese, sorry for my bad English and I'll try my best to explain.
The documentation says:
pickle.loads(data, /, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=None)
Return the reconstituted object hierarchy of the pickled representation data of an object. data must be a bytes-like object.
There is an implicit requirement if you use pickle.loads, the object hierarchy must be declared before you load it. Intuitively you can think as you bring USD to north pole and you want to exchange USD to fish with a penguin. As they don't have the concept what is money, they won't make the deal. Same as pickle, if you haven't import tokenizer before, after pickle loads the bytes back to tokenizer, they don't know what is 'tokenizer' and return error to you. Thats why your code works in training file but fail when you loads the model in a different file.
in my case, I just import an extra lib.
# import your own lib
import pickle
import nltk.tokenizer
import genism
import sklearn
#...
model = pickle.load(open("sentiment.model", "rb"))
#model.predict()

How to save to disk / export a lightgbm LGBMRegressor model trained in python?

Hi I am unable to find a way to save a lightgbm.LGBMRegressor model to a file for later re-use.
Try:
my_model.booster_.save_model('mode.txt')
#load from model:
bst = lgb.Booster(model_file='mode.txt')
Note: the API state that
bst = lgb.train(…)
bst.save_model('model.txt', num_iteration=bst.best_iteration)
Depending on the version, one of the above works. For generic, You can also use pickle or something similar to freeze your model.
import joblib
# save model
joblib.dump(my_model, 'lgb.pkl')
# load model
gbm_pickle = joblib.load('lgb.pkl')
Let me know if that helps
For Python 3.7 and lightgbm==2.3.1, I found that the previous answers were insufficient to correctly save and load a model. The following worked:
lgbr = lightgbm.LGBMRegressor(num_estimators = 200, max_depth=5)
lgbr.fit(train[num_columns], train["prep_time_seconds"])
preds = lgbr.predict(predict[num_columns])
lgbr.booster_.save_model('lgbr_base.txt')
Finally, we can validated that this worked via:
model = lightgbm.Booster(model_file='lgbr_base.txt')
model.predict(predict[num_columns])
Without the above, I was getting the error: AttributeError: 'LGBMRegressor' object has no attribute 'save_model'
With the lastest version of lightGBM using import lightgbm as lgb, here is how to do it:
model.save_model('lgb_classifier.txt', num_iteration=model.best_iteration)
and then you can read the model as follow :
model = lgb.Booster(model_file='lgb_classifier.txt')
clf.save_model('lgbm_model.mdl')
clf = lgb.Booster(model_file='lgbm_model.mdl')

Problem loading an tfidf object with pickle

I have an issue with pickle:
from a previous job, I created an sklearn tfidfvectorizer object and I saved it thanks to pickle.
Her is the code I used to do it :
def lemma_tokenizer(text):
lemmatizer=WordNetLemmatizer()
return [lemmatizer.lemmatize(token) for token in
word_tokenize(text.replace("'"," "))]
punctuation=list(string.punctuation)
stop_words=set(stopwords.words("english")+punctuation+['``',"''"]+['doe','ha','wa'])
tfidf = TfidfVectorizr(input='content',tokenizer=lemma_tokenizer,stop_words=stop_words)
pickle.dump(tfidf, open(filename_tfidf, 'wb'))
I saw that if i wan't to load the this tfidf object thanks to pickle, I need to define the function "lemma_tokenizer" before.
So I create the following python script named 'loadtfidf.py' to load the tfidf object :
import pickle
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
def lemma_tokenizer(text):
lemmatizer=WordNetLemmatizer()
return [lemmatizer.lemmatize(token) for token in word_tokenize(text.replace("'"," "))]
tfidf = pickle.load(open('tfidf.sav', 'rb'))
if I run this script the object is well loaded and everything goes well!
BUT then, I create another python script named 'test.py' in the same directory of 'loadtfidf.py' where I simply try to import loadtfidf:
import loadtfidf
And when I try to run this one line I have the following error :
"""Can't get attribute 'lemma_tokenizer' on module 'main' from '.../test.py'"""
I really don't understand why... I don't even know what to try to fix this error... can you help me to fix it?
Thank you in advance for your help !

Save sklearn cross validation object

Following the tutorial for sklearn, I attempted to save an object that was created via sklearn but was unsuccessful. It appears the problem is with the cross validation object, as I can save the actual (final) model.
Given:
rf_model = RandomForestRegressor(n_estimators=1000, n_jobs=4, compute_importances = False)
cvgridsrch = GridSearchCV(estimator=rf_model, param_grid=parameters,n_jobs=4)
cvgridsrch.fit(X,y)
This will succeed:
joblib.dump(cvgridsrch.best_estimator_, 'C:\\Users\\Desktop\\DMA\\cvgridsrch.pkl', compress=9)
and this will fail:
joblib.dump(cvgridsrch, 'C:\\Users\\Desktop\\DMA\\cvgridsrch.pkl', compress=9)
with error:
PicklingError: Can't pickle <type 'instancemethod'>: it's not found as __builtin__.instancemethod
How to save the full object?
If you are using Python 2,
try:
import dill
So that lambda functions can be pickled....
One possible cause could be multithreading issue, which you may refer to this stackoverflow answer.
Also, is it possible for you to dump your object not via joblib but a more fundamental method like pickle (and not even cPickle, which is more restrictive)?
I know this is an old question, but it might be useful for people coming here having the same, or similar, problem.
I'm not sure of the specific error message, but I managed to sucessfully save the entire GridSearchCV object in my own project by using pickle:
import pickle
gs = GridSearchCV(some parameters) #create the gridsearch object
gs.fit(X, y) # fit the model
with open('file_name', 'wb') as f:
pickle.dump(gs, f) # save the object to a file
Then you can use
with open('file_name', 'rb') as f:
gs = pickle.load(f)
to read the file and hence be able to use the object again.

Categories

Resources