HuggingFace BPE Trainer Error - Training Tokenizer

HuggingFace BPE Trainer Error - Training Tokenizer - python

I am trying to train a ByteLevelBPETokenizer using an iterable instead of from files. There must be something I am doing wrong when I instantiate the trainer, but I can't tell what it is. When I try to train the tokenizer with my dataset (clothing data from Kaggle) + the BpeTrainer, I get an error.
**TypeError**: 'tokenizers.trainers.BpeTrainer' object cannot be interpreted as an integer
I am using Colab
Step 1: Install tokenizers & download the Kaggle data
!pip install tokenizers
# Download clothing data from Kaggle
# https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/version/1?select=Womens+Clothing+E-Commerce+Reviews.csv
Step 2: Upload the file
# use colab file upload
from google.colab import files
uploaded = files.upload()
Step 3: Clean the data (remove floats) & run trainer
import io
import pandas as pd
# convert the csv to a dataframe so it can be parsed
data = io.BytesIO(uploaded['clothing_dataset.csv'])
df = pd.read_csv(data)
# convert the review text to a list so it can be passed as iterable to tokenizer
clothing_data = df['Review Text'].to_list()
# Remove float values from the data
clean_data = []
for item in clothing_data:
if type(item) != float:
clean_data.append(item)
from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
from tokenizers import trainers, pre_tokenizers
from tokenizers.trainers import BpeTrainer
from pathlib import Path
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer(lowercase=True)
# Intantiate BpeTrainer
trainer = BpeTrainer(
vocab_size=20000,
min_frequence = 2,
show_progress=True,
special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"],)
# Train the tokenizer
tokenizer.train_from_iterator(clean_data, trainer)
Error - I can see that the trainer is a BpeTrainer Type.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-103-7738a7becb0e> in <module>()
34
35 # Train the tokenizer
---> 36 tokenizer.train_from_iterator(clean_data, trainer)
/usr/local/lib/python3.7/dist-packages/tokenizers/implementations/byte_level_bpe.py in train_from_iterator(self, iterator, vocab_size, min_frequency, show_progress, special_tokens)
119 show_progress=show_progress,
120 special_tokens=special_tokens,
--> 121 initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
122 )
123 self._tokenizer.train_from_iterator(iterator, trainer=trainer)
TypeError: 'tokenizers.trainers.BpeTrainer' object cannot be interpreted as an integer
Interesting Note: If I set the input trainer=trainer I get this
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-104-64737f948e6d> in <module>()
34
35 # Train the tokenizer
---> 36 tokenizer.train_from_iterator(clean_data, trainer=trainer)
TypeError: train_from_iterator() got an unexpected keyword argument 'trainer'

I haven't used train_from_iterator before but looking at the HF docs it seems you should use a generator function. So something like:
def clothing_generator():
for item in clothing_data:
if type(item) != float:
yield item
Followed by:
tokenizer.train_from_iterator(clothing_generator(), trainer)
Might help?

Related

splitting annotated images with json file

to make it short, I'm trying to split annotated images into train - val - test. I've used the libary splitfolders however it only split the images while the annotated file (.JSON) didn't.
so I tried to use annotated_image libary as I've seen it can do that however I've had an error:
TypeError Traceback (most recent call last)
<ipython-input-19-2345d5e2fe5c> in <module>()
6 RANDOM_SEED=1337
7 IMG_PATH= '/content/Untitled Folder'
----> 8 annotated_images.split(IMG_PATH,'/content/', seed=1337, ratio=(0.8,0.15,0.05))
9
10 import splitfolders
3 frames
/usr/local/lib/python3.7/dist-packages/annotated_images/split.py in list_files(directory)
26 if len(files) == 0:
27 files = glob.glob
---> 28 if len(files) == 0:
29 files = glob.glob(directory + "*.*")
30 return files
TypeError: object of type 'function' has no len()
I'm using Google Colab tho
can anyone help please? or if there is anyway to split images + their respective annotations together!
thank you.
PS: I'm kinda new to python and all, not very familiar with it.
PS2: the code:
import annotated_images
RANDOM_SEED=1337
IMG_PATH= '/content/Untitled Folder'
annotated_images.split(IMG_PATH,'/content/', seed=1337, ratio=(0.8,0.15,0.05))

A simple oversight. Add an extra slash to the end of your path like this
IMG_PATH = '/content/Untitled Folder/'

pycaret compare_models() call doesn't recognize sort

After creating the clr_default:
clr_default = setup(df_rain_definitivo_one_drop_catboost_norm_fs_dropna,fold_shuffle=True, target='RainTomorrow', session_id=123)
I've tried to use the compare_models() function in Pycaret, using the following call:
best_model = compare_models()
from pycaret.classification import *
However I get the following error message:
ValueError Traceback (most recent call last)
<ipython-input-228-e1d76b68915a> in <module>()
----> 1 best_model = compare_models(n_select = 5, sort='Accuracy')
1 frames
/usr/local/lib/python3.7/dist-packages/pycaret/internal/tabular.py in compare_models(include, exclude, fold, round, cross_validation, sort, n_select, budget_time, turbo, errors, fit_kwargs, groups, verbose, display)
1954 if sort is None:
1955 raise ValueError(
-> 1956 f"Sort method not supported. See docstring for list of available parameters."
1957 )
1958
ValueError: Sort method not supported. See docstring for list of available parameters.
I've tried to call compare_models() with the sort parameter = 'Accuracy' but it didn't do any good.
Also, I'm on Google Colab

I dont get what is n_select = 5? do you want to get the top-5 models? Otherwise;
Im using your code examples:
First import pycaret
from pycaret.classification import *
Then setup,
clr_default = setup(df_rain_definitivo_one_drop_catboost_norm_fs_dropna,fold_shuffle=True, target='RainTomorrow', session_id=123)
Last use compare model method
best_model = compare_models(sort='Accuracy')
After that you can create your models then tune it.

AttributeError: 'NoneType' object has no attribute 'lower

I am trying to implement CountVectorizer on a tags data but everytime it throws attribute error , tried everything and still cant understand why this error. This is my code,
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(" "))
tag_dtm = vectorizer.fit_transform(tag_data['Tags'])
and this is the error i get:
`AttributeError
Traceback (most recent call last)
<ipython-input-53-7a05ab3b6655> in <module>()
7 # and learns the vocabulary; second, it transforms our training data
8 # into feature vectors. The input to fit_transform should be a list of strings.
----> 9 tag_dtm = vectorizer.fit_transform(tag_data['Tags'])
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py in _preprocess(doc, accent_function, lower)
66 """
67 if lower:
---> 68 doc = doc.lower()
69 if accent_function is not None:
70 doc = accent_function(doc)
AttributeError: 'NoneType' object has no attribute 'lower'`

You can complete the code by the following syntax and eliminate if any value is taking null by list comprehension.
tag_dtm = vectorizer.fit_transform([str(val) for val in tag_data['Tags'] if val is not np.nan])
Do let me know if this works for you!

'doc' is not containing any data/string in it, it is None type . None is not the meaning of none or blank , its a type.
check that with a condition like "if doc is None:" it will be true there.

Try and convert it to a string as lower is a string function and since one of the data passed into the CountVectorizer is a NoneType, It literally hooks there

'list' object is not callable for checking score accuracy?

I am creating a model using SVM. I wanted to save the classifier model and the parameters that was used into an excel and .json file, which will then be opened to see the best model out of all the .json files.
However, I got this error when I tried to run the second part of the code:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-9fd85866127d> in <module>
88 for x in func:
89 count=count+1
---> 90 train_val(x[0],x[1],x[2],count)
91 end_time = time.time()
<ipython-input-4-9fd85866127d> in train_val(kernel, c, gamma, count)
43 scoring.append(score(y_test, predictions))
44 else:
---> 45 scoring.append(score(y_test, predictions,average='macro'))
46
47 # saving kernel that is used to the list
TypeError: 'list' object is not callable
I didn't put anything that has the word 'list' so it shouldn't have been overridden. What makes the score list uncallable? Thank you.

You create lists:
accuracy = []
precision = []
recall = []
f1 = []
...
and you define scores to hold these lists:
scores = [accuracy, precision, recall, f1]
Then you iterate over these lists:
for score in scores:
...
But inside that loop you use these lists as if they're functions:
score(y_test, predictions)

When I use TF-IDF in Natural language processing, it said list is not callable.Can you help me with it?

I got error like this :
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-b9ac626e6121> in <module>
5
6 # Fitting TF-IDF to both training and test sets (semi-supervised learning)
----> 7 tfv.fit(list(xtrain) + list(xvalid))
8 xtrain_tfv = tfv.transform(xtrain)
9 xvalid_tfv = tfv.transform(xvalid)
TypeError: 'list' object is not callable
When I run these codes in python:
tfv = TfidfVectorizer(min_df=3, max_features=None,
strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
stop_words = 'english')
# Fitting TF-IDF to both training and test sets (semi-supervised learning)
tfv.fit(list(xtrain) + list(xvalid))
xtrain_tfv = tfv.transform(xtrain)
xvalid_tfv = tfv.transform(xvalid)
P.S. I also tried to convert the xtrain to list with xtrain.tolist(), but it doesn't work for me either.

From the code you provided nothing seems wrong. However, I hypothesize that somewhere before that block of code, you assigned an object to the variable name list (most likely something along the lines of list = [...]) which is usually the cause of this error.
Try to find that line of code if it exists and rename that variable. Generally it is not a good idea to rename built-in types for this reason. For more info read this

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

HuggingFace BPE Trainer Error - Training Tokenizer - python

Related

splitting annotated images with json file

pycaret compare_models() call doesn't recognize sort

AttributeError: 'NoneType' object has no attribute 'lower

'list' object is not callable for checking score accuracy?

When I use TF-IDF in Natural language processing, it said list is not callable.Can you help me with it?

Categories

Resources