I am new to machine learning, I have a dataset with 4000-5000 items, they are all product descriptions, and the result
example, I want to train a model to classify them into 1 or 0, can I train it with this kind of text?
Yes of course you can! The keyword you are searching for is sentiment analysis. Take a look at this article by huggingface for sentiment analysis in python with a pre-trained model and from scratch.
Using pretrained models
Install the huggingface transformers python package pip install -q transformers
Import the sentiment-analysis pipeline provided by huggingface, which already implements publicly available models on huggingface:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
Inference
data = [PRODUCT_DESCRIPTION_1, PRODUCT_DESCRIPTION_2, ...]
results = sentiment_pipeline(data)
Results then contains an array of objects with a property "label" ("POSITIVE"/"NEGATIVE") and "score" (confidence score). In order to retrieve ratings (1-5) you either have to implement some sort of gausian random distribution to generate stars based on the positive/negative ratings (e.g. positive: mean of 4 with a variance of 1, negative: 2 with a variance of 1).
Fine-tuning model
To achive better results than using binary classification and randomness to provide ratings in form of stars, you would likely need to fine-tune an existing machine learning model for sentiment-analysis. Fine-tuning in this context means, that you build up on an existing models with existing weights and use an own, smaller dataset to fit the existing models to your special needs. You can do this with the huggingface library as well:
Install python packages pip install datasets transformers huggingface_hub
Preprocess your own dataset by shuffling your samples, splitting a test and train set
Tokenize the dataset (our model doesn't understand words, so we kind of need to encode them first. A usual practice doing so, is to use one-hot encode every single word in a high dimensional vector or to assign every word an index). As these kind of machine learning models expect the input to have a predefined length, we need to pad shorter sentences (usually done with a special
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
We can then start the training process based on a pretrained model called "distilbert", but with 5 output labels (1 label equals 1 possible rating):
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=5)
Now we can finally start the training process:
training_args = TrainingArguments(
output_dir="my-model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5, # adjust this parameter to your desired training length
weight_decay=0.01,
save_strategy="epoch",
push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
trainer.save_model("my-model")
You can then use the model to inference with the difference, that you will receive results with more than 2 classes:
model = AutoModelForSequenceClassification.from_pretrained("my-model")
results = model(data)
Source: huggingface blog
Comment section summarization
As their was a misunderstanding on the problem the question author is facing, here a summarization of the discussion in the comments:
The author want's to predict review ratings (1-5 stars) based on product DESCIRPTIONS. As - in my opinion - descriptions and the resulting reviews aren't related (descriptions are always written positive, to sell the products they describe). I think that therefor no prediction based on the description itself is possible, and further inputs are needed (e.g. overall product "quality", utility of the product, etc.).
Related
I have a binary classification problem with tweets; 17000 as a positive class and 122000 as a negative class. I have balanced the data with both as 17000 tweets in each class. I have implemented models like LR, SVM, BERT, LSTM and CNN. In every run, the F1 score is around 0.55-0.66. Am I doing something wrong? Is it normal to have an F1 score around 0.55?
The problem continues with another dataset also. The sample BERT model is
trainer = Trainer(
model=model, # our loaded pre-trained transformer-based model "DistilBERT"
args=training_args, # our defined training arguments
train_dataset=train_dataset, # training dataset
eval_dataset=eval_dataset, # evaluation dataset
compute_metrics=compute_metrics # our defined evaluation function
)
Althought an F1 score around 0.55-0.66 can be normal it depends on your use case if it is good enough.
The F1 score depends not only on data balancing but also on lots of factors.
I'd recommend to take into account this steps for your ML pipeline:
Data preparation/cleansing (tokenization, stop-words removal, etc.)
Algorithms/Model selection (from experience SVM and NN perform good)
Feature engineering/selection (Which features have more influence in the model)
Hyperparameter tunning (depending on the model you'll have to search for the better combination of hyperparameters. E.g. with NN you have to specify how many layers, how many nodes, activation fn, back propagation fn, etc.)
Many people like to focus on the last steps but I'd say that data preparation is one of the most important steps in any data pipeline. Data preparation/cleansing plays also a huge role in the F1 score and practicaly all other metrics.
I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.
Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it. That includes texts with known labels that you're witthiolding from other supervised-classification steps as test/validation records.
You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors. Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.
(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?)
But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space. (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates.
So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.
Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value. (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.)
I'm doing sentiment analysis of Spanish tweets.
After reviewing some of the recent literature, I've seen that there's been a most recent effort to train a RoBERTa model exclusively on Spanish text (roberta-base-bne). It seems to perform better than the current state-of-the-art model for Spanish language modeling so far, BETO.
The RoBERTa model has been trained for a variety of tasks, which do not include text classification.
I want to take this RoBERTa model and fine-tune it for text classification, more specifically, sentiment analysis.
I've done all the preprocessing and created the dataset objects, and want to natively train the model.
Code
# Training with native TensorFlow
from transformers import TFRobertaForSequenceClassification
model = TFRobertaForSequenceClassification.from_pretrained("BSC-TeMU/roberta-base-bne")
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
Question
My questions is regarding the TFRobertaForSequenceClassification:
Is it correct to use this, since it's not specified in the model card? Instead of the AutoModelForMaskedLM specified in the model card.
Do we, by simply applying TFRobertaForSequenceClassification, imply that it will automatically apply the trained (and pretrained) knowledge to the new task, namely text classification?
The model in the model card references what essentially the model has been trained on. If you are familiar with architectural choices for different modeling tasks (e.g., token classification vs sequence classification), it should become clear that these models will have slightly different layouts, specifically in the layers after the Transformer output layer. For token classification, this is (generally speaking) Dropout and an additional linear layer, mapping from the hidden_size of the model to the number of output classes. See here for an example with BERT.
This means that the model checkpoint which was pre-trained with a different learning objective will not have weights for this final layer, but instead you train these (comparably few) parameters during your fine-tuning. In fact, for PyTorch models you will generally get a warning when loading a model checkpoint that slightly differs in the available weights:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: [...]
This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). [...]
This is exactly what you are doing, so as long as you have a decent number of fine-tuning examples (depending on the number of classes, I would suggest 10e3-10e4 as a rule of thumb), this will not affect your training by much.
I want to point out, however, that it might be necessary for you to specify the number of labels that your TokenClassification layer has. You can do this, by specifying it during the loading of your model:
from transformers import TFRobertaForSequenceClassification
roberta = TFRobertaForSequenceClassification.from_pretrained("BSC-TeMU/roberta-base-bne",
num_labels=<your_value>)
Purpose: We are exploring the use of word2vec models in clustering our data. We are looking for the ideal model to fit our needs and have been playing with using (1) existing models offered via Spacy and Gensim (trained on internet data only), (2) creating our own custom models with Gensim (trained on our technical data only) and (3) now looking into creating hybrid models that add our technical data to existing models (trained on internet + our data).
Here is how we created our hybrid model of adding our data to an existing Gensim model:
model = api.load("word2vec-google-news-300")
model = Word2Vec(size=300, min_count =1)
model.build_vocab(our_data)
model.train(our_data, total_examples=2, epochs =1)
model.wv.vocab
Question: Did we do this correctly in terms of our intentions of having a model that is trained on the internet and layered with our data?
Concerns: We are wondering if our data was really added to the model. When using the most similar function, we see really high correlations with more general words with this model. Our custom model has much lower correlations with more technical words. See output below.
Most Similar results for 'Python'
This model (internet + our data):
'technicians' = .99
'system' = .99
'working' = .99
Custom model (just our data):
'scripting' = .65
'perl' = .63
'julia' = .58
No: your code won't work for your intents.
When you execute the line...
model = Word2Vec(size=300, min_count=1)
...you've created an all-new, empty Word2Vec object, assigning it into the model variable, which discards anything that's already there. So the prior-loaded data will have no effect. You're just training a new model on your (tiny) data.
Further, the object you had loaded isn't a full Word2Vec model. The 'GoogleNews' vectors that Google released back in 2013 are only the vectors, not a full model. There's no straightforward & reliable way to keep training that object, as it is missing lots of information a real full model would have (including word-frequencies and the model's internal weights).
There are some advanced ways you could try to seed your own model with those values - but they involve lots of murky tradeoffs & poorly-documented steps, in order for the end-results to have any value, compared to just training your own model on your own sufficient data. There's no officially-documented/supported way to do it in Gensim.
I have tried to train incrementally word2vec model produced by gensim. But I found that the vocabulary size doesn't increased , only the word2vec model weights are updated . But i need to update both vocabulary and model size .
#Load data
sentences = []
....................
#Training
model = Word2Vec(sentences, size=100)
model.save("modelbygensim.txt")
model.save_word2vec_format("modelbygensim_text.txt")
#Incremental Training
model = Word2Vec.load('modelbygensim.txt')
model.train(sentences)
model.save("modelbygensim_incremental.txt")
model.save_word2vec_format("modelbygensim_text_incremental.txt")
By default, gensim Word2Vec only does vocabulary-discovery once. It will happen when you supply a corpus like your sentences to the initial constructor (which does an automatic vocabulary-scan and train), or alternatively when you call build_vocab(). While you can continue to call train(), no new words will be recognized.
There is support (that I would consider experimental) for calling build_vocab() with new text examples, and an update=True parameter, to expand the vocabulary. While this would let further train() calls train both old-and-new words, there are many caveats:
such sequential training may not lead to models as good, or as self-consistent, as providing all examples interleaved. (For example, the continued training may drift words learned-from-later-batches arbitrarily far from words/word-senses in earlier batches that are not re-presented.)
such calls to train() should use one of the optional parameters to give an accurate estimate of the new batch size (in words or examples) so that learning-rate decay and progress-logging is done properly
the core algorithm and underlying theories aren't based on such batching, and multiple restarts of the learning-rate from high-to-low, so the interpretation of results – and relative strength/balance of resulting vectors - isn't as well-grounded
If at all possible, combine all your examples into one corpus, and do one large vocabulary-discovery then training.