I have a binary classification problem with tweets; 17000 as a positive class and 122000 as a negative class. I have balanced the data with both as 17000 tweets in each class. I have implemented models like LR, SVM, BERT, LSTM and CNN. In every run, the F1 score is around 0.55-0.66. Am I doing something wrong? Is it normal to have an F1 score around 0.55?
The problem continues with another dataset also. The sample BERT model is
trainer = Trainer(
model=model, # our loaded pre-trained transformer-based model "DistilBERT"
args=training_args, # our defined training arguments
train_dataset=train_dataset, # training dataset
eval_dataset=eval_dataset, # evaluation dataset
compute_metrics=compute_metrics # our defined evaluation function
)
Althought an F1 score around 0.55-0.66 can be normal it depends on your use case if it is good enough.
The F1 score depends not only on data balancing but also on lots of factors.
I'd recommend to take into account this steps for your ML pipeline:
Data preparation/cleansing (tokenization, stop-words removal, etc.)
Algorithms/Model selection (from experience SVM and NN perform good)
Feature engineering/selection (Which features have more influence in the model)
Hyperparameter tunning (depending on the model you'll have to search for the better combination of hyperparameters. E.g. with NN you have to specify how many layers, how many nodes, activation fn, back propagation fn, etc.)
Many people like to focus on the last steps but I'd say that data preparation is one of the most important steps in any data pipeline. Data preparation/cleansing plays also a huge role in the F1 score and practicaly all other metrics.
Related
I am performing sentiment analysis on a dataset of Movie Reviews. The neural network is a single-hidden layer NN, made from scratch in Python. The classifier is expected to assign one of five classes(0 to 4) to each review phrase. however, upon training, the confusion matrix for the dev set gives the following results:
This means that the classifier is heavily biased towards class 0 and class 4. What could be the possible reasons?
The classifier earlier predicted only class 2 always because the dataset was skewed (~ 50% of the data was from class 2). Hence I chose a subset of the dataset containing an equal number of examples from all 5 classes. I still don't understand the output and low accuracy.
The link to my notebook can be found here
first of all your model is linear, with only 1 layer. so its simple model which might not produce good results, try increasing number of layers.
you training cost is also very high, you have to train for more epochs until you get good training cost. which also affect your validation cost which is twice the training cost.
it is sign of over fitting.
my aim is to create document embeddings from the column df["text"] as a first step and then as a second step plug them along with other variables into a XGBoost Regressor model in order to make predictions. This works very well for the train_df.
I am currently trying to evaluate my trained Doc2Vec model by inferring vectors with infer_vector() on the unseen test_df and then again make predictions with it.However, the results are super bad. I got a very large error (RMSE).
I assume, this means that Doc2Vec is massively overfitting?
I am actually not sure if this is the correct way to evaluate my doc2vec model (by infer_vector)?
What to do to prevent doc2vec from overfitting?
Please find my code below for infering vectors from a model:
vectors_test=[]
for i in range(0, len(test_df)):
vecs=model.infer_vector(tokenize(test_df["text"][i]))
vectors_test.append(vecs)
vectors_test= pd.DataFrame(vectors_test)
test_df = pd.concat([test_df, vectors_test], axis=1)
I then make predictions with my XGBoost model:
np.random.seed(0)
test_df= test_df.reindex(np.random.permutation(test_df.index))
y = test_df['target'].values
X = test_df.drop(['target'], axis=1).values
y_pred = mod.predict(X)
pred = pd.DataFrame()
pred["Prediction"] = y_pred
rmse = np.sqrt(mean_squared_error(y,y_pred))
print(rmse)
Please see also the training of my doc2vec model:
doc_tag = train_df.apply(lambda train_df: TaggedDocument(words=tokenize(train_df["text"]), tags= [train_df.Tag]), axis = 1)
# initializing model, building a vocabulary
model = Doc2Vec(dm=0, vector_size=200, min_count=1, window=10, workers= cores)
model.build_vocab([x for x in tqdm(doc_tag.values)])
# train model for 5 epochs
for epoch in range(5):
model.train(utils.shuffle([x for x in tqdm(doc_tag.values)]), total_examples=len(doc_tag.values), epochs=1)
Without knowing what your XGBoost model is being trained to predict, or more about the type/quantity of your training data for certain steps, it's hard to speculate why one particular set of inputs are performing poorly. (For example, it could equally be the XGBoost model's data, parameters, or training that's mismatched to the task.)
But, some observations:
You generally shouldn't be calling train() multiple times in your own loop. See My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong? for discussion of common problems here. (Yours isn't quite as stark, but the learning-rate isn't being handled properly in your 5 separate train()s - indeed there should even be some error in your log output.)
Similarly: it's often a bad idea to use a min_count so small as 1 in these kinds of models: such rare words, without enough varied examples to be truly understood, just inject idiosyncratic noise which dilutes the influence of other, surrounding tokens which are meaningful.
Most published work trains a Doc2Vec model for 10-20 epochs – you're only using 5. (And, for smaller datasets or smaller texts, often even more epochs help.) Inference will also default to the epochs configured when the model was created – here only 5 – but more epochs are often beneficial.
It's unclear the size of your training texts and their unique vocabulary, but Doc2Vec overfitting will be most likely if the model is relatively large – in terms of vector_size or total surviving vocabulary – compared to the training data. Then, the model has lots of opportunity to essentially 'memorize' idiosyncracies of the training set, instead of more-generalizable patterns that will still be useful for out-of-training data. (For example, min_count=1, if it's preserving many singleton words which appear in only one text each, gives the model lots of "nooks and crannies" in which to improve its training target results in ways unlikely to help on other examples.) If your training data is "small", you likely need to use a smaller vector_size and a larger min_count to avoid overfitting, and then perhaps more epochs to ensure adequate training.
infer_vector essentially ignores any words not in its vocabulary - so you should take a look at some of the specific texts in the set performing poorly, and check whether most of their words are present, or not. But note also: as Doc2Vec is an unsupervised method, a plausible case can be made for training it to learn textual patterns on all available data, including the texts in your 'test' set. Then, it is more likely to have some word data, top at least the min_count threshold, for words across all examples. (Of course the actual supervised predictor itself can only be fairly evaluated on test examples whose desired answers weren't provided during the predictor's training. But it still can receive its features from an unsupervised step that used all text data.)
a crude check of a Doc2Vec model for overfitting or other training problems (but not overall quality) is to re-infer doc-vectors from the same texts it was trained on, and checking the model's set of bulk-trained vectors (model.docvecs) for the nearest-neighbors to these re-inferred vectors. If the re-inferred vector's nearest neighbor isn't usually the same text's bulk-trained vector – or if more generally, re-inferring the same text multiple times doesn't yield vectors that are 'close' to each other – then something about the model training or inference is deficient: overfitting, or undertraining, or insufficient data, or unwise parameters.
I'm currently creating a model and while creating it I came with some questions. Does training the same model with the same data multiple times leads to better precision of those objects, since your training it every time? And what could be the issue when sometimes the object gets 90% precision and when I re-run it gets lower precision or even not predicting the right object? Is it because of Tensorflow running on the GPU?
I will guess that you are doing image recognition and that you want to identify images (objects) using a neuronal network made with Keras. You should train it once, but during training you will do several epochs, meaning the algorithm adapts the weights in several rounds (epochs). For each round it goes over all training images. Once trained, you can use the model to identify images/objects.
You can evaluate the accuracy of your trained model over the same training set, but it is better to use a different set (see train_test_split from sklearn for instance).
Training is a stochastic process, meaning that every time you train your network it will be different in the end. Hence, you will get different accurcies. The stochasticity comes from different initial weights or from using stochastic gradient descent methods for instance.
The question does not appear to have anything to do with Keras or TensorFlow but basic understandting of how neuronal networks work. There is no connection to running Tensorflow on the GPU. You will also not get better precision by training with the same objects. If you train your model on a dataset for a very long time (many epochs), you might get into overfitting. Then, the accuracy of your model on this training dataset will be very high, but the model will have low accuracy on other datasets.
A common technique is split your date in train and validation datasets, then repeatedly train your model using EarlyStopping. This will train on the training dataset, then calculate the loss against the validation dataset, and then keep training until no further improvement is seen. You can set a patience parameter to wait for X epochs without an improvement to stop training (and optionally save the best model)
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/
Another trick is image augmentation with ImageDataGenerator which will generate synthetic data for you (rotations, shifts, mirror images, brightness adjusts, noise etc). This can effectively increase the amount of data you have to train with, thus reducing overfitting.
https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/
I have gathered some train dataset to train the network model, but unfortunately the dataset is critically unbalanced is there a way to balancing the data using Keras library without the need to balance it manually (dataset of two objects: object 1 2000 data while the other is 15000 ) , I don't want to use upsampling or downsampling cause I don't want to get problems in overfitting or underfitting
There are a number of ways and best-practices to deal with so called imbalanced data sets.
Upsample the minority class (Drawback: possibly overfitting of minority class)
Downsample the majority class (Drawback: loss of training data, information loss)
There are a number of techniques you can use for this, some even offer methods to overcome drawbacks (e.g. synthetic sampling). Have a look at the imbalanced-learn package for a easy-to-use implementation.
Another thing you could use is to weight the loss of your model in order to tell the model that it should "pay more attention" to specific classes. This can be easily done by defining the optional argument class_weight in keras fit function. The class weights can be easily computed by sklearns compute_class_weight function.
I am building a tensorflow model with new estimator high-level api. My model looks like below screenshot
.
In fact, the model is more complex than that due to the model is used to simulate game operation. Classification is responsible for decide whether it is good time for action. Then the regression will give the details about the action. It contains a combination of CNN and RNN.
However, due to the complexity and memory consumption, it is impossible to train and run classification and regression as two network simultaneously. Also, when I create my estimator like:
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="/tmp/mnist_convnet_model")
I can only provide one model function for the estimator. Is it possible to train and run two estimator together?
Change your loss function to be a linear combination of regression and classification losses. It will be one estimator with one loss, but multiple inferences.