I am trying to implement a model that uses encoding from multiple pre-trained BERT models on different datasets and gets a combined representation using a fully-connected layer. In this, I want that BERT models should remain fixed and only fully-connected layers should get trained. Is it possible to achieve this in huggingface-transformers? I don't see any flag which allows me to do that.
PS: I don't want to go by the way of dumping the encoding of inputs for each BERT model and use them as inputs.
A simple solution to this is to just exclude the parameters related to the BERT model while passing to the optimizer.
param_optimizer = [x for x in param_optimizer if 'bert' not in x[0]]
optimizer = AdamW(param_optimizer, lr)
Related
I'm doing sentiment analysis of Spanish tweets.
After reviewing some of the recent literature, I've seen that there's been a most recent effort to train a RoBERTa model exclusively on Spanish text (roberta-base-bne). It seems to perform better than the current state-of-the-art model for Spanish language modeling so far, BETO.
The RoBERTa model has been trained for a variety of tasks, which do not include text classification.
I want to take this RoBERTa model and fine-tune it for text classification, more specifically, sentiment analysis.
I've done all the preprocessing and created the dataset objects, and want to natively train the model.
Code
# Training with native TensorFlow
from transformers import TFRobertaForSequenceClassification
model = TFRobertaForSequenceClassification.from_pretrained("BSC-TeMU/roberta-base-bne")
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
Question
My questions is regarding the TFRobertaForSequenceClassification:
Is it correct to use this, since it's not specified in the model card? Instead of the AutoModelForMaskedLM specified in the model card.
Do we, by simply applying TFRobertaForSequenceClassification, imply that it will automatically apply the trained (and pretrained) knowledge to the new task, namely text classification?
The model in the model card references what essentially the model has been trained on. If you are familiar with architectural choices for different modeling tasks (e.g., token classification vs sequence classification), it should become clear that these models will have slightly different layouts, specifically in the layers after the Transformer output layer. For token classification, this is (generally speaking) Dropout and an additional linear layer, mapping from the hidden_size of the model to the number of output classes. See here for an example with BERT.
This means that the model checkpoint which was pre-trained with a different learning objective will not have weights for this final layer, but instead you train these (comparably few) parameters during your fine-tuning. In fact, for PyTorch models you will generally get a warning when loading a model checkpoint that slightly differs in the available weights:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: [...]
This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). [...]
This is exactly what you are doing, so as long as you have a decent number of fine-tuning examples (depending on the number of classes, I would suggest 10e3-10e4 as a rule of thumb), this will not affect your training by much.
I want to point out, however, that it might be necessary for you to specify the number of labels that your TokenClassification layer has. You can do this, by specifying it during the loading of your model:
from transformers import TFRobertaForSequenceClassification
roberta = TFRobertaForSequenceClassification.from_pretrained("BSC-TeMU/roberta-base-bne",
num_labels=<your_value>)
Is it possible in Keras that the training of each or some of outputs in multi-output training start at different epochs? For example one of the outputs takes some other outputs as its input. But those outputs at the beginning are quite premature and it brings huge computational burdens to the model. This output that I would like its training to be postponed to some time later is a custom layer that has to apply some image processing operations to its input which is an image generated by another output but at the beginning that the generated image is quite meaningless, I think it's just waste of time for first epochs to apply this custom layer. Is there a way to do that? Like we have weights over each output's loss, do we have different starting point for calculating each output's loss?
Build a model that does not contain the later output.
Train that model to the degree you want.
Build a new model that incorporates the old model into it.
Compile the new model with the new loss functions you want.
Train that model.
To elaborate on step 3: Keras models can be used like layers in Keras' functional API.
You can build a normal model like so:
input = Input((100,))
x = Dense(50)(input)
x = Dense(1, activation='sigmoid')(x)
model = Model(input, x)
However, if you have another standard Keras model, it can be used just like any other layer. For example, if we have a model (created with Sequential(), Model(), or keras.models.load_model()) called model1, we can put it in like this:
input = Input((100,))
x = model1(input)
x = Dense(1, activation='sigmoid')(x)
model = Model(input, x)
This would be the equivalent of putting in each layer in model1 individually.
Migrating to the TF2.0 I'm trying to use the tf.keras approach for solving things.
In standard TF, I can use with tf.device(...) to control where ops are.
For example, I might have a model something like
model = tf.keras.Sequential([tf.keras.layers.Input(..),
tf.keras.layers.Embedding(...),
tf.keras.layers.LSTM(...),
...])
Assuming I want to have the network up until Embedding (including) on the CPU and the and from there on on the GPU, how will I go about that?
(This is just an example, the layers could have nothing to do with embeddings)
If the solution involves subclassing tf.keras.Model that is OK too, I don't mind not using Sequential
You can use the Keras functional API:
inputs = tf.keras.layers.Input(..)
with tf.device("/GPU:0"):
model = tf.keras.layers.Embedding(...)(inputs)
outputs = tf.keras.layers.LSTM(...)(model)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
I want extract 1000 image features using pretrained Xception model.
but xception models last layer(avg_pool) give 2048 features.
Can i reduce final output feature number without additional training?
I want image features before the softmax not predcition result.
base_model = xception.Xception(include_top=True, weights='imagenet')
base_model.summary()
self.model = Model(inputs = base_model.input, outputs=base_model.get_layer('avg_pool').output)
This model was trained to produce embeddings in 2048-dimensional space to the classifier after it. There is no sense in trying to reduce the dimensionality of the embedding space, unless you are combining very complex and inflexible models. If you are just doing simple transfer learning with no memory constraints, just snap your new classifier (extra layers) on top of it, and retrain after freezing (or not) all layers in the original Xception. That should work regardless of Xception output_shape. See the keras docs.
That said, if you REALLY need to reduce dimensionality to 1000-d, you will need a method that preserves (or at least tries to preserve) the original topology of the embedding space, otherwise your model will not benefit at all from transfer learning. Take a look at PCA, SVD, or T-SNE.
Considering the example of Image classification on ImageNet, How to update the pre-trained model using the new data points.
I have loaded the pre-trained model. I have a new data point that is quite different from the distribution of the original data on which the model was previously trained. So, I would like to update/fine-tune the model with the help of new data point. How to go about doing it? Can anyone help me out in doing it? I am using pytorch 0.4.0 for implementation, running on GPU Tesla K40C.
If you don't want to change the output of the classifier (i.e. the number of classes), then you can simply continue training the model with new example images, assuming that they are reshaped to the same shape that the pretrained model accepts.
On the other hand, if you want to change the number of classes in a pre-trained model, then you can replace the last fully connected layer with a new one and train only this specific layer on new samples. Here's a sample code for this case from PyTorch's autograd mechanics notes:
model = torchvision.models.resnet18(pretrained=True)
for param in model.parameters():
param.requires_grad = False
# Replace the last fully-connected layer
# Parameters of newly constructed modules have requires_grad=True by default
model.fc = nn.Linear(512, 100)
# Optimize only the classifier
optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)