How to use TPU to accelerate sentiment analysis using FinBert in Kaggle

How to use TPU to accelerate sentiment analysis using FinBert in Kaggle - python

I am trying to analyze the sentiment of earnings calls using FinBert. Since I am analysing more than 40,000 earnings calls, the computation of the sentiment scores takes more than a week. Because of that, I want to use the TPU provided by Kaggle to accelerate this process.
But all the tutorials/ guides I could find where just dealing with the training of the model, but I just want to use one of the pre-trained versions and use the TPU to accelerate the sentiment analysis of the earnings calls.
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
except ValueError:
tpu = None
gpus = tf.config.experimental.list_logical_devices("GPU")
if tpu:
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu,)
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
elif len(gpus) > 1:
strategy = tf.distribute.MirroredStrategy([gpu.name for gpu in gpus])
print('Running on multiple GPUs ', [gpu.name for gpu in gpus])
elif len(gpus) == 1:
strategy = tf.distribute.get_strategy()
print('Running on single GPU ', gpus[0].name)
else:
strategy = tf.distribute.get_strategy()
print('Running on CPU')
print("Number of accelerators: ", strategy.num_replicas_in_sync)
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer)```
#This is the loop I am using to calculate and score
for i in range(40001,len(clean_data)-1):
#for i in range(0,10):
print(i)
# Get QandA Text
temp = test_data.iloc[i,3]
sentences = nltk.sent_tokenize(temp)
results = nlp(sentences)
filename = clean_data.iloc[i,0]
# positive = 0
# neutral = 0
# negative = 0
j = 0
positive = 0
neutral = 0
negative = 0
for j in range (0,len(results)):
label = results[j]["label"]
if label == "Positive":
positive = positive + 1
elif label == "Neutral":
neutral = neutral + 1
else:
negative = negative + 1

per_pos_qanda = positive / len(results)
per_neg_qanda = negative / len(results)
net_score_qanda = per_pos_qanda - per_neg_qanda
finbert_results.iloc[i,0] = filename
finbert_results.iloc[i,7] = per_pos_qanda
finbert_results.iloc[i,8] = per_neg_qanda
finbert_results.iloc[i,9] = net_score_qanda
Do I now need to incorporate the TPU in the for loop code when I am calling the algorithm? So, in this line?
results = nlp(sentences)

If I understand correctly, you're referring to inference and not training. In that case also you can benefit from TPUs, for example by using distributed strategy. Please refer to this guide https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy

Related

tflite inference only predicts one label despite multiclass label training

I have trained a multiclass classifier for speech recognition using tensorflow. Then converted the model using tflite converter. The model can predict but it always outputs a single class. I suppose the problem is with the inference code because .h5 model can predict multiclass without any issue. I have been searching online for several days for some insight but I can't quite figure it out. Here is my code. Any suggestions would be really appreciated.
import sounddevice as sd
import numpy as np
import scipy.signal
import timeit
import python_speech_features
import tflite_runtime.interpreter as tflite
import importlib
# Parameters
debug_time = 0
debug_acc = 0
word_threshold = 0.95
rec_duration = 0.5 # 0.5
sample_length = 0.5
window_stride = 0.5 # 0.5
sample_rate = 8000 # The mic requires at least 44100 Hz to work
resample_rate = 8000
num_channels = 1
num_mfcc = 16
model_path = 'model.tflite'
mfccs_old = np.zeros((32, 25))
# Load model (interpreter)
interpreter = tflite.Interpreter(model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print(input_details)
# Filter and downsample
def decimate(signal, old_fs, new_fs):
# Check to make sure we're downsampling
if new_fs > old_fs:
print("Error: target sample rate higher than original")
return signal, old_fs
# Downsampling is possible only by an integer factor
dec_factor = old_fs / new_fs
if not dec_factor.is_integer():
print("Error: can only downsample by integer factor")
# Do decimation
resampled_signal = scipy.signal.decimate(signal, int(dec_factor))
return resampled_signal, new_fs
# Callback that gets called every 0.5 seconds
def sd_callback(rec, frames, time, status):
# Start timing for debug purposes
start = timeit.default_timer()
# Notify errors
if status:
print('Error:', status)
global mfccs_old
# Compute MFCCs
mfccs = python_speech_features.base.mfcc(rec,
samplerate=resample_rate,
winlen=0.02,
winstep=0.02,
numcep=num_mfcc,
nfilt=26,
nfft=512, # 2048
preemph=0.0,
ceplifter=0,
appendEnergy=True,
winfunc=np.hanning)
delta = python_speech_features.base.delta(mfccs, 2)
mfccs_delta = np.append(mfccs, delta, axis=1)
mfccs_new = mfccs_delta.transpose()
mfccs = np.append(mfccs_old, mfccs_new, axis=1)
# mfccs = np.insert(mfccs, [0], 0, axis=1)
mfccs_old = mfccs_new
# Run inference and make predictions
in_tensor = np.float32(mfccs.reshape(1, mfccs.shape[0], mfccs.shape[1], 1))
interpreter.set_tensor(input_details[0]['index'], in_tensor)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
val = np.amax(output_data) # DEFINED FOR BINARY CLASSIFICATION, CHANGE TO MULTICLASS
ind = np.where(output_data == val)
prediction = ind[1].astype(int)
if val > word_threshold:
print('index:', ind[1])
print('accuracy', val, '/n')
print(int(prediction))
if debug_acc:
# print('accuracy:', val)
# print('index:', ind[1])
print('out tensor:', output_data)
if debug_time:
print(timeit.default_timer() - start)
# Start recording from microphone
with sd.InputStream(channels=num_channels,
samplerate=sample_rate,
blocksize=int(sample_rate * rec_duration),
callback=sd_callback):
while True:
pass

Since I figured out the issue, I am answering it myself in case others find it useful.
The issue is not having a "background noise" class in your dataset. Also make sure you have enough data for background noises. If you look at Google's teachable machine's audio project (https://teachablemachine.withgoogle.com/train/audio), a "background noise" class is already there, you cannot delete or disable the class.
I tested with both codes provided on tensorflow's github example (https://github.com/tensorflow/examples/blob/master/lite/examples/sound_classification/raspberry_pi/classify.py) and on tensorflow's website (https://www.tensorflow.org/tutorials/audio/simple_audio). They both work well for your prediction as long as you have enough background noise samples in your dataset considering the particular environment you are testing it in.
I made slight changes to the tensorflow's github code to output the category name and category confidence score.
# Loop until the user close the classification results plot.
while True:
# Wait until at least interval_between_inference seconds has passed since
# the last inference.
now = time.time()
diff = now - last_inference_time
if diff < interval_between_inference:
time.sleep(pause_time)
continue
last_inference_time = now
# Load the input audio and run classify.
tensor_audio.load_from_audio_record(audio_record)
result = classifier.classify(tensor_audio)
for category in result.classifications[0].categories:
print(category.category_name, category.score)
Hope it's helpful for people playing around with similar projects.

I just wanted to know as the below code is making the prediction on all test dataset but i want to give my custom input then what are changes?

I have taken this code from GITHUB here is the full code
This code is finding the accuracy,precision of all text samples but i want to give new test sample so that it can predict the result
so what are changes i should do
# evaluate trained models
# find precision, recall, f1 score
num_classes = 1 # no. of classes
model.load_weights('%_on_100_class_CNN.h5') # load trained model
def maxNumber(arr) : # find index of max number in array
mx = 0
mxIndx = -1
for indx, val in enumerate(arr):
if val > mx:
mx = val
mxIndx = indx
return mxIndx
total = [] # total sample in each class
pos_reg = [] # number of positive samples in each class
fal_reg = [] # number of failed samples in each class
for i in range(num_classes): # initializing
total.append(0)
pos_reg.append(0)
fal_reg.append(0)
pred = model.predict(X_test) # predict on the test set
for indx, res in enumerate(pred) :
target = int(y_test[indx])
total[target] += 1
if target == maxNumber(res) :
pos_reg[target] += 1
else:
fal_reg[maxNumber(res)] += 1
s1 = 0
s2 = 0
s3 = 0
for i in range(num_classes):
s1 += pos_reg[i]
s2 += total[i]
s3 += fal_reg[i]
print("Precision, Recall, F1-score on test dataset")
print("precision: ", s1/(s1+s3)," recall: ", s1/s2, "f1 score :", (2*(s1/(s1+s3)))*(s1/s2)/((s1/(s1+s3))+(s1/s2)))
print("Precision, Recall, F1-score on each class of test dataset")
for i in range(num_classes):
print(i, ": ", (pos_reg[i] / total[i] * 100), " precision : ", pos_reg[i]/(pos_reg[i]+fal_reg[i]), " recall : ", pos_reg[i]/total[i], " f1score: ", (2*(pos_reg[i]/(pos_reg[i]+fal_reg[i]))*(pos_reg[i]/total[i]))/((pos_reg[i]/(pos_reg[i]+fal_reg[i]))+(pos_reg[i]/total[i])))

Training a BERT and Running out of memory - Google Colab

I keep running out of memory even after i bought google colab pro which has 25gb RAM usage. I have no idea why is this happening. I tried every kernel possible (Google colab, Google colab pro, Kaggle kernel, Amazon Sagemaker, Google Cloud Platform). I reduced my batch size to 8, no success whatsoever.
My goal is to train Bert in Deep Pavlov (with Russian text classification extension) to predict emotion of the tweet. It is a multiclass classification with 5 classes
Here is my whole code:
!pip3 install deeppavlov
import pandas as pd
train_df = pd.read_csv('train_pikabu.csv')
test_df = pd.read_csv('test_pikabu.csv')
val_df = pd.read_csv('validation_pikabu.csv')
from deeppavlov.dataset_readers.basic_classification_reader import BasicClassificationDatasetReader
# read data from particular columns of `.csv` file
data = BasicClassificationDatasetReader().read(
data_path='./',
train='train_pikabu.csv',
valid="validation_pikabu_a.csv",
test="test_pikabu.csv",
x = 'content',
y = 'emotions'
)
from deeppavlov.dataset_iterators.basic_classification_iterator import
BasicClassificationDatasetIterator
# initializing an iterator
iterator = BasicClassificationDatasetIterator(data, seed=42, shuffle=True)
!python -m deeppavlov install squad_bert
from deeppavlov.models.preprocessors.bert_preprocessor import BertPreprocessor
bert_preprocessor = BertPreprocessor(vocab_file="./bert/vocab.txt",
do_lower_case=False,
max_seq_length=256)
from deeppavlov.core.data.simple_vocab import SimpleVocabulary
vocab = SimpleVocabulary(save_path="./binary_classes.dict")
iterator.get_instances(data_type="train")
vocab.fit(iterator.get_instances(data_type="train")[1])
from deeppavlov.models.preprocessors.one_hotter import OneHotter
one_hotter = OneHotter(depth=vocab.len,
single_vector=True # means we want to have one vector per sample
)
from deeppavlov.models.classifiers.proba2labels import Proba2Labels
prob2labels = Proba2Labels(max_proba=True)
from deeppavlov.models.bert.bert_classifier import BertClassifierModel
from deeppavlov.metrics.accuracy import sets_accuracy
bert_classifier = BertClassifierModel(
n_classes=vocab.len,
return_probas=True,
one_hot_labels=True,
bert_config_file="./bert/bert_config.json",
pretrained_bert="./bert/bert_model.ckpt",
save_path="sst_bert_model/model",
load_path="sst_bert_model/model",
keep_prob=0.5,
learning_rate=1e-05,
learning_rate_drop_patience=5,
learning_rate_drop_div=2.0
)
# Method `get_instances` returns all the samples of particular data field
x_valid, y_valid = iterator.get_instances(data_type="valid")
# You need to save model only when validation score is higher than previous one.
# This variable will contain the highest accuracy score
best_score = 0.
patience = 2
impatience = 0
# let's train for 3 epochs
for ep in range(3):
nbatches = 0
for x, y in iterator.gen_batches(batch_size=8,
data_type="train", shuffle=True):
x_feat = bert_preprocessor(x)
y_onehot = one_hotter(vocab(y))
bert_classifier.train_on_batch(x_feat, y_onehot)
print("Batch done\n")
nbatches += 1
if nbatches % 1 == 0:
# validating every 100 batches
y_valid_pred = bert_classifier(bert_preprocessor(x_valid))
score = sets_accuracy(y_valid, vocab(prob2labels(y_valid_pred)))
print("Batches done: {}. Valid Accuracy: {}".format(nbatches, score))
y_valid_pred = bert_classifier(bert_preprocessor(x_valid))
score = sets_accuracy(y_valid, vocab(prob2labels(y_valid_pred)))
print("Epochs done: {}. Valid Accuracy: {}".format(ep + 1, score))
if score > best_score:
bert_classifier.save()
print("New best score. Saving model.")
best_score = score
impatience = 0
else:
impatience += 1
if impatience == patience:
print("Out of patience. Stop training.")
break
It runs up to 1 batch and then crushes.

Perceptron with Stochastic Gradient Descent - why is the training algorithm degrading with iteration?

I am implementing my own perceptron algorithm in python wihtout using numpy or scikit yet. I wanted to get the basics right before proceeding to machine learning specific modules.
I wrote the code as given below.
used iris data set to classify based on sepal length and petal size.
updating the weights at the end of each training set
learning rate, number of iterations for training provided to the algorithm from client
Issues:
My training algorithm degrades instead of improving over time. Can someone please explain what i am doing incorrectly.
This is my error set across iteration number, as you can see the error is actually increasing.
{
0: 0.01646885885483229,
1: 0.017375368112097056,
2: 0.018105024923841584,
3: 0.01869233173693685,
4: 0.019165059856726563,
5: 0.01954556263697238,
6: 0.019851832477317588,
7: 0.02009835160930562,
8: 0.02029677690109266,
9: 0.020456491062436744
}
import pandas as panda
import matplotlib.pyplot as plot
import random
remote_location = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
class Perceptron(object):
def __init__(self, epochs, learning_rate, weight_range = None):
self.epochs = epochs
self.learning_rate = learning_rate
self.weight_range = weight_range if weight_range else [-1, 1]
self.weights = []
self._x_training_set = None
self._y_training_set = None
self.number_of_training_set = 0
def setup(self):
self.number_of_training_set = self.setup_training_set()
self.initialize_weights(len(self._x_training_set[0]) + 1)
def setup_training_set(self):
"""
Downloading training set data from UCI ML Repository - Iris DataSet
"""
data = panda.read_csv(remote_location)
self._x_training_set = list(data.iloc[0:, [0,2]].values)
self._y_training_set = [0 if i.lower()!='iris-setosa' else 1
for i in data.iloc[0:, 4].values]
return len(self._x_training_set)
def initialize_weights(self, number_of_weights):
random_weights = [random.uniform(self.weight_range[0], self.weight_range[1])
for i in range(number_of_weights)]
self.weights.append(-1) # setting up bias unit
self.weights.extend(random_weights)
def draw_initial_plot(self, _x_data, _y_data, _x_label, _y_label):
plot.xlabel(_x_label)
plot.ylabel(_y_label)
plot.scatter(_x_data,_y_data)
plot.show()
def learn(self):
self.setup()
epoch_data = {}
error = 0
for epoch in range(self.epochs):
for i in range(self.number_of_training_set):
_x = self._x_training_set[i]
_desired = self._y_training_set[i]
_weight = self.weights
guess = _weight[0] ## setting up the bias unit
for j in range(len(_x)):
guess += _weight[j+1] * _x[j]
error = _desired - guess
## i am going to reset all the weights
if error!= 0 :
## resetting the bias unit
self.weights[0] = error * self.learning_rate
for j in range(len(_x)):
self.weights[j+1] = self.weights[j+1] + error * self.learning_rate * _x[j]
#saving error at the end of the training set
epoch_data[epoch] = error
# print(epoch_data)
self.draw_initial_plot(list(epoch_data.keys()), list(epoch_data.values()),'Epochs', 'Error')
def runMyCode():
learning_rate = 0.01
epochs = 15
random_generator_start = -1
random_generator_end = 1
perceptron = Perceptron(epochs, learning_rate, [random_generator_start, random_generator_end])
perceptron.learn()
runMyCode()
plot with epoch 6
plot with epoch > 6

Really is weird, since I just ran your code (as is) and got the following plot:
Can you elaborate on where you see this error? maybe you are reading it in reverse.

i figured out the issue. i was plotting the error function incorrectly.
the error in gradient descent is error squared/ so the code should be:
epoch_data[epoch] = error
once i did that, the epoch vs cost function plotting was always coming as converging towards zero, irrespective of the number of epochs.

KNN classifier not working in python on raspberrypi

I am writing a KNN classifier taken from here for character recognition from accelerometer and gyroscopic data.But, the below functions are not working correctly and prediction is not happening.Are there any mistakes in below code? kindly guide me.
trainingset-> training data with 20 samples(10=A,10=B).
testset-> live reading taken for recognition.
#-- KNN Classifier Functions ----------
def loaddataset():
global trainingset
with open('imudata.csv','rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)):
trainingset.append(dataset[x])
def euclideandistance(instance1,instance2,length):
distance = 0
for x in range(length-1):
instance1[x] = float(instance1[x])
instance2[x] = float(instance2[x])
for x in range(length-1):
distance += pow((instance1[x]-instance2[x]),2)
return math.sqrt(distance)
def getneighbours(trainingset,testinstance,k):
distances = []
length = len(testinstance)-1
for x in range(len(trainingset)):
dist = euclideandistance(testinstance, trainingset[x],length)
#print(trainingset[x][-1],dist)
distances.append((trainingset[x],dist))
#print(distances)
distances.sort(key=operator.itemgetter(1))
#print(distances)
neighbours = []
print('k='+repr(k)+'length of distances='+repr(len(distances)))
for x in range(k):
neighbours.append(distances[x][0])
return neighbours
def getresponse(neighbours):
classvotes = {}
for x in range(len(neighbours)):
response = neighbours[x][-1]
if response in classvotes:
classvotes[response] += 1
else:
classvotes[response] = 1
sortedvotes = sorted(classvotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedvotes[0][0]
def getaccuracy(testset, predictions):
correct = 0
for x in range(len(testset)):
if testset[x][-1] is predictions[x]:
correct +=1
return ((correct/float(len(testset))) * 100.0)
#------- END of KNN Classifier Functions -------------
My main compare function is
def compare():
loaddataset()
testset.append(testdata)
print 'Train set: '+ repr(len(trainingset))
print 'Test set: '+ repr(len(testset))
predictions=[]
k = len(trainingset)
for x in range(len(testset)):
neighbours = getneighbours(trainingset,testset[x],k)
result = getresponse(neighbours)
predictions.append(result)
print('>Predicted=' +repr(result)+', actual=' + repr(testset[x][-1]))
accuracy = getaccuracy(testset, predictions)
print('Accuracy: '+repr(accuracy)+'%')
My output is
Train set: 20
Test set: 1
k=20 length of distance=20
>Predicted='A', actual='B'
Accuracy: 0.0%
My sample data packet:
-1.1945864763443935e-16,1.0000000000000031,0.81335962823925234,1.2678119727931405,4.6396523259663871,3,1.0000000000000013,108240.99999999988,328.99999999999966,4.3008487686466931e-16,1.000000000000002,0.73006871826334618,0.88693535629714804,4.3903300136708818,15,1.0000000000000011,108240.99999999977,328.99999999999932,1.990977460573989e-16,1.0000000000000009,0.8120281400849243,1.3556881217171162,4.2839744646260876,9,1.0000000000000004,108240.99999999994,328.99999999999983,-3.4217816017322454e-16,1.0000000000000009,0.7842111273340705,1.0882622268942712,4.4762484049613418,4,1.0000000000000004,108241.00000000038,329.00000000000114,2.6996304550155782e-18,1.000000000000004,0.76504908035654873,1.1890598964371606,4.2138613873737967,7,1.000000000000002,108241.0000000001,329.00000000000028,7.154020705791282e-17,1.0,0.83945423805187047,1.4309844267934049,3.7008217934312198,6,1.0,108240.99999999983,328.99999999999949,-0.66014932688009009,0.48967404184734276,0.083592048161537938,A
I am from hardware and dont know much about KNN, thatswhy I am asking for corrections in my code if any.I added my dataset here.

I could see from your data that number of samples is very less that no. of features, that may affect the accuracy of prediction and the number of samples need to be very high. You can't expect to predict everything correctly, algorithms have their own accuracies. Try to check the correctness of this code by using any other well-known datasets like iris Or try to use built-in knn classifier from scikit learn python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use TPU to accelerate sentiment analysis using FinBert in Kaggle - python

If I understand correctly, you're referring to inference and not training. In that case also you can benefit from TPUs, for example by using distributed strategy. Please refer to this guide https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy

Related

tflite inference only predicts one label despite multiclass label training

I just wanted to know as the below code is making the prediction on all test dataset but i want to give my custom input then what are changes?

Training a BERT and Running out of memory - Google Colab

Perceptron with Stochastic Gradient Descent - why is the training algorithm degrading with iteration?

KNN classifier not working in python on raspberrypi

Categories

Resources