Is it necessary to re-train BERT models, specifically RoBERTa model?

Is it necessary to re-train BERT models, specifically RoBERTa model? - python

I am looking for a sentiment analysis code with atleast 80%+ accuracy. I tried Vader and it I found it easy and usable, however it was giving accuracy of 64% only.
Now, I was looking at some BERT models and I noticed it needs to be re-trained? Is that correct? Isn't it pre-trained? is re-training necessary?

You can use pre-trained models from HuggingFace. There are plenty to choose from. Search for emotion or sentiment models
Here is an example of a model with 26 emotions. The current implementation works but is very slow for large datasets.
import pandas as pd
from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification, pipeline
tokenizer = RobertaTokenizerFast.from_pretrained("arpanghoshal/EmoRoBERTa")
model = TFRobertaForSequenceClassification.from_pretrained("arpanghoshal/EmoRoBERTa")
emotion = pipeline('sentiment-analysis',
model='arpanghoshal/EmoRoBERTa')
# example data
DATA_URI = "https://github.com/AFAgarap/ecommerce-reviews-analysis/raw/master/Womens%20Clothing%20E-Commerce%20Reviews.csv"
dataf = pd.read_csv(DATA_URI, usecols=["Review Text",])
# This is super slow, I will find a better optimization ASAP
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.fillna("")
.map(lambda x: emotion(x)[0].get("label", None))
),
)
)
We could also refactor it a bit
...
# a bit faster than the previous but still slow
def emotion_func(text:str) -> str:
if not text:
return None
return emotion(text)[0].get("label", None)
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.map(emotion_func)
),
)
)
Results:
Review Text Emotion
0 Absolutely wonderful - silky and sexy and comf... admiration
1 Love this dress! it's sooo pretty. i happene... love
2 I had such high hopes for this dress and reall... fear
3 I love, love, love this jumpsuit. it's fun, fl... love
...
6 I aded this in my basket at hte last mintue to... admiration
7 I ordered this in carbon for store pick up, an... neutral
8 I love this dress. i usually get an xs but it ... love
9 I'm 5"5' and 125 lbs. i ordered the s petite t... love
...
16 Material and color is nice. the leg opening i... neutral
17 Took a chance on this blouse and so glad i did... admiration
...
26 I have been waiting for this sweater coat to s... excitement
27 The colors weren't what i expected either. the... disapproval
...
31 I never would have given these pants a second ... love
32 These pants are even better in person. the onl... disapproval
33 I ordered this 3 months ago, and it finally ca... disappointment
34 This is such a neat dress. the color is great ... admiration
35 Wouldn't have given them a second look but tri... love
36 This is a comfortable skirt that can span seas... approval
...
40 Pretty and unique. great with jeans or i have ... admiration
41 This is a beautiful top. it's unique and not s... admiration
42 This poncho is so cute i love the plaid check ... love
43 First, this is thermal ,so naturally i didn't ... love

you can use pickle.
Pickle lets you.. well pickle your model for later use and in fact, you can use a loop to keep training the model until it reaches a certain accuracy and then exit the loop and pickle the model for later use.
You can find many tutorials on youtube on how to pickel a model.

Related

Naive Bayes classifier not working for sentiment analysis

I am trying to train a Naive Bayes classifier to predict whether a movie review is good or bad.
I am following this tutorial but have run into an error when trying to train the model:
https://medium.com/#MarynaL/analyzing-movie-review-data-with-natural-language-processing-7c5cba6ed922
I have followed all steps until training the model. My data and code looks as such:
Reviews Labels
0 For fans of Chris Farley, this is probably his... 1
1 Fantastic, Madonna at her finest, the film is ... 1
2 From a perspective that it is possible to make... 1
3 What is often neglected about Harold Lloyd is ... 1
4 You'll either love or hate movies such as this... 1
... ...
14995 This is perhaps the worst movie I have ever se... 0
14996 I was so looking forward to seeing this film t... 0
14997 It pains me to see an awesome movie turn into ... 0
14998 "Grande Ecole" is not an artful exploration of... 0
14999 I felt like I was watching an example of how n... 0
gnb = MultinomialNB()
gnb.fit(all_train_set['Reviews'], all_train_set['Labels'])
However when trying to fit the model I receive this error:
ValueError: could not convert string to float: 'For fans of Chris Farley, this is probably his best film. David Spade pl
If anyone could help me decide why following this tutorial has gone wrong it would e greatly appreciated.
Many Thanks

Indeed with Scikit-learn you have to convert texts to numbers before calling a classifier. You can do so by using, for instance, the CountVectorizer or the TfidfVectorizer.
If you want to use the more modern word embeddings, you can use the Zeugma package (install it with pip install zeugma in a terminal), e.g.
from zeugma.embeddings import EmbeddingTransformer
embedding = EmbeddingTransformer('glove')
X = embedding.transform(all_train_set['Reviews'])
y = all_train_set['Labels']
gnb = MultinomialNB()
gnb.fit(X, y)
I hope it helps!

How to get rid of the bold tag from xml document in python 3 without removing the enclosed text?

I am trying to remove the bold tag (<b> Some text in bold here </b>) from this xml document (but want to keep the text covered by the tags intact). The bold tags are present around the following words/text: Objectives, Design, Setting, Participants, Interventions, Main outcome measures, Results, Conclusion, and Trial registrations.
This is my Python code:
import requests
import urllib
from urllib.request import urlopen
import xml.etree.ElementTree as etree
from time import sleep
import json
urlHead = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&rettype=abstract&id='
pmid = "28420629"
completeUrl = urlHead + pmid
response = urllib.request.urlopen(completeUrl)
tree = etree.parse(response)
studyAbstractParts = tree.findall('.//AbstractText')
for studyAbstractPart in studyAbstractParts:
print(studyAbstractPart.text)
The problem with this code is that it finds all the text under "AbstractText" tag but it stops (or ignores) the text in bold tags and after it. In principle, I need all the text between the "<AbstractText> </AbstractText>" tags, but the bold formatting <b> </b> is just a shitty obstruction to it.

You can use the itertext() method to get all the text in <AbstractText> and its subelements.
studyAbstractParts = tree.findall('.//AbstractText')
for studyAbstractPart in studyAbstractParts:
for t in studyAbstractPart.itertext():
print(t)
Output:
Objectives
 To determine whether preoperative dexamethasone reduces postoperative vomiting in patients undergoing elective bowel surgery and whether it is associated with other measurable benefits during recovery from surgery, including quicker return to oral diet and reduced length of stay.
Design
 Pragmatic two arm parallel group randomised trial with blinded postoperative care and outcome assessment.
Setting
 45 UK hospitals.
Participants
 1350 patients aged 18 or over undergoing elective open or laparoscopic bowel surgery for malignant or benign pathology.
Interventions
 Addition of a single dose of 8 mg intravenous dexamethasone at induction of anaesthesia compared with standard care.
Main outcome measures
 Primary outcome: reported vomiting within 24 hours reported by patient or clinician.
vomiting with 72 and 120 hours reported by patient or clinician; use of antiemetics and postoperative nausea and vomiting at 24, 72, and 120 hours rated by patient; fatigue and quality of life at 120 hours or discharge and at 30 days; time to return to fluid and food intake; length of hospital stay; adverse events.
Results
 1350 participants were recruited and randomly allocated to additional dexamethasone (n=674) or standard care (n=676) at induction of anaesthesia. Vomiting within 24 hours of surgery occurred in 172 (25.5%) participants in the dexamethasone arm and 223 (33.0%) allocated standard care (number needed to treat (NNT) 13, 95% confidence interval 5 to 22; P=0.003). Additional postoperative antiemetics were given (on demand) to 265 (39.3%) participants allocated dexamethasone and 351 (51.9%) allocated standard care (NNT 8, 5 to 11; P<0.001). Reduction in on demand antiemetics remained up to 72 hours. There was no increase in complications.
Conclusions
 Addition of a single dose of 8 mg intravenous dexamethasone at induction of anaesthesia significantly reduces both the incidence of postoperative nausea and vomiting at 24 hours and the need for rescue antiemetics for up to 72 hours in patients undergoing large and small bowel surgery, with no increase in adverse events.
Trial registration
 EudraCT (2010-022894-32) and ISRCTN (ISRCTN21973627).

ERROR while loading & reading custom 20newsgroups corpus with NLTK

I am trying to load the 20newsgroups corpus with the NLTK corpus reader and thereafter I am extracting words from all documents and tagging them. But it is showing error when I am trying to build the word extracted and tagged list.
Here is the CODE:
import nltk
import random
from nltk.tokenize import word_tokenize
newsgroups = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
r"C:\nltk_data\corpora\20newsgroups",
r'(?!\.).*\.txt',
cat_pattern=r'(not_sports|sports)/.*',
encoding="utf8")
documents = [(list(newsgroups.words(fileid)), category)
for category in newsgroups.categories()
for fileid in newsgroups.fileids(category)]
random.shuffle(documents)
And the corresponding ERROR is:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-10-de2a1a6859ea> in <module>()
1 documents = [(list(newsgroups.words(fileid)), category)
----> 2 for category in newsgroups.categories()
3 for fileid in newsgroups.fileids(category)]
4
5 random.shuffle(documents)
<ipython-input-10-de2a1a6859ea> in <listcomp>(.0)
1 documents = [(list(newsgroups.words(fileid)), category)
2 for category in newsgroups.categories()
----> 3 for fileid in newsgroups.fileids(category)]
4
5 random.shuffle(documents)
C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in __len__(self)
231 # iterate_from() sets self._len when it reaches the end
232 # of the file:
--> 233 for tok in self.iterate_from(self._toknum[-1]): pass
234 return self._len
235
C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok)
294 self._current_toknum = toknum
295 self._current_blocknum = block_index
--> 296 tokens = self.read_block(self._stream)
297 assert isinstance(tokens, (tuple, list, AbstractLazySequence)), (
298 'block reader %s() should return list or tuple.' %
C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\plaintext.py in _read_word_block(self, stream)
120 words = []
121 for i in range(20): # Read 20 lines at a time.
--> 122 words.extend(self._word_tokenizer.tokenize(stream.readline()))
123 return words
124
C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in readline(self, size)
1166 while True:
1167 startpos = self.stream.tell() - len(self.bytebuffer)
-> 1168 new_chars = self._read(readsize)
1169
1170 # If we're at a '\r', then read one extra character, since
C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _read(self, size)
1398
1399 # Decode the bytes into unicode characters
-> 1400 chars, bytes_decoded = self._incr_decode(bytes)
1401
1402 # If we got bytes but couldn't decode any, then read further.
C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _incr_decode(self, bytes)
1429 while True:
1430 try:
-> 1431 return self.decode(bytes, 'strict')
1432 except UnicodeDecodeError as exc:
1433 # If the exception occurs at the end of the string,
C:\ProgramData\Anaconda3\lib\encodings\utf_8.py in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6: invalid start byte
I have tried changing the encoding in the corpus reader to ascii and utf16 as well. That's not working either. I am not sure whether the regex I have provided is the right one or not. The filenames in the 20newsgroups corpus are in the form of 2 numbers separated by a hyphen(-), such as:
5-53286
102-53553
8642-104983
The second thing that I am worried about is whether the error is being generated from the document contents when they are being read for feature extraction.
Here are a what documents in 20newsgroups corpus look like:
From: bil#okcforum.osrhe.edu (Bill Conner) Subject: Re: free moral
agency
dean.kaflowitz (decay#cbnewsj.cb.att.com) wrote: : > : > I think
you're letting atheist mythology
: Great start. I realize immediately that you are not interested : in
discussion and are going to thump your babble at me. I would : much
prefer an answer from Ms Healy, who seems to have a : reasonable and
reasoned approach to things. Say, aren't you the : creationist guy
who made a lot of silly statements about : evolution some time ago?
: Duh, gee, then we must be talking Christian mythology now. I : was
hoping to discuss something with a reasonable, logical : person, but
all you seem to have for your side is a repetition : of the same
boring mythology I've seen a thousand times before. : I am deleting
the rest of your remarks, unless I spot something : that approaches an
answer, because they are merely a repetition : of some uninteresting
doctrine or other and contain no thought : at all.
: I have to congratulate you, though, Bill. You wouldn't : know a
logical argument if it bit you on the balls. Such : a persistent lack
of function in the face of repeated : attempts to assist you in
learning (which I have seen : in this forum and others in the past)
speaks of a talent : that goes well beyond my own, meager abilities.
I just don't : seem to have that capacity for ignoring outside
influences.
: Dean Kaflowitz
Dean,
Re-read your comments, do you think that merely characterizing an
argument is the same as refuting it? Do you think that ad hominum
attacks are sufficient to make any point other than you disapproval of
me? Do you have any contribution to make at all?
Bill
From: cmk#athena.mit.edu (Charles M Kozierok) Subject: Re: Jack Morris
In article <1993Apr19.024222.11181#newshub.ariel.yorku.ca> cs902043#ariel.yorku.ca (SHAWN LUDDINGTON) writes: } In article <1993Apr18.032345.5178#cs.cornell.edu> tedward#cs.cornell.edu (Edward [Ted] Fischer) writes: } >In article <1993Apr18.030412.1210#mnemosyne.cs.du.edu> gspira#nyx.cs.du.edu (Greg Spira) writes: } >>Howard_Wong#mindlink.bc.ca (Howard Wong) writes: }
>> } >>>Has Jack lost a bit of his edge? What is the worst start Jack Morris has had? } >> } >>Uh, Jack lost his edge about 5 years ago, and has had only one above } >>average year in the last 5. } > } >Again goes to prove that it is better to be good than lucky. You can }
>count on good tomorrow. Lucky seems to be prone to bad starts (and a } >bad finish last year :-). } > } >(Yes, I am enjoying every last run he gives up. Who was it who said } >Morris was a better signing than Viola?) } } Hey Valentine, I don't see Boston with any world series rings on their } fingers.
oooooo. cheap shot. :^)
} Damn, Morris now has three and probably the Hall of Fame in his } future.
who cares? he had two of them before he came to Toronto; and if the Jays had signed Viola instead of Morris, it would have been Frank who won 20 and got the ring. and he would be on his way to 20 this year, too.
} Therefore, I would have to say Toronto easily made the best } signing.
your logic is curious, and spurious.
there is no reason to believe that Viola wouldn't have won as many games had *he* signed with Toronto. when you compare their stupid W-L records, be sure to compare their team's offensive averages too.
now, looking at anything like the Morris-Viola sweepstakes a year later is basically hindsight. but there were plenty of reasons why it should have been apparent that Viola was the better pitcher, based on previous recent years and also based on age (Frank is almost 5 years younger! how many knew that?). people got caught up in the '91 World Series, and then on Morris' 21 wins last year. wins are the stupidest, most misleading statistic in baseball, far worse than RBI or R. that he won 21 just means that the Jays got him a lot of runs.
the only really valid retort to Valentine is: weren't the Red Sox trying to get Morris too? oh, sure, they *said* Viola was their first choice afterwards, but what should we have expected they would say?
} And don't tell me Boston will win this year. They won't } even be in the top 4 in the division, more like 6th.
if this is true, it won't be for lack of contribution by Viola, so who cares?
-*- charles
Please suggest me whether the error is while loading the documents or while reading the files and extracting words. What do I need to do to load the corpus correctly?

NLTK has corpora loading issues
you can load the useful category data using
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
Where newsgroups_train.target_names give you categories.

Add data_frame column while fulfill condition in dict

I am trying to add a column to a pandas.DataFrame. If the string in the DataFrame has one or more words as a key in a dict. But it gives me an error, and I don't know what went wrong. Could anyone help?
data_frame:
tw_test.head()
tweet
0 living the dream. #cameraman #camera #camerac...
1 justin #trudeau's reasons for thanksgiving. to...
2 #themadape butt…..butt…..we’re allergic to l...
3 2 massive explosions at peace march in #turkey...
4 #mulcair suggests there’s bad blood between hi...
dict:
party={}
{'#mulcair': 'NDP', '#cdnleft': 'liberal', '#LiberalExpress': 'liberal', '#ThankYouStephenHarper': 'Conservative ', '#pmjt': 'liberal'...}
My code:
tw_test["party"]=tw_test["tweet"].apply(lambda x: party[x.split(' ')[1].startswith("#")[0]])

I believe your trouble was due to trying to cram too much into a lambda. A function to do the lookup was pretty straight forward:
Code:
party_tags = {
'#mulcair': 'NDP',
'#cdnleft': 'liberal',
'#LiberalExpress': 'liberal',
'#ThankYouStephenHarper': 'Conservative ',
'#pmjt': 'liberal'
}
def party(tweet):
for tag in [t for t in tweet.split() if t.startswith('#')]:
if tag in party_tags:
return party_tags[tag]
Test Code:
import pandas as pd
tw_test = pd.DataFrame([x.strip() for x in u"""
living the dream. #cameraman #camera #camerac
justin #trudeau's reasons for thanksgiving. to
#themadape butt…..butt…..we’re allergic to
2 massive explosions at peace march in #turkey
#mulcair suggests there’s bad blood between
""".split('\n')[1:-1]], columns=['tweet'])
tw_test["party"] = tw_test["tweet"].apply(party)
print(tw_test)
Results:
tweet party
0 living the dream. #cameraman #camera #camerac None
1 justin #trudeau's reasons for thanksgiving. to None
2 #themadape butt…..butt…..we’re allergic to None
3 2 massive explosions at peace march in #turkey None
4 #mulcair suggests there’s bad blood between NDP

How to convert a list of mixed data type into a dataframe in Python

I have a list of mixed data type looking like this:
list = [['3D prototypes',
'Can print large objects',
'Autodesk Maya/Mudbox',
'3D Studio'],
['We can produce ultra high resolution 3D prints in multiple materials.',
'The quality of our prints beats MakerBot, Form 1, or any other either
powder based or printers using PLA, ABS, Wax or Resin. This printer has
the highest resolution and a very large build size. It prints fully
functional moving parts like a chain or an engine right out of the
printer.',
'The printer is loaded with DurusWhite.',
'Inquire to change the material. There is a $30 surcharge for material
switch.',
"Also please mention your creation's dimensions in mm and if you need
expedite delivery.",
"Printer's Net build size:",
'294 x 192 x 148.6 mm (11.57 x 7.55 x 5.85 in.)',
'The Objet30 features four Rigid Opaque materials and one material that
mimics polypropylene. The Vero family of materials all feature dimensional
stability and high-detail visualization, and are designed to simulate
plastics that closely resemble the end product.',
'PolyJet based printers have a different way of working. These
technologies deliver the highest quality and precision unmatched by the
competition. These type of printers are ideal for professionals, for uses
ranging from casting jewelry to device prototyping.',
'Rigid opaque white (VeroWhitePlus)',
'Rigid opaque black (VeroBlackPlus )',
'Rigid opaque blue (VeroBlue)',
'Rigid opaque gray (VeroGray)',
'Polypropylene-like material (DurusWhite) for snap fit applications'],
'Hub can print invoices',
'postal service',
'Mar 2015',
'Within the hour i',
[u'40.7134', u'-74.0069'],
'4',
['Customer JAMES reviewed Sun, 2015-04-19 05:17: Awesome print!
Good quality, relatively fast shipping, and very responsive to my
questions; would certainly recommend this hub. ',
'Hub XSENIO replied 2 days 16 hours ago: Thanks James! ',
'Customer Sara reviewed Sun, 2015-04-19 00:10: Thank you for going
out of your way to get this to us in time for our shoot. ',
'Hub XSENIO replied 2 days 16 hours ago: Thanks ! ',
'Customer Aaron reviewed Sat, 2015-04-18 02:36: Great service ',
'Hub XSENIO replied 2 days 16 hours ago: Thanks! ',
"Customer Arnoldas reviewed Mon, 2015-03-23 19:47: Xsenio's Hub was
able to produce an excellent quality print , was quick and reliable.
Awesome printing experience! "]]
It has a mixed data type looking like this,
<type 'list'>
<type 'list'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'list'>
<type 'str'>
<type 'list'>
But when I use
pd.DataFrame(list)
It shows that,
TypeError: Expected list, got str
Can anyone tell me what's wrong with that? Do I have to convert all items in list from string to list?
Thanks

It seems you should convert your list into a numpy array or a dict:
from pandas import DataFrame
import numpy
lst = numpy.array([['3D prototypes',
'Can print large objects',
'Autodesk Maya/Mudbox',
'3D Studio'],
['We can produce ultra high resolution 3D prints in multiple materials.',
'''The quality of our prints beats MakerBot, Form 1, or any other either
powder based or printers using PLA, ABS, Wax or Resin. This printer has
the highest resolution and a very large build size. It prints fully
functional moving parts like a chain or an engine right out of the
printer.''',
'The printer is loaded with DurusWhite.',
'''Inquire to change the material. There is a $30 surcharge for material
switch.''',
'''Also please mention your creation's dimensions in mm and if you need
expedite delivery.''',
"Printer's Net build size:",
'294 x 192 x 148.6 mm (11.57 x 7.55 x 5.85 in.)',
'''The Objet30 features four Rigid Opaque materials and one material that
mimics polypropylene. The Vero family of materials all feature dimensional
stability and high-detail visualization, and are designed to simulate
plastics that closely resemble the end product.''',
'''PolyJet based printers have a different way of working. These
technologies deliver the highest quality and precision unmatched by the
competition. These type of printers are ideal for professionals, for uses
ranging from casting jewelry to device prototyping.''',
'Rigid opaque white (VeroWhitePlus)',
'Rigid opaque black (VeroBlackPlus )',
'Rigid opaque blue (VeroBlue)',
'Rigid opaque gray (VeroGray)',
'Polypropylene-like material (DurusWhite) for snap fit applications'],
'Hub can print invoices',
'postal service',
'Mar 2015',
'Within the hour i',
[u'40.7134', u'-74.0069'],
'4',
['''Customer JAMES reviewed Sun, 2015-04-19 05:17: Awesome print!
Good quality, relatively fast shipping, and very responsive to my
questions; would certainly recommend this hub. ''',
'Hub XSENIO replied 2 days 16 hours ago: Thanks James! ',
'''Customer Sara reviewed Sun, 2015-04-19 00:10: Thank you for going
out of your way to get this to us in time for our shoot. ''',
'Hub XSENIO replied 2 days 16 hours ago: Thanks ! ',
'Customer Aaron reviewed Sat, 2015-04-18 02:36: Great service ',
'Hub XSENIO replied 2 days 16 hours ago: Thanks! ',
'''Customer Arnoldas reviewed Mon, 2015-03-23 19:47: Xsenio's Hub was
able to produce an excellent quality print , was quick and reliable.
Awesome printing experience! ''']])
df = DataFrame(lst)
print df
The above prints
0
0 [3D prototypes, Can print large objects, Autod...
1 [We can produce ultra high resolution 3D print...
2 Hub can print invoices
3 postal service
4 Mar 2015
5 Within the hour i
6 [40.7134, -74.0069]
7 4
8 [Customer JAMES reviewed Sun, 2015-04-19 05:17...
[9 rows x 1 columns]
The doc does state the data parameter should be a numpy array or dict: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html
PS: I also took the liberty of enclosing the multiline strings in triple quotes

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.