TextBlob sentiment analysis: nan values - python

I'm new to sentiment analysis and I'm exploring with TextBlob.
My data is pre-processed Twitter data. It's in a series and each tweet has been cleaned and tokenized:
0 [new, leaked, treasury, document, full, sugges...
1 [tommy, robinson, endorsing, conservative, for...
2 [thanks, already, watched, catch, tv, morning, ]
3 [treasury, document, check, today, check, cons...
4 [utterly, stunning, video, hoped, prayed, woul...
... ...
307370 [trump, disciple, copycat]
307373 [disgusting]
307389 [wonder, people, vote, racist, homophobe, like...
307391 [gary, neville, slam, fuelling, racism, manche...
307393 [brexit, fault, excuseforeverything]
When I run textblob sentiment (using help from Apply textblob in for each row of a dataframe), my result is a column of nan values:
# Create sentiment column using textblob
# Source: https://stackoverflow.com/questions/43485469/apply-textblob-in-for-each-row-of-a-dataframe
def sentiment_calc(text):
try:
return TextBlob(text).sentiment
except:
return None
boris_data['sentiment'] = boris_data['text'].apply(sentiment_calc)
text sentiment
0 [new, leaked, treasury, document, full, sugges... None
1 [tommy, robinson, endorsing, conservative, for... None
2 [thanks, already, watched, catch, tv, morning, ] None
3 [treasury, document, check, today, check, cons... None
4 [utterly, stunning, video, hoped, prayed, woul... None
... ... ...
307370 [trump, disciple, copycat] None
307373 [disgusting] None
307389 [wonder, people, vote, racist, homophobe, like... None
307391 [gary, neville, slam, fuelling, racism, manche... None
307393 [brexit, fault, excuseforeverything] None

Related

How to search specific string in another col

What I want to do is, I have several columns and I wanna search each colum string in a column here is the example.
Here is my df.head()
Check index reviews summary output1 output2 ... output35 output36 output37 output38 output39 output40
0 True 1 After realizing my old mascara had a petroleum... Output: Quality: love the wand (positive); Len... Quality: love the wand (positive) Lengthening: lengthens my lashes well (positive) ... None None None None None None
1 True 2 Best mascara I’ve ever used. Makes my non-exis... Output: Makes lashes visible (positive); Non-e... Makes lashes visible (positive) Non-existent Asian lashes (negative) ... None None None None None None
2 True 3 I've never had a mascara that made my lashes l... Output: Look: long lashes (positive) Look: long lashes (positive) None ... None None None None None None
3 True 4 It is clump and smudge-free with an awesome la... Output: Clump-free (positive); Smudge-proof (p... Clump-free (positive) Smudge-proof (positive) ... None None None None None None
4 True 5 And I’m going to buy it again, and again, and ... Output: Quality: impressed by a mascara before... Quality: impressed by a mascara before (positive) Extends Lashes: huge impact (positive) ... None None None None None None
What I want to do is to check if the values ​​of all output columns (except columns with N/A in them) are inside the string in the reviews column.
here is a example for reviews column and the output1 column value.
Reviews = "After realizing my old mascara had a petroleum by-product in it, I needed a new one I felt good about putting on my beautiful lashes. I needed one that had a clean formula, produced by a company making more environmentally sustainable efforts. I was so excited to receive this mascara after ordering it, and it did not disappoint. I love natural, clean beauty products. I love the wand on this mascara, and it lengthens my lashes very well. I like using mascara because I love my lashes, and I don't want to use falsies or extensions. This mascara takes me from girlboss to goddess in less than 2 minutes. It's so easy to use, non-messy, and dries fast. Even if you do or don't feel like doing a full-face of makeup, this mascara will upgrade your look. It is a must-have and worth every cent."
Output1 = "Quality: love the wand (positive)"
I wanna search "love the wand" value is in reviews column
## Read your DataFrame
df = pd.read_excel("ilia.xlsx")
df.insert(0, "Check","True")
# Split the values in the 'summary' column and create new columns
split_values = df['summary'].str.split(";", expand=True)
for i in range(split_values.shape[1]):
df[f'output{i+1}'] = split_values[i]
df = df.apply(lambda x: x.str.lstrip() if x.dtype == "object" else x)
df["output1"] = df["output1"].str.replace("Output: ", "")
df["Check"] = df.apply(lambda x: any(i in x['reviews'] for i in x['summary'].split(";")), axis=1)
print(df.head())
df.to_excel("foo.xlsx", index=False)
Seems like this is what you are looking for:
df['reviews'].str.contains('love the wand').any()
From Statology

(TypeError: expected string or bytes-like object) Why if my variable has my data (string) storaged they display as different types?

My code basically looks like this until now after importing the dataset, libraries and all of that:
data = pd.read_csv("/content/gdrive/MyDrive/Data/tripadvisor_hotel_reviews.csv")
reviews = data['Review'].str.lower()
#Check
print(reviews)
print(type('Review'))
print(type(reviews))
The output, however, looks like this:
0 nice hotel expensive parking got good deal sta...
1 ok nothing special charge diamond member hilto...
2 nice rooms not 4* experience hotel monaco seat...
3 unique, great stay, wonderful time hotel monac...
4 great stay great stay, went seahawk game aweso...
...
20486 best kept secret 3rd time staying charm, not 5...
20487 great location price view hotel great quick pl...
20488 ok just looks nice modern outside, desk staff ...
20489 hotel theft ruined vacation hotel opened sept ...
20490 people talking, ca n't believe excellent ratin...
Name: Review, Length: 20491, dtype: object
<class 'str'>
<class 'pandas.core.series.Series'>
I want to know why the variable "reviews" is a different type than the data column "Review" if I (supposedly) set them to equal.
This is problem because when I try to tokenize my data, it shows an error.
My code for tokenize:
word_tokenize(reviews)
The error I get:
**TypeError** Traceback (most recent call last)
<ipython-input-9-ebaf7dca0fec> in <module>()
----> 1 word_tokenize(reviews)
8 frames
/usr/local/lib/python3.7/dist-packages/nltk/tokenize/punkt.py in _slices_from_text(self, text)
1287 def _slices_from_text(self, text):
1288 last_break = 0 ->
1289 for match in self._lang_vars.period_context_re().finditer(text):
1290 context = match.group() + match.group('after_tok')
1291 if self.text_contains_sentbreak(context):
**TypeError:** expected string or bytes-like object
There are many things going on here. First of all, reviews is a pd.Series. This means that
word_tokenize(reviews)
won't work, because you can't tokenize a series of strings. You can tokenize, however, a string. The following should work
tokens = [word_tokenize(review) for review in reviews]
because review above is a string, and you are tokenizing each string in the whole pd.Series of strings named reviews.
Also, comparing type('Review') and type(reviews) makes no sense. reviews is a pd.Series (i.e. an iterable) with many different strings, while "Review" is a string instance that holds the English word "Review" in it. type('Review') will always be string. In contrast, type(reviews) might change depending on what value the variable reviews hold.

Is it necessary to re-train BERT models, specifically RoBERTa model?

I am looking for a sentiment analysis code with atleast 80%+ accuracy. I tried Vader and it I found it easy and usable, however it was giving accuracy of 64% only.
Now, I was looking at some BERT models and I noticed it needs to be re-trained? Is that correct? Isn't it pre-trained? is re-training necessary?
You can use pre-trained models from HuggingFace. There are plenty to choose from. Search for emotion or sentiment models
Here is an example of a model with 26 emotions. The current implementation works but is very slow for large datasets.
import pandas as pd
from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification, pipeline
tokenizer = RobertaTokenizerFast.from_pretrained("arpanghoshal/EmoRoBERTa")
model = TFRobertaForSequenceClassification.from_pretrained("arpanghoshal/EmoRoBERTa")
emotion = pipeline('sentiment-analysis',
model='arpanghoshal/EmoRoBERTa')
# example data
DATA_URI = "https://github.com/AFAgarap/ecommerce-reviews-analysis/raw/master/Womens%20Clothing%20E-Commerce%20Reviews.csv"
dataf = pd.read_csv(DATA_URI, usecols=["Review Text",])
# This is super slow, I will find a better optimization ASAP
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.fillna("")
.map(lambda x: emotion(x)[0].get("label", None))
),
)
)
We could also refactor it a bit
...
# a bit faster than the previous but still slow
def emotion_func(text:str) -> str:
if not text:
return None
return emotion(text)[0].get("label", None)
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.map(emotion_func)
),
)
)
Results:
Review Text Emotion
0 Absolutely wonderful - silky and sexy and comf... admiration
1 Love this dress! it's sooo pretty. i happene... love
2 I had such high hopes for this dress and reall... fear
3 I love, love, love this jumpsuit. it's fun, fl... love
...
6 I aded this in my basket at hte last mintue to... admiration
7 I ordered this in carbon for store pick up, an... neutral
8 I love this dress. i usually get an xs but it ... love
9 I'm 5"5' and 125 lbs. i ordered the s petite t... love
...
16 Material and color is nice. the leg opening i... neutral
17 Took a chance on this blouse and so glad i did... admiration
...
26 I have been waiting for this sweater coat to s... excitement
27 The colors weren't what i expected either. the... disapproval
...
31 I never would have given these pants a second ... love
32 These pants are even better in person. the onl... disapproval
33 I ordered this 3 months ago, and it finally ca... disappointment
34 This is such a neat dress. the color is great ... admiration
35 Wouldn't have given them a second look but tri... love
36 This is a comfortable skirt that can span seas... approval
...
40 Pretty and unique. great with jeans or i have ... admiration
41 This is a beautiful top. it's unique and not s... admiration
42 This poncho is so cute i love the plaid check ... love
43 First, this is thermal ,so naturally i didn't ... love
you can use pickle.
Pickle lets you.. well pickle your model for later use and in fact, you can use a loop to keep training the model until it reaches a certain accuracy and then exit the loop and pickle the model for later use.
You can find many tutorials on youtube on how to pickel a model.

ERROR while loading & reading custom 20newsgroups corpus with NLTK

I am trying to load the 20newsgroups corpus with the NLTK corpus reader and thereafter I am extracting words from all documents and tagging them. But it is showing error when I am trying to build the word extracted and tagged list.
Here is the CODE:
import nltk
import random
from nltk.tokenize import word_tokenize
newsgroups = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
r"C:\nltk_data\corpora\20newsgroups",
r'(?!\.).*\.txt',
cat_pattern=r'(not_sports|sports)/.*',
encoding="utf8")
documents = [(list(newsgroups.words(fileid)), category)
for category in newsgroups.categories()
for fileid in newsgroups.fileids(category)]
random.shuffle(documents)
And the corresponding ERROR is:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-10-de2a1a6859ea> in <module>()
1 documents = [(list(newsgroups.words(fileid)), category)
----> 2 for category in newsgroups.categories()
3 for fileid in newsgroups.fileids(category)]
4
5 random.shuffle(documents)
<ipython-input-10-de2a1a6859ea> in <listcomp>(.0)
1 documents = [(list(newsgroups.words(fileid)), category)
2 for category in newsgroups.categories()
----> 3 for fileid in newsgroups.fileids(category)]
4
5 random.shuffle(documents)
C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in __len__(self)
231 # iterate_from() sets self._len when it reaches the end
232 # of the file:
--> 233 for tok in self.iterate_from(self._toknum[-1]): pass
234 return self._len
235
C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok)
294 self._current_toknum = toknum
295 self._current_blocknum = block_index
--> 296 tokens = self.read_block(self._stream)
297 assert isinstance(tokens, (tuple, list, AbstractLazySequence)), (
298 'block reader %s() should return list or tuple.' %
C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\plaintext.py in _read_word_block(self, stream)
120 words = []
121 for i in range(20): # Read 20 lines at a time.
--> 122 words.extend(self._word_tokenizer.tokenize(stream.readline()))
123 return words
124
C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in readline(self, size)
1166 while True:
1167 startpos = self.stream.tell() - len(self.bytebuffer)
-> 1168 new_chars = self._read(readsize)
1169
1170 # If we're at a '\r', then read one extra character, since
C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _read(self, size)
1398
1399 # Decode the bytes into unicode characters
-> 1400 chars, bytes_decoded = self._incr_decode(bytes)
1401
1402 # If we got bytes but couldn't decode any, then read further.
C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _incr_decode(self, bytes)
1429 while True:
1430 try:
-> 1431 return self.decode(bytes, 'strict')
1432 except UnicodeDecodeError as exc:
1433 # If the exception occurs at the end of the string,
C:\ProgramData\Anaconda3\lib\encodings\utf_8.py in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6: invalid start byte
I have tried changing the encoding in the corpus reader to ascii and utf16 as well. That's not working either. I am not sure whether the regex I have provided is the right one or not. The filenames in the 20newsgroups corpus are in the form of 2 numbers separated by a hyphen(-), such as:
5-53286
102-53553
8642-104983
The second thing that I am worried about is whether the error is being generated from the document contents when they are being read for feature extraction.
Here are a what documents in 20newsgroups corpus look like:
From: bil#okcforum.osrhe.edu (Bill Conner) Subject: Re: free moral
agency
dean.kaflowitz (decay#cbnewsj.cb.att.com) wrote: : > : > I think
you're letting atheist mythology
: Great start. I realize immediately that you are not interested : in
discussion and are going to thump your babble at me. I would : much
prefer an answer from Ms Healy, who seems to have a : reasonable and
reasoned approach to things. Say, aren't you the : creationist guy
who made a lot of silly statements about : evolution some time ago?
: Duh, gee, then we must be talking Christian mythology now. I : was
hoping to discuss something with a reasonable, logical : person, but
all you seem to have for your side is a repetition : of the same
boring mythology I've seen a thousand times before. : I am deleting
the rest of your remarks, unless I spot something : that approaches an
answer, because they are merely a repetition : of some uninteresting
doctrine or other and contain no thought : at all.
: I have to congratulate you, though, Bill. You wouldn't : know a
logical argument if it bit you on the balls. Such : a persistent lack
of function in the face of repeated : attempts to assist you in
learning (which I have seen : in this forum and others in the past)
speaks of a talent : that goes well beyond my own, meager abilities.
I just don't : seem to have that capacity for ignoring outside
influences.
: Dean Kaflowitz
Dean,
Re-read your comments, do you think that merely characterizing an
argument is the same as refuting it? Do you think that ad hominum
attacks are sufficient to make any point other than you disapproval of
me? Do you have any contribution to make at all?
Bill
From: cmk#athena.mit.edu (Charles M Kozierok) Subject: Re: Jack Morris
In article <1993Apr19.024222.11181#newshub.ariel.yorku.ca> cs902043#ariel.yorku.ca (SHAWN LUDDINGTON) writes: } In article <1993Apr18.032345.5178#cs.cornell.edu> tedward#cs.cornell.edu (Edward [Ted] Fischer) writes: } >In article <1993Apr18.030412.1210#mnemosyne.cs.du.edu> gspira#nyx.cs.du.edu (Greg Spira) writes: } >>Howard_Wong#mindlink.bc.ca (Howard Wong) writes: }
>> } >>>Has Jack lost a bit of his edge? What is the worst start Jack Morris has had? } >> } >>Uh, Jack lost his edge about 5 years ago, and has had only one above } >>average year in the last 5. } > } >Again goes to prove that it is better to be good than lucky. You can }
>count on good tomorrow. Lucky seems to be prone to bad starts (and a } >bad finish last year :-). } > } >(Yes, I am enjoying every last run he gives up. Who was it who said } >Morris was a better signing than Viola?) } } Hey Valentine, I don't see Boston with any world series rings on their } fingers.
oooooo. cheap shot. :^)
} Damn, Morris now has three and probably the Hall of Fame in his } future.
who cares? he had two of them before he came to Toronto; and if the Jays had signed Viola instead of Morris, it would have been Frank who won 20 and got the ring. and he would be on his way to 20 this year, too.
} Therefore, I would have to say Toronto easily made the best } signing.
your logic is curious, and spurious.
there is no reason to believe that Viola wouldn't have won as many games had *he* signed with Toronto. when you compare their stupid W-L records, be sure to compare their team's offensive averages too.
now, looking at anything like the Morris-Viola sweepstakes a year later is basically hindsight. but there were plenty of reasons why it should have been apparent that Viola was the better pitcher, based on previous recent years and also based on age (Frank is almost 5 years younger! how many knew that?). people got caught up in the '91 World Series, and then on Morris' 21 wins last year. wins are the stupidest, most misleading statistic in baseball, far worse than RBI or R. that he won 21 just means that the Jays got him a lot of runs.
the only really valid retort to Valentine is: weren't the Red Sox trying to get Morris too? oh, sure, they *said* Viola was their first choice afterwards, but what should we have expected they would say?
} And don't tell me Boston will win this year. They won't } even be in the top 4 in the division, more like 6th.
if this is true, it won't be for lack of contribution by Viola, so who cares?
-*- charles
Please suggest me whether the error is while loading the documents or while reading the files and extracting words. What do I need to do to load the corpus correctly?
NLTK has corpora loading issues
you can load the useful category data using
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
Where newsgroups_train.target_names give you categories.

Add data_frame column while fulfill condition in dict

I am trying to add a column to a pandas.DataFrame. If the string in the DataFrame has one or more words as a key in a dict. But it gives me an error, and I don't know what went wrong. Could anyone help?
data_frame:
tw_test.head()
tweet
0 living the dream. #cameraman #camera #camerac...
1 justin #trudeau's reasons for thanksgiving. to...
2 #themadape butt…..butt…..we’re allergic to l...
3 2 massive explosions at peace march in #turkey...
4 #mulcair suggests there’s bad blood between hi...
dict:
party={}
{'#mulcair': 'NDP', '#cdnleft': 'liberal', '#LiberalExpress': 'liberal', '#ThankYouStephenHarper': 'Conservative ', '#pmjt': 'liberal'...}
My code:
tw_test["party"]=tw_test["tweet"].apply(lambda x: party[x.split(' ')[1].startswith("#")[0]])
I believe your trouble was due to trying to cram too much into a lambda. A function to do the lookup was pretty straight forward:
Code:
party_tags = {
'#mulcair': 'NDP',
'#cdnleft': 'liberal',
'#LiberalExpress': 'liberal',
'#ThankYouStephenHarper': 'Conservative ',
'#pmjt': 'liberal'
}
def party(tweet):
for tag in [t for t in tweet.split() if t.startswith('#')]:
if tag in party_tags:
return party_tags[tag]
Test Code:
import pandas as pd
tw_test = pd.DataFrame([x.strip() for x in u"""
living the dream. #cameraman #camera #camerac
justin #trudeau's reasons for thanksgiving. to
#themadape butt…..butt…..we’re allergic to
2 massive explosions at peace march in #turkey
#mulcair suggests there’s bad blood between
""".split('\n')[1:-1]], columns=['tweet'])
tw_test["party"] = tw_test["tweet"].apply(party)
print(tw_test)
Results:
tweet party
0 living the dream. #cameraman #camera #camerac None
1 justin #trudeau's reasons for thanksgiving. to None
2 #themadape butt…..butt…..we’re allergic to None
3 2 massive explosions at peace march in #turkey None
4 #mulcair suggests there’s bad blood between NDP

Categories

Resources