What I want to do is, I have several columns and I wanna search each colum string in a column here is the example.
Here is my df.head()
Check index reviews summary output1 output2 ... output35 output36 output37 output38 output39 output40
0 True 1 After realizing my old mascara had a petroleum... Output: Quality: love the wand (positive); Len... Quality: love the wand (positive) Lengthening: lengthens my lashes well (positive) ... None None None None None None
1 True 2 Best mascara I’ve ever used. Makes my non-exis... Output: Makes lashes visible (positive); Non-e... Makes lashes visible (positive) Non-existent Asian lashes (negative) ... None None None None None None
2 True 3 I've never had a mascara that made my lashes l... Output: Look: long lashes (positive) Look: long lashes (positive) None ... None None None None None None
3 True 4 It is clump and smudge-free with an awesome la... Output: Clump-free (positive); Smudge-proof (p... Clump-free (positive) Smudge-proof (positive) ... None None None None None None
4 True 5 And I’m going to buy it again, and again, and ... Output: Quality: impressed by a mascara before... Quality: impressed by a mascara before (positive) Extends Lashes: huge impact (positive) ... None None None None None None
What I want to do is to check if the values of all output columns (except columns with N/A in them) are inside the string in the reviews column.
here is a example for reviews column and the output1 column value.
Reviews = "After realizing my old mascara had a petroleum by-product in it, I needed a new one I felt good about putting on my beautiful lashes. I needed one that had a clean formula, produced by a company making more environmentally sustainable efforts. I was so excited to receive this mascara after ordering it, and it did not disappoint. I love natural, clean beauty products. I love the wand on this mascara, and it lengthens my lashes very well. I like using mascara because I love my lashes, and I don't want to use falsies or extensions. This mascara takes me from girlboss to goddess in less than 2 minutes. It's so easy to use, non-messy, and dries fast. Even if you do or don't feel like doing a full-face of makeup, this mascara will upgrade your look. It is a must-have and worth every cent."
Output1 = "Quality: love the wand (positive)"
I wanna search "love the wand" value is in reviews column
## Read your DataFrame
df = pd.read_excel("ilia.xlsx")
df.insert(0, "Check","True")
# Split the values in the 'summary' column and create new columns
split_values = df['summary'].str.split(";", expand=True)
for i in range(split_values.shape[1]):
df[f'output{i+1}'] = split_values[i]
df = df.apply(lambda x: x.str.lstrip() if x.dtype == "object" else x)
df["output1"] = df["output1"].str.replace("Output: ", "")
df["Check"] = df.apply(lambda x: any(i in x['reviews'] for i in x['summary'].split(";")), axis=1)
print(df.head())
df.to_excel("foo.xlsx", index=False)
Seems like this is what you are looking for:
df['reviews'].str.contains('love the wand').any()
From Statology
Related
I am looking for a sentiment analysis code with atleast 80%+ accuracy. I tried Vader and it I found it easy and usable, however it was giving accuracy of 64% only.
Now, I was looking at some BERT models and I noticed it needs to be re-trained? Is that correct? Isn't it pre-trained? is re-training necessary?
You can use pre-trained models from HuggingFace. There are plenty to choose from. Search for emotion or sentiment models
Here is an example of a model with 26 emotions. The current implementation works but is very slow for large datasets.
import pandas as pd
from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification, pipeline
tokenizer = RobertaTokenizerFast.from_pretrained("arpanghoshal/EmoRoBERTa")
model = TFRobertaForSequenceClassification.from_pretrained("arpanghoshal/EmoRoBERTa")
emotion = pipeline('sentiment-analysis',
model='arpanghoshal/EmoRoBERTa')
# example data
DATA_URI = "https://github.com/AFAgarap/ecommerce-reviews-analysis/raw/master/Womens%20Clothing%20E-Commerce%20Reviews.csv"
dataf = pd.read_csv(DATA_URI, usecols=["Review Text",])
# This is super slow, I will find a better optimization ASAP
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.fillna("")
.map(lambda x: emotion(x)[0].get("label", None))
),
)
)
We could also refactor it a bit
...
# a bit faster than the previous but still slow
def emotion_func(text:str) -> str:
if not text:
return None
return emotion(text)[0].get("label", None)
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.map(emotion_func)
),
)
)
Results:
Review Text Emotion
0 Absolutely wonderful - silky and sexy and comf... admiration
1 Love this dress! it's sooo pretty. i happene... love
2 I had such high hopes for this dress and reall... fear
3 I love, love, love this jumpsuit. it's fun, fl... love
...
6 I aded this in my basket at hte last mintue to... admiration
7 I ordered this in carbon for store pick up, an... neutral
8 I love this dress. i usually get an xs but it ... love
9 I'm 5"5' and 125 lbs. i ordered the s petite t... love
...
16 Material and color is nice. the leg opening i... neutral
17 Took a chance on this blouse and so glad i did... admiration
...
26 I have been waiting for this sweater coat to s... excitement
27 The colors weren't what i expected either. the... disapproval
...
31 I never would have given these pants a second ... love
32 These pants are even better in person. the onl... disapproval
33 I ordered this 3 months ago, and it finally ca... disappointment
34 This is such a neat dress. the color is great ... admiration
35 Wouldn't have given them a second look but tri... love
36 This is a comfortable skirt that can span seas... approval
...
40 Pretty and unique. great with jeans or i have ... admiration
41 This is a beautiful top. it's unique and not s... admiration
42 This poncho is so cute i love the plaid check ... love
43 First, this is thermal ,so naturally i didn't ... love
you can use pickle.
Pickle lets you.. well pickle your model for later use and in fact, you can use a loop to keep training the model until it reaches a certain accuracy and then exit the loop and pickle the model for later use.
You can find many tutorials on youtube on how to pickel a model.
I'm new to sentiment analysis and I'm exploring with TextBlob.
My data is pre-processed Twitter data. It's in a series and each tweet has been cleaned and tokenized:
0 [new, leaked, treasury, document, full, sugges...
1 [tommy, robinson, endorsing, conservative, for...
2 [thanks, already, watched, catch, tv, morning, ]
3 [treasury, document, check, today, check, cons...
4 [utterly, stunning, video, hoped, prayed, woul...
... ...
307370 [trump, disciple, copycat]
307373 [disgusting]
307389 [wonder, people, vote, racist, homophobe, like...
307391 [gary, neville, slam, fuelling, racism, manche...
307393 [brexit, fault, excuseforeverything]
When I run textblob sentiment (using help from Apply textblob in for each row of a dataframe), my result is a column of nan values:
# Create sentiment column using textblob
# Source: https://stackoverflow.com/questions/43485469/apply-textblob-in-for-each-row-of-a-dataframe
def sentiment_calc(text):
try:
return TextBlob(text).sentiment
except:
return None
boris_data['sentiment'] = boris_data['text'].apply(sentiment_calc)
text sentiment
0 [new, leaked, treasury, document, full, sugges... None
1 [tommy, robinson, endorsing, conservative, for... None
2 [thanks, already, watched, catch, tv, morning, ] None
3 [treasury, document, check, today, check, cons... None
4 [utterly, stunning, video, hoped, prayed, woul... None
... ... ...
307370 [trump, disciple, copycat] None
307373 [disgusting] None
307389 [wonder, people, vote, racist, homophobe, like... None
307391 [gary, neville, slam, fuelling, racism, manche... None
307393 [brexit, fault, excuseforeverything] None
I need some help with running some filter on some data. I have a a data set made up of text. And i also have a list of words. I would like to filter each row of my data such that the remaining text in the rows will be made up of only words in the list object
words
(cell, CDKs, lung, mutations monomeric, Casitas, Background, acquired, evidence, kinases, small, evidence, Oncogenic )
data
ID Text
0 Cyclin-dependent kinases CDKs regulate a
1 Abstract Background Non-small cell lung
2 Abstract Background Non-small cell lung
3 Recent evidence has demonstrated that acquired
4 Oncogenic mutations in the monomeric Casitas
so after my filter i would like the data-frame to look like this
data
ID Text
0 kinases CDKs
1 Background cell lung
2 Background small cell lung
3 evidence acquired
4 Oncogenic mutations monomeric Casitas
I tried using the iloc and similar functions but I dont seem to get it. any help with that?
You can simply use apply() along with a simple list comprehension:
>>> df['Text'].apply(lambda x: ' '.join([i for i in x.split() if i in words]))
0 kinases CDKs
1 Background cell lung
2 Background cell lung
3 evidence acquired
4 Oncogenic mutations monomeric Casitas
Also, I made words a set to improve performance (O(1) average lookup time), I recommend you do the same.
I'm not certain this is the most elegant of solutions, but you could do:
to_remove = ['foo', 'bar']
df = pd.DataFrame({'Text': [
'spam foo& eggs',
'foo bar eggs bacon and lettuce',
'spam and foo eggs'
]})
df['Text'].str.replace('|'.join(to_remove), '')
I'm attempting a similar operation as shown here.
I begin with reading in two columns from a CSV file that contains 2405 rows in the format of: Year e.g. "1995" AND cleaned e.g. ["this", "is, "exemplar", "document", "contents"], both columns utilise strings as data types.
df = pandas.read_csv("ukgovClean.csv", encoding='utf-8', usecols=[0,2])
I have already pre-cleaned the data, and below shows the format of the top 4 rows:
[IN] df.head()
[OUT] Year cleaned
0 1909 acquaint hous receiv follow letter clerk crown...
1 1909 ask secretari state war whether issu statement...
2 1909 i beg present petit sign upward motor car driv...
3 1909 i desir ask secretari state war second lieuten...
4 1909 ask secretari state war whether would introduc...
[IN] df['cleaned'].head()
[OUT] 0 acquaint hous receiv follow letter clerk crown...
1 ask secretari state war whether issu statement...
2 i beg present petit sign upward motor car driv...
3 i desir ask secretari state war second lieuten...
4 ask secretari state war whether would introduc...
Name: cleaned, dtype: object
Then I initialise the TfidfVectorizer:
[IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
Following this, calling upon the below line results in:
[IN] x = v.fit_transform(df['cleaned'])
[OUT] ValueError: np.nan is an invalid document, expected byte or unicode string.
I overcame this using the solution in the aforementioned thread:
[IN] x = v.fit_transform(df['cleaned'].values.astype('U'))
however, this resulted in a Memory Error (Full Traceback).
I've attempted to look up storage using Pickle to circumvent mass-memory usage, but I'm not sure how to filter it in in this scenario. Any tips would be much appreciated, and thanks for reading.
[UPDATE]
#pittsburgh137 posted a solution to a similar problem involving fitting data here, in which the training data is generated using pandas.get_dummies(). What I've done with this is:
[IN] train_X = pandas.get_dummies(df['cleaned'])
[IN] train_X.shape
[OUT] (2405, 2380)
[IN] x = v.fit_transform(train_X)
[IN] type(x)
[OUT] scipy.sparse.csr.csr_matrix
I thought I should update any readers while I see what I can do with this development. If there are any predicted pitfalls with this method, I'd love to hear them.
I believe it's the conversion to dtype('<Unn') that might be giving you trouble. Check out the size of the array on a relative basis, using just the first few documents plus an NaN:
>>> df['cleaned'].values
array(['acquaint hous receiv follow letter clerk crown',
'ask secretari state war whether issu statement',
'i beg present petit sign upward motor car driv',
'i desir ask secretari state war second lieuten',
'ask secretari state war whether would introduc', nan],
dtype=object)
>>> df['cleaned'].values.astype('U').nbytes
1104
>>> df['cleaned'].values.nbytes
48
It seems like it would make sense to drop the NaN values first (df.dropna(inplace=True)). Then, it should be pretty efficient to call v.fit_transform(df['cleaned'].tolist()).
I am trying to add a column to a pandas.DataFrame. If the string in the DataFrame has one or more words as a key in a dict. But it gives me an error, and I don't know what went wrong. Could anyone help?
data_frame:
tw_test.head()
tweet
0 living the dream. #cameraman #camera #camerac...
1 justin #trudeau's reasons for thanksgiving. to...
2 #themadape butt…..butt…..we’re allergic to l...
3 2 massive explosions at peace march in #turkey...
4 #mulcair suggests there’s bad blood between hi...
dict:
party={}
{'#mulcair': 'NDP', '#cdnleft': 'liberal', '#LiberalExpress': 'liberal', '#ThankYouStephenHarper': 'Conservative ', '#pmjt': 'liberal'...}
My code:
tw_test["party"]=tw_test["tweet"].apply(lambda x: party[x.split(' ')[1].startswith("#")[0]])
I believe your trouble was due to trying to cram too much into a lambda. A function to do the lookup was pretty straight forward:
Code:
party_tags = {
'#mulcair': 'NDP',
'#cdnleft': 'liberal',
'#LiberalExpress': 'liberal',
'#ThankYouStephenHarper': 'Conservative ',
'#pmjt': 'liberal'
}
def party(tweet):
for tag in [t for t in tweet.split() if t.startswith('#')]:
if tag in party_tags:
return party_tags[tag]
Test Code:
import pandas as pd
tw_test = pd.DataFrame([x.strip() for x in u"""
living the dream. #cameraman #camera #camerac
justin #trudeau's reasons for thanksgiving. to
#themadape butt…..butt…..we’re allergic to
2 massive explosions at peace march in #turkey
#mulcair suggests there’s bad blood between
""".split('\n')[1:-1]], columns=['tweet'])
tw_test["party"] = tw_test["tweet"].apply(party)
print(tw_test)
Results:
tweet party
0 living the dream. #cameraman #camera #camerac None
1 justin #trudeau's reasons for thanksgiving. to None
2 #themadape butt…..butt…..we’re allergic to None
3 2 massive explosions at peace march in #turkey None
4 #mulcair suggests there’s bad blood between NDP