Applying a function on pandas dataframe to perform a sentiment analysis - python

I have the function below that performs a sentiment analysis in phrase and returns a tuple (sentiment, % NB classifier), like (sadness, 0.78)
I want to apply this function on a pandas dataframe df.Message to analyse it and then create 2 another columns df.Sentiment , df.Prob
The code is below:
def avalia(teste):
testeStemming = []
stemmer = nltk.stem.RSLPStemmer()
for (palavras_treinamento) in teste.split():
comStem = [p for p in palavras_treinamento.split()]
testeStemming.append(str(stemmer.stem(comStem[0])))
novo = extrator_palavras(testeStemming)
distribuicao = classificador.prob_classify(novo)
classe_array = [(classe, (distribuicao.prob(classe))) for classe in distribuicao.samples()]
inverse = [(value, key) for key, value in classe_array]
max_key = max(inverse)[1]
for each in classe_array:
if each[0] == max_key:
a=each[0] # returns the sentiment
b=each[1] # returns the probability
#print(each)
return a, b
example on a single string:
avalia('i am sad today!')
(sadness, 0.98)
Now i have a dataframe with 13k rows and one column: Message.
I can apply my function to a dataframe column and get a pandas.series like:
0 (surpresa, 0.27992165905522154)
1 (medo, 0.5632686358414051)
2 (surpresa, 0.2799216590552195)
3 (alegria, 0.5429940754962914)
I want to use these info´s to create 2 new columns in the same dataframe, like below.
Message Sentiment Probability
0 I am sad surpresa 0.2799
1 I am happy medo 0.56
I cant get this last part done. Any help please?

Try returning both values at the end of the function, and saving them into separate columns with an apply():
def avalia(teste):
testeStemming = []
stemmer = nltk.stem.RSLPStemmer()
for (palavras_treinamento) in teste.split():
comStem = [p for p in palavras_treinamento.split()]
testeStemming.append(str(stemmer.stem(comStem[0])))
novo = extrator_palavras(testeStemming)
distribuicao = classificador.prob_classify(novo)
classe_array = [(classe, (distribuicao.prob(classe))) for classe in distribuicao.samples()]
inverse = [(value, key) for key, value in classe_array]
max_key = max(inverse)[1]
for each in classe_array:
if each[0] == max_key:
a=each[0] # returns the sentiment
b=each[1] # returns the probability
return a, b
df.Sentiment, df.Prob = df.Message.apply(avalia)

Related

Iterate function across dataframe

I have a dataset containing pre-processed online reviews, each row contains words from online review. I am doing a Latent Dirichlet Allocation process to extract topics from the entire dataframe. Now, I want to assign topics to each row of data based on an LDA function called get_document_topics.
I found a code from a source but it only prints the probability of a document being assign to each topic. I'm trying to iterate the code to all documents and returns to the same dataset. Here's the code I found...
text = ["user"]
bow = dictionary.doc2bow(text)
print "get_document_topics", model.get_document_topics(bow)
### get_document_topics [(0, 0.74568415806946331), (1, 0.25431584193053675)]
Here's what I'm trying to get...
stemming probabOnTopic1 probOnTopic2 probaOnTopic3 topic
0 [bank, water, bank] 0.7 0.3 0.0 0
1 [baseball, rain, track] 0.1 0.8 0.1 1
2 [coin, money, money] 0.9 0.0 0.1 0
3 [vote, elect, bank] 0.2 0.0 0.8 2
Here's the codes that I'm working on...
def bow (text):
return [dictionary.doc2bow(text) in document]
df["probability"] = optimal_model.get_document_topics(bow)
df[['probOnTopic1', 'probOnTopic2', 'probOnTopic3']] = pd.DataFrame(df['probability'].tolist(), index=df.index)
slightly different approach #Christabel, that include your other request with 0.7 threshold:
import pandas as pd
results = []
# Iterate over each review
for review in df['review']:
bow = dictionary.doc2bow(review)
topics = model.get_document_topics(bow)
#to a dictionary
topic_dict = {topic[0]: topic[1] for topic in topics}
#get the prob
max_topic = max(topic_dict, key=topic_dict.get)
if topic_dict[max_topic] > 0.7:
topic = max_topic
else:
topic = 0
topic_dict['topic'] = topic
results.append(topic_dict)
#to a DF
df_topics = pd.DataFrame(results)
df = df.merge(df_topics, left_index=True, right_index=True)
Is it helpful and working for you ?
You can then place this code inside of a function and use the '0.70' value as an external parameter so to make it usable in different use-cases.
One possible option can be creating a new column in your DF and then iterate over each row in your DF. You can use the get_document_topics function to get the topic distribution for each row and then assign the most likely topic to that row.
df['topic'] = None
for index, row in df.iterrows():
text = row['review_text']
bow = dictionary.doc2bow(text)
topic_distribution = model.get_document_topics(bow)
most_likely_topic = max(topic_distribution, key=lambda x: x[1])
df.at[index, 'topic'] = most_likely_topic
is it helpful ?

Represent a p-value obtained from chi square test between multiple columns in the form of a crosstab in Python

I had 10 features in my dataframe. I applied chi square test and generated the p-values for all the column pairs in the dataframe. I want to represent the p-values as a cross-grid of multiple features.
Example : A, B, C are my features and p-values between (A,B) = 0.0001, (A,C) = 0.5, (B,C) = 0.0
So, I want to see this thing as:
A B C
A 1 0.001 0.5
B 0.001 1 0.0
C 0.5 0.0 1
If any other detail needed please let know.
Assuming you have list of features as features = ['A','B','C',...] and p-values as
p_values = {('A','B'):0.0001,('A','C'):0.5,...}
import pandas as pd
p_values = {('A','B'):0.0001,('A','C'):0.5}
features = ['A','B','C']
df = pd.DataFrame(columns=features)
for row in features:
rowdf = [] # prepare a row for df
for col in features:
if row == col:
rowdf.append(1) # (A,A) taken as 1
continue
try:
rowdf.append(p_values[(row,col)]) # add the value from dictionary
except KeyError:
try:
rowdf.append(p_values[(col, row)]) # look for pair like (B,A) if (A,B) not found
except KeyError: # still not found, append None
rowdf.append(None)
df.loc[len(df)] = rowdf # write row in df
df.index = features # to make row names as A,B,C ...
print(df)

spacy stemming on pandas df column not working

How to apply stemming on Pandas Dataframe column
am using this function for stemming which is working perfect on string
xx='kenichan dived times ball managed save 50 rest'
def make_to_base(x):
x_list = []
doc = nlp(x)
for token in doc:
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
print(" ".join(x_list))
make_to_base(xx)
But when i am applying this function on my pandas dataframe column it is not working neither giving any error
x = list(df['text']) #my df column
x = str(x)#converting into string otherwise it is giving error
make_to_base(x)
i've tried different thing but nothing working. like this
df["texts"] = df.text.apply(lambda x: make_to_base(x))
make_to_base(df['text'])
my dataset looks like this:
df['text'].head()
Out[17]:
0 Hope you are having a good week. Just checking in
1 K..give back my thanks.
2 Am also doing in cbe only. But have to pay.
3 complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 okmail: Dear Dave this is your final notice to...
Name: text, dtype: object
You need to actually return the value you got inside the make_to_base method, use
def make_to_base(x):
x_list = []
for token in nlp(x):
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
return " ".join(x_list)
Then, use
df['texts'] = df['text'].apply(lambda x: make_to_base(x))

How do I perform a T-test from a dataframe?

I want to do a t-test for the means of hourly wages of male and female staff.
`df1 = df[["gender","hourly_wage"]] #creating a sub-dataframe with only the columns of gender and hourly wage
staff_wages=df1.groupby(['gender']).mean() #grouping the data frame by gender and assigning it to a new variable 'staff_wages'
staff_wages.head()`
Truth is, I think I've got confused half way. I wanted to do a t-test so I wrote the code
`mean_val_salary_female = df1[staff_wages['gender'] == 'female'].mean()
mean_val_salary_female = df1[staff_wages['gender'] == 'male'].mean()
t_val, p_val = stats.ttest_ind(mean_val_salary_female, mean_val_salary_male)
# obtain a one-tail p-value
p_val /= 2
print(f"t-value: {t_val}, p-value: {p_val}")`
It will only return errors.
I sort of went crazy trying different things...
`#married_vs_dependents = df[['married', 'num_dependents', 'years_in_employment']]
#married_vs_dependents = df[['married', 'num_dependents', 'years_in_employment']]
#married_vs_dependents.head()
#my_data = df(married_vs_dependents)
#my_data.groupby('married').mean()
mean_gender = df.groupby("gender")["hourly_wage"].mean()
married_vs_dependents.head()
mean_gender.groupby('gender').mean()
mean_val_salary_female = df[staff_wages['gender'] == 'female'].mean()
mean_val_salary_female = df[staff_wages['gender'] == 'male'].mean()
#cat1 = mean_gender['male']==['cat1']
#cat2 = mean_gender['female']==['cat2']
ttest_ind(cat1['gender'], cat2['hourly_wage'])`
Please who can guide me to the right step to take?
You're passing mean values of each group as a and b parameters - that's why the error raises. Instead, you should pass arrays, as it is stated in the documentation.
df1 = df[["gender","hourly_wage"]]
m = df1.loc[df1["gender"].eq("male")]["hourly_wage"].to_numpy()
f = df1.loc[df1["gender"].eq("female")]["hourly_wage"].to_numpy()
stats.ttest_ind(m,f)

convert for loop to apply function in python to reduce run-time

I have a for-loop looks like this, but it takes a long time to run once a large dataset being passed into it.
for i in range(0,len(data_sim.index)):
for j in range(1,len(data_sim.columns)):
user = data_sim.index[i]
activity = data_sim.columns[j]
if dt_full.loc[i][j] != 0:
data_sim.loc[i][j] = 0
else:
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchases = data_activity.loc[user,activity_top_names]
data_sim.loc[i][j] = getScore(user_purchases,activity_top_sims)
In for-loop, data_sim looks like this:
CustomerId A B C D E
1 NAs NAs NAs NAs NAs
2 ..
I tried to reproduce the same process in apply function, which looks like this:
def test(cell):
user = cell.index
activity = cell
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchase = data_activity_index.loc[user, activity_top_names]
if dt_full.loc[user][activity] != 0:
return cell.replace(cell, 0)
else:
re = getScore(user_purchase, activity_top_sims)
return cell.replace(cell, re)
In function, data_sim2 looks like this, I set the 'CustomerId' column to index column and duplicated the activity name to each activity column.
CustomerId(Index) A B C D E
1 A B C D E
2 A B C D E
Inside of the function 'def test(cell)', if the cell is in data_sim2[1][0],
cell.index = 1 # userId
cell # activity name
The whole idea of this for-loop is to fit the scoring data into 'data_sim' table based on position of each cell. And I used the same idea in creating function, used the same calculation in each cell, then apply this to the data table 'data_sim',
data_test = data_sim2.apply(lambda x: test(x))
it gave me a error said
"sort_values() missing 1 required positional argument: 'by'"
which is odd, because this issue was not happening inside of for loop. It sounds like the 'data_corr.loc[activity]' is still a Dataframe istead of a Series.

Categories

Resources