I have the function below that performs a sentiment analysis in phrase and returns a tuple (sentiment, % NB classifier), like (sadness, 0.78)
I want to apply this function on a pandas dataframe df.Message to analyse it and then create 2 another columns df.Sentiment , df.Prob
The code is below:
def avalia(teste):
testeStemming = []
stemmer = nltk.stem.RSLPStemmer()
for (palavras_treinamento) in teste.split():
comStem = [p for p in palavras_treinamento.split()]
testeStemming.append(str(stemmer.stem(comStem[0])))
novo = extrator_palavras(testeStemming)
distribuicao = classificador.prob_classify(novo)
classe_array = [(classe, (distribuicao.prob(classe))) for classe in distribuicao.samples()]
inverse = [(value, key) for key, value in classe_array]
max_key = max(inverse)[1]
for each in classe_array:
if each[0] == max_key:
a=each[0] # returns the sentiment
b=each[1] # returns the probability
#print(each)
return a, b
example on a single string:
avalia('i am sad today!')
(sadness, 0.98)
Now i have a dataframe with 13k rows and one column: Message.
I can apply my function to a dataframe column and get a pandas.series like:
0 (surpresa, 0.27992165905522154)
1 (medo, 0.5632686358414051)
2 (surpresa, 0.2799216590552195)
3 (alegria, 0.5429940754962914)
I want to use these info´s to create 2 new columns in the same dataframe, like below.
Message Sentiment Probability
0 I am sad surpresa 0.2799
1 I am happy medo 0.56
I cant get this last part done. Any help please?
Try returning both values at the end of the function, and saving them into separate columns with an apply():
def avalia(teste):
testeStemming = []
stemmer = nltk.stem.RSLPStemmer()
for (palavras_treinamento) in teste.split():
comStem = [p for p in palavras_treinamento.split()]
testeStemming.append(str(stemmer.stem(comStem[0])))
novo = extrator_palavras(testeStemming)
distribuicao = classificador.prob_classify(novo)
classe_array = [(classe, (distribuicao.prob(classe))) for classe in distribuicao.samples()]
inverse = [(value, key) for key, value in classe_array]
max_key = max(inverse)[1]
for each in classe_array:
if each[0] == max_key:
a=each[0] # returns the sentiment
b=each[1] # returns the probability
return a, b
df.Sentiment, df.Prob = df.Message.apply(avalia)
Related
I have a dataset containing pre-processed online reviews, each row contains words from online review. I am doing a Latent Dirichlet Allocation process to extract topics from the entire dataframe. Now, I want to assign topics to each row of data based on an LDA function called get_document_topics.
I found a code from a source but it only prints the probability of a document being assign to each topic. I'm trying to iterate the code to all documents and returns to the same dataset. Here's the code I found...
text = ["user"]
bow = dictionary.doc2bow(text)
print "get_document_topics", model.get_document_topics(bow)
### get_document_topics [(0, 0.74568415806946331), (1, 0.25431584193053675)]
Here's what I'm trying to get...
stemming probabOnTopic1 probOnTopic2 probaOnTopic3 topic
0 [bank, water, bank] 0.7 0.3 0.0 0
1 [baseball, rain, track] 0.1 0.8 0.1 1
2 [coin, money, money] 0.9 0.0 0.1 0
3 [vote, elect, bank] 0.2 0.0 0.8 2
Here's the codes that I'm working on...
def bow (text):
return [dictionary.doc2bow(text) in document]
df["probability"] = optimal_model.get_document_topics(bow)
df[['probOnTopic1', 'probOnTopic2', 'probOnTopic3']] = pd.DataFrame(df['probability'].tolist(), index=df.index)
slightly different approach #Christabel, that include your other request with 0.7 threshold:
import pandas as pd
results = []
# Iterate over each review
for review in df['review']:
bow = dictionary.doc2bow(review)
topics = model.get_document_topics(bow)
#to a dictionary
topic_dict = {topic[0]: topic[1] for topic in topics}
#get the prob
max_topic = max(topic_dict, key=topic_dict.get)
if topic_dict[max_topic] > 0.7:
topic = max_topic
else:
topic = 0
topic_dict['topic'] = topic
results.append(topic_dict)
#to a DF
df_topics = pd.DataFrame(results)
df = df.merge(df_topics, left_index=True, right_index=True)
Is it helpful and working for you ?
You can then place this code inside of a function and use the '0.70' value as an external parameter so to make it usable in different use-cases.
One possible option can be creating a new column in your DF and then iterate over each row in your DF. You can use the get_document_topics function to get the topic distribution for each row and then assign the most likely topic to that row.
df['topic'] = None
for index, row in df.iterrows():
text = row['review_text']
bow = dictionary.doc2bow(text)
topic_distribution = model.get_document_topics(bow)
most_likely_topic = max(topic_distribution, key=lambda x: x[1])
df.at[index, 'topic'] = most_likely_topic
is it helpful ?
I had 10 features in my dataframe. I applied chi square test and generated the p-values for all the column pairs in the dataframe. I want to represent the p-values as a cross-grid of multiple features.
Example : A, B, C are my features and p-values between (A,B) = 0.0001, (A,C) = 0.5, (B,C) = 0.0
So, I want to see this thing as:
A B C
A 1 0.001 0.5
B 0.001 1 0.0
C 0.5 0.0 1
If any other detail needed please let know.
Assuming you have list of features as features = ['A','B','C',...] and p-values as
p_values = {('A','B'):0.0001,('A','C'):0.5,...}
import pandas as pd
p_values = {('A','B'):0.0001,('A','C'):0.5}
features = ['A','B','C']
df = pd.DataFrame(columns=features)
for row in features:
rowdf = [] # prepare a row for df
for col in features:
if row == col:
rowdf.append(1) # (A,A) taken as 1
continue
try:
rowdf.append(p_values[(row,col)]) # add the value from dictionary
except KeyError:
try:
rowdf.append(p_values[(col, row)]) # look for pair like (B,A) if (A,B) not found
except KeyError: # still not found, append None
rowdf.append(None)
df.loc[len(df)] = rowdf # write row in df
df.index = features # to make row names as A,B,C ...
print(df)
How to apply stemming on Pandas Dataframe column
am using this function for stemming which is working perfect on string
xx='kenichan dived times ball managed save 50 rest'
def make_to_base(x):
x_list = []
doc = nlp(x)
for token in doc:
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
print(" ".join(x_list))
make_to_base(xx)
But when i am applying this function on my pandas dataframe column it is not working neither giving any error
x = list(df['text']) #my df column
x = str(x)#converting into string otherwise it is giving error
make_to_base(x)
i've tried different thing but nothing working. like this
df["texts"] = df.text.apply(lambda x: make_to_base(x))
make_to_base(df['text'])
my dataset looks like this:
df['text'].head()
Out[17]:
0 Hope you are having a good week. Just checking in
1 K..give back my thanks.
2 Am also doing in cbe only. But have to pay.
3 complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 okmail: Dear Dave this is your final notice to...
Name: text, dtype: object
You need to actually return the value you got inside the make_to_base method, use
def make_to_base(x):
x_list = []
for token in nlp(x):
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
return " ".join(x_list)
Then, use
df['texts'] = df['text'].apply(lambda x: make_to_base(x))
I want to do a t-test for the means of hourly wages of male and female staff.
`df1 = df[["gender","hourly_wage"]] #creating a sub-dataframe with only the columns of gender and hourly wage
staff_wages=df1.groupby(['gender']).mean() #grouping the data frame by gender and assigning it to a new variable 'staff_wages'
staff_wages.head()`
Truth is, I think I've got confused half way. I wanted to do a t-test so I wrote the code
`mean_val_salary_female = df1[staff_wages['gender'] == 'female'].mean()
mean_val_salary_female = df1[staff_wages['gender'] == 'male'].mean()
t_val, p_val = stats.ttest_ind(mean_val_salary_female, mean_val_salary_male)
# obtain a one-tail p-value
p_val /= 2
print(f"t-value: {t_val}, p-value: {p_val}")`
It will only return errors.
I sort of went crazy trying different things...
`#married_vs_dependents = df[['married', 'num_dependents', 'years_in_employment']]
#married_vs_dependents = df[['married', 'num_dependents', 'years_in_employment']]
#married_vs_dependents.head()
#my_data = df(married_vs_dependents)
#my_data.groupby('married').mean()
mean_gender = df.groupby("gender")["hourly_wage"].mean()
married_vs_dependents.head()
mean_gender.groupby('gender').mean()
mean_val_salary_female = df[staff_wages['gender'] == 'female'].mean()
mean_val_salary_female = df[staff_wages['gender'] == 'male'].mean()
#cat1 = mean_gender['male']==['cat1']
#cat2 = mean_gender['female']==['cat2']
ttest_ind(cat1['gender'], cat2['hourly_wage'])`
Please who can guide me to the right step to take?
You're passing mean values of each group as a and b parameters - that's why the error raises. Instead, you should pass arrays, as it is stated in the documentation.
df1 = df[["gender","hourly_wage"]]
m = df1.loc[df1["gender"].eq("male")]["hourly_wage"].to_numpy()
f = df1.loc[df1["gender"].eq("female")]["hourly_wage"].to_numpy()
stats.ttest_ind(m,f)
I have a for-loop looks like this, but it takes a long time to run once a large dataset being passed into it.
for i in range(0,len(data_sim.index)):
for j in range(1,len(data_sim.columns)):
user = data_sim.index[i]
activity = data_sim.columns[j]
if dt_full.loc[i][j] != 0:
data_sim.loc[i][j] = 0
else:
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchases = data_activity.loc[user,activity_top_names]
data_sim.loc[i][j] = getScore(user_purchases,activity_top_sims)
In for-loop, data_sim looks like this:
CustomerId A B C D E
1 NAs NAs NAs NAs NAs
2 ..
I tried to reproduce the same process in apply function, which looks like this:
def test(cell):
user = cell.index
activity = cell
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchase = data_activity_index.loc[user, activity_top_names]
if dt_full.loc[user][activity] != 0:
return cell.replace(cell, 0)
else:
re = getScore(user_purchase, activity_top_sims)
return cell.replace(cell, re)
In function, data_sim2 looks like this, I set the 'CustomerId' column to index column and duplicated the activity name to each activity column.
CustomerId(Index) A B C D E
1 A B C D E
2 A B C D E
Inside of the function 'def test(cell)', if the cell is in data_sim2[1][0],
cell.index = 1 # userId
cell # activity name
The whole idea of this for-loop is to fit the scoring data into 'data_sim' table based on position of each cell. And I used the same idea in creating function, used the same calculation in each cell, then apply this to the data table 'data_sim',
data_test = data_sim2.apply(lambda x: test(x))
it gave me a error said
"sort_values() missing 1 required positional argument: 'by'"
which is odd, because this issue was not happening inside of for loop. It sounds like the 'data_corr.loc[activity]' is still a Dataframe istead of a Series.