I have a dataset containing pre-processed online reviews, each row contains words from online review. I am doing a Latent Dirichlet Allocation process to extract topics from the entire dataframe. Now, I want to assign topics to each row of data based on an LDA function called get_document_topics.
I found a code from a source but it only prints the probability of a document being assign to each topic. I'm trying to iterate the code to all documents and returns to the same dataset. Here's the code I found...
text = ["user"]
bow = dictionary.doc2bow(text)
print "get_document_topics", model.get_document_topics(bow)
### get_document_topics [(0, 0.74568415806946331), (1, 0.25431584193053675)]
Here's what I'm trying to get...
stemming probabOnTopic1 probOnTopic2 probaOnTopic3 topic
0 [bank, water, bank] 0.7 0.3 0.0 0
1 [baseball, rain, track] 0.1 0.8 0.1 1
2 [coin, money, money] 0.9 0.0 0.1 0
3 [vote, elect, bank] 0.2 0.0 0.8 2
Here's the codes that I'm working on...
def bow (text):
return [dictionary.doc2bow(text) in document]
df["probability"] = optimal_model.get_document_topics(bow)
df[['probOnTopic1', 'probOnTopic2', 'probOnTopic3']] = pd.DataFrame(df['probability'].tolist(), index=df.index)
slightly different approach #Christabel, that include your other request with 0.7 threshold:
import pandas as pd
results = []
# Iterate over each review
for review in df['review']:
bow = dictionary.doc2bow(review)
topics = model.get_document_topics(bow)
#to a dictionary
topic_dict = {topic[0]: topic[1] for topic in topics}
#get the prob
max_topic = max(topic_dict, key=topic_dict.get)
if topic_dict[max_topic] > 0.7:
topic = max_topic
else:
topic = 0
topic_dict['topic'] = topic
results.append(topic_dict)
#to a DF
df_topics = pd.DataFrame(results)
df = df.merge(df_topics, left_index=True, right_index=True)
Is it helpful and working for you ?
You can then place this code inside of a function and use the '0.70' value as an external parameter so to make it usable in different use-cases.
One possible option can be creating a new column in your DF and then iterate over each row in your DF. You can use the get_document_topics function to get the topic distribution for each row and then assign the most likely topic to that row.
df['topic'] = None
for index, row in df.iterrows():
text = row['review_text']
bow = dictionary.doc2bow(text)
topic_distribution = model.get_document_topics(bow)
most_likely_topic = max(topic_distribution, key=lambda x: x[1])
df.at[index, 'topic'] = most_likely_topic
is it helpful ?
Related
I have the function below that performs a sentiment analysis in phrase and returns a tuple (sentiment, % NB classifier), like (sadness, 0.78)
I want to apply this function on a pandas dataframe df.Message to analyse it and then create 2 another columns df.Sentiment , df.Prob
The code is below:
def avalia(teste):
testeStemming = []
stemmer = nltk.stem.RSLPStemmer()
for (palavras_treinamento) in teste.split():
comStem = [p for p in palavras_treinamento.split()]
testeStemming.append(str(stemmer.stem(comStem[0])))
novo = extrator_palavras(testeStemming)
distribuicao = classificador.prob_classify(novo)
classe_array = [(classe, (distribuicao.prob(classe))) for classe in distribuicao.samples()]
inverse = [(value, key) for key, value in classe_array]
max_key = max(inverse)[1]
for each in classe_array:
if each[0] == max_key:
a=each[0] # returns the sentiment
b=each[1] # returns the probability
#print(each)
return a, b
example on a single string:
avalia('i am sad today!')
(sadness, 0.98)
Now i have a dataframe with 13k rows and one column: Message.
I can apply my function to a dataframe column and get a pandas.series like:
0 (surpresa, 0.27992165905522154)
1 (medo, 0.5632686358414051)
2 (surpresa, 0.2799216590552195)
3 (alegria, 0.5429940754962914)
I want to use these info´s to create 2 new columns in the same dataframe, like below.
Message Sentiment Probability
0 I am sad surpresa 0.2799
1 I am happy medo 0.56
I cant get this last part done. Any help please?
Try returning both values at the end of the function, and saving them into separate columns with an apply():
def avalia(teste):
testeStemming = []
stemmer = nltk.stem.RSLPStemmer()
for (palavras_treinamento) in teste.split():
comStem = [p for p in palavras_treinamento.split()]
testeStemming.append(str(stemmer.stem(comStem[0])))
novo = extrator_palavras(testeStemming)
distribuicao = classificador.prob_classify(novo)
classe_array = [(classe, (distribuicao.prob(classe))) for classe in distribuicao.samples()]
inverse = [(value, key) for key, value in classe_array]
max_key = max(inverse)[1]
for each in classe_array:
if each[0] == max_key:
a=each[0] # returns the sentiment
b=each[1] # returns the probability
return a, b
df.Sentiment, df.Prob = df.Message.apply(avalia)
Currently I am working on assigning different inflow rates (float values) to each product based on the product code => There should be 2 columns: PRODUCT_CODE' and 'INFLOW_RATE'. The product code has 4 characters and the rule is as followed:
If the code starts with 'L','H' or 'M': assign float value = 1.0 to 'INFLOW_RATE' column.
If the codes are 'SVND' or 'SAVL': assign float value = 0.1 to 'INFLOW_RATE' column.
Other cases: assign float value = 0.5 to 'INFLOW_RATE' column.
The sample data is as followed:
There are > 50 product codes so I believe it is best to check the conditions and assign values using wildcards. So far I managed to come up with this code:
Import re
CFIn_01 = ['SVND','SAVL']
CFIn_10 = ["M.+","L.+","H.+"]
file_consol['INFLOW_RATE'] = 0.5
file_consol.loc[file_consol['PRODUCT_CODE'].isin(CFIn_01), 'INFLOW_RATE'] = 0.1
file_consol.loc[file_consol['PRODUCT_CODE'].isin(CFIn_10), 'INFLOW_RATE'] = 1.0
However, when I check the result, all columns of 'INFLOW_RATE' are still filled with 0.5, instead of the rules I expected. I'm not sure what will be the appropriate code for this problem. Any help or advise is appreciated!
Create your custom function like you would do with a simple string:
def my_func(word: str):
if word.startswith('H') or word.startswith('L') or word.startswith('M'):
out = 0.1
elif word == 'SVND' or word == 'SAVL':
out = 1.0
else:
out = 0.5
return out
Then apply the function:
df['INFLOW'] = df.PRODUCT_CODE.apply(my_func)
I had 10 features in my dataframe. I applied chi square test and generated the p-values for all the column pairs in the dataframe. I want to represent the p-values as a cross-grid of multiple features.
Example : A, B, C are my features and p-values between (A,B) = 0.0001, (A,C) = 0.5, (B,C) = 0.0
So, I want to see this thing as:
A B C
A 1 0.001 0.5
B 0.001 1 0.0
C 0.5 0.0 1
If any other detail needed please let know.
Assuming you have list of features as features = ['A','B','C',...] and p-values as
p_values = {('A','B'):0.0001,('A','C'):0.5,...}
import pandas as pd
p_values = {('A','B'):0.0001,('A','C'):0.5}
features = ['A','B','C']
df = pd.DataFrame(columns=features)
for row in features:
rowdf = [] # prepare a row for df
for col in features:
if row == col:
rowdf.append(1) # (A,A) taken as 1
continue
try:
rowdf.append(p_values[(row,col)]) # add the value from dictionary
except KeyError:
try:
rowdf.append(p_values[(col, row)]) # look for pair like (B,A) if (A,B) not found
except KeyError: # still not found, append None
rowdf.append(None)
df.loc[len(df)] = rowdf # write row in df
df.index = features # to make row names as A,B,C ...
print(df)
I have a time series with a structure like below, and identifier column and two value columns (floats)
dataframe called just df:
Date Id Value1 Value2
2014-10-01 A 1.1 1.2
2014-10-01 B 1.3 1.4
2014-10-02 A 1.5 1.6
2014-10-02 B 1.7 1.8
2014-10-03 A 3.2 4.8
2014-10-03 B 8.2 10.1
2014-10-04 A 6.1 7.2
2014-10-04 B 4.3 4.1
What I am trying to do is turn it into a an array that is grouped by the identifier column with a rolling 3 observation period so I would end up with is this:
[[[1.1 1.2]
[1.5 1.6] '----> ID A 10/1 to 10/3'
[3.2 4.8]]
[[1.3 1.4]
[1.7 1.8] '----> ID B 10/1 to 10/3'
[8.2 10.1]]
[[1.5 1.6]
[3.2 4.8] '----> ID A 10/2 to 10/4'
[6.1 7.2]]
[[1.7 1.8]
[8.2 10.1] '----> ID B 10/2 to 10/4'
[4.3 4.1]]]
Of course ignore the parts in quotes above in the array but you hopefully get the idea.
I have a larger dataset that has more identifiers and may need to change the observation count, so can't hard the row count. So far the direction I am leaning towards is taking the unique values of the ID column and iterating and grabbing 3 values at a time, by creating a temp df and iterating over that.
Seems there is probably a better and faster way to do this.
"pseudo code"
unique_ids = df.ID.unique().tolist()
for id in unique_ids:
temp_df = df.loc[df['Id']==id]]
Though the part am I stuck on there is the best way to iterate over the temp_df as well.
The end output would be used in an LSTM model; however most other solutions are written to not need to handle the groupby aspect as with column 'Id'.
Here is what I ended up doing for the solution, not the pretty easiest but then again my question wasn't winning any beauty contests to begin with
id_list = array_steps_df['Id'].unique().tolist()
# change number of steps as needed
step = 3
column_list = ['Value1', 'Value2']
master_list = []
for id in id_list:
master_dict = {}
for column in column_list:
array_steps_id_df = array_steps_df.loc[array_steps_df['Id'] == id]
array_steps_id_df = array_steps_id_df[[column]].values
master_dict[column] = []
for obs in range(len(array_steps_id_df)-step+1):
start_obs = obs + step
master_dict[column].append(array_steps_id_df[obs:start_obs,])
master_list.append(master_dict)
for idx, dic in enumerate(master_list):
# init arrays here
if idx == 0:
value1_array_init = master_list[0]['Value1']
value2_array_init = master_list[1]['Value2']
else:
value1_array_init += master_list[idx]['Value1']
value2_array_init += master_list[idx]['Value2']
value1_array = np.array(value1_array_init)
value2_array = np.array(value2_array_init)
all_array = np.hstack((value1_array, value2_array)).reshape((len(array_steps_df) - (step + 1),
len(column_list),
step)).transpose(0, 2, 1)
Fixed, my mistake added a transpose at the end and redid order of features and steps in reshape.
Credit to this site for some extra help
https://www.mikulskibartosz.name/how-to-turn-pandas-data-frame-into-time-series-input-for-rnn/
I ended up redoing this a bit to make it more dynamic for the columns and keep the time series in order, also added a target array as well to keep the predictions in order. For anyone that needs this here is the function:
def data_to_array_steps(array_steps_df, time_steps, columns_to_array, id_column):
"""
https: //www.mikulskibartosz.name/ how - to - turn - pandas - data - frame - into - time - series - input - for -rnn /
:param array_steps_df: the dataframe from the csv
:param time_steps: how many time steps
:param columns_to_array: what columns to convert to the array
:param id_column: what is to be used for the identifier
:return: data grouped in a # observations by identifier and date
"""
id_list = array_steps_df[id_column].unique().tolist()
date_list = array_steps_df['date'].unique().tolist()
master_list = []
target_list = []
missing_counter = 0
total_counter = 0
# grab date size = time steps at a time and iterate through all of them
for date in range(len(date_list) - time_steps + 1):
date_range_test = date_list[date:time_steps+date]
date_range_df = array_steps_df.loc[(array_steps_df['date'] <= date_range_test[-1]) &
(array_steps_df['date'] >= date_range_test[0])
]
# for each id do it separately so time series data doesn't get mixed up
for identifier in id_list:
# get id in here and then skip if not the required time steps/observations for the id
date_range_id = date_range_df.loc[date_range_df[id_column] == identifier]
master_dict = {}
# if there aren't enough observations for the data range
if len(date_range_id) != time_steps:
# dont fully need the counter except in unusual circumstances when debugging it causes no harm for now
missing_counter += 1
else:
# add target each loop through for the last date in the date range for the id or ticker
target = array_steps_df['target'].\
loc[(array_steps_df['date'] == date_range_test[-1])
& (array_steps_df[id_column] == identifier)
].iloc[0]
target_list.append(target)
total_counter += 1
# loop through each column in dataframe
for column in columns_to_array:
date_range_id_value = date_range_id[[column]].values
master_dict[column] = []
master_dict[column].append(date_range_id_value)
master_list.append(master_dict)
# redo columns to arrays, after they have been ordered and grouped by Id above
array_list = []
# for each column go through the values in the array create an array for the column then append to list
for column in columns_to_array:
for idx, dic in enumerate(master_list):
# init arrays here if the first value
if idx == 0:
value_array_init = master_list[0][column]
else:
value_array_init += master_list[idx][column]
array_list.append(np.array(value_array_init))
# for each value in the array list, horizontally stack each value
all_array = np.hstack(array_list).reshape((total_counter,
len(columns_to_array),
time_steps
)
).transpose(0, 2, 1)
target_array_all = np.array(target_list
).reshape(len(target_list),
1)
# should probably make this an if condition later after a few more tests
print('check of length of arrays', len(all_array), len(target_array_all))
return all_array, target_array_all
My Problem
I'm working on sentiment analysis using ML models.
I have a dataset of Amazon reviews from 1 to 5 stars.
print(df.groupby('overall').count())
overall reviewText
1.0 108725
2.0 82139
3.0 142257
4.0 347041
5.0 1009026
These results are biased, with 59% of them being 5-stars. I'm afraid if I train my model with this dataset, it will learn quickly to be biased towards rating a sentiment of 'Positive'.
I would like to equalize all of these rows so each 'overall' rating has an equal number of 'reviewText'
My Current Solution
Here is my current solution
one_star_ratings = df.loc[df['overall'] == 1.0][0:80000]
two_star_ratings = df.loc[df['overall'] == 2.0][0:80000]
three_star_ratings = df.loc[df['overall'] == 3.0][0:80000]
four_star_ratings = df.loc[df['overall'] == 4.0][0:80000]
five_star_ratings = df.loc[df['overall'] == 5.0][0:80000]
df2 = pd.concat([one_star_ratings, two_star_ratings, three_star_ratings, four_star_ratings,
five_star_ratings])
This works, but it is a naive solution.
My question
I will encounter this issue frequently while working with datasets, and I am trying to find a better solution. Assume I had 100 categories, and not just 5. How can I better solve this problem without writing 100+ lines of code to do it?
You could use groupby().head() for this:
n_sample = 80000
df2 = df.groupby('overall').head(n_sample)
If you want to sample randomly:
df2 = df.sample(frac=1).groupby('overall').head(n_sample)
You can also use sample to randomly select the data:
df2 = df.groupby('overall')apply(lambda x: x.sample(n=n_sample))