My Problem
I'm working on sentiment analysis using ML models.
I have a dataset of Amazon reviews from 1 to 5 stars.
print(df.groupby('overall').count())
overall reviewText
1.0 108725
2.0 82139
3.0 142257
4.0 347041
5.0 1009026
These results are biased, with 59% of them being 5-stars. I'm afraid if I train my model with this dataset, it will learn quickly to be biased towards rating a sentiment of 'Positive'.
I would like to equalize all of these rows so each 'overall' rating has an equal number of 'reviewText'
My Current Solution
Here is my current solution
one_star_ratings = df.loc[df['overall'] == 1.0][0:80000]
two_star_ratings = df.loc[df['overall'] == 2.0][0:80000]
three_star_ratings = df.loc[df['overall'] == 3.0][0:80000]
four_star_ratings = df.loc[df['overall'] == 4.0][0:80000]
five_star_ratings = df.loc[df['overall'] == 5.0][0:80000]
df2 = pd.concat([one_star_ratings, two_star_ratings, three_star_ratings, four_star_ratings,
five_star_ratings])
This works, but it is a naive solution.
My question
I will encounter this issue frequently while working with datasets, and I am trying to find a better solution. Assume I had 100 categories, and not just 5. How can I better solve this problem without writing 100+ lines of code to do it?
You could use groupby().head() for this:
n_sample = 80000
df2 = df.groupby('overall').head(n_sample)
If you want to sample randomly:
df2 = df.sample(frac=1).groupby('overall').head(n_sample)
You can also use sample to randomly select the data:
df2 = df.groupby('overall')apply(lambda x: x.sample(n=n_sample))
Related
I have a dataset containing pre-processed online reviews, each row contains words from online review. I am doing a Latent Dirichlet Allocation process to extract topics from the entire dataframe. Now, I want to assign topics to each row of data based on an LDA function called get_document_topics.
I found a code from a source but it only prints the probability of a document being assign to each topic. I'm trying to iterate the code to all documents and returns to the same dataset. Here's the code I found...
text = ["user"]
bow = dictionary.doc2bow(text)
print "get_document_topics", model.get_document_topics(bow)
### get_document_topics [(0, 0.74568415806946331), (1, 0.25431584193053675)]
Here's what I'm trying to get...
stemming probabOnTopic1 probOnTopic2 probaOnTopic3 topic
0 [bank, water, bank] 0.7 0.3 0.0 0
1 [baseball, rain, track] 0.1 0.8 0.1 1
2 [coin, money, money] 0.9 0.0 0.1 0
3 [vote, elect, bank] 0.2 0.0 0.8 2
Here's the codes that I'm working on...
def bow (text):
return [dictionary.doc2bow(text) in document]
df["probability"] = optimal_model.get_document_topics(bow)
df[['probOnTopic1', 'probOnTopic2', 'probOnTopic3']] = pd.DataFrame(df['probability'].tolist(), index=df.index)
slightly different approach #Christabel, that include your other request with 0.7 threshold:
import pandas as pd
results = []
# Iterate over each review
for review in df['review']:
bow = dictionary.doc2bow(review)
topics = model.get_document_topics(bow)
#to a dictionary
topic_dict = {topic[0]: topic[1] for topic in topics}
#get the prob
max_topic = max(topic_dict, key=topic_dict.get)
if topic_dict[max_topic] > 0.7:
topic = max_topic
else:
topic = 0
topic_dict['topic'] = topic
results.append(topic_dict)
#to a DF
df_topics = pd.DataFrame(results)
df = df.merge(df_topics, left_index=True, right_index=True)
Is it helpful and working for you ?
You can then place this code inside of a function and use the '0.70' value as an external parameter so to make it usable in different use-cases.
One possible option can be creating a new column in your DF and then iterate over each row in your DF. You can use the get_document_topics function to get the topic distribution for each row and then assign the most likely topic to that row.
df['topic'] = None
for index, row in df.iterrows():
text = row['review_text']
bow = dictionary.doc2bow(text)
topic_distribution = model.get_document_topics(bow)
most_likely_topic = max(topic_distribution, key=lambda x: x[1])
df.at[index, 'topic'] = most_likely_topic
is it helpful ?
Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.
I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2
I have a series of dataframes containing daily rainfall totals (continuous data) and whether or not a flood occurs (binary data, i.e. 1 or 0). Each data frame represents a year (e.g. df01, df02, df03, etc.), which looks like this:
date ppt fld
01/02/2011 1.5 0
02/02/2011 0.0 0
03/02/2011 2.7 0
04/02/2011 4.6 0
05/02/2011 15.5 1
06/02/2011 1.5 0
...
I wish to perform logistic regression on each year of data, but the data is heavily imbalanced due to the very small number of flood events relative to the number of rainfall events. As such, I wish to upsample just the minority class (values of 1 in 'fld'). So far I know to split each dataframe into two according to the 'fld' value, upsample the resulting '1' dataframe, and then remerge into one dataframe.
# So if I apply to one dataframe it looks like this:
# Separate majority and minority classes
mask = df01.fld == 0
fld_0 = df01[mask]
fld_1 = df01[~mask]
# Upsample minority class
fld_1_upsampled = resample(fld_1,
replace=True, # sample with replacement
n_samples=247, # to match majority class
random_state=123) # reproducible results
# Combine majority class with upsampled minority class
df01_upsampled = pd.concat([fld_0, fld_1_upsampled])
As I have 17 dataframes, it is inefficient to go dataframe-by-dataframe. Are there any thoughts as to how I could be more efficient with this? So far I have tried this (it is probably evident I have no idea what I am doing with loops of this kind, I am quite new to python):
df_all = [df01, df02, df03, df04,
df05, df06, df07, df08,
df09, df10, df11, df12,
df13, df14, df15, df16, df17]
# This is my list of annual data
for i in df_all:
fld_0 = i[mask]
fld_1 = i[~mask]
fld_1_upsampled = resample(fld_1,
replace=True, # sample with replacement
n_samples=len(fld_0), # to match majority class
random_state=123) # reproducible results
i_upsampled = pd.concat([fld_0, fld_1_upsampled])
return i_upsampled
Which returns the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-6fd782d4c469> in <module>()
11 replace=True, # sample with replacement
12 n_samples=247, # to match majority class
---> 13 random_state=123) # reproducible results
14 i_upsampled = pd.concat([fld_0, fld_1_upsampled])
15 return i_upsampled
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py in resample(*arrays, **options)
259
260 if replace:
--> 261 indices = random_state.randint(0, n_samples, size=(max_n_samples,))
262 else:
263 indices = np.arange(n_samples)
mtrand.pyx in mtrand.RandomState.randint()
ValueError: low >= high
Any advice or comments greatly appreciated :)
UPDATE: one reply suggested that some of my dataframes may not contain any samples from the minority class. This was correct, so I have removed them, but the same error arises.
Giving you the benefit of the doubt that you're using the same mask syntax in your second code block as in your first, it looks like you may not have any samples to pass in to your resample in one or more of your DFs:
df=pd.DataFrame({'date':[1,2,3,4,5,6],'ppt':[1.5,0,2.7,4.6,15.5,1.5],'fld':[0,1,0,0,1,1]})
date ppt fld
1 1.5 0
2 0.0 1
3 2.7 0
4 4.6 0
5 15.5 1
6 1.5 1
resample(df[df.fld==1], replace=True, n_samples=3, random_state=123)
date ppt fld
6 1.5 1
5 15.5 1
6 1.5 1
resample(df[df.fld==2], replace=True, n_samples=3, random_state=123)
"...ValueError: low >= high"
I am having performance issues with iterrows in on my dataframe as I start to scale up my data analysis.
Here is the current loop that I am using.
for ii, i in a.iterrows():
for ij, j in a.iterrows():
if ii != ij:
if i['DOCNO'][-5:] == j['DOCNO'][4:9]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
elif i['DOCNO'][-5:] == j['DOCNO'][-5:]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
c = a.drop(a.index[dl])
The point of the loop is to find 'DOCNO' values that are different in the dataframe but are known to be equivalent denoted by the 5 characters that are equivalent but spaced differently in the string. When found I want to drop the smaller number from the associated 'RSLTN1' column. Additionally, my data set may have multiple entries for a unique 'DOCNO' that I want to drop the lower number 'RSLTN1' result.
I was successful running this will small quantities of data (~1000 rows) but as I scale up 10x I am running into performance issues. Any suggestions?
Sample from dataset
In [107]:a[['DOCNO','RSLTN1']].sample(n=5)
Out[107]:
DOCNO RSLTN1
6815 MP00064958 72386.0
218 MP0059189A 65492.0
8262 MP00066187 96497.0
2999 MP00061663 43677.0
4913 MP00063387 42465.0
How does this fit you needs?
import pandas as pd
s = '''\
DOCNO RSLTN1
MP00059189 72386.0
MP0059189A 65492.0
MP00066187 96497.0
MP00061663 43677.0
MP00063387 42465.0'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Create mask
# We sort to make sure we keep only highest value
# Remove all non-digit according to: https://stackoverflow.com/questions/44117326/
m = (df.sort_values(by='RSLTN1',ascending=False)['DOCNO']
.str.extract('(\d+)', expand=False)
.astype(int).duplicated())
# Apply inverted `~` mask
df = df.loc[~m]
Resulting df:
DOCNO RSLTN1
0 MP00059189 72386.0
2 MP00066187 96497.0
3 MP00061663 43677.0
4 MP00063387 42465.0
In this example the following row was removed:
MP0059189A 65492.0