Removing duplicates with a condition in data frame - python

I have a data frame with text as one column and its labels as other column.
The texts are duplicates with a single label.
I want to remove these duplicates and keep the records for only the label specified.
Sample dataframe:
text label
0 great view a
1 great view b
2 good balcony a
3 nice service a
4 nice service b
5 nice service c
6 bad rooms f
7 nice restaurant a
8 nice restaurant d
9 nice beach nearby x
10 good casino z
Now if I want to keep the text wherever label a is present and remove only the duplicates.
Sample output:
text label
0 great view a
1 good balcony a
2 nice service a
3 bad rooms f
4 nice restaurant a
5 nice beach nearby x
6 good casino z
Thanks in advance!

You can simple try sort_values before drop_duplicates, since the df will first ordered by the label by the order of alpha beta (a>b yield to True)
df=df.sort_values('label').drop_duplicates('text')
Or
df=df.sort_values('label').groupby('text').head(1)
Update
Valuetokeep='a'
df=df.iloc[(df.label!=Valuetokeep).argsort()].drop_duplicates('text')

Related

Processing dataframe with conditionals, using df.apply

I have a catalog of trees, which I've imported into a dataframe. It looks like this:
>>> df
ID Tree Zone Temp_Limit Grade
0 1 Apple 1 21 A
1 2 Apple 1 21 B
2 3 Orange 3 28 B
3 4 Pear 2 26 A
4 5 Apple 4 24 C
The idea is that depending on the type of tree, zone, and temp_limit, the dosage for irrigation, fertilizers, estimated transplant date, etc. would be calculated. Those would be additional columns in the dataframe.
The problem is that the formulas are conditional. It's not just "multiply temp limit by 5 and add 4", more like "if it's an apple tree in zone 2, apply this formula, if it's an orange tree in zone 1, formula goes like this... etc"
And to make things a bit more complicated, there might be rows that have an ID, a Tree type, and no data, that correspond to trees that haven't been delivered, etc.
My current solution is to use df.apply and have a function to do the conditionals and skip the blank rows:
def calculate_irrigation(species,zone,templimit,grade):
if species.lower() == "apple":
if zone == 3:
etc etc etc
df['irrigation'] = df.apply (lambda x: calculate_irrigation(x['Tree'], x['Zone'], x['Temp_limit'], x['Grade'])
Question: is a Dataframe and df.apply the best solution for this? I used a df because it adapts very well to the data I'm working with and getting the data in there is pretty straightforward. Plus exporting the final results is easy. But when you have to do different operations based on values, and have to start putting functions in there, it makes you wonder if there's a better way you're not seeing.

How to filter column in Data frame based on text containing symbol #

Hi have a Data frame with two columns which looks like this:
Index Text
0 READ MY NEW OP-ED: IRREVERSIBLE – Many Effects...
1 #COVID19 is linked to more #diabetes diagnoses...
2 #COVID19: IRREVERSIBLE – Many Effects...
3 READ MY NEW OP-ED: IRREVERSIBLE – Many Effects...
4 Advanced healthcare at your fingertips\nhttps:...
I am trying keep only the rows which contain the #symbol, so based on my data frame above my desired output is:
Index Text
1 #COVID19 is linked to more #diabetes diagnoses...
2 #COVID19: IRREVERSIBLE – Many Effects...
I have tried several ways to obtain that unsuccessfully, my latest code attempt was:
for column in twt_text:
print(twt_text['text'].str.contains('#'))
But the output generated was not at all what I expected:
0 False
1 True
2 True
3 False
4 False
Any idea or insight on how I can obtain the output I want based on text containing # ?
You could build a selection mask and use that to filter the rows:
df[df['Text'].str.contains('#')]
Result
Text
1 #COVID19 is linked to more #diabetes diagnoses...
2 #COVID19: IRREVERSIBLE – Many Effects...

How to split multiple values from a dataframe column into separate columns

I have a column with multiple values. I want to split the unique values into multiple columns with headers and then apply Label Encoder or One Hot Encoder(I don't know yet) because I have a Multi-label text classification problem to solve.
I try
df['labels1'] = df['labels1'].str.split(',', expand=True)
but it splits only the first item. Also before try to split the column I try to change the type but I didn't make it.
id
0 Politics, Journals, International
1 Social, Blogs, Celebrities
2 Media, Blogs, Video
3 Food&Drink, Cooking
4 Media, Blogs, Video
5 Culture
6 Social, TV Shows
7 News, Crime, National
8 Social, Blogs, Celebrities
9 Social, Blogs, Celebrities
10 Social, Blogs, Celebrities
11 Family, Blogs
12 Media, Blogs, Video
13 Social, TV Shows
14 Entertainment, TV Shows
15 Social, TV Shows
16 Social, Blogs, Celebrities
It seems like for the right side of the equation of df['labels1'].str.split(',', expand=True) would spit out two items. So perhaps you can do something like:
df['newcolumn1'], df['newcolumn2'] = df['labels1'].str.split(',', expand=True)
You try to set a column of a dataframe with a three-columns-dataframe - which unfortunately silently is done by passing only the first column...
Perhaps you try to concatenate the new three expanded columns to the first dataframe
df = pd.concat([df, df['labels1'].str.split(', ', expand=True)], 1)
or perhaps just go on with this step in a new one
df_exp = df['labels1'].str.split(', ', expand=True)
Edit:
IIUC, your binary table can be created like this (but I don't know if this is the recommended way to do):
col_head = set(df.labels1.str.split(', ', expand=True).values.flatten())
bin_tbl = pd.DataFrame(columns=col_head)
for c in bin_tbl:
bin_tbl[c] = df.labels1.str.split(', ').apply(lambda x: c in x)

plotting stacked bar graph on column values

I have a Pandas data frame that looks like this:
ID Management Administrative
1 1 2
3 2 1
4 3 3
10 1 3
essentially the 1-3 is a grade of low medium or high. I want a stacked bar chart that has Management and Administrative on x-axis and the stacked composition of 1,2,3 of each column in percentages.
e.g. if there were only 4 entries as above, 1 would compose 50% of the height, 2 would compose 25% and 3 would compose 25% of the height of the management bar. The y axis would go up to 100%.
Hope this makes sense. Hard to explain but if unclear willing to clarify further!
You will need to chain several operations: First melt your dataset to move the Department as a new variable, after that you can groupby the Department and the Rating to count the number of IDs that fall into that bucket, then you groupby again by Department to calculate the percentages. Lastly you can plot your stacked bar graph:
df4.melt().rename(columns={'variable':'Dept', 'value':'Rating'}
).query('Dept!="ID"'
).groupby(['Dept','Rating']).size(
).rename('Count'
).groupby(level=0).apply(lambda x: x/sum(x)
).unstack().plot(kind='bar', stacked=True)

Count occurrence of elements in column of lists (with a twist)

I've got a column of lists called "author_background" which I would like to analyze. The actual column consists of 8.000 rows. My aim is to get an overview on how many different elements there are in total (in all lists of the column) and count in how many lists each element occurs in.
How my column looks like:
df.author_background
0 [Professor for Business Administration, Harvard Business School]
1 [Professor for Industrial Engineering, University of Oakland]
2 [Harvard Business School]
3 [CEO, SpaceX]
desired output
0 Harvard Business School 2
1 Professor for Business Administration 1
2 Professor for Industrial Engineering 1
3 CEO 1
4 University of Oakland 1
5 SpaceX 1
I would like to know how often "Professor of Business Administration", "Professor for Industrial Engineering", "Harvard Business School", etc. occurs in the column. There are way more titles I don't know about.
Basically, I would like to use pd.value_counts for the column. However, its not possible because its a list.
Is there another way to count the occurrence of each element?
If thats more helpful: I also got a list which contains all elements of the lists (not nested).
Turn it all into a single series by list flattening:
pd.Series([bg for bgs in df.author_background for bg in bgs])
Now you can call value_counts() to get your result.
You can try so:
el = pd.Series([item for sublist in df.author_background for item in sublist])
df = el.groupby(el).size().rename_axis('author_background').reset_index(name='counter')

Categories

Resources