Drop or replace values within duplicate rows in pandas dataframe - python

I have a data frame df where some rows are duplicates with respect to a subset of columns:
A B C
1 Blue Green
2 Red Green
3 Red Green
4 Blue Orange
5 Blue Orange
I would like to remove (or replace with a dummy string) values for duplicate rows with respect to B and C, without deleting the whole row, ideally producing:
A B C
1 Blue Green
2 Red Green
3 NaN NaN
4 Blue Orange
5 Nan NaN
As per this thread: Replace duplicate values across columns in Pandas I've tried using pd.Series.duplicated, however I can't get it to work with duplicates in a subset of columns.
I've also played around with:
is_duplicate = df.loc[df.duplicated(subset=['B','C'])]
df = df.where(is_duplicated==True, 999) # 999 intended as a placeholder that I could find-and-replace later on
However this replaces almost every row with 999 in each column - so clearly I'm doing something wrong. I'd appreciate any advice on how to proceed!

df.loc[df.duplicated(subset=['B','C']), ['B','C']] = np.nan seems to work for me.
Edited to include #ALollz and #macaw_9227 correction.

Let me share with you how I used to confront those kind of challenges in the beginning. Obviously, there are quicker ways (a one-liner) but for the sake of the answer, let's do it on a more intuitive level (later, you'll see that you can do it in one line).
So here we go...
df = pd.DataFrame({"B":['Blue','Red','Red','Blue','Blue'],"C":['Green','Green','Green','Orange','Orange']})
which result in
Step 1: identify the duplication:
For this, I'm simply adding another (facilitator) column and asking with True/False if B and C are duplicated.
df['IS_DUPLICATED']= df.duplicated(subset=['B','C'])
Step 2: Identify the indexes of the 'True' IS_DUPLICATED:
dup_index = df[df['IS_DUPLICATED']==True].index
result: Int64Index([2, 4], dtype='int64')
Step 3: mark them as Nan:
df.iloc[dup_index]=np.NaN
Step 4: remove the IS_DUPLICATED column:
df.drop('IS_DUPLICATED',axis=1, inplace=True)
and the desired result:

I will using
df[['B','C']]=df[['B','C']].mask(df.duplicated(['B','C']))
df
Out[141]:
A B C
0 1 Blue Green
1 2 Red Green
2 3 NaN NaN
3 4 Blue Orange
4 5 NaN NaN

Related

Is there a way to associate the value of a row with another row in Excel using Python

I created a df from the data of my excel sheet and in a specific column I have a lot of values that are the same, but some of then are different. What I want to do is find in what row these different values are and associate each one with another value from the same row. I will give an example:
ColA ColB
'Ship' 5
'Ship' 5
'Car' 3
'Ship' 5
'Plane' 2
Following the example, is there a way to find where the values different from 5 are with the code giving me the respective value from ColA? In this case would be finding 3 and 2, returning for me 'Car' and 'Plane', respectively.
Any help is welcome! :)
It depends on exacty what you want to do, but you could use:
a filter - to filter for the value you seek.
.where - to show values which are False.
Given the above dataframe the following would work:
df['different'] = df['ColB']==5
df['type'] = df['ColA'].where(df['different']==False)
print(df)
Which returns this:
ColA ColB different type
0 Ship 5 True NaN
1 Ship 5 True NaN
2 Car 3 False Car
3 Ship 5 True NaN
4 Plane 2 False Plane
The 4th column has what you seek...

conditionally filling blanks in pandas according multiple columns from other rows

I have a huge df with missing entries in the brand column that needs to be filled according to other rows. if the all other 3 columns match, fill the blanks with the existing brand, else fill with 'Other'.
if this is my starting df:
df_start = pd.DataFrame({'device_id':[1,1,1,1,2,2,3,3,3,3,4,4,4,4],
'head':['a','a','b','b','a','b','a','b','b','b','a','b','c','d'],
'supplement':['Salt','Salt','Pepper','Pepper','Pepper','Pepper','Salt','Pepper','Salt','Pepper','Pepper','Salt','Pepper','Salt'],
'brand':['white',np.nan,np.nan,'white','white','black',np.nan,np.nan,'white','black',np.nan,'white','black',np.nan]})
how to get this result:
df_end = pd.DataFrame({'device_id':[1,1,1,1,2,2,3,3,3,3,4,4,4,4],
'head':['a','a','b','b','a','b','a','b','b','b','a','b','c','d'],
'supplement':['Salt','Salt','Pepper','Pepper','Pepper','Pepper','Salt','Pepper','Salt','Pepper','Pepper','Salt','Pepper','Salt'],
'brand':['white','white','white','white','white','black','Other','black','white','black','Other','white','black','Other']})
You could try with a groupby on the columns that need to be the same, in your case 'device_id', 'head', 'supplement', and use forward fill ffill(), backward fill bfill(), and at the very end you fillna() with 'Other', as the leftovers will be the ones with no identical rows in those 3 columns:
result = df_start.groupby(['device_id','head','supplement'])\
.apply(lambda x: x.ffill().bfill().fillna('Other'))
prints:
>>> result
device_id head supplement brand
0 1 a Salt white
1 1 a Salt white
2 1 b Pepper white
3 1 b Pepper white
4 2 a Pepper white
5 2 b Pepper black
6 3 a Salt Other
7 3 b Pepper black
8 3 b Salt white
9 3 b Pepper black
10 4 a Pepper Other
11 4 b Salt white
12 4 c Pepper black
13 4 d Salt Other
A solution not requiring a group by (costly), based on a simple mapping.
from collections import defaultdict
# create a mapping (ddict with key ('device_id', 'head', 'supplement')
# returns 'Other' when missing key
mapping = defaultdict(lambda: 'Other')
mapping.update(df_start.dropna()\
.set_index(['device_id', 'head', 'supplement'])['brand']\
.to_dict())
# apply function using the mapping to get the brand
brand = df_start.iloc[:, :-1].apply(lambda row: mapping[tuple(row)], axis=1)
You can replace the nan values in the brands column after the creation of the dataframe. This may not be be the most efficient way but is the simplest one.
df['brand'].replace(np.NaN, "Other")

How to find different values of different columns from the same row

I'm working with a dataframe structured like this one below:
A B C D
group_997 NaN Purple JMHKJALJ_06538
group_998 NaN Pink JMHKJALJ_04556
group_999 NaN Red JMHKJALJ_09211
group_995 NaN Green JMHKJALJ_16378
group_996 NaN Yellow JMHKJALJ_81324
Now I'd like to find the corresponding value of JMHKJALJ_16378 (column D) on column A (that is group_995).
I tried to use:
df[df['A'].str.contains('JMHKJALJ_16378')]
But the output is not correct.
How can I do that?

Unexpected Result When Filtering PANDAS DF Row

I typically filter a pandas DataFrame using the following syntax:
FDF = DF[DF['Color'] == 'Blue']
I expect to see a result where FDF, which is my filtered DataFrame returns just the rows where the color column is set to blue. Instead, I get something like this. Funny thing is, the program used to worked as expected, but stopped working after I upgraded my operating system and re-installed Python and all of the libraries. Also, it does not do this on all of my DataFrames. Any ideas?
0 Color Shape Data
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 Blue NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
Edit: I think the first 3 responses misunderstood the question. I am showing the result, not the original DF. My original DF looks like this:
Color Shape Data
0 Green square Y
1 Red triangle N
2 Red circle Y
3 Blue circle N
4 Green square N
5 Red triangle N
The result I am expecting is:
Color Shape Data
3 Blue circle N
('Color', 'Shape', 'Data') is NOT your column names, but your first row of data, otherwise there wouldn't an index 0 assigned to this row. Since DF has no column named 'Color', DF['Color'] == 'Blue' doesn't filter out anything, and therefore it would return all records.
If you imported your data from a csv or Excel sheet, I'd suggest you specify using the first row of your file as column names.
You are missing DF.loc so you are getting unneccessary rows.
Make first row as column header:
DF=DF.rename(columns=DF.iloc[0].drop(DF.index[0])
Then use the below to get just the rows where color is blue :
FDF=DF.loc[DF['Color'] =='Blue']
You're trying to filter by a column name which does not exists. First make your first row as column header :
DF.columns = DF.iloc[0]
DF.reindex(DF.index.drop(0))
Now filter using
FDF = DF[DF['Color'] == 'Blue']
Not sure I understand the negative ratings on this question. However, I was able to work around the issue by assigning a new index and renaming the columns.

Inner joining two columns in Pandas

I have a rather basic question for pandas, but I've tried merge and join to no success
-edit: these are in the same dataframe, and that wasn't clear. We are indeed condensing the data.
print df
product_code_shipped quantity product_code
0 A12395 1 A12395
1 H53456 4 D78997
2 A13456 3 E78997
3 A12372 8 A13456
4 E28997 1 D83126
5 B78997 2 C64516
6 C78117 9 B78497
7 B78227 1 H53456
8 B78497 2 J12372
So I want to just have one product code column with the unique product codes and their other data. So quantity, and color say, I just want the product codes of the shipped products (and in another column there is color). How do I do this inside the same dataframe?
So I should get
print df2
product_code_shipped quantity product_code color
0 A12395 1 A12395 red
1 H53456 4 H53456 blue
2 B78497 2 B78497 yellow
I'm a little confused by your question, specifically where "unique product codes" enter in...are we condensing the data? The example does not make that clear. Nonetheless I'll give it a shot:
Many DataFrame methods rely on the indexes to automatically align data. In your case, it seems convenient to set the index of these DataFrames to the product code. So you'd have this:
In [132]: shipped
Out[132]:
quantity
product_code_shipped
A 1
B 4
C 2
In [133]: info
Out[133]:
color
product_code
A red
B blue
C yellow
Now, join requires no extra parameters; it gives you exactly what (I think) you want.
In [134]: info.join(shipped)
Out[134]:
color quantity
product_code
A red 1
B blue 4
C yellow 2
If this doesn't answer your question, please clarify it by giving example input including where color comes from and the exact output that would come from that input.

Categories

Resources