df.duplicated() not finding duplicates - python

I am trying to run this code.
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
print(df.duplicated())
It is giving me the output.
0 False
1 False
dtype: bool
I want to know why it is showing index 1 as False and not True.
I'm expecting output this.
0 False
1 True
dtype: bool
I'm using Python 3.11.1 and Pandas 1.4.4

duplicated is working on full rows (or a subset of the columns if the parameter is used).
Here you don't have any duplicate:
A B C
0 1 1 1 # this row is unique
1 2 2 2 # this one is also unique
I believe you might want duplication column-wise?
df.T.duplicated()
Output:
A False
B True
C True
dtype: bool

You are not getting the expected output because you don't have duplicates, to begin with. I added the duplicate rows to the end of your dataframe and this is closer to what you are looking for:
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
df = pd.concat([df]*2)
df
A B C
0 1 1 1
1 2 2 2
0 1 1 1
1 2 2 2
df.duplicated(keep='first')
Output:
0 False
1 False
0 True
1 True
dtype: bool
And the if you want to keep duplicates the other way around:
df.duplicated(keep='last')
0 True
1 True
0 False
1 False
dtype: bool

Related

Confusion regarding duplicates in a data frame in python

I have a Pandas dataframe , for which I am checking for duplicates. I get the following output, but I dont know why its showing them as duplicates. Isnt all the column values in a row supposed to be same to be shown as a duplicate? Please correct me if I am wrong, I am newbie in Python
Per default, duplicated does not count the first occurence of a duplicated row.
>>> df = pd.DataFrame([[1,1], [2,2], [1,1], [1,1]])
>>> df
0 1
0 1 1
1 2 2
2 1 1
3 1 1
>>> df.duplicated()
0 False
1 False
2 True
3 True
dtype: bool
This means that your df[dup] will have unique rows if the duplicate rows in df were only duplicated twice.
You can adjust this behavior with the keep argument.
>>> df.duplicated(keep=False)
0 True
1 False
2 True
3 True
dtype: bool

drop the row only if all columns contains 0

I am trying to drop rows that have 0 for all 3 columns, i tried using these codes, but it dropped all the rows that have 0 in either one of the 3 columns instead.
indexNames = news[ news['contain1']&news['contain2'] &news['contain3']== 0 ].index
news.drop(indexNames , inplace=True)
My CSV file
contain1 contain2 contain3
1 0 0
0 0 0
0 1 1
1 0 1
0 0 0
1 1 1
Using the codes i used, all of my rows would be deleted. Below are the result i wanted instead
contain1 contain2 contain3
1 0 0
0 1 1
1 0 1
1 1 1
First filter by DataFrame.ne for not equal 0 and then get rows with at least one match - so removed only 0 rows by DataFrame.any:
df = news[news.ne(0).any(axis=1)]
#cols = ['contain1','contain2','contain3']
#if necessary filter only columns by list
#df = news[news[cols].ne(0).any(axis=1)]
print (df)
contain1 contain2 contain3
0 1 0 0
2 0 1 1
3 1 0 1
5 1 1 1
Details:
print (news.ne(0))
contain1 contain2 contain3
0 True False False
1 False False False
2 False True True
3 True False True
4 False False False
5 True True True
print (news.ne(0).any(axis=1))
0 True
1 False
2 True
3 True
4 False
5 True
dtype: bool
If this is a pandas dataframe you can sum the indexes with .sum().
news_sums = news.sum(axis=0)
indexNames = news.loc[news_sums == 0].index
news.drop(indexNames, inplace=True)
(note: Not tested, just from memory)
A simple solution would be to filter on the sum of your columns. You can do this by running this code news[news.sum(axis=1)!=0].
Hope this will help you :)
You might want to try this.
news[(news.T != 0).any()]

Conditional Formatting on duplicate values using pandas

I have a dataFrame with 2 columns a A and B. I have to separate out subset of dataFrames using pandas to delete all the duplicate values.
For Example
My dataFrame looks like this
**A B**
1 1
2 3
4 4
8 8
5 6
4 7
Then the output should be
**A B**
1 1 <--- both values Highlighted
2 3
4 4 <--- both values Highlighted
8 8 <--- both values Highlighted
5 6
4 7 <--- value in column A highlighted
How do I do that?
Thanks in advance.
You can use this:
def color_dupes(x):
c1='background-color:red'
c2=''
cond=x.stack().duplicated(keep=False).unstack()
df1 = pd.DataFrame(np.where(cond,c1,c2),columns=x.columns,index=x.index)
return df1
df.style.apply(color_dupes,axis=None)
# if df has many columns: df.style.apply(color_dupes,axis=None,subset=['A','B'])
Example working code:
Explanation:
First we stack the dataframe so as to bring all the columns into a series and find duplicated with keep=False to mark all duplicates as true:
df.stack().duplicated(keep=False)
0 A True
B True
1 A False
B False
2 A True
B True
3 A True
B True
4 A False
B False
5 A True
B False
dtype: bool
After this we unstack() the dataframe which gives a boolean dataframe with the same dataframe structure:
df.stack().duplicated(keep=False).unstack()
A B
0 True True
1 False False
2 True True
3 True True
4 False False
5 True False
Once we have this we assign the background color to values if True else no color using np.where

Select rows in Pandas which does not contain a specific character

I need something similar to
.str.startswith()
.str.endswith()
but for the middle part of a string.
For example, given the following pd.DataFrame
str_name
0 aaabaa
1 aabbcb
2 baabba
3 aacbba
4 baccaa
5 ababaa
I need to throw rows 1, 3 and 4 which contain (at least one) letter 'c'.
The position of the specific letter ('c') is not known.
The task is to remove all rows which do contain at least one specific letter
You want df['string_column'].str.contains('c')
>>> df
str_name
0 aaabaa
1 aabbcb
2 baabba
3 aacbba
4 baccaa
5 ababaa
>>> df['str_name'].str.contains('c')
0 False
1 True
2 False
3 True
4 True
5 False
Name: str_name, dtype: bool
Now, you can "delete" like this
>>> df = df[~df['str_name'].str.contains('c')]
>>> df
str_name
0 aaabaa
2 baabba
5 ababaa
>>>
Edited to add:
If you only want to check the first k characters, you can slice. Suppose k=3:
>>> df.str_name.str.slice(0,3)
0 aaa
1 aab
2 baa
3 aac
4 bac
5 aba
Name: str_name, dtype: object
>>> df.str_name.str.slice(0,3).str.contains('c')
0 False
1 False
2 False
3 True
4 True
5 False
Name: str_name, dtype: bool
Note, Series.str.slice does not behave like a typical Python slice.
you can use numpy
df[np.core.chararray.find(df.str_name.values.astype(str), 'c') < 0]
str_name
0 aaabaa
2 baabba
5 ababaa
You can use str.contains()
str_name = pd.Series(['aaabaa', 'aabbcb', 'baabba', 'aacbba', 'baccaa','ababaa'])
str_name.str.contains('c')
This will return the boolean
The following will return the inverse of the above
~str_name.str.contains('c')

Drop row in pandas dataframe if any value in the row equals zero

How do I drop a row if any of the values in the row equal zero?
I would normally use df.dropna() for NaN values but not sure how to do it with "0" values.
i think the easiest way is looking at rows where all values are not equal to 0:
df[(df != 0).all(1)]
You could make a boolean frame and then use any:
>>> df = pd.DataFrame([[1,0,2],[1,2,3],[0,1,2],[4,5,6]])
>>> df
0 1 2
0 1 0 2
1 1 2 3
2 0 1 2
3 4 5 6
>>> df == 0
0 1 2
0 False True False
1 False False False
2 True False False
3 False False False
>>> df = df[~(df == 0).any(axis=1)]
>>> df
0 1 2
1 1 2 3
3 4 5 6
Although it is late, someone else might find it helpful.
I had similar issue. But the following worked best for me.
df =pd.read_csv(r'your file')
df =df[df['your column name'] !=0]
reference:
Drop rows with all zeros in pandas data frame
see #ikbel benabdessamad
Assume a simple DataFrame as below:
df=pd.DataFrame([1,2,0,3,4,0,9])
Pick non-zero values which turns all zero values into nan and remove nan-values
df=df[df!=0].dropna()
df
Output:
0
0 1.0
1 2.0
3 3.0
4 4.0
6 9.0

Categories

Resources