Confusion regarding duplicates in a data frame in python - python

I have a Pandas dataframe , for which I am checking for duplicates. I get the following output, but I dont know why its showing them as duplicates. Isnt all the column values in a row supposed to be same to be shown as a duplicate? Please correct me if I am wrong, I am newbie in Python

Per default, duplicated does not count the first occurence of a duplicated row.
>>> df = pd.DataFrame([[1,1], [2,2], [1,1], [1,1]])
>>> df
0 1
0 1 1
1 2 2
2 1 1
3 1 1
>>> df.duplicated()
0 False
1 False
2 True
3 True
dtype: bool
This means that your df[dup] will have unique rows if the duplicate rows in df were only duplicated twice.
You can adjust this behavior with the keep argument.
>>> df.duplicated(keep=False)
0 True
1 False
2 True
3 True
dtype: bool

Related

df.duplicated() not finding duplicates

I am trying to run this code.
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
print(df.duplicated())
It is giving me the output.
0 False
1 False
dtype: bool
I want to know why it is showing index 1 as False and not True.
I'm expecting output this.
0 False
1 True
dtype: bool
I'm using Python 3.11.1 and Pandas 1.4.4
duplicated is working on full rows (or a subset of the columns if the parameter is used).
Here you don't have any duplicate:
A B C
0 1 1 1 # this row is unique
1 2 2 2 # this one is also unique
I believe you might want duplication column-wise?
df.T.duplicated()
Output:
A False
B True
C True
dtype: bool
You are not getting the expected output because you don't have duplicates, to begin with. I added the duplicate rows to the end of your dataframe and this is closer to what you are looking for:
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
df = pd.concat([df]*2)
df
A B C
0 1 1 1
1 2 2 2
0 1 1 1
1 2 2 2
df.duplicated(keep='first')
Output:
0 False
1 False
0 True
1 True
dtype: bool
And the if you want to keep duplicates the other way around:
df.duplicated(keep='last')
0 True
1 True
0 False
1 False
dtype: bool

Pandas, fill new column with existing values dependant on conditions of another column [duplicate]

Its easy to create a vector of bools in pandas for testing values such as
DF['a'] > 10
but how do you write
DF['a'] in list
to generate a vector of bools based on membership of each value in the Series in some list or other? I am getting a value error.
I know I can loop through the data pretty simply, but is this possible without having to do that?
Use the isin method:
DF['a'].isin(list)
Example:
DF = pd.DataFrame({'a':np.arange(5),'b':np.arange(5)*2})
print DF
a b
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
print DF['a'].isin([0,2,3])
0 True
1 False
2 True
3 True
4 False

Duplicated() function as boolean indexing generates different result compares with drop_duplicates

The data frame looks like the following:
df = pd.DataFrame({'k1':['one']*3 + ['two']*4,'k2':[1,1,2,3,3,4,4]})
When I am checking duplicates, I get boolean index by doing
df.duplicated(), then I use it as the filter
df[df.duplicated()] which shows different result compares with df.drop_duplicates()
An additional row has been created in the result
2 one 2
drop_duplicate will drop all duplicated row . duplicated will return False for the first item and True for the other row of duplicates when it has the duplicate , so they are different function target for different problem .
df.duplicated()
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
df.drop_duplicates()
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
How to make the output same ?
Check the unique value
df[~df.duplicated(keep=False)]
k1 k2
2 one 2
df.drop_duplicates(keep=False)
k1 k2
2 one 2

python dataframe boolean values with if statement

i wanna make if statement to show all REF_INT that are duplicated i tried this:
(df_picru['REF_INT'].value_counts()==1)
and it shows me all values with true or false but i dont wanna do something like this:
if (df_picru['REF_INT'].value_counts()==1)
print "df_picru['REF_INT']"
In [28]: df_picru['new'] = \
df_picru['REF_INT'].duplicated(keep=False) \
.map({True:'duplicates',False:'unique'})
In [29]: df_picru
Out[29]:
REF_INT new
0 1 unique
1 2 duplicates
2 3 unique
3 8 duplicates
4 8 duplicates
5 2 duplicates
I think you need duplicated for boolean mask and for new column numpy.where:
mask = df_picru['REF_INT'].duplicated(keep=False)
Sample:
df_picru = pd.DataFrame({'REF_INT':[1,2,3,8,8,2]})
mask = df_picru['REF_INT'].duplicated(keep=False)
print (mask)
0 False
1 True
2 False
3 True
4 True
5 True
Name: REF_INT, dtype: bool
df_picru['new'] = np.where(mask, 'duplicates', 'unique')
print (df_picru)
REF_INT new
0 1 unique
1 2 duplicates
2 3 unique
3 8 duplicates
4 8 duplicates
5 2 duplicates
If need check at least one if unique value need any for convert boolean mask - array to scalar True or False:
if mask.any():
print ('at least one unique')
at least one unique
Another solution using groupby.
#groupby REF_INT and then count the occurrence and set as duplicate if count is greater than 1
df_picru.groupby('REF_INT').apply(lambda x: 'Duplicated' if len(x)> 1 else 'Unique')
Out[21]:
REF_INT
1 Unique
2 Duplicated
3 Unique
8 Duplicated
dtype: object
value_counts can actually work if you make a minor change:
df_picru.REF_INT.value_counts()[lambda x: x>1]
Out[31]:
2 2
8 2
Name: REF_INT, dtype: int64

Drop row in pandas dataframe if any value in the row equals zero

How do I drop a row if any of the values in the row equal zero?
I would normally use df.dropna() for NaN values but not sure how to do it with "0" values.
i think the easiest way is looking at rows where all values are not equal to 0:
df[(df != 0).all(1)]
You could make a boolean frame and then use any:
>>> df = pd.DataFrame([[1,0,2],[1,2,3],[0,1,2],[4,5,6]])
>>> df
0 1 2
0 1 0 2
1 1 2 3
2 0 1 2
3 4 5 6
>>> df == 0
0 1 2
0 False True False
1 False False False
2 True False False
3 False False False
>>> df = df[~(df == 0).any(axis=1)]
>>> df
0 1 2
1 1 2 3
3 4 5 6
Although it is late, someone else might find it helpful.
I had similar issue. But the following worked best for me.
df =pd.read_csv(r'your file')
df =df[df['your column name'] !=0]
reference:
Drop rows with all zeros in pandas data frame
see #ikbel benabdessamad
Assume a simple DataFrame as below:
df=pd.DataFrame([1,2,0,3,4,0,9])
Pick non-zero values which turns all zero values into nan and remove nan-values
df=df[df!=0].dropna()
df
Output:
0
0 1.0
1 2.0
3 3.0
4 4.0
6 9.0

Categories

Resources