Duplicated() function as boolean indexing generates different result compares with drop_duplicates - python

The data frame looks like the following:
df = pd.DataFrame({'k1':['one']*3 + ['two']*4,'k2':[1,1,2,3,3,4,4]})
When I am checking duplicates, I get boolean index by doing
df.duplicated(), then I use it as the filter
df[df.duplicated()] which shows different result compares with df.drop_duplicates()
An additional row has been created in the result
2 one 2

drop_duplicate will drop all duplicated row . duplicated will return False for the first item and True for the other row of duplicates when it has the duplicate , so they are different function target for different problem .
df.duplicated()
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
df.drop_duplicates()
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
How to make the output same ?
Check the unique value
df[~df.duplicated(keep=False)]
k1 k2
2 one 2
df.drop_duplicates(keep=False)
k1 k2
2 one 2

Related

How to find group-column have duplicate values in a dataframegroup python?

first i have a df, when i groupby it with a column, will it remove duplicate values?.
Second, how to know which group have duplicate values ( i tried to find how to know which columns of a df have duplicate values but couldn't find anything, they just talk about how each element duplicated or not)
ex i have a df like this:
A B C
1 1 2 3
2 1 4 3
3 2 2 2
4 2 3 4
5 2 2 3
after groupby('A')
A B C
1 2 3
4 3
2 2 2
3 2
2 3
i want to know how many group A have B duplicated, and how many group A have C duplicated
result:
B C
1 1 2
or maybe better can caculate percent
B : 50%
C : 100%
thanks
You could use a lambda function inside GroupBy.agg to compare number of unique values that is not equal to the number of values in a group. To get the number of unique we can use Series.nunique and Series.size for the number of values in a group.
df.groupby(level=0).agg(lambda x: x.size!=x.nunique())
# B C
# 1 False True
# 2 True False
Let us try
out = df.groupby(level=0).agg(lambda x : x.duplicated().any())
B C
1 False True
2 True False

Pandas, fill new column with existing values dependant on conditions of another column [duplicate]

Its easy to create a vector of bools in pandas for testing values such as
DF['a'] > 10
but how do you write
DF['a'] in list
to generate a vector of bools based on membership of each value in the Series in some list or other? I am getting a value error.
I know I can loop through the data pretty simply, but is this possible without having to do that?
Use the isin method:
DF['a'].isin(list)
Example:
DF = pd.DataFrame({'a':np.arange(5),'b':np.arange(5)*2})
print DF
a b
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
print DF['a'].isin([0,2,3])
0 True
1 False
2 True
3 True
4 False

pandas: Select rows by diff with previous columns, but only one time per row

I have a dataset like below
ID value
1 10
2 15
3 18
4 30
5 35
I would like to keep all the rows that has value - value of the previous row <=5, so I do
df['diff'] = df.value.diff()
df = df[df.diff <= 5]
Then I will have
ID value diff
2 15 5
3 18 3
5 35 5
However, I don't want to keep row 3, because row 2 is kept due to row 1, and as row 1 and row 2 become a pair, row 3 should not be paired with row 2 anymore.
How could I do that using pandas? Indeed I can write a for loop but it is not the best idea.
So you have the mask that checks if difference to previous row <= 5:
>>> d = df.value.diff().le(5)
>>> d
1 False
2 True
3 True
4 False
5 True
Rows marked with True will be kept, but you don't want to keep a True row if the previous row was also True.
Then we can shift this mask, negate it and & with the original to convert True's that have True in previous row into False:
>>> d & ~d.shift(fill_value=False)
1 False
2 True
3 False
4 False
5 True
where fill_value is needed otherwise there arises NaN and it "can't bitwise-negate float". Putting False there has no effect other than silencing that issue.
Now we can select the rows from the dataframe with this resultant mask:
>>> wanted = d & ~d.shift(fill_value=False)
>>> df[wanted]
ID value
2 15
5 35

Confusion regarding duplicates in a data frame in python

I have a Pandas dataframe , for which I am checking for duplicates. I get the following output, but I dont know why its showing them as duplicates. Isnt all the column values in a row supposed to be same to be shown as a duplicate? Please correct me if I am wrong, I am newbie in Python
Per default, duplicated does not count the first occurence of a duplicated row.
>>> df = pd.DataFrame([[1,1], [2,2], [1,1], [1,1]])
>>> df
0 1
0 1 1
1 2 2
2 1 1
3 1 1
>>> df.duplicated()
0 False
1 False
2 True
3 True
dtype: bool
This means that your df[dup] will have unique rows if the duplicate rows in df were only duplicated twice.
You can adjust this behavior with the keep argument.
>>> df.duplicated(keep=False)
0 True
1 False
2 True
3 True
dtype: bool

python dataframe boolean values with if statement

i wanna make if statement to show all REF_INT that are duplicated i tried this:
(df_picru['REF_INT'].value_counts()==1)
and it shows me all values with true or false but i dont wanna do something like this:
if (df_picru['REF_INT'].value_counts()==1)
print "df_picru['REF_INT']"
In [28]: df_picru['new'] = \
df_picru['REF_INT'].duplicated(keep=False) \
.map({True:'duplicates',False:'unique'})
In [29]: df_picru
Out[29]:
REF_INT new
0 1 unique
1 2 duplicates
2 3 unique
3 8 duplicates
4 8 duplicates
5 2 duplicates
I think you need duplicated for boolean mask and for new column numpy.where:
mask = df_picru['REF_INT'].duplicated(keep=False)
Sample:
df_picru = pd.DataFrame({'REF_INT':[1,2,3,8,8,2]})
mask = df_picru['REF_INT'].duplicated(keep=False)
print (mask)
0 False
1 True
2 False
3 True
4 True
5 True
Name: REF_INT, dtype: bool
df_picru['new'] = np.where(mask, 'duplicates', 'unique')
print (df_picru)
REF_INT new
0 1 unique
1 2 duplicates
2 3 unique
3 8 duplicates
4 8 duplicates
5 2 duplicates
If need check at least one if unique value need any for convert boolean mask - array to scalar True or False:
if mask.any():
print ('at least one unique')
at least one unique
Another solution using groupby.
#groupby REF_INT and then count the occurrence and set as duplicate if count is greater than 1
df_picru.groupby('REF_INT').apply(lambda x: 'Duplicated' if len(x)> 1 else 'Unique')
Out[21]:
REF_INT
1 Unique
2 Duplicated
3 Unique
8 Duplicated
dtype: object
value_counts can actually work if you make a minor change:
df_picru.REF_INT.value_counts()[lambda x: x>1]
Out[31]:
2 2
8 2
Name: REF_INT, dtype: int64

Categories

Resources