Pandas Drop Duplicates And Store Duplicates [duplicate] - python

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 months ago.
i use the pandas.DataFrame.drop_duplicates to search duplicates in a dataframe. This removes the duplicates from the dataframe. This also works great. However, I would like to know which data has been removed.
Is there a way to save the data in a new list before removing it?
I have unfortunately found in the documentation of pandas no information on this.
Thanks for the answer.

It uses duplicated function to filter out the information which is duplicated. By default the first occurrence is set to True, all others set as False, Using this function and filter on original data, you can know which data is kept and which is dropped out.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

You can use duplicated and boolean indexing with groupby.agg to keep the list of duplicates:
m = df.duplicated('group')
dropped = df[m].groupby(df['group'])['value'].agg(list)
print(dropped)
df = df[~m]
print(df)
Output:
# print(dropped)
group
A [2]
B [4, 5]
C [7]
Name: value, dtype: object
# print(df)
group value
0 A 1
2 B 3
5 C 6
Used input:
group value
0 A 1
1 A 2
2 B 3
3 B 4
4 B 5
5 C 6
6 C 7

Related

Combining pandas rows based on condition [duplicate]

This question already has answers here:
Pandas groupby with delimiter join
(2 answers)
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 3 years ago.
Given a Pandas Dataframe df, with column names 'Session', and 'List':
Can I group together the 'List' values for the same values of 'Session'?
My Approach
I've tried solving the problem by creating a new dataframe, and iterating through the rows of the inital dataframe while maintaing a session counter that I increment if I see that the session has changed.
If it hasn't changed, then I append the List value that corresponds to that rows value with a comma.
Whenever the session changes, I used strip to get rid of the last comma (extra).
Initial DataFrame
Session List
0 1 a
1 1 b
2 1 c
3 2 d
4 2 e
5 3 f
Required DataFrame
Session List
0 1 a,b,c
1 2 d,e
2 3 f
Can someone suggest something more efficient or simple?
Thank you in advance.
Use groupby and apply and reset_index:
>>> df.groupby('Session')['List'].agg(','.join).reset_index()
Session List
0 1 a,b,c
1 2 d,e
2 3 f
>>>

Remove rows which are matched by multiple values [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I got two data frames. One contains the main data (called dtt_main) and could be huge, the other one (called dtt_selected) contains only two columns, which are also available in the main data frame. For every entry in the dtt_selected, I want to check whether the same values are included in dtt_main. If so, this row(s) should be removed (these values are not unique in the dtt_main, so one could remove multiple rows by applying this criterion). I managed to write a small function which does exactly this, but is really slow because I have to iterate over both dataframes simultaneously. I would be very happy about a faster, more pandas-like solution. Thanks!
# The real data set contains ~100_000 rows and ~1000 columns
dtt_main = pd.DataFrame({
'a': [1,1,1,2,2,4,5,4],
'b': [1,1,2,2,3,3,4,6],
'data': list('abcdefgh')
})
dtt_selected = pd.DataFrame({
'a': [1,1,2,4],
'b': [1,5,3,6]
})
def remove_selected(dtt_main, dtt_selected):
for row_select in dtt_select.itertuples():
for row_main in dtt_main.itertuples():
# First entry of the tuples is the index!
if (row_select[1] == row_main[1]) & (row_select[2] == row_main[2]):
dtt_main.drop(row_main[0], axis='rows', inplace=True)
remove_selected(dtt_main, dtt_selected)
print(dtt_main)
>>> a b data
>>> 2 1 2 c
>>> 3 2 2 d
>>> 5 4 3 f
>>> 6 5 4 g
You could left join the DataFrames using pd.merge. By setting indicator=True, it adds a column _merge which will have 'both' if it occurs also in dtt_selected (and therefore should be dropped) and 'left_only' if it was only in dtt_main (and thus should be kept). Now in hte next line, you can first keep only the columns that have 'left_only', and then drop the now unnecessary '_merge'-column:
df1 = dtt_main.merge(dtt_selected, how='left', indicator=True)
df1[df1['_merge'] == 'left_only'].drop(columns='_merge')
#Output
# a b data
#2 1 2 c
#3 2 2 d
#5 4 3 f
#6 5 4 g

subset the dataframe into a new one using copy [duplicate]

This question already has answers here:
why should I make a copy of a data frame in pandas
(8 answers)
Closed 4 years ago.
I have a dataframe df
a b c
0 5 6 9
1 6 7 10
2 7 8 11
3 8 9 12
So if I want to select only col a and b and store it in another df I would use something like this
df1 = df[['a','b']]
But I have seen places where people write it this way
df1 = df[['a','b']].copy()
Can anyone let me know what is .copy() because the earlier code works just fine.
For example, if you want to rename a dataframe (example using replace):
df2=df
df2=df2.replace('blah','foo')
Here:
df==df2
Will be:
True
You want it to only do to, df2:
df2=df.copy()
df2=df2.replace('blah','foo')
Then now:
df==df2
Returns:
False

How can I drop rows in a dataframe efficiently ir a specific column contains a substring [duplicate]

This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I tried
df = df[~df['event.properties.comment'].isin(['Extra'])]
Problem is it would just drop the row if the column contains exactly 'Extra' and I need to drop the ones that contain it even as a substring.
Any help?
You can use or condition to have multiple conditions in checking string, for your requirement you may retain text if it have "Extra" or "~".
Considered df
vals ids
0 1 ~
1 2 bball
2 3 NaN
3 4 Extra text
df[~df.ids.fillna('').str.contains('Extra')]
Out:
vals ids
0 1 ~
1 2 bball
2 3 NaN

Error subsetting a data frame in python [duplicate]

This question already has an answer here:
Python - splitting dataframe into multiple dataframes based on column values and naming them with those values [duplicate]
(1 answer)
Closed 4 years ago.
I am learning python and pandas and am having trouble overcoming an error while trying to subset a data frame.
I have an input data frame:
df0-
Index Group Value
1 A 10
2 A 15
3 B 20
4 C 10
5 C 10
df0.dtypes-
Group object
Value float64
That I am trying to split out into unique values based off of the Group column. With the output looking something like this:
df1-
Index Group Value
1 A 10
2 A 15
df2-
Index Group Value
3 B 20
df3-
Index Group Value
4 C 10
5 C 10
So far I have written this code to subset the input:
UniqueGroups = df0['Group'].unique().tolist()
OutputFrame = {}
for x in UniqueAgencies:
ReturnFrame[str('ConsolidateReport_')+x] = UniqueAgencies[df0['Group']==x]
The code above returns the following error, which I can`t quite work my head around. Can anyone point me in the right direction?
*** TypeError: list indices must be integers or slices, not str
you can use groupby to group the column
for _, g in df0.groupby('Group'):
print g

Categories

Resources