Retain only duplicated rows in a pandas dataframe - python

I have a dataframe with two columns: "Agent" and "Client"
Each row corresponds to an interaction between an Agent and a client.
I want to keep only the rows if a client had interactions with at least 2 agents.
How can I do that?

Worth adding that now you can use df.duplicated()
df = df.loc[df.duplicated(subset='Agent', keep=False)]

Use groupby and transform by value_counts.
df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
Note, that, as mentioned here, you might have one agent interacting with the same client multiple times. This might be retained as a false positive. If you do not want this, you could add a drop_duplicates call before filtering:
df = df.drop_duplicates()
df = df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
print(df)
A B
0 1 2
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1
mask = df.B.groupby(df.B).transform('value_counts') > 1
print(mask)
0 False
1 True
2 True
3 True
4 True
5 True
Name: B, dtype: bool
df = df[mask]
print(df)
A B
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1

Related

How to drop row with bracket in Pandas

I would like to drop the [] for a given df
df=pd.DataFrame(dict(a=[1,2,4,[],5]))
Such that the expected output will be
a
0 1
1 2
2 4
3 5
Edit:
or to make thing more interesting, what if we have two columns and some of the cell is with [] to be dropped.
df=pd.DataFrame(dict(a=[1,2,4,[],5],b=[2,[],1,[],6]))
One way is to get the string repr and filter:
df = df[df['a'].map(repr)!='[]']
Output:
a
0 1
1 2
2 4
4 5
For multiple columns, we could apply the above:
out = df[df.apply(lambda c: c.map(repr)).ne('[]').all(axis=1)]
Output:
a b
0 1 2
2 4 1
4 5 6
You can't use equality directly as pandas will try to align a Series and a list, but you can use isin:
df[~df['a'].isin([[]])]
output:
a
0 1
1 2
2 4
4 5
To act on all columns:
df[~df.isin([[]]).any(1)]
output:
a b
0 1 2
2 4 1
4 5 6

Adding boolean column to pandas dataframe where one row being true should make all same users rows true

I have problems with pandas dataframe when adding a boolean column. Data has users who have projects they can open in several places. I would need to have a group of users who have worked with the same project in several places. If the same user has opened the same project in different places even once it would make shared_projects true. Then all rows with that user_id should be true.
Here is an example df:
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
3 10 11
3 12 10
This is a simple example what I would like to get out. If the condition is true in one line it will be true in all the users with that user_id.
user_id project_id_x project_id_y shared_projects
1 1 2 false
1 3 4 false
2 5 6 true
2 7 7 true
2 8 9 true
3 10 11 true
3 12 10 true
I can get boolean values based on each row but I am stuck how to make it true to all users if it is true on one row.
Assuming you want to match on the same row:
df['shared_projects'] = (df['project_id_x'].eq(df['project_id_y'])
.groupby(df['user_id']).transform('any')
)
If you want to match on any value x/y for a given user, you can use a set intersection:
s = df.groupby('user_id').apply(lambda g: bool(set(g['project_id_x'])
.intersection(g['project_id_y'])))
df.merge(s.rename('shared_project'), left_on='user_id', right_index=True)
output:
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True
First you will have to do a complex selection to find the user that have worked in the same project in different columns:
df['shared_projects'] = (df['project_id_x'] == df['project_id_y'])
That will create a new boolean column as you've already done. But then you can use the index of those True values to apply to the rest, assuming that "user_id" is your index for the dataframe.
for index in df[df['shared_projects'] == True]].index.unique():
df.at[index, 'project_id_x'] = True
df.at[index, 'project_id_y'] = True
Update
Another approach without apply, using value_counts.
user_id = df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id']) \
.loc[lambda x: x > 1].index.get_level_values('user_id')
df['shared_projects'] = df['user_id'].isin(user_id)
Output:
>>> df
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
# Intermediate result
>>> df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id'])
user_id project_id
2 7 2 # <- project 7 in multiple places for user 2
1 1 1
2 1
3 1
4 1
2 5 1
6 1
8 1
9 1
dtype: int64
Old answer
You can use melt:
shared_projects = lambda x: len(set(x)) != len(x)
user_id = df.melt('user_id').groupby('user_id')['value'].apply(shared_projects)
df['shared_projects'] = df['user_id'].isin(user_id[user_id].index)
Output:
>>> df
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True

Summarize dataframe: key column (unique values) plus flag if all rows for the key have the same 'type'

I have something like this:
data = {'SKU':[1,1,2,1,2,2,3],
'QTY':[5,12,2,24,1,2,12],
'TYPE': ['M','C','M','C','M','M','C']
}
df = pd.DataFrame(data)
print(df)
OUTPUT:
SKU QTY TYPE
0 1 5 M
1 1 12 C
2 2 2 M
3 1 24 C
4 2 1 M
5 2 2 M
6 3 12 C
And I want a list of unique SKUs and a true / false column indicating if Type = C for all instances of that SKU.
Something like this:
SKU Case
0 1 False
1 2 False
2 3 True
I've tried all manner of combinations of groupby, filter, agg, value_counts, etc. and just can't seem to find a reasonable way to achieve this.
Any help would be much appreciated. I'm sure the answer will be humbling.
print(df.groupby('SKU')['TYPE'].agg(lambda x: np.all(x == 'C')).reset_index())
Prints:
SKU TYPE
0 1 False
1 2 False
2 3 True
Let us do groupby + nunique
s=df.TYPE.eq('C').groupby(df['SKU']).all().reset_index()
SKU TYPE
0 1 False
1 2 False
2 3 True

Perform operation on corresponding matching grouped Pandas dataframe

I have a Dataframe:
User Numbers
A 0
A 4
A 5
B 0
B 0
C 1
C 3
I want to perform an operation on each corresponding grouped data. For example, if I want to remove all Users that have the Number 0, it should look like:
User Numbers
A 0
A 4
A 5
C 1
C 3
since all Numbers of User B is 0.
Or for example, if I want to find the variance of the Numbers of all the Users, it should look like:
Users Variance
A 7
B 0
C 2
This means only the Numbers of A are calculated for finding the variance of A and so on.
Is there a general way to do all these computations for matching grouped data?
You want 2 different operations - filtration per groups and aggregation per groups.
Filtration:
For better performance is better use transform for boolean mask and filter by boolean indexing.
df1 = df[~df['Number'].eq(0).groupby(df['User']).transform('all')]
print (df1)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Steps:
1.First create boolean Series by comparing Number by eq:
print (df['Number'].eq(0))
0 True
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
2.Then use syntactic sugar - groupby by another column and transform function all for check if all Trues per group and transform is for mask with same size as original DataFrame:
print (df['Number'].eq(0).groupby(df['User']).transform('all'))
0 False
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
3.Invert boolen mask by ~:
print (~df['Number'].eq(0).groupby(df['User']).transform('all'))
0 True
1 True
2 True
3 False
4 False
5 True
6 True
Name: Number, dtype: bool
4.Filter:
print (df[~df['Number'].eq(0).groupby(df['User']).transform('all')])
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Another slowier solution in large DataFrame with filter and same logic as first solution:
df2 = df.groupby('User').filter(lambda x: ~x['Number'].eq(0).all())
print (df2)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Aggregation:
For simplier aggregation by one column with one aggregate function, e.g. GroupBy.var use:
df3 = df.groupby('User', as_index=False)['Number'].var()
print (df3)
User Number
0 A 7
1 B 0
2 C 2

Python Pandas replicate rows in dataframe

If the dataframe looks like:
Store,Dept,Date,Weekly_Sales,IsHoliday
1,1,2010-02-05,24924.5,FALSE
1,1,2010-02-12,46039.49,TRUE
1,1,2010-02-19,41595.55,FALSE
1,1,2010-02-26,19403.54,FALSE
1,1,2010-03-05,21827.9,FALSE
1,1,2010-03-12,21043.39,FALSE
1,1,2010-03-19,22136.64,FALSE
1,1,2010-03-26,26229.21,FALSE
1,1,2010-04-02,57258.43,FALSE
And I wanna duplicate rows with IsHoliday equal to TRUE, I can do:
is_hol = df['IsHoliday'] == True
df_try = df[is_hol]
df=df.append(df_try*10)
But is there a better way to do this as I need to duplicate holiday rows 5 times, and I have to append 5 times if using the above way.
You can put df_try inside a list and then do what you have in mind:
>>> df.append([df_try]*5,ignore_index=True)
Store Dept Date Weekly_Sales IsHoliday
0 1 1 2010-02-05 24924.50 False
1 1 1 2010-02-12 46039.49 True
2 1 1 2010-02-19 41595.55 False
3 1 1 2010-02-26 19403.54 False
4 1 1 2010-03-05 21827.90 False
5 1 1 2010-03-12 21043.39 False
6 1 1 2010-03-19 22136.64 False
7 1 1 2010-03-26 26229.21 False
8 1 1 2010-04-02 57258.43 False
9 1 1 2010-02-12 46039.49 True
10 1 1 2010-02-12 46039.49 True
11 1 1 2010-02-12 46039.49 True
12 1 1 2010-02-12 46039.49 True
13 1 1 2010-02-12 46039.49 True
Other way is using concat() function:
import pandas as pd
In [603]: df = pd.DataFrame({'col1':list("abc"),'col2':range(3)},index = range(3))
In [604]: df
Out[604]:
col1 col2
0 a 0
1 b 1
2 c 2
In [605]: pd.concat([df]*3, ignore_index=True) # Ignores the index
Out[605]:
col1 col2
0 a 0
1 b 1
2 c 2
3 a 0
4 b 1
5 c 2
6 a 0
7 b 1
8 c 2
In [606]: pd.concat([df]*3)
Out[606]:
col1 col2
0 a 0
1 b 1
2 c 2
0 a 0
1 b 1
2 c 2
0 a 0
1 b 1
2 c 2
This is an old question, but since it still comes up at the top of my results in Google, here's another way.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':list("abc"),'col2':range(3)},index = range(3))
Say you want to replicate the rows where col1="b".
reps = [3 if val=="b" else 1 for val in df.col1]
df.loc[np.repeat(df.index.values, reps)]
You could replace the 3 if val=="b" else 1 in the list interpretation with another function that could return 3 if val=="b" or 4 if val=="c" and so on, so it's pretty flexible.
Appending and concatenating is usually slow in Pandas so I recommend just making a new list of the rows and turning that into a dataframe (unless appending a single row or concatenating a few dataframes).
import pandas as pd
df = pd.DataFrame([
[1,1,'2010-02-05',24924.5,False],
[1,1,'2010-02-12',46039.49,True],
[1,1,'2010-02-19',41595.55,False],
[1,1,'2010-02-26',19403.54,False],
[1,1,'2010-03-05',21827.9,False],
[1,1,'2010-03-12',21043.39,False],
[1,1,'2010-03-19',22136.64,False],
[1,1,'2010-03-26',26229.21,False],
[1,1,'2010-04-02',57258.43,False]
], columns=['Store','Dept','Date','Weekly_Sales','IsHoliday'])
temp_df = []
for row in df.itertuples(index=False):
if row.IsHoliday:
temp_df.extend([list(row)]*5)
else:
temp_df.append(list(row))
df = pd.DataFrame(temp_df, columns=df.columns)
You can do it in one line:
df.append([df[df['IsHoliday'] == True]] * 5, ignore_index=True)
or
df.append([df[df['IsHoliday']]] * 5, ignore_index=True)
Another alternative to append() is to first replace the values of a column by a list of entries and then explode() (either using ignore_index=True or not, depending on what you want):
df['IsHoliday'] = df['IsHoliday'].apply(lambda x: 5*[x] if (x == True) else x)
df.explode('IsHoliday', ignore_index=True)
The nice thing about this one is that you can already use the list in the apply() call to build copies of rows with modified values in a column, in case you wanted to do that later anyways...

Categories

Resources