Drop group if another column has duplicate values - pandas dataframe - python

I have the following df
id value many other variables
A 5
A 5
A 8
A 9
B 3
B 4
B 5
B 9
C 10
C 11
C 19
D 6
D 6
D 10
E 0
E 0
E 0
...
I want to drop the whole id group if there are duplicate values in the value column (except zeros) So the output should be
id value many other variables
B 3
B 4
B 5
B 9
C 10
C 11
C 19
E 0
E 0
E 0
...

You can use duplicated to flag the duplicates, then transform groupby.any to flag the groups with duplicates. Then to get the rows with 0s, chain this boolean mask with a boolean mask that flags 0s:
out =df[~df.duplicated(['id','value']).groupby(df['id']).transform('any') | df['value'].eq(0)]
Output:
id value many_other_variables
4 B 3
5 B 4
6 B 5
7 B 9
8 C 10
9 C 11
10 C 19
14 E 0
15 E 0
16 E 0
Note: groupby.any is an aggregation, transform transforms that aggregate to match the length of the original DataFrame. The goal is to create a boolean mask to filter df with; and boolean masks must have the same length as the original df, so we transform the aggregate here.

Related

I want to groupby and drop groups if the shape is 3 and non of the values from a column contains zero

I want to groupby and drop groups if it satisfies two conditions (the shape is 3 and column A doesn't contain zeros).
My df
ID value
A 3
A 2
A 0
B 1
B 1
C 3
C 3
C 4
D 0
D 5
D 5
E 6
E 7
E 7
F 3
F 2
my desired df would be
ID value
A 3
A 2
A 0
B 1
B 1
D 0
D 5
D 5
F 3
F 2
You can use boolean indexing with groupby operations:
g = df['value'].eq(0).groupby(df['ID'])
# group contains a 0
m1 = g.transform('any')
# group doesn't have size 3
m2 = g.transform('size').ne(3)
# keep if any of the condition above is met
# this is equivalent to dropping if contains 0 AND size 3
out = df[m1|m2]
Output:
ID value
0 A 3
1 A 2
2 A 0
3 B 1
4 B 1
8 D 0
9 D 5
10 D 5
14 F 3
15 F 2

Join two dataframe into one and fill the missing gaps

Here is the first df:
index name
4 a
8 b
10 c
Here is the second df:
index name
4 d
5 d
6 d
7 d
8 e
9 e
10 f
Is there a way to join them as below df?
index name1 name2
4 a d
5 a d
6 a d
7 a d
8 b e
9 b e
10 c f
Basically join two df base on index, then auto fill the gap on name1 base on the first value.
# merge the two DF based on the index, and ffill null values
df2=df2.merge(df, on='index', how='left', suffixes=(['2','1'])).ffill()
# reindex the columns
df2.reindex(sorted(df2.columns), axis=1)
index name1 name2
0 4 a d
1 5 a d
2 6 a d
3 7 a d
4 8 b e
5 9 b e
6 10 c f

Boolean filtering using AND - Pandas

I just can't seem to get the proper output when filtering the df below using boolean operators. I want the df to remove rows where ID is <= 2 AND String == A,B, or C. Is seems to be removing strings that are not equal to A,B, or C.
df = pd.DataFrame({
'String' : ['A','F','B','C','D','A','X','C','B','D','A','Y','A','C','A','D','C','B'],
'ID' : [4,2,3,4,5,6,4,2,3,4,5,6,4,2,3,4,5,6],
})
df = df[~(df['ID'] <= 2) & (df['String'].isin(['A','B','C']))]
intended Output:
String ID
0 A 4
1 F 2
2 B 3
3 C 4
4 D 5
5 A 6
6 X 4
#7 C 2 Remove
8 B 3
9 D 4
10 A 5
11 Y 6
12 A 4
#13 C 2 Remove
14 A 3
15 D 4
16 C 5
17 B 6

Filling missing data in df.loc filtered conditions?

I have following problem with filling nan in a filtered df.
Let's take this df :
condition value
0 A 1
1 B 8
2 B np.nan
3 A np.nan
4 C 3
5 C np.nan
6 A 2
7 B 5
8 C 4
9 A np.nan
10 B np.nan
11 C np.nan
How can I fill np.nan with the value from the last value based on condition, so that I get following result?
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
I've failed with following code (ValueError: Cannot index with multidimensional key):
conditions = set(df['condition'].tolist())
for c in conditions :
filter = df.loc[df['condition'] == c]
df.loc[filter, 'value'] = df.loc[filter, 'value'].fillna(method='ffill')
THX & BR from Vienna
If your values are actual NaN, you simply need to do a groupby on condition, and then call ffill (which is essentially a wrapper for fillna(method='ffill')):
df.groupby('condition').ffill()
Which returns:
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
If your values are strings that say np.nan, as in your example, then replace them before:
df.replace('np.nan', np.nan, inplace=True)
df.groupby('condition').ffill()

Python value difference in dataframe by group key

I have a DataFrame
name value
A 2
A 4
A 5
A 7
A 8
B 3
B 4
B 8
C 1
C 3
C 5
And I want to get the value differences based on each name
like this
name value dif
A 2 0
A 4 2
A 5 1
A 7 2
A 8 1
B 3 0
B 4 1
B 8 4
C 1 0
C 3 2
C 5 2
Can anyone show me the easiest way?
You can use GroupBy.diff to compute the difference between consecutive rows per grouped object. Optionally, filling missing values( first row in every group) by 0 and casting them finally as integers.
df['dif'] = df.groupby('name')['value'].diff().fillna(0).astype(int)
df

Categories

Resources