Pandas : drop duplicated row based on some conditions

Pandas : drop duplicated row based on some conditions - python

I hope you're doing well.
I want to drop duplicates rows based on some conditions.
For example :
A B C D E
0 foo 2 3 4 100
1 foo 2 3 1 3
2 foo 2 3 5 nan
3 bar 1 2 8 nan
4 bar 1 2 1 nan
The result should be
A B C D E
0 foo 2 3 4 100
1 foo 2 3 1 3
2 bar 1 2 nan nan
So we have duplicated rows (based on columns A,B, and C), first we check the value in column E if it's nan we drop the row but if all values in column E are nan (like the example of row 3 and 4 concerning the name 'bar'), we should keep one row and set the value in column D as nan.
Thanks in advance.

It works
import pandas as pd
import io
table = """
A B C D E
0 foo 2 3 4 100
1 foo 2 3 1 3
2 foo 2 3 5 nan
3 bar 1 2 8 nan
4 bar 1 2 1 nan
"""
df = pd.read_table(io.StringIO(table), index_col=0, sep=' ', skipinitialspace=True)
# Index for duplicated in A,B,C and all nan in E
index_1 = set(df[df.duplicated(['A','B','C','E'], keep=False)]["E"].isna().index)
# Index for duplicated ABC and nan in E
index_2 = set(df[df[df.duplicated(['A','B','C'], keep=False)]["E"].isna()].index)
# Set nan for D in index_1
df.loc[index_1, 'D'] = np.nan
# Drop nan E with duplicated ABC except index_1
df.drop(index_2-index_1, inplace=True)
# Drop other duplicates
df.drop_duplicates(['A','B','C','D'], inplace=True)
print(df)
This is what was required:
A B C D E
0 foo 2 3 4.0 100.0
1 foo 2 3 1.0 3.0
3 bar 1 2 NaN NaN

Related

how to perform such pandas aggregation: drop nan and concatenate to the first?

Input:
df = pd.DataFrame({
'a':['1',np.nan,np.nan, '2',np.nan,np.nan],
'b':['a',np.nan,'ddd',np.nan,'d','gg'],
'c':[np.nan,'aa','bb',np.nan,'d',np.nan],
})
print (df)
a b c
0 1 a NaN
1 NaN NaN aa
2 NaN ddd bb
3 2 NaN NaN
4 NaN d d
5 NaN gg NaN
Output:
a b c
0 1 a ddd aa bb
1 2 d gg d

If there is non missing value for start of each group use ffill for forward filling missing values and aggregate all values with join and removed missing values:
df = df.groupby(df['a'].ffill()).agg(lambda x: ' '.join(x.dropna())).reset_index(drop=True)
print (df)
a b c
0 1 a ddd aa bb
1 2 d gg d
Detail:
print (df['a'].ffill())
0 1
1 1
2 1
3 2
4 2
5 2
Name: a, dtype: object

With a dataframe in pandas, how do I append a dataframe where only some columns are the same?

If I have a dataframe that looks like this
rootID parentID jobID counter
0 A B D 0
1 E F G 0
2 A C D 0
3 E B F 0
4 E F G 0
And one dataframe that looks like this
rootID parentID StepID
0 A B 1
1 A F 2
2 A C 3
3 E B 4
4 E F 5
How can I append the second dataframe to the first dataframe based on the keys that they have in common, "rootID" and "parentID" such that I get
rootID parentID jobID counter stepÌD
0 A B D 0 Null
1 E F G 0 Null
2 A C D 0 Null
3 E B F 0 Null
4 E F G 0 Null
5 A B Null Null 1
6 A F Null Null 2
7 A C Null Null 3
8 E B Null Null 4
9 E F Null Null 5
Thank you for the help

Try, pd.concat, pandas has intrinsic data alignment, therefore when using this function and most others, pandas will keep row index labels and column headers aligned:
pd.concat([df, df2], ignore_index=True, sort=False)
Output:
rootID parentID jobID counter StepID
0 A B D 0.0 NaN
1 E F G 0.0 NaN
2 A C D 0.0 NaN
3 E B F 0.0 NaN
4 E F G 0.0 NaN
5 A B NaN NaN 1.0
6 A F NaN NaN 2.0
7 A C NaN NaN 3.0
8 E B NaN NaN 4.0
9 E F NaN NaN 5.0
Note: pandas has an unfortunate side-effect of converting numerical columns that contain NaN to float datatype.

Match values based on group value with columns values and merge it in two columns

df
index group1 group2 a b c d
-
0 a b 1 2 NaN NaN
1 b c NaN 5 1 NaN
2 c d NaN NaN 6 9
4 b a 1 7 NaN NaN
5 d a 6 NaN NaN 5
df expect
index group1 group2 one two
-
0 a b 1 2
1 b c 5 1
2 c d 6 9
4 b a 7 1
5 d a 5 6
I want to match values based on columns ['group1','group2'] and append to columns [‘one','two'] by order. For example, row index 5: group1 is 'd', so it will take value of 5 from 'd' first, and then it will do group2.
I am trying to use lookup function: df.one = df.lookup(df.index, df.group1), it works on small data, but not with big data with lots of columns, and values get mixed up.

Get only two values from 4 specified columns and merge valid values into 2 columns

df:
index a b c d
-
0 1 2 NaN NaN
1 2 NaN 3 NaN
2 5 NaN 6 NaN
3 1 NaN NaN 5
df expect:
index one two
-
0 1 2
1 2 3
2 5 6
3 1 5
Above output example is self-explanatory. Basically, I just need to shift the two values from columns [a, b, c, d] except NaN into another set of two columns ["one", "two"]

Use back filling missing values and select first 2 columns:
df = df.bfill(axis=1).iloc[:, :2].astype(int)
df.columns = ["one", "two"]
print (df)
one two
index
0 1 2
1 2 3
2 5 6
3 1 5

Or combine_first + drop:
df['two']=df.pop('b').combine_first(df.pop('c')).combine_first(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Or fillna:
df['two']=df.pop('b').fillna(df.pop('c')).fillna(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Both cases:
print(df)
Is:
index one two
0 0 1 2.0
1 1 2 3.0
2 2 5 6.0
3 3 1 5.0
If want output like #jezrael's, add a: (both cases all okay)
df=df.set_index('index')
And then:
print(df)
Is:
one two
index
0 1 2.0
1 2 3.0
2 5 6.0
3 1 5.0

Adding rows in dataframe based on values of another dataframe

I have the following two dataframes. Please note that 'amt' is grouped by 'id' in both dataframes.
df1
id code amt
0 A 1 5
1 A 2 5
2 B 3 10
3 C 4 6
4 D 5 8
5 E 6 11
df2
id code amt
0 B 1 9
1 C 12 10
I want to add a row in df2 for every id of df1 not contained in df2. For example as Id's A, D and E are not contained in df2,I want to add a row for these Id's. The appended row should contain the id not contained in df2, null value for the attribute code and stored value in df1 for attribute amt
The result should be something like this:
id code name
0 B 1 9
1 C 12 10
2 A nan 5
3 D nan 8
4 E nan 11
I would highly appreciate if I can get some guidance on it.

By using pd.concat
df=df1.drop('code',1).drop_duplicates()
df[~df.id.isin(df2.id)]
pd.concat([df2,df[~df.id.isin(df2.id)]],axis=0).rename(columns={'amt':'name'}).reset_index(drop=True)
Out[481]:
name code id
0 9 1.0 B
1 10 12.0 C
2 5 NaN A
3 8 NaN D
4 11 NaN E

Drop dups from df1 then append df2 then drop more dups then append again.
df2.append(
df1.drop_duplicates('id').append(df2)
.drop_duplicates('id', keep=False).assign(code=np.nan),
ignore_index=True
)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
Slight variation
m = ~np.in1d(df1.id.values, df2.id.values)
d = ~df1.duplicated('id').values
df2.append(df1[m & d].assign(code=np.nan), ignore_index=True)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas : drop duplicated row based on some conditions - python

Related

how to perform such pandas aggregation: drop nan and concatenate to the first?

With a dataframe in pandas, how do I append a dataframe where only some columns are the same?

Match values based on group value with columns values and merge it in two columns

Get only two values from 4 specified columns and merge valid values into 2 columns

Adding rows in dataframe based on values of another dataframe

Categories

Resources