python - keep duplicates if two columns equal - python

I have a dataset that looks like below:
col1. col2. col3.
a b c
a d x
b c e
s f e
f f e
I need to drop duplicates in col3 if col1 differs from col2. The result looks like:
col1. col2. col3.
a b c
a d x
f f e
Is there a way to nest this condition in df = df.drop_duplicates(subset=['col3'])?

Yes we can do argsort
df = df.iloc[df.eval('col1==col2').argsort()].drop_duplicates('col3',keep='last')
col1 col2 col3
0 a b c
1 a d x
4 f f e

Related

Python data frame drop duplicate rows based on pairwise columns

How do I remove the duplicate rows based on pairwise columns (Col1, Col2) and (Col3, Col4)
import pandas as pd
df = pd.DataFrame({'Col1' : ['A','A','C','A','C'],
'Col2' : ['B','B','D','B','D'],
'Col3' : ['C','A','C','B','D'],
'Col4' :['D','B','D','A','C']})
Col1 Col2 Col3 Col4
A B C D
A B A B
C D C D
A B B A
C D D C
The desired output is:
Col1 Col2 Col3 Col4
A B C D
A B B A
C D D C
row two and row three are dropped because
A B = A B and C D = C D
I tried something like
df.drop_duplicates(subset=[['Col1', 'Col2'],['Col3', 'Col4']])
but this is not right.
Let us try compare with values
out = df[np.all(df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values,1)]
Out[298]:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C
You could try comparing columns like this.
new_df = df[(df['Col1'] != df['Col3']) & (df['Col2'] != df['Col4'])]
print(new_df)
Output:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C
I used the following approach:
For the df ->
import pandas as pd
df = pd.DataFrame({'Col1' : ['A','A','C','A','C'],
'Col2' : ['B','B','D','B','D'],
'Col3' : ['C','A','C','B','D'],
'Col4' :['D','B','D','A','C']})
Comparing the values of (Col1,Col2) with (Col3,Col4) and drop the duplicates
desired_output = df[df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values].drop_duplicates()
desired_output
Output:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C

How to skip some symbol characters, when this character is used as a split column symbol in pandas split function

I have a dataframe like below:
Original data
index string
0 a,b,c,d,e,f
1 a,b,c,d,e,f
2 a,(I,j,k),c,d,e,f
I want to be:
To be data
index col1 col2 col3 col4 col5 col6
0 a b c d e f
1 a b c d e f
2 a (I,j,k) c d e f
You can split on commas that are not inside brackets. Then convert the result to a DataFrame and assign to df columns:
df[['col {}'.format(i) for i in range(1,7)]] = df['string'].str.split(r",\s*(?![^()]*\))").apply(pd.Series)
Output:
index string col 1 col 2 col 3 col 4 col 5 col 6
0 0 a,b,c,d,e,f a b c d e f
1 1 a,b,c,d,e,f a b c d e f
2 2 a,(I,j,k),c,d,e,f a (I,j,k) c d e f
Try this :
df = df['string'].str.split(r",\s*(?![^()]*\))", expand= True)
df.columns = ['col1','col2','col3','col4','col5','col6']

How to group a list into a dataframe with four columns?

Let's assume I have a list similar to the one below:
l = ['A','B','C','D','E','F','G','H','I','L','M','N']
I want to create a dataframe that has 4 columns from the fact that every 4 objects in the list is a row. The outcome should be a dataframe with the following form:
Col1 Col2 Col3 Col4
A B C D
E F G H
I L M N
Can anyone help me do it?
Thanks!
Convert values to numpy array and then use reshape:
l = ['A','B','C','D','E','F','G','H','I','L','M','N']
df = pd.DataFrame(np.array(l).reshape(-1, 4)).add_prefix('col')
print(df)
col0 col1 col2 col3
0 A B C D
1 E F G H
2 I L M N

How to transpose and merge same column names after transposing?

I have a dataframe that has column names in the index and values in a column next to it like so:
column
col1 a
col2 b
col3 c
col1 d
col2 e
col3 f
How do I flip and merge the index into columns like so?
col1 col2 col3
a b c
d e f
I tried:
new_df = pd.DataFrame(df).transpose()
new_df looks like this:
col1 col2 col3 col1 col2 col3
a b c d e f
Use DataFrame.set_index with counter by GroupBy.cumcount and parameter append=True for MultiIndex and then reshape by first level by Series.unstack:
df = df.set_index(df.groupby(level=0).cumcount(), append=True)['column'].unstack(0)
print (df)
col1 col2 col3
0 a b c
1 d e f
If they are symmetric as your example, can use just reshape
pd.DataFrame(df.values.reshape([-1,3]), columns=df.index[:3])
col1 col2 col3
0 a b c
1 d e f
You can populate a dictionary first
d = {}
for k, v in df.column.items():
d.setdefault(k, []).append(v)
pd.DataFrame(d)
col1 col2 col3
0 a b c
1 d e f

Sort and align 2 dataframes by values in corresponding columns

I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3
Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6
How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN

Categories

Resources