How do I remove the duplicate rows based on pairwise columns (Col1, Col2) and (Col3, Col4)
import pandas as pd
df = pd.DataFrame({'Col1' : ['A','A','C','A','C'],
'Col2' : ['B','B','D','B','D'],
'Col3' : ['C','A','C','B','D'],
'Col4' :['D','B','D','A','C']})
Col1 Col2 Col3 Col4
A B C D
A B A B
C D C D
A B B A
C D D C
The desired output is:
Col1 Col2 Col3 Col4
A B C D
A B B A
C D D C
row two and row three are dropped because
A B = A B and C D = C D
I tried something like
df.drop_duplicates(subset=[['Col1', 'Col2'],['Col3', 'Col4']])
but this is not right.
Let us try compare with values
out = df[np.all(df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values,1)]
Out[298]:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C
You could try comparing columns like this.
new_df = df[(df['Col1'] != df['Col3']) & (df['Col2'] != df['Col4'])]
print(new_df)
Output:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C
I used the following approach:
For the df ->
import pandas as pd
df = pd.DataFrame({'Col1' : ['A','A','C','A','C'],
'Col2' : ['B','B','D','B','D'],
'Col3' : ['C','A','C','B','D'],
'Col4' :['D','B','D','A','C']})
Comparing the values of (Col1,Col2) with (Col3,Col4) and drop the duplicates
desired_output = df[df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values].drop_duplicates()
desired_output
Output:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C
Related
I'm new with python and pandas and I'm struggling with a problem
Here is a dataset
data = {'col1': ['a','b','a','c'], 'col2': [None,None,'a',None], 'col3': [None,'a',None,'b'], 'col4': ['a',None,'b',None], 'col5': ['b','c','c',None]}
df = pd.DataFrame(data)
I need to create 3 columns based on the unique values of col1 to col4 and whenever the col1 or col2 or col3 or col4 have a value equals to the header of the new columns it should return 1 otherwise it should return 0
need a output like this
dataset output example:
data = {'col1': ['a','b','a','c'], 'col2': [None,None,'a',None], 'col3': [None,'a',None,'b'], 'col4': ['a',None,'b',None], 'col5': ['b','c','c',None], 'a':[1,1,1,0],'b':[0,1,1,1],'c':[0,1,1,1]}
df = pd.DataFrame(data)
I was able to create a new colum and set it to 1 using the code below
df['a'] = 0
df['a'] = (df['col1'] == 'a').astype(int)
but it works only with the first column, I would have to repeat it for all columns.
Is there a way to make it happens for all columns at once?
Check with pd.get_dummies and groupby
df = pd.concat([df,
pd.get_dummies(df,prefix='',prefix_sep='').groupby(level=0,axis=1).max()],
axis=1)
Out[377]:
col1 col2 col3 col4 col5 a b c
0 a None None a b 1 1 0
1 b None a None c 1 1 1
2 a a None b c 1 1 1
3 c None b None None 0 1 1
pd.concat([df, pd.get_dummies(df.stack().droplevel(1)).groupby(level=0).max()], axis=1)
result:
col1 col2 col3 col4 col5 a b c
0 a None None a b 1 1 0
1 b None a None c 1 1 1
2 a a None b c 1 1 1
3 c None b None None 0 1 1
Consider a Table1 having two columns :col1 and col2
col1
col2
A
B
B
A
C
D
E
F
F
E
How can we find final output as ( we have to find unique records) :
col1
col2
A
B
C
D
E
F
**** OR ****
col1
col2
B
A
C
D
F
E
You can use not exists:
select col1, col2
from t
where col1 < col2 or
not exists (select 1
from t t2
where t2.col1 = t.col2 and t2.col2 = t.col1
);
That is, select all rows where col1 < col2. Or, select all rows where no such row exists.
This snippet cannot find all possible "unique" sets, but it can find at least one. Hope it can fit your current problem.
table = [['A', 'B'],
['B', 'A'],
['C', 'D'],
['E', 'F'],
['F', 'E']]
unique_set = set()
for row in table:
tmp = frozenset(row)
if tmp not in unique_set:
unique_set.add(tmp)
print(unique_set)
I like pythonic solutions, here is my way:
df
col1 col2
0 A B
1 B A
2 C D
3 E F
4 F E
create new column
df['test'] = [sorted(x) for x in list(zip(df['col1'],df['col2']))]
output:
col1 col2 test
0 A B [A, B]
1 B A [A, B]
2 C D [C, D]
3 E F [E, F]
4 F E [E, F]
remove duplicates and get the index:
idx = df['test'].astype(str).drop_duplicates().index
new dataframe:
del df['test']
df.loc[idx]
.
col1 col2
0 A B
2 C D
3 E F
I have a dataset that looks like below:
col1. col2. col3.
a b c
a d x
b c e
s f e
f f e
I need to drop duplicates in col3 if col1 differs from col2. The result looks like:
col1. col2. col3.
a b c
a d x
f f e
Is there a way to nest this condition in df = df.drop_duplicates(subset=['col3'])?
Yes we can do argsort
df = df.iloc[df.eval('col1==col2').argsort()].drop_duplicates('col3',keep='last')
col1 col2 col3
0 a b c
1 a d x
4 f f e
I have a dataframe that has column names in the index and values in a column next to it like so:
column
col1 a
col2 b
col3 c
col1 d
col2 e
col3 f
How do I flip and merge the index into columns like so?
col1 col2 col3
a b c
d e f
I tried:
new_df = pd.DataFrame(df).transpose()
new_df looks like this:
col1 col2 col3 col1 col2 col3
a b c d e f
Use DataFrame.set_index with counter by GroupBy.cumcount and parameter append=True for MultiIndex and then reshape by first level by Series.unstack:
df = df.set_index(df.groupby(level=0).cumcount(), append=True)['column'].unstack(0)
print (df)
col1 col2 col3
0 a b c
1 d e f
If they are symmetric as your example, can use just reshape
pd.DataFrame(df.values.reshape([-1,3]), columns=df.index[:3])
col1 col2 col3
0 a b c
1 d e f
You can populate a dictionary first
d = {}
for k, v in df.column.items():
d.setdefault(k, []).append(v)
pd.DataFrame(d)
col1 col2 col3
0 a b c
1 d e f
I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3
Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6
How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN