Remove duplicates in a row pandas - python

I have a df
Name Symbol Dummy
A (BO),(BO),(AD),(TR) 2
B (TV),(TV),(TV) 2
C (HY) 2
D (UI) 2
I need df as
Name Symbol Dummy
A (BO),(AD),(TR) 2
B (TV) 2
C (HY) 2
D (UI) 2
Tried with this function but not working as expected.
drop_duplicates

Split the strings around delimiter ,, then dedupe using dict.fromkeys which also preserves the order of strings, finally join around delimiter ,
df['Symbol'] = df['Symbol'].str.split(',').map(dict.fromkeys).str.join(',')
Name Symbol Dummy
0 A (BO),(AD),(TR) 2
1 B (TV) 2
2 C (HY) 2
3 D (UI) 2

Another method
#original DF
index
col1
col2
0
(BO),(BO),(AD),(TR)
2
df.col1 = df.col1.str.split(',').apply(lambda x: sorted(set(x), key=x.index)).str.join(',')
df
#output
index
col1
col2
0
(BO),(AD),(TR)
2
If values order not important you can simply do:
df.col1 = df.col1.str.split(',').apply(lambda x: set(x)).str.join(',')
df
#output
index
col1
col2
0
(AD),(BO),(TR)
2

Related

iloc[] by value columns

I want to use iloc with value in column.
df1 = pd.DataFrame({'col1': ['1' ,'1','1','2','2','2','2','2','3' ,'3','3'],
'col2': ['A' ,'B','C','D','E','F','G','H','I' ,'J','K']})
I want to select index 2 in each column value as data frame and the result will be like
col1 col2
1 C
2 F
3 K
Thank you so much
Use GroupBy.nth:
df2 = df1.groupby('col1', as_index=False).nth(2)
Alternative with GroupBy.cumcount:
df2 = df1[df1.groupby('col1').cumcount().eq(2)]
print (df2)
col1 col2
2 1 C
5 2 F
10 3 K
Use GroupBy.nth with as_index=False:
df1.groupby('col1', as_index=False).nth(2)
output:
col1 col2
2 1 C
5 2 F
10 3 K
df1.groupby('col1').agg(lambda ss:ss.iloc[2])
col2
col1
1 C
2 F
3 K

How to select most occurring common values from each column

I have the DafaFrame
Col1 Col2 Col3 Col4
A B C OP
B D A JK
B C E MK
A B B LO
and like get DataFrame below:
Result Total
B 5
A 3
C 2
I manage to get top values from each column using the following command but not sure how to get from there to the DataFrame needed. Trying to find the best way to approach this scenario.
df.groupby(['Col1']).size().sort_values(ascending=False).head(1)
df.groupby(['Col2']).size().sort_values(ascending=False).head(1)
df.groupby(['Col3']).size().sort_values(ascending=False).head(1)
stack and value_counts() and then can rename columns
df.stack().value_counts().head(3).to_frame('Total')
If you need to filter columns as per your comment.
cols=['Col1', 'Col2', 'Col3']
df.loc[:,cols].stack().value_counts().head(3).to_frame('Total')
Use DataFrame.melt with specify columns in value_vars and then counts values by GroupBy.size with Series.nlargest and last convert Series to DataFrame:
df1 = (df.melt(value_vars=['Col1','Col2','Col3'], value_name='Result')
.groupby(['Result'])
.size()
.nlargest(3)
.reset_index(name='Total'))
print (df1)
Result Total
0 B 5
1 A 3
2 C 2
Or use Series.value_counts with Series.head for top3:
df1 = (df.melt(value_vars=['Col1','Col2','Col3'], value_name='Result')['Result']
.value_counts()
.head(3)
.rename_axis('Result')
.reset_index(name='Total'))
print (df1)
Result Total
0 B 5
1 A 3
2 C 2

Pandas Groupby mean and first of multiple columns

My Pandas df is like following and want to apply groupby and then want to calculate the average and first of many columns
index col1 col2 col3 col4 col5 col6
0 a c 1 2 f 5
1 a c 1 2 f 7
2 a d 1 2 g 9
3 b d 6 2 g 4
4 b e 1 2 g 8
5 b e 1 2 g 2
something like this I tried
df.groupby(['col1','col5').agg({['col6','col3']:'mean',['col4','col2']:'first'})
expecting output
col1 col5 col6 col3 col4 col2
a f 6 1 2 c
a g 9 1 2 d
b g 4 3 2 e
but it seems, list is not an option here, in my real dataset I have 100 of columns of different nature so I cant pass them individually. Any thoughts on passing them as list?
if you have lists depending on the aggregation, you can do:
l_mean = ['col6','col3']
l_first = ['col4','col2']
df.groupby(['col1','col5']).agg({**{col:'mean' for col in l_mean},
**{col:'first' for col in l_first}})
the notation **{} is for unpacking dictionary, doing {**{}, **{}} create one dictionary from 2 dictionaries (it could be ore than two), it is like union of dictionaries. And doing {col:'mean' for col in l_mean} create a dictionary with each col of the list as a key and 'mean' as value, it is dictionary comprehension.
Or using concat:
gr = df.groupby(['col1','col5'])
pd.concat([gr[l_mean].mean(),
gr[l_first].first()],
axis=1)
and reset_index after to get the expected output
(
df.groupby(['col1','col5'])
.agg(col6=('col6', 'mean'),
col3=('col3', 'mean'),
col4=('col4', 'first'),
col2=('col2', 'first'))
)
this is an extension of #Ben.T's solution, just wrapping it in a function and passing it via the pipe method :
#set the list1, list2
def fil(grp,list1,list2):
A = grp.mean().filter(list1)
B = grp.first().filter(list2)
C = A.join(B)
return C
grp1 = ['col6','col3']
grp2 = ['col4','col2']
m = df.groupby(['col1','col5']).pipe(fil,grp1,grp2)
m

Sort and align 2 dataframes by values in corresponding columns

I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3
Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6
How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN

How to change values in certain columns according to certain rule in pandas dataframe

Suppose I have a pandas dataframe looks like this:
col1 col2
0 A A60
1 B B23
2 C NaN
The data from is read from a csv file. Suppose I want to change each non-missing value of 'col2' to its prefix (i.e. 'A' or 'B'). How could I do this without writing a for loop?
The expected output is
col1 col2
0 A A
1 B B
2 C NaN
.str[:1] just returns the first character
d = {'col1': ['A', 'B','C'], 'col2': ['A32', 'B60',np.nan]}
df = pd.DataFrame(data=d)
df['col2'] = df['col2'].str[:1]
df
out:
col1 col2
0 A A
1 B B
2 C NaN
You can also your pandas replace function directly on the column.
# sample data
df = pd.DataFrame({'col1':['A','B','C'], 'col2':['A60','B23',np.nan]})
# remove numbers from col2
df['col2'] = df['col2'].str.replace('\d+','')
print(df)
col1 col2
0 A A
1 B B
2 C NaN
You may need to use isnull():
df['col2'] = df['col2'].apply(lambda x: str(x)[0] if not pd.isnull(x) else x)

Categories

Resources