How to select most occurring common values from each column

How to select most occurring common values from each column - python

I have the DafaFrame
Col1 Col2 Col3 Col4
A B C OP
B D A JK
B C E MK
A B B LO
and like get DataFrame below:
Result Total
B 5
A 3
C 2
I manage to get top values from each column using the following command but not sure how to get from there to the DataFrame needed. Trying to find the best way to approach this scenario.
df.groupby(['Col1']).size().sort_values(ascending=False).head(1)
df.groupby(['Col2']).size().sort_values(ascending=False).head(1)
df.groupby(['Col3']).size().sort_values(ascending=False).head(1)

stack and value_counts() and then can rename columns
df.stack().value_counts().head(3).to_frame('Total')
If you need to filter columns as per your comment.
cols=['Col1', 'Col2', 'Col3']
df.loc[:,cols].stack().value_counts().head(3).to_frame('Total')

Use DataFrame.melt with specify columns in value_vars and then counts values by GroupBy.size with Series.nlargest and last convert Series to DataFrame:
df1 = (df.melt(value_vars=['Col1','Col2','Col3'], value_name='Result')
.groupby(['Result'])
.size()
.nlargest(3)
.reset_index(name='Total'))
print (df1)
Result Total
0 B 5
1 A 3
2 C 2
Or use Series.value_counts with Series.head for top3:
df1 = (df.melt(value_vars=['Col1','Col2','Col3'], value_name='Result')['Result']
.value_counts()
.head(3)
.rename_axis('Result')
.reset_index(name='Total'))
print (df1)
Result Total
0 B 5
1 A 3
2 C 2

Related

Merge two pandas dataframes that have slightly different values on the column which is being merged

How can I merge two data frames when the column has a slight offset than the column I am merging to?
df1 =
col1
col2
1
a
2
b
3
c
df2 =
col1
col3
1.01
d
2
e
2.95
f
so, the merged column would end up like this even though the values in col1 are slightly different.
df_merge =
col1
col2
col3
1
a
d
2
b
e
3
c
f
I have seen scenarios like this where "col1" is a string, but I'm wondering if it's possible to do this with something like pandas.merge() in the scenario where there is slight numerical offset (e.g +/- 0.05).

Lets do merge_asof with tolerance parameter
pd.merge_asof(
df1.astype({'col1': 'float'}).sort_values('col1'),
df2.sort_values('col1'),
on='col1',
direction='nearest',
tolerance=.05
)
col1 col2 col3
0 1.0 a d
1 2.0 b e
2 3.0 c f
PS: if the dataframes are already sorted on col1 then there is no need to sort again.

Remove duplicates in a row pandas

I have a df
Name Symbol Dummy
A (BO),(BO),(AD),(TR) 2
B (TV),(TV),(TV) 2
C (HY) 2
D (UI) 2
I need df as
Name Symbol Dummy
A (BO),(AD),(TR) 2
B (TV) 2
C (HY) 2
D (UI) 2
Tried with this function but not working as expected.
drop_duplicates

Split the strings around delimiter ,, then dedupe using dict.fromkeys which also preserves the order of strings, finally join around delimiter ,
df['Symbol'] = df['Symbol'].str.split(',').map(dict.fromkeys).str.join(',')
Name Symbol Dummy
0 A (BO),(AD),(TR) 2
1 B (TV) 2
2 C (HY) 2
3 D (UI) 2

Another method
#original DF
index
col1
col2
0
(BO),(BO),(AD),(TR)
2
df.col1 = df.col1.str.split(',').apply(lambda x: sorted(set(x), key=x.index)).str.join(',')
df
#output
index
col1
col2
0
(BO),(AD),(TR)
2
If values order not important you can simply do:
df.col1 = df.col1.str.split(',').apply(lambda x: set(x)).str.join(',')
df
#output
index
col1
col2
0
(AD),(BO),(TR)
2

iloc[] by value columns

I want to use iloc with value in column.
df1 = pd.DataFrame({'col1': ['1' ,'1','1','2','2','2','2','2','3' ,'3','3'],
'col2': ['A' ,'B','C','D','E','F','G','H','I' ,'J','K']})
I want to select index 2 in each column value as data frame and the result will be like
col1 col2
1 C
2 F
3 K
Thank you so much

Use GroupBy.nth:
df2 = df1.groupby('col1', as_index=False).nth(2)
Alternative with GroupBy.cumcount:
df2 = df1[df1.groupby('col1').cumcount().eq(2)]
print (df2)
col1 col2
2 1 C
5 2 F
10 3 K

Use GroupBy.nth with as_index=False:
df1.groupby('col1', as_index=False).nth(2)
output:
col1 col2
2 1 C
5 2 F
10 3 K

df1.groupby('col1').agg(lambda ss:ss.iloc[2])
col2
col1
1 C
2 F
3 K

Pandas Groupby mean and first of multiple columns

My Pandas df is like following and want to apply groupby and then want to calculate the average and first of many columns
index col1 col2 col3 col4 col5 col6
0 a c 1 2 f 5
1 a c 1 2 f 7
2 a d 1 2 g 9
3 b d 6 2 g 4
4 b e 1 2 g 8
5 b e 1 2 g 2
something like this I tried
df.groupby(['col1','col5').agg({['col6','col3']:'mean',['col4','col2']:'first'})
expecting output
col1 col5 col6 col3 col4 col2
a f 6 1 2 c
a g 9 1 2 d
b g 4 3 2 e
but it seems, list is not an option here, in my real dataset I have 100 of columns of different nature so I cant pass them individually. Any thoughts on passing them as list?

if you have lists depending on the aggregation, you can do:
l_mean = ['col6','col3']
l_first = ['col4','col2']
df.groupby(['col1','col5']).agg({**{col:'mean' for col in l_mean},
**{col:'first' for col in l_first}})
the notation **{} is for unpacking dictionary, doing {**{}, **{}} create one dictionary from 2 dictionaries (it could be ore than two), it is like union of dictionaries. And doing {col:'mean' for col in l_mean} create a dictionary with each col of the list as a key and 'mean' as value, it is dictionary comprehension.
Or using concat:
gr = df.groupby(['col1','col5'])
pd.concat([gr[l_mean].mean(),
gr[l_first].first()],
axis=1)
and reset_index after to get the expected output

(
df.groupby(['col1','col5'])
.agg(col6=('col6', 'mean'),
col3=('col3', 'mean'),
col4=('col4', 'first'),
col2=('col2', 'first'))
)

this is an extension of #Ben.T's solution, just wrapping it in a function and passing it via the pipe method :
#set the list1, list2
def fil(grp,list1,list2):
A = grp.mean().filter(list1)
B = grp.first().filter(list2)
C = A.join(B)
return C
grp1 = ['col6','col3']
grp2 = ['col4','col2']
m = df.groupby(['col1','col5']).pipe(fil,grp1,grp2)
m

Sort and align 2 dataframes by values in corresponding columns

I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3

Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6

How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to select most occurring common values from each column - python

stack and value_counts() and then can rename columns df.stack().value_counts().head(3).to_frame('Total') If you need to filter columns as per your comment. cols=['Col1', 'Col2', 'Col3'] df.loc[:,cols].stack().value_counts().head(3).to_frame('Total')

Related

Merge two pandas dataframes that have slightly different values on the column which is being merged

Remove duplicates in a row pandas

iloc[] by value columns

Pandas Groupby mean and first of multiple columns

Sort and align 2 dataframes by values in corresponding columns

Categories

Resources