How to groupby with nlargest and keep all columns?

How to groupby with nlargest and keep all columns? - python

I want to groupby DataFrame and get the nlargest data of column 'C'.
while the return is series, not DataFrame.
dftest = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).sort_index()
the result get a series.
2 1
4 2
5 2
6 3
7 3
8 4
9 4
Name: C, dtype: int64
I hope the result is DataFrame, like
A B C
2 3 A 1
4 5 A 2
5 6 B 2
6 7 A 3
7 8 B 3
8 9 B 4
9 10 B 4
******update**************
sorry, the column 'A' in fact does not series integers, the dftest might be more like
dftest = pd.DataFrame({'A':['Feb','Flow','Air','Flow','Feb','Beta','Cat','Feb','Beta','Air'],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
and the result should be
A B C
2 Air A 1
4 Feb A 2
5 Beta B 2
6 Cat A 3
7 Feb B 3
8 Beta B 4
9 Air B 4

It may be a bit clumsy, but it does what you asked:
dfn= dftest.groupby('B').apply(lambda
grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).reset_index().rename(columns=
{'level_1':'A'})
dfn.A = dfn.A+1
dfn=dfn[['A','B','C']].sort_values(by='A')

Thanks to my friends, the follow code works for me.
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp.nlargest(n=int(grp['C'].count()*0.8),columns='C').sort_index())
the dfn is
In [8]:dfn
Out[8]:
A B C
2 3 A 1
4 5 A 2
6 7 A 3
5 6 B 2
7 8 B 3
8 9 B 4
9 10 B 4
my previous code is deal with series, the later one is deal with DataFrame.

Related

Add together DataFrame rows with the same column values, but preserve ordering

I have a pandas DataFrame that looks like this:
a b c
8 3 3
4 3 3
5 3 3
1 9 4
7 3 1
1 3 3
6 3 3
9 7 7
1 7 7
I want to get a DataFrame like this:
a b c
17 3 3
1 9 4
7 3 1
7 3 3
10 7 7
Essentially, I want to add together the values in column a when the values in columns b and c are the same, but I want to do that in sections. groupby wouldn't work here because it would put the DataFrame out of order. I have an iterative solution, but it is messy and not very Pythonic. Is there a way to do this using the functions of the DataFrame?

Let us do shift with cumsum create the subgroup by key
s = df[['b','c']].ne(df[['b','c']].shift()).all(1).cumsum()
out = df.groupby([s,df.b,df.c]).agg({'a':'sum','b':'first','c':'first'}).reset_index(drop=True)
a b c
0 17 3 3
1 1 9 4
2 7 3 1
3 7 3 3
4 10 7 7

Try this:
df.groupby(['b', 'c', df[['b', 'c']].diff().any(axis=1).cumsum()], as_index=False)['a'].sum()
Output:
b c a
0 3 1 7
1 3 3 17
2 3 3 7
3 7 7 10
4 9 4 1

Pandas - Merge multiple columns and sum

I have a main df like so:
index A B C
5 1 5 8
6 2 4 1
7 8 3 4
8 3 9 5
and an auxiliary df2 that I want to add to the main df like so:
index A B
5 4 2
6 4 3
7 7 1
8 6 2
Columns A & B are the same name, however the main df contains many columns that the secondary df2 does not. I want to sum the columns that are common and leave the others as is.
Output:
index A B C
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
Have tried variations of df.join, pd.merge and groupby but having no luck at the moment.
Last Attempt:
df.groupby('index').sum().add(df2.groupby('index').sum())
But this does not keep common columns.
pd.merge I am getting suffix _x and _y

Use add only with same columns by intersection:
c = df.columns.intersection(df2.columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C
index
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
If use only add, integers columns which not matched are converted to floats:
df = df.add(df2, fill_value=0)
print (df)
A B C
index
5 5 7 8.0
6 6 7 1.0
7 15 4 4.0
8 9 11 5.0
EDIT:
If possible strings common columns:
print (df)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
print (df2)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
Solution is similar, only filter first only numeric columns by select_dtypes:
c = df.select_dtypes(np.number).columns.intersection(df2.select_dtypes(np.number).columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C D
index
5 5 7 8 a
6 6 7 1 e
7 15 4 4 r
8 9 11 5 w

Not the cleanest way but it might work.
df_new = pd.DataFrame()
df_new['A'] = df['A'] + df2['A']
df_new['B'] = df['B'] + df2['B']
df_new['C'] = df['C']
print(df_new)
A B C
0 5 7 8
1 6 7 1
2 15 4 4
3 9 11 5

Pandas behaviour on stack

Lets Suppose I have
ID A1 B1 A2 B2
1 3 4 5 6
2 7 8 9 10
I want to use pandas stack and wants to achieve something like this
ID A B
1 3 4
1 5 6
2 7 8
2 9 10
but what I got is
ID A B
1 3 4
2 7 8
1 5 6
2 9 10
this is what i am using
df.stack().reset_index().
Is it possible to achieve something like this using Stack? append() method in pandas does this, but if possible I want to achieve using pandas stack() Any idea ?

You can use pd.wide_to_long:
pd.wide_to_long(df, ['A','B'], 'ID', 'value', sep='', suffix='.+')\
.reset_index()\
.sort_values('ID')\
.drop('value', axis=1)
Output:
ID A B
0 1 3 4
2 1 5 6
1 2 7 8
3 2 9 10

Create a new columns object by splitting up the existing column names. This takes for granted that we have single character letters followed by a single digit.
d = df.set_index('ID')
d.columns = d.columns.map(tuple)
d.stack().reset_index('ID')
ID A B
1 1 3 4
2 1 5 6
1 2 7 8
2 2 9 10
One-line
df.set_index('ID').rename(columns=tuple).stack().reset_index('ID')
More generalized
d = df.set_index('ID')
s = d.columns.str
d.columns = [
s.extract('^(\D+)', expand=False),
s.extract('(\d+)$', expand=False)
]
d.stack().reset_index('ID')

A more interested way
s.groupby(s.columns.str[0],axis=1).agg(lambda x : x.values.tolist()).stack().apply(pd.Series).unstack(0).T.reset_index(level=0,drop=True)
Out[90]:
A B
ID
1 3 4
2 7 8
1 5 6
2 9 10

Python dataframe find index of top-5, then index into another column

I have a dataframe with two numeric columns, A & B. I want to find the top 5 values from col A and return the values from Col B held in the location of those top 5.
Many thanks.

I think need DataFrame.nlargest with column A for top 5 rows and then select column B:
df = pd.DataFrame({'A':[4,5,26,43,54,36,18,7,8,9],
'B':range(10)})
print (df)
A B
0 4 0
1 5 1
2 26 2
3 43 3
4 54 4
5 36 5
6 18 6
7 7 7
8 8 8
9 9 9
print (df.nlargest(5, 'A'))
A B
4 54 4
3 43 3
5 36 5
2 26 2
6 18 6
a = df.nlargest(5, 'A')['B']
print (a)
4 4
3 3
5 5
2 2
6 6
Name: B, dtype: int64
Alternative solution with sorting:
a = df.sort_values('A', ascending=False)['B'].head(5)
print (a)
4 4
3 3
5 5
2 2
6 6
Name: B, dtype: int64

nlargest function on the dataframe will do your work, df.nlargest(#of rows,'column_to_sort')
import pandas
df = pd.DataFrame({'A':[1,1,1,2,2,2,2,3,4],'B':[1,2,3,1,2,3,4,1,1]})
df.nlargest(5,'B')
Out[13]:
A B
6 2 4
2 1 3
5 2 3
1 1 2
4 2 2
# if you want only certain column in the output, the use
df.nlargest(5,'B')['A']

Pandas reverse column values groupwise

I want to reverse a column values in my dataframe, but only on a individual "groupby" level. Below you can find a minimal demonstration example, where I want to "flip" values that belong the same letter A,B or C:
df = pd.DataFrame({"group":["A","A","A","B","B","B","B","C","C"],
"value": [1,3,2,4,4,2,3,2,5]})
group value
0 A 1
1 A 3
2 A 2
3 B 4
4 B 4
5 B 2
6 B 3
7 C 2
8 C 5
My desired output looks like this: (column is added instead of replaced only for the brevity purposes)
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
As always, when I don't see a proper vector-style approach, I end messing with loops just for the sake of final output, but my current code hurts me very much:
for i in list(set(df["group"].values.tolist())):
reversed_group = df.loc[df["group"]==i,"value"].values.tolist()[::-1]
df.loc[df["group"]==i,"value_desired"] = reversed_group
Pandas gurus, please show me the way :)

You can use transform
In [900]: df.groupby('group')['value'].transform(lambda x: x[::-1])
Out[900]:
0 2
1 3
2 1
3 3
4 2
5 4
6 4
7 5
8 2
Name: value, dtype: int64
Details
In [901]: df['value_desired'] = df.groupby('group')['value'].transform(lambda x: x[::-1])
In [902]: df
Out[902]:
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to groupby with nlargest and keep all columns? - python

It may be a bit clumsy, but it does what you asked: dfn= dftest.groupby('B').apply(lambda grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).reset_index().rename(columns= {'level_1':'A'}) dfn.A = dfn.A+1 dfn=dfn[['A','B','C']].sort_values(by='A')

Related

Add together DataFrame rows with the same column values, but preserve ordering

Pandas - Merge multiple columns and sum

Pandas behaviour on stack

Python dataframe find index of top-5, then index into another column

Pandas reverse column values groupwise

Categories

Resources