Sort multiIndex dataframe given the sum

Sort multiIndex dataframe given the sum - python

Hello I have prepared a MultiIndex table in Pandas that looks like this:
Lang C++ java python All
Corp Name
ASW ASW 0.0 7.0 8.0 15
Cristiano NaN NaN 8.0 8
Michael NaN 7.0 0 7
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Now I would like to Sort group of rows where sum of Corp(in column "All') is sorted decsending, then I would like to select only the two index "Corp"(and their rows) which are the largest,
It should looks like:
Lang C++ java python All
Corp Name
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
ASW ASW 0.0 7.0 8.0 15
Cristiano NaN NaN 8.0 8
Michael NaN 7.0 0 7
Thank You!

IIUC, you can sort_values per group, then slice using the max sum of All per group:
out = (df
# for each company, sort the values using the All column in descending order
.groupby(level=0).apply(lambda g: g.sort_values('All', ascending=False))
# calculate the sum or All per company
# get the index of the top 2 companies (nlargest(2))
# slice to keep only those
.loc[lambda d: d.groupby(level=0)['All'].sum().nlargest(2).index]
)
output:
Lang C++ java python All
Corp Name
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3

Related

How to use grouped rows in pandas

Hello I have table with MultiIndex:
Lang C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Google Google 2.0 24.0 1.0 27
ASW Cristiano NaN NaN 5.0 5
Facebook Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Piter 8.0 NaN NaN 8
Google Cristiano NaN NaN 1.0 1
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
I am trying use this code
out = df.groupby(level=0).apply(lambda g: g.sort_values('All', ascending=False)
But It adds one more level index, how Can I use code without adding index?
I don't want to add and then delete indexes
thank You in Advance!

Add group_keys=False parameter in DataFrame.groupby:
out = (df.groupby(level=0, group_keys=False)
.apply(lambda g: g.sort_values('All', ascending=False)))
print (out)
C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Cristiano NaN NaN 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Better/faster/simplier solution is sorting by level of MultiIndex and column:
out = df.sort_values(['Corp','All'], ascending=[True, False])
print (out)
C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Cristiano NaN NaN 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1

How to join two dataframe with same category?

Now, I have two dataframe. I have use groupby. and count() function to export this dataframe(df1). When I used groupby. to count the total number of each category. It filtered out the category which the count is 0. How can I use Python to get the outcome?
However,I will like to have a dataframe which also required categories.
Original dataframe:
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 Beam Sensor 27.0 12.0 13.0 14.0
3 CLPS 1.0 NaN NaN 1.0
However,I will like to have a dataframe which also required categories.
(required categories: ATIDS, BasicCrane, LLP, Beam Sensor, CLPS, SPR)
Expected dataframe (The count number of 'LLP' and 'SPR' is 0)
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN

>>> categories
['ATIDS', 'BasicCrane', 'LLP', 'Beam Sensor', 'CLPS', 'SPR']
>>> pd.merge(pd.DataFrame({'Cat': categories}), df, how='outer')
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN

One way you could easily do is to fill NaN value with 0 'before' doing a groupby function. All zero data (previously NaN value) will be totally be counted as zero.
df.fillna(0)

Convert two pandas rows into one

I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.

You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()

Replace row in pandas with next row only when the entire row (each column) has NaN values

I have a DataFrame where I want to replace only the rows with NaN values in each column by the row below it. I tried solutions from multiple feeds and used ffill but that resulted in filling few cells and not the entire row.
ss s h b sb
0 NaN NaN NaN NaN NaN
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 1.0 6.0 7.0 11.0 3.0
Expected output:
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0

You can create groups by testing rows with only missing values with cumulative sum by swapped order of column and pass to GroupBy.bfill:
df = df.groupby((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1]).bfill()
print (df)
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
Detail:
print ((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1])
0 3
1 3
2 2
3 1
4 1
5 1
dtype: int32

ReArrange Pandas DataFrame date columns in date order

I have a pandas dataframe that summarises sales by calendar month & outputs something like:
Month level_0 UNIQUE_ID 102018 112018 12018 122017 122018 22018 32018 42018 52018 62018 72018 82018 92018
0 SOLD_QUANTITY 01 3692.0 5182.0 3223.0 1292.0 2466.0 2396.0 2242.0 2217.0 3590.0 2593.0 1665.0 3371.0 3069.0
1 SOLD_QUANTITY 011 3.0 6.0 NaN NaN 7.0 5.0 2.0 1.0 5.0 NaN 1.0 1.0 3.0
2 SOLD_QUANTITY 02 370.0 130.0 NaN NaN 200.0 NaN NaN 269.0 202.0 NaN 201.0 125.0 360.0
3 SOLD_QUANTITY 03 2.0 6.0 NaN NaN 2.0 1.0 NaN 6.0 11.0 9.0 2.0 3.0 5.0
4 SOLD_QUANTITY 08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 175.0 NaN NaN
I want to be able to programmatically re-arrange the column headers in ascending date order (eg starting 122017, 12018, 22018...). I need to do it in a way that is programmatic as every way the report runs, it will be a different list of months as it runs every month for last 365 days.
The index data type:
Index(['level_0', 'UNIQUE_ID', '102018', '112018', '12018', '122017', '122018',
'22018', '32018', '42018', '52018', '62018', '72018', '82018', '92018'],
dtype='object', name='Month')

Use set_index for only dates columns, convert them to datetimes and get order positions by argsort, then change ordering with iloc:
df = df.set_index(['level_0','UNIQUE_ID'])
df = df.iloc[:, pd.to_datetime(df.columns, format='%m%Y').argsort()].reset_index()
print (df)
level_0 UNIQUE_ID 122017 12018 22018 32018 42018 52018 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0 3590.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0 5.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0 202.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0 11.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN NaN
62018 72018 82018 92018 102018 112018 122018
0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN 175.0 NaN NaN NaN NaN NaN
Another idea is create month period index by DatetimeIndex.to_period, so is possible use sort_index:
df = df.set_index(['level_0','UNIQUE_ID'])
df.columns = pd.to_datetime(df.columns, format='%m%Y').to_period('m')
#alternative for convert to datetimes
#df.columns = pd.to_datetime(df.columns, format='%m%Y')
df = df.sort_index(axis=1).reset_index()
print (df)
level_0 UNIQUE_ID 2017-12 2018-01 2018-02 2018-03 2018-04 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN
2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11 2018-12
0 3590.0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 5.0 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 202.0 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 11.0 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN NaN 175.0 NaN NaN NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort multiIndex dataframe given the sum - python

Related

How to use grouped rows in pandas

How to join two dataframe with same category?

Convert two pandas rows into one

Replace row in pandas with next row only when the entire row (each column) has NaN values

ReArrange Pandas DataFrame date columns in date order

Categories

Resources