Hello I have prepared a MultiIndex table in Pandas that looks like this:
Lang C++ java python All
Corp Name
ASW ASW 0.0 7.0 8.0 15
Cristiano NaN NaN 8.0 8
Michael NaN 7.0 0 7
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Now I would like to Sort group of rows where sum of Corp(in column "All') is sorted decsending, then I would like to select only the two index "Corp"(and their rows) which are the largest,
It should looks like:
Lang C++ java python All
Corp Name
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
ASW ASW 0.0 7.0 8.0 15
Cristiano NaN NaN 8.0 8
Michael NaN 7.0 0 7
Thank You!
IIUC, you can sort_values per group, then slice using the max sum of All per group:
out = (df
# for each company, sort the values using the All column in descending order
.groupby(level=0).apply(lambda g: g.sort_values('All', ascending=False))
# calculate the sum or All per company
# get the index of the top 2 companies (nlargest(2))
# slice to keep only those
.loc[lambda d: d.groupby(level=0)['All'].sum().nlargest(2).index]
)
output:
Lang C++ java python All
Corp Name
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Related
Hello I have table with MultiIndex:
Lang C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Google Google 2.0 24.0 1.0 27
ASW Cristiano NaN NaN 5.0 5
Facebook Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Piter 8.0 NaN NaN 8
Google Cristiano NaN NaN 1.0 1
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
I am trying use this code
out = df.groupby(level=0).apply(lambda g: g.sort_values('All', ascending=False)
But It adds one more level index, how Can I use code without adding index?
I don't want to add and then delete indexes
thank You in Advance!
Add group_keys=False parameter in DataFrame.groupby:
out = (df.groupby(level=0, group_keys=False)
.apply(lambda g: g.sort_values('All', ascending=False)))
print (out)
C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Cristiano NaN NaN 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Better/faster/simplier solution is sorting by level of MultiIndex and column:
out = df.sort_values(['Corp','All'], ascending=[True, False])
print (out)
C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Cristiano NaN NaN 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Now, I have two dataframe. I have use groupby. and count() function to export this dataframe(df1). When I used groupby. to count the total number of each category. It filtered out the category which the count is 0. How can I use Python to get the outcome?
However,I will like to have a dataframe which also required categories.
Original dataframe:
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 Beam Sensor 27.0 12.0 13.0 14.0
3 CLPS 1.0 NaN NaN 1.0
However,I will like to have a dataframe which also required categories.
(required categories: ATIDS, BasicCrane, LLP, Beam Sensor, CLPS, SPR)
Expected dataframe (The count number of 'LLP' and 'SPR' is 0)
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
>>> categories
['ATIDS', 'BasicCrane', 'LLP', 'Beam Sensor', 'CLPS', 'SPR']
>>> pd.merge(pd.DataFrame({'Cat': categories}), df, how='outer')
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
One way you could easily do is to fill NaN value with 0 'before' doing a groupby function. All zero data (previously NaN value) will be totally be counted as zero.
df.fillna(0)
I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()
I have a DataFrame where I want to replace only the rows with NaN values in each column by the row below it. I tried solutions from multiple feeds and used ffill but that resulted in filling few cells and not the entire row.
ss s h b sb
0 NaN NaN NaN NaN NaN
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 1.0 6.0 7.0 11.0 3.0
Expected output:
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
You can create groups by testing rows with only missing values with cumulative sum by swapped order of column and pass to GroupBy.bfill:
df = df.groupby((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1]).bfill()
print (df)
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
Detail:
print ((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1])
0 3
1 3
2 2
3 1
4 1
5 1
dtype: int32
I have a pandas dataframe that summarises sales by calendar month & outputs something like:
Month level_0 UNIQUE_ID 102018 112018 12018 122017 122018 22018 32018 42018 52018 62018 72018 82018 92018
0 SOLD_QUANTITY 01 3692.0 5182.0 3223.0 1292.0 2466.0 2396.0 2242.0 2217.0 3590.0 2593.0 1665.0 3371.0 3069.0
1 SOLD_QUANTITY 011 3.0 6.0 NaN NaN 7.0 5.0 2.0 1.0 5.0 NaN 1.0 1.0 3.0
2 SOLD_QUANTITY 02 370.0 130.0 NaN NaN 200.0 NaN NaN 269.0 202.0 NaN 201.0 125.0 360.0
3 SOLD_QUANTITY 03 2.0 6.0 NaN NaN 2.0 1.0 NaN 6.0 11.0 9.0 2.0 3.0 5.0
4 SOLD_QUANTITY 08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 175.0 NaN NaN
I want to be able to programmatically re-arrange the column headers in ascending date order (eg starting 122017, 12018, 22018...). I need to do it in a way that is programmatic as every way the report runs, it will be a different list of months as it runs every month for last 365 days.
The index data type:
Index(['level_0', 'UNIQUE_ID', '102018', '112018', '12018', '122017', '122018',
'22018', '32018', '42018', '52018', '62018', '72018', '82018', '92018'],
dtype='object', name='Month')
Use set_index for only dates columns, convert them to datetimes and get order positions by argsort, then change ordering with iloc:
df = df.set_index(['level_0','UNIQUE_ID'])
df = df.iloc[:, pd.to_datetime(df.columns, format='%m%Y').argsort()].reset_index()
print (df)
level_0 UNIQUE_ID 122017 12018 22018 32018 42018 52018 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0 3590.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0 5.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0 202.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0 11.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN NaN
62018 72018 82018 92018 102018 112018 122018
0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN 175.0 NaN NaN NaN NaN NaN
Another idea is create month period index by DatetimeIndex.to_period, so is possible use sort_index:
df = df.set_index(['level_0','UNIQUE_ID'])
df.columns = pd.to_datetime(df.columns, format='%m%Y').to_period('m')
#alternative for convert to datetimes
#df.columns = pd.to_datetime(df.columns, format='%m%Y')
df = df.sort_index(axis=1).reset_index()
print (df)
level_0 UNIQUE_ID 2017-12 2018-01 2018-02 2018-03 2018-04 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN
2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11 2018-12
0 3590.0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 5.0 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 202.0 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 11.0 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN NaN 175.0 NaN NaN NaN NaN NaN