How to use grouped rows in pandas - python

Hello I have table with MultiIndex:
Lang C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Google Google 2.0 24.0 1.0 27
ASW Cristiano NaN NaN 5.0 5
Facebook Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Piter 8.0 NaN NaN 8
Google Cristiano NaN NaN 1.0 1
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
I am trying use this code
out = df.groupby(level=0).apply(lambda g: g.sort_values('All', ascending=False)
But It adds one more level index, how Can I use code without adding index?
I don't want to add and then delete indexes
thank You in Advance!

Add group_keys=False parameter in DataFrame.groupby:
out = (df.groupby(level=0, group_keys=False)
.apply(lambda g: g.sort_values('All', ascending=False)))
print (out)
C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Cristiano NaN NaN 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Better/faster/simplier solution is sorting by level of MultiIndex and column:
out = df.sort_values(['Corp','All'], ascending=[True, False])
print (out)
C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Cristiano NaN NaN 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1

Related

I need to assing some record from one table to another according to some conditions

I have 2 data frames.
train:
rooms bedr bathr surface_t surface_c property_type
0 NaN 4.0 4.0 NaN NaN Casa
1 NaN 3.0 2.0 NaN NaN Apartamento
2 NaN NaN 2.0 NaN NaN Casa
3 NaN NaN 1.0 NaN NaN Otro
4 NaN NaN 2.0 NaN NaN Apartamento
... ... ... ... ... ... ...
197544 3.0 3.0 NaN NaN NaN Apartamento
197545 NaN NaN 1.0 NaN 17.0 Oficina
197546 NaN NaN 1.0 NaN NaN Otro
197547 NaN NaN 2.0 NaN NaN Casa
197548 NaN NaN 1.0 NaN NaN Apartamento
empty: with the mean value for each column according to the type of property
property_type rooms bedrooms bathrooms surface_total surface_covered
0 Apartamento 3.0 3.0 2.0 108.0 113.0
1 Casa 4.0 4.0 3.0 897.0 300.0
2 Finca 4.0 4.0 4.0 14925.0 30939.0
3 Local comercial 3.0 1.0 2.0 180.0 160.0
4 Lote 3.0 1.0 2.0 8979.0 13101.0
5 Oficina 3.0 1.0 2.0 144.0 121.0
6 Otro 6.0 5.0 3.0 991.0 1010.0
7 Parqueadero 4.0 2.0 NaN 496.0 545.0
In the dataframe Train for each of these columns: rooms, bedrooms, bathrooms, surface_total and surface_covered if the value is nan I need to fill it with the appropiate record of empty matching the property_type column.
e.g I need in the train.loc[0,'rooms] to be equal to 4.0 from empty (empty.loc[1,'rooms'],
train.loc[1,'rooms] == the value of empty.loc[0,'rooms'] that is 3.0 and so on.
I have been trying with double for cycles but I have not been able to do so. I'm frustrated now.

Sort multiIndex dataframe given the sum

Hello I have prepared a MultiIndex table in Pandas that looks like this:
Lang C++ java python All
Corp Name
ASW ASW 0.0 7.0 8.0 15
Cristiano NaN NaN 8.0 8
Michael NaN 7.0 0 7
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Now I would like to Sort group of rows where sum of Corp(in column "All') is sorted decsending, then I would like to select only the two index "Corp"(and their rows) which are the largest,
It should looks like:
Lang C++ java python All
Corp Name
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
ASW ASW 0.0 7.0 8.0 15
Cristiano NaN NaN 8.0 8
Michael NaN 7.0 0 7
Thank You!
IIUC, you can sort_values per group, then slice using the max sum of All per group:
out = (df
# for each company, sort the values using the All column in descending order
.groupby(level=0).apply(lambda g: g.sort_values('All', ascending=False))
# calculate the sum or All per company
# get the index of the top 2 companies (nlargest(2))
# slice to keep only those
.loc[lambda d: d.groupby(level=0)['All'].sum().nlargest(2).index]
)
output:
Lang C++ java python All
Corp Name
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3

How to find the cumprod for dataframe by reserving the row values?

I have dataframe df:
0 1 2 3 4 5 6
Row Labels
2017 A1 2.0 2.0 NaN 2.0 NaN 2.0 NaN
2017 A2 2.0 2.0 2.0 NaN 2.0 2.0 NaN
2017 A3 2.0 2.0 2.0 2.0 2.0 2.0 NaN
2017 A4 2.0 2.0 2.0 2.0 2.0 2.0 NaN
2018 A1 2.0 2.0 2.0 2.0 NaN NaN NaN
2019 A2 2.0 2.0 2.0 NaN NaN NaN NaN
2020 A3 2.0 2.0 NaN NaN NaN NaN NaN
2021 A4 2.0 NaN NaN NaN NaN NaN NaN
I have to find the cumprod of the dataframe by reversing row values:
I tried this code ;
df1 = df[::-1].cumprod(axis=1)[::-1]
i got output like this ,
0 1 2 3 4 5 6
Row Labels
2017 A1 2.0 4.0 NaN 8.0 NaN 16.0 NaN
2017 A2 2.0 4.0 8.0 NaN 16.0 32.0 NaN
2017 A3 2.0 4.0 8.0 16.0 32.0 64.0 NaN
2017 A4 2.0 4.0 8.0 16.0 32.0 64.0 NaN
2018 A1 2.0 4.0 8.0 16.0 NaN NaN NaN
2019 A2 2.0 4.0 8.0 NaN NaN NaN NaN
2020 A3 2.0 4.0 NaN NaN NaN NaN NaN
2021 A4 2.0 NaN NaN NaN NaN NaN NaN
But expected output is ;
0 1 2 3 4 5 6
Row Labels
2017 A1 16.0 8.0 NaN 4.0 NaN 2.0 NaN
2017 A2 32.0 16.0 8.0 NaN 4.0 2.0 NaN
2017 A3 64.0 32.0 16.0 8.0 4.0 2.0 NaN
2017 A4 64.0 32.0 16.0 8.0 4.0 2.0 NaN
2018 A1 16.0 8.0 4.0 2.0 NaN NaN NaN
2019 A2 8.0 4.0 2.0 NaN NaN NaN NaN
2020 A3 4.0 2.0 NaN NaN NaN NaN NaN
2021 A4 2.0 NaN NaN NaN NaN NaN NaN
Thank You For Your Time :)
Use DataFrame.iloc with first : for select all rows and ::-1 for swapping by columns:
df1 = df.iloc[:, ::-1].cumprod(axis=1).iloc[:, ::-1]
print (df1)
0 1 2 3 4 5 6
Row Labels
2017 A1 16.0 8.0 NaN 4.0 NaN 2.0 NaN
2017 A2 32.0 16.0 8.0 NaN 4.0 2.0 NaN
2017 A3 64.0 32.0 16.0 8.0 4.0 2.0 NaN
2017 A4 64.0 32.0 16.0 8.0 4.0 2.0 NaN
2018 A1 16.0 8.0 4.0 2.0 NaN NaN NaN
2019 A2 8.0 4.0 2.0 NaN NaN NaN NaN
2020 A3 4.0 2.0 NaN NaN NaN NaN NaN
2021 A4 2.0 NaN NaN NaN NaN NaN NaN

Replace row in pandas with next row only when the entire row (each column) has NaN values

I have a DataFrame where I want to replace only the rows with NaN values in each column by the row below it. I tried solutions from multiple feeds and used ffill but that resulted in filling few cells and not the entire row.
ss s h b sb
0 NaN NaN NaN NaN NaN
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 1.0 6.0 7.0 11.0 3.0
Expected output:
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
You can create groups by testing rows with only missing values with cumulative sum by swapped order of column and pass to GroupBy.bfill:
df = df.groupby((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1]).bfill()
print (df)
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
Detail:
print ((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1])
0 3
1 3
2 2
3 1
4 1
5 1
dtype: int32

ReArrange Pandas DataFrame date columns in date order

I have a pandas dataframe that summarises sales by calendar month & outputs something like:
Month level_0 UNIQUE_ID 102018 112018 12018 122017 122018 22018 32018 42018 52018 62018 72018 82018 92018
0 SOLD_QUANTITY 01 3692.0 5182.0 3223.0 1292.0 2466.0 2396.0 2242.0 2217.0 3590.0 2593.0 1665.0 3371.0 3069.0
1 SOLD_QUANTITY 011 3.0 6.0 NaN NaN 7.0 5.0 2.0 1.0 5.0 NaN 1.0 1.0 3.0
2 SOLD_QUANTITY 02 370.0 130.0 NaN NaN 200.0 NaN NaN 269.0 202.0 NaN 201.0 125.0 360.0
3 SOLD_QUANTITY 03 2.0 6.0 NaN NaN 2.0 1.0 NaN 6.0 11.0 9.0 2.0 3.0 5.0
4 SOLD_QUANTITY 08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 175.0 NaN NaN
I want to be able to programmatically re-arrange the column headers in ascending date order (eg starting 122017, 12018, 22018...). I need to do it in a way that is programmatic as every way the report runs, it will be a different list of months as it runs every month for last 365 days.
The index data type:
Index(['level_0', 'UNIQUE_ID', '102018', '112018', '12018', '122017', '122018',
'22018', '32018', '42018', '52018', '62018', '72018', '82018', '92018'],
dtype='object', name='Month')
Use set_index for only dates columns, convert them to datetimes and get order positions by argsort, then change ordering with iloc:
df = df.set_index(['level_0','UNIQUE_ID'])
df = df.iloc[:, pd.to_datetime(df.columns, format='%m%Y').argsort()].reset_index()
print (df)
level_0 UNIQUE_ID 122017 12018 22018 32018 42018 52018 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0 3590.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0 5.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0 202.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0 11.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN NaN
62018 72018 82018 92018 102018 112018 122018
0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN 175.0 NaN NaN NaN NaN NaN
Another idea is create month period index by DatetimeIndex.to_period, so is possible use sort_index:
df = df.set_index(['level_0','UNIQUE_ID'])
df.columns = pd.to_datetime(df.columns, format='%m%Y').to_period('m')
#alternative for convert to datetimes
#df.columns = pd.to_datetime(df.columns, format='%m%Y')
df = df.sort_index(axis=1).reset_index()
print (df)
level_0 UNIQUE_ID 2017-12 2018-01 2018-02 2018-03 2018-04 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN
2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11 2018-12
0 3590.0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 5.0 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 202.0 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 11.0 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN NaN 175.0 NaN NaN NaN NaN NaN

Categories

Resources