Applying function to Pandas dataframe with groupby ('Too many indexers' error) - python

I am trying to compute mean and var along axis=1 of dataframe using only first k columns (compute as .iloc[:,:-5]),naively, I would run as:
df.groupby('id').agg([lambda x: x.iloc[:,:-5].mean(axis=1), lambda x: x.iloc[:,:-5].var(axis=1)])
but it throws the 'too many indexers' error.
EDIT
Some data:
0 1 2 3 4 5 6 7 8 9 Q1 Q2 Q3 Q4 id
0 3.0 3.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 12.0 0.83 80.0 1.000 11.0
1 3.0 3.0 4.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 14.0 1.60 80.0 1.000 11.0
2 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 13.0 1.40 75.0 1.000 11.0
3 3.0 3.0 4.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 12.0 0.50 80.0 0.848 11.0
4 3.0 4.0 4.0 4.0 7.0 7.0 5.0 4.0 4.0 2.0 13.0 1.74 70.0 0.883 11.0
13 3.0 3.0 2.0 2.0 2.0 2.0 3.0 2.0 3.0 3.0 12.0 3.67 45.0 1.000 14.0
14 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 13.0 3.67 48.0 0.848 14.0
15 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 12.0 1.67 70.0 0.848 14.0
16 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 NaN 2.0 12.0 3.33 60.0 0.848 14.0
17 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 12.0 3.33 60.0 0.848 14.0
25 4.0 4.0 6.0 5.0 NaN 6.0 4.0 3.0 NaN 4.0 11.0 3.36 85.0 0.796 17.0
26 4.0 5.0 4.0 7.0 6.0 5.0 4.0 6.0 7.0 5.0 8.0 4.76 50.0 0.725 17.0
27 4.0 4.0 3.0 4.0 5.0 4.0 5.0 3.0 3.0 5.0 9.0 3.33 50.0 0.725 17.0
28 3.0 4.0 4.0 3.0 4.0 4.0 NaN 3.0 NaN 3.0 10.0 3.12 75.0 0.725 17.0
29 3.0 3.0 2.0 NaN 2.0 1.0 NaN NaN 1.0 2.0 15.0 3.05 79.0 0.725 17.0
39 3.0 3.0 5.0 4.0 4.0 4.0 4.0 4.0 NaN 5.0 12.0 1.19 80.0 0.883 18.0
40 5.0 4.0 5.0 5.0 5.0 5.0 4.0 5.0 7.0 4.0 9.0 1.83 75.0 0.883 18.0
41 5.0 6.0 4.0 4.0 4.0 4.0 4.0 4.0 7.0 7.0 12.0 1.71 35.0 1.000 18.0
42 5.0 5.0 5.0 5.0 4.0 NaN 4.0 4.0 3.0 2.0 13.0 0.86 85.0 1.000 18.0
43 3.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 3.0 3.0 11.0 1.36 75.0 1.000 18.0
48 1

It seems you need first:
df['m'] = df.iloc[:,:-5].mean(axis=1)
df['v'] = df.iloc[:,:-5].var(axis=1)
and then aggregate if necesary.

Related

How can I merge my columns into a single one using a multiindex

I have a DataFrame looking like this:
year 2015 2016 2017 2018 2019 2015 2016 2017 2018 2019 ... 2015 2016 2017 2018 2019 2015 2016 2017 2018 2019
PATIENTS PATIENTS PATIENTS PATIENTS PATIENTS month month month month month ... diffs_24h diffs_24h diffs_24h diffs_24h diffs_24h diffs_168h diffs_168h diffs_168h diffs_168h diffs_168h
date
2016-01-01 00:00:00 0.0 2.0 1.0 7.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 -4.0 2.0 -2.0 NaN -3.0 -2.0 -3.0 -6.0
2016-01-01 01:00:00 6.0 6.0 7.0 6.0 7.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 0.0 0.0 1.0 NaN 3.0 1.0 2.0 -1.0
2016-01-01 02:00:00 2.0 7.0 6.0 2.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 3.0 -1.0 0.0 NaN 6.0 2.0 -3.0 0.0
2016-01-01 03:00:00 0.0 2.0 2.0 4.0 6.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 0.0 2.0 4.0 NaN -1.0 -2.0 3.0 3.0
2016-01-01 04:00:00 1.0 2.0 5.0 8.0 0.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 5.0 7.0 -1.0 NaN -2.0 3.0 5.0 -2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-12-31 19:00:00 6.0 7.0 6.0 6.0 6.0 12.0 12.0 12.0 12.0 12.0 ... -9.0 -1.0 -7.0 1.0 -2.0 1.0 0.0 -6.0 -4.0 0.0
2016-12-31 20:00:00 2.0 2.0 5.0 5.0 3.0 12.0 12.0 12.0 12.0 12.0 ... -9.0 -7.0 -12.0 -1.0 -10.0 -2.0 -6.0 -2.0 -1.0 -4.0
2016-12-31 21:00:00 4.0 5.0 3.0 3.0 3.0 12.0 12.0 12.0 12.0 12.0 ... -2.0 -3.0 -10.0 -2.0 -11.0 -2.0 -2.0 -2.0 -3.0 -2.0
2016-12-31 22:00:00 5.0 2.0 6.0 6.0 3.0 12.0 12.0 12.0 12.0 12.0 ... 0.0 -6.0 -4.0 5.0 -4.0 2.0 -1.0 0.0 2.0 -3.0
2016-12-31 23:00:00 1.0 3.0 4.0 4.0 6.0 12.0 12.0 12.0 12.0 12.0 ... -6.0 -1.0 -11.0 2.0 -3.0 -4.0 -2.0 -7.0 -2.0 -2.0
and I want to end with a DataFrame in which the first level is the years but having a single year with all of the columns inside. How can I achieve that?
Example:
year 2015 2016 2017 2018 2019
PATIENTS month PATIENTS motnh PATIENTS month PATIENTS month PATIENTS month ...
date
2016-01-01 00:00:00 0.0 2.0 1.0 7.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 -4.0 2.0 -2.0 NaN -3.0 -2.0 -3.0 -6.0
2016-01-01 01:00:00 6.0 6.0 7.0 6.0 7.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 0.0 0.0 1.0 NaN 3.0 1.0 2.0 -1.0
2016-01-01 02:00:00 2.0 7.0 6.0 2.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 3.0 -1.0 0.0 NaN 6.0 2.0 -3.0 0.0
2016-01-01 03:00:00 0.0 2.0 2.0 4.0 6.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 0.0 2.0 4.0 NaN -1.0 -2.0 3.0 3.0
2016-01-01 04:00:00 1.0 2.0 5.0 8.0 0.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 5.0 7.0 -1.0 NaN -2.0 3.0 5.0 -2.0
... ... ... ... ... ... ... ... ... ... ... .
I think you only need sort your columns:
new_df = df.sort_index(axis=1, level=0)

Moving average by column / year - python, pandas

I need to built a moving average over column "total_medals" by country [noc] for all previous years - my daata looks like:
medal Bronze Gold Medal Silver **total_medals**
noc year
ALG 1984 2.0 NaN NaN NaN 2.0
1992 4.0 2.0 NaN NaN 6.0
1996 2.0 1.0 4.0 7.0
ANZ 1984 2.0 15.0 NaN 2.0 19.0
1992 3.0 5.0 NaN 2.0 10.0
1996 1.0 2.0 2.0 5.0
ARG 1984 2.0 6.0 NaN 3.0 11.0
1992 5.0 3.0 NaN 24.0 32.0
1992 3.0 7.0 NaN 5.0 15.0
I want to have a moving average per country and year (i.e. for ALG: 1984 Avg (total_medals)=2.0; 1992 Avg(total_medals) = (2.0+6.0)/2 = 4.0; 1996 Acg(total_medals) = (2.0+6.0+7.0)/3 = 5.0) - moving average should appear in new column (next to total_medals).
Additionally, for each country & year combination new column called "performance" should be the fraction of "total_medals" divided by "moving average"
Sample dataframe:
print(df)
medal Bronze Gold Medal Silver
noc year
ALG 1984 2.0 NaN NaN NaN 2.0
1992 4.0 2.0 NaN NaN 6.0
1996 2.0 1.0 NaN 4.0 7.0
ANZ 1984 2.0 15.0 NaN 2.0 19.0
1992 3.0 5.0 NaN 2.0 10.0
1996 1.0 2.0 NaN 2.0 5.0
ARG 1984 2.0 6.0 NaN 3.0 11.0
1992 5.0 3.0 NaN 24.0 32.0
1992 3.0 7.0 NaN 5.0 15.0
Use DataFrame.groupby + expanding:
df['total_mean']=df.groupby(level=0,sort=False).Silver.apply(lambda x: x.expanding(1).mean())
print(df)
medal Bronze Gold Medal Silver total_medals
noc year
ALG 1984 2.0 NaN NaN NaN 2.0 2.000000
1992 4.0 2.0 NaN NaN 6.0 4.000000
1996 2.0 1.0 NaN 4.0 7.0 5.000000
ANZ 1984 2.0 15.0 NaN 2.0 19.0 19.000000
1992 3.0 5.0 NaN 2.0 10.0 14.500000
1996 1.0 2.0 NaN 2.0 5.0 11.333333
ARG 1984 2.0 6.0 NaN 3.0 11.0 11.000000
1992 5.0 3.0 NaN 24.0 32.0 21.500000
1992 3.0 7.0 NaN 5.0 15.0 19.333333
bonze lagged
s=df.groupby('noc').apply(lambda x: x['Bronze']/x['total_medals'].shift())
s.index=s.index.droplevel()
df['bronze_lagged']=s
You could create a function for this...
def lagged_medals(type_of_medal):
s=df.groupby('noc').apply(lambda x: x[type_of_medal]/x['total_medals'].shift())
s.index=s.index.droplevel()
df[f'{type_of_medal}_lagged']=s
lagged_medals('Silver')
#print(df)

How to reindex data frame in Pandas?

I'm using pandas in Python, and I have performed some crosstab calculations and concatenations, and at the end up with a data frame that looks like this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
The problem is that I want the last 4 rows, that start with Superior to be places before Total row. So, simply I want to switch the positions of last 4 rows with the 4 rows that start with Regular. How can I achieve this in pandas? So that I get this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
More generalized solution Categorical and argsort, I know this df was ordered , so ffill is safe here
s=df.ID
s=s.where(s.isin(['Total','Regular','Superior'])).ffill()
s=pd.Categorical(s,['Total','Superior','Regular'],ordered=True)
df=df.iloc[np.argsort(s)]
df
Out[188]:
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
5 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
6 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
7 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
8 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
1 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
2 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
3 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
4 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Here's one way:
import numpy as np
df.iloc[1:,:] = np.roll(df.iloc[1:,:].values, 4, axis=0)
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
1 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
2 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
3 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
4 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
5 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
6 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
7 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
8 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
For a specific answer to this question, just use iloc
df.iloc[[0,5,6,7,8,1,2,3,4],:]
For a more generalized solution,
m = (df.ID.eq('Superior') | df.ID.eq('Regular')).cumsum()
pd.concat([df[m==0], df[m==2], df[m==1]])
or
order = (2,1)
pd.concat([df[m==0], *[df[m==c] for c in order]])
where order defines the mapping from previous ordering to new ordering.

python script breaks output lines

If I run
import numpy as np
import pandas as pd
import sys
df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file content
description = df.groupby(['option','subcase']).describe()
totals = df.groupby('option').describe().set_index(np.array(['total'] * df['option'].nunique()), append=True)
description = description.append(totals).sort_index()
print(description)
on .csv
option,subcase,cost,time
A,sub1,4,3
A,sub1,2,0
A,sub2,3,8
A,sub2,1,2
B,sub1,13,0
B,sub1,11,0
B,sub2,5,2
B,sub2,3,4
, I get an output like this:
cost time \
count mean std min 25% 50% 75% max count
option subcase
A sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0 2.0
sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0 2.0
total 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0 4.0
B sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0 2.0
sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0 2.0
total 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0 4.0
mean std min 25% 50% 75% max
option subcase
A sub1 1.50 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 5.00 4.242641 2.0 3.50 5.0 6.50 8.0
total 3.25 3.403430 0.0 1.50 2.5 4.25 8.0
B sub1 0.00 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 3.00 1.414214 2.0 2.50 3.0 3.50 4.0
total 1.50 1.914854 0.0 0.00 1.0 2.50 4.0
This is annoying, especially if you want to save it as a .csv instead of displaying it in a console.
(e.g. python myscript.py my.csv > my.summary)
How do I stop this linebreak from happening?
Adding : pd.set_option
pd.set_option('expand_frame_repr', False)
print(description)
cost time
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
option subcase
A sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0 2.0 1.50 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0 2.0 5.00 4.242641 2.0 3.50 5.0 6.50 8.0
total 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0 4.0 3.25 3.403430 0.0 1.50 2.5 4.25 8.0
B sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0 2.0 0.00 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0 2.0 3.00 1.414214 2.0 2.50 3.0 3.50 4.0
total 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0 4.0 1.50 1.914854 0.0 0.00 1.0 2.50 4.0

Easy pythonic way to classify columns in groups and store it in Dictionary?

Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0
20 21.0 205.0
I am trying to classify according to machine number. Like Machine_number 1 to 5 will be one group. Then 6 to 10 in one group and so on.
I think you need substract 1 by sub and then floordiv:
df['g'] = df.Machine_number.sub(1).floordiv(5)
#same as //
#df['g'] = df.Machine_number.sub(1) // 5
print (df)
Machine_number Machine_Running_Hours g
0 1.0 424.0 -0.0
1 2.0 458.0 0.0
2 3.0 465.0 0.0
3 4.0 446.0 0.0
4 5.0 466.0 0.0
5 6.0 466.0 1.0
6 7.0 445.0 1.0
7 8.0 466.0 1.0
8 9.0 447.0 1.0
9 10.0 469.0 1.0
10 11.0 467.0 2.0
11 12.0 449.0 2.0
12 13.0 436.0 2.0
13 14.0 465.0 2.0
14 15.0 463.0 2.0
15 16.0 372.0 3.0
16 17.0 460.0 3.0
17 18.0 450.0 3.0
18 19.0 467.0 3.0
19 20.0 463.0 3.0
20 21.0 205.0 4.0
If need store in dictionary use groupby with dict comprehension:
dfs = {i:g for i, g in df.groupby(df.Machine_number.astype(int).sub(1).floordiv(5))}
print (dfs)
{0: Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0, 1: Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0, 2: Machine_number Machine_Running_Hours
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0, 3: Machine_number Machine_Running_Hours
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0, 4: Machine_number Machine_Running_Hours
20 21.0 205.0}
print (dfs[0])
Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
print (dfs[1])
Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0

Categories

Resources