Accessing columns with MultiIndex after using pandas groupby and aggregate - python

I am using the df.groupby() method:
g1 = df[['md', 'agd', 'hgd']].groupby(['md']).agg(['mean', 'count', 'std'])
It produces exactly what I want!
agd hgd
mean count std mean count std
md
-4 1.398350 2 0.456494 -0.418442 2 0.774611
-3 -0.281814 10 1.314223 -0.317675 10 1.161368
-2 -0.341940 38 0.882749 0.136395 38 1.240308
-1 -0.137268 125 1.162081 -0.103710 125 1.208362
0 -0.018731 603 1.108109 -0.059108 603 1.252989
1 -0.034113 178 1.128363 -0.042781 178 1.197477
2 0.118068 43 1.107974 0.383795 43 1.225388
3 0.452802 18 0.805491 -0.335087 18 1.120520
4 0.304824 1 NaN -1.052011 1 NaN
However, I now want to access the groupby object columns like a "normal" dataframe.
I will then be able to:
1) calculate the errors on the agd and hgd means
2) make scatter plots on md (x axis) vs agd mean (hgd mean) with appropriate error bars added.
Is this possible? Perhaps by playing with the indexing?

1) You can rename the columns and proceed as normal (will get rid of the multi-indexing)
g1.columns = ['agd_mean', 'agd_std','hgd_mean','hgd_std']
2) You can keep multi-indexing and use both levels in turn (docs)
g1['agd']['mean count']

It is possible to do what you are searching for and it is called transform. You will find an example that does exactly what you are searching for in the pandas documentation here.

Related

Grouping rows by proximity of floats in a python pandas dataframe

It feels so straight forward but I haven't found the answer to my question yet. How does one group by proximity, or closeness, of two floats in pandas?
Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:
I have a column of times in nanoseconds in my DataFrame. I want to group these based on the proximity of their values to little clusters. Most of them will be two rows per cluster maybe up to five or six. I do not know the number of clusters. It will be a massive amount of very small clusters.
I thought I could e.g. introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter.
something like:
t (ns)
cluster
71
1524957248.4375
1
72
1524957265.625
1
699
14624846476.5625
2
700
14624846653.125
2
701
14624846661.287
2
1161
25172864926.5625
3
1160
25172864935.9375
3
Thanks for your help!
Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:
thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)
using the "t (ns)":
thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)
output:
t (ns) cluster
71 1.524957e+09 1
72 1.524957e+09 1
699 1.462485e+10 2
700 1.462485e+10 2
701 1.462485e+10 2
1161 2.517286e+10 3
1160 2.517286e+10 3
You can 'round' the t (ns) column by floor dividing them with a threshold value and looking at their differences:
df[['t (ns)']].assign(
cluster=(df['t (ns)'] // 10E7)
.diff().gt(0).cumsum().add(1)
)
Or you can experiment with the number of clusters you try to organize your data:
bins=3
df[['t (ns)']].assign(
bins=pd.cut(
df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins + 1)
)
)

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Get maximum relative difference between row-values and row-mean in new pandas dataframe column

I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})

print(df)

a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]


# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

sorting the quintile output from qcut in pandas python

I have a ebola dataset with 499 records. I am trying to find the number of observations in each quintile based on the prob(probability variable). the number of observations should fall into categories 0-20%, 20-40% etc. My code I think to do this is,
test = pd.qcut(ebola.prob,5).value_counts()
this returns
[0.044, 0.094] 111
(0.122, 0.146] 104
(0.106, 0.122] 103
(0.146, 0.212] 92
(0.094, 0.106] 89
My question is how do I sort this to return the correct number of observations for 0-20%, 20-40% 40-60% 60-80% 80-100%?
I have tried
test.value_counts(sort=False)
This returns
104 1
89 1
92 1
103 1
111 1
Is this the order 104,89,92,103,111? for each quintile?
I am confused because if I look at the probability outputs from my first piece of code it looks like it should be 111,89,103,104,92?
What you're doing is essentially correct but you might have two issues:
I think you are using pd.cut() instead of pd.qcut().
You are applying value_counts() one too many times.
(1) You can reference this question here here; when you use pd.qcut(), you should have the same number of records in each bin (assuming that your total records are evenly divisible by the # of bins) which you do not. Maybe check and make sure you are using the one you intended to use.
Here is some random data to illustrate (2):
>>> np.random.seed(1234)
>>> arr = np.random.randn(100).reshape(100,1)
>>> df = pd.DataFrame(arr, columns=['prob'])
>>> pd.cut(df.prob, 5).value_counts()
(0.00917, 1.2] 47
(-1.182, 0.00917] 34
(1.2, 2.391] 9
(-2.373, -1.182] 8
(-3.569, -2.373] 2
Adding the sort flag will get you what you want
>>> pd.cut(df.prob, 5).value_counts(sort=False)
(-3.569, -2.373] 2
(-2.373, -1.182] 8
(-1.182, 0.00917] 34
(0.00917, 1.2] 47
(1.2, 2.391] 9
or with pd.qcut()
>>> pd.qcut(df.prob, 5).value_counts(sort=False)
[-3.564, -0.64] 20
(-0.64, -0.0895] 20
(-0.0895, 0.297] 20
(0.297, 0.845] 20
(0.845, 2.391] 20

Categories

Resources