Mean Std in pandas data frame - python

Having a pandas data frame as follow:
a b
0 1 12
1 1 13
2 1 23
3 2 22
4 2 23
5 2 24
6 3 30
7 3 35
8 3 55
I want to find the mean standard deviation of column b in each group.
My following code give me 0 for each group.
stdMeann = lambda x: np.std(np.mean(x))
print(pd.Series(data.groupby('a').b.apply(stdMeann)))

As noted in the comments you can use .agg to aggregate by multiple statistics:
In [11]: df.groupby("a")["b"].agg([np.mean, np.std])
Out[11]:
mean std
a
1 16 6.082763
2 23 1.000000
3 40 13.228757
pandas let's you pass dispatch strings, rather than using the numpy function:
In [12]: df.groupby("a")["b"].agg(["mean", "std"]) # just b
Out[12]:
mean std
a
1 16 6.082763
2 23 1.000000
3 40 13.228757
In [13]: df.groupby("a").agg(["mean", "std"]) # all columns
Out[13]:
b
mean std
a
1 16 6.082763
2 23 1.000000
3 40 13.228757
You can also specify what to do on a per-column basis:
In [14]: df.groupby("a").agg({"b": ["mean", "std"]})
Out[14]:
b
mean std
a
1 16 6.082763
2 23 1.000000
3 40 13.228757
Note: the reason you were getting 0s was that np.std of a single number is 0 (it's a little surprising to me that it's not an error, but there we are):
In [21]: np.std(1)
Out[21]: 0.0

Related

How can I normalize data in a pandas dataframe to the starting value of a time series?

I would like to analyze a dataset from a clinical study using pandas.
Patients come at different visits to the clinic and some parameters are measured. I would like to normalize the bloodparameters to the values of the first visit (baseline values), i.e: Normalized = Parameter[Visit X] / Parameter[Visit 1].
The dataset looks roughly like the following example:
import pandas as pd
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'Patient': ['A','A','A','B','B','B','C','C','C'],
'Visit': [1,2,3,1,2,3,1,2,3],
'Parameter': rng.randint(0, 100, 9)},
columns = ['Patient', 'Visit', 'Parameter'])
df
Patient Visit Parameter
0 A 1 44
1 A 2 47
2 A 3 64
3 B 1 67
4 B 2 67
5 B 3 9
6 C 1 83
7 C 2 21
8 C 3 36
Now I would like to add a column that includes each parameter normalized to the baseline value, i.e. the value at Visit 1. The simplest thing would be to add a column, which contains only the Visit 1 value for each patient and then simply divide the parameter column by this added column. However I fail to create such a column, which would add the baseline value for each respective patient. But maybe there are also one-line solutions without adding another column.
The result should look like this:
Patient Visit Parameter Normalized
0 A 1 44 1.0
1 A 2 47 1.07
2 A 3 64 1.45
3 B 1 67 1.0
4 B 2 67 1.0
5 B 3 9 0.13
6 C 1 83 1.0
7 C 2 21 0.25
8 C 3 36 0.43
IIUC, GroupBy.transform
df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.000000
1 A 2 47 1.068182
2 A 3 64 1.454545
3 B 1 67 1.000000
4 B 2 67 1.000000
5 B 3 9 0.134328
6 C 1 83 1.000000
7 C 2 21 0.253012
8 C 3 36 0.433735
df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first')).round(2)
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.00
1 A 2 47 1.07
2 A 3 64 1.45
3 B 1 67 1.00
4 B 2 67 1.00
5 B 3 9 0.13
6 C 1 83 1.00
7 C 2 21 0.25
8 C 3 36 0.43
If you need create a new DataFrame:
df2 = df.assign(Normalized = df['Parameter'].div(df.groupby('Patient')['Parameter'].transform('first')))
We could also use lambda as I suggested.
Or:
df2 = df.copy()
df2['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))
What #ansev said: GroupBy.transform
If you wish to preserve the Parameter column, just run the last line he wrote but with Normalized instead of Parameter as the new column name:
df = df.assign(Normalized = lambda x: x['Parameter'].div(x.groupby('Patient')['Parameter'].transform('first')))

Python How to fill a customized value (such as "#NA####') with bfill method?

I have a data frame containing "#NA####". I want to back-fill this value with group mean.
I know I can first replace "#NA####" with np.NAN, then use pd.fillna, but are there any more convenient ways?
Setup
df
Group Value
0 1 10
1 1 #NA###
2 3 5
3 2 10
4 2 #NA###
5 3 #NA###
6 1 40
7 2 #NA###
8 3 100
9 1 20
Call pd.to_numeric, to coerce those strings to NaNs.
df.Value = pd.to_numeric(df.Value, errors='coerce')
Now, group by Group, and call fillna with the mean -
df = df.set_index('Group').Value\
.fillna(df.groupby('Group').mean().Value)\
.reset_index()
df
Group Value
0 1 10.000000
1 1 23.333333
2 3 5.000000
3 2 10.000000
4 2 10.000000
5 3 52.500000
6 1 40.000000
7 2 10.000000
8 3 100.000000
9 1 20.000000
An alternative fill method (from a now deleted answer) which I thought was pretty good involves groupby + transform -
df.Value = df.Value.fillna(df.groupby('Group')['Value'].transform('mean'))
df
Group Value
0 1 10.000000
1 1 23.333333
2 3 5.000000
3 2 10.000000
4 2 10.000000
5 3 52.500000
6 1 40.000000
7 2 10.000000
8 3 100.000000
9 1 20.000000

Pandas: Iterate group by object

Given a data frame as following:
In [8]:
df
Out[8]:
Experiment SampleVol Mass
0 A 1 11
1 A 1 12
2 A 2 20
3 A 2 17
4 A 2 21
5 A 3 28
6 A 3 29
7 A 4 35
8 A 4 38
9 A 4 35
10 B 1 12
11 B 1 11
12 B 2 22
13 B 2 24
14 B 3 30
15 B 3 33
16 B 4 37
17 B 4 42
18 C 1 8
19 C 1 7
20 C 2 17
21 C 2 19
22 C 3 29
23 C 3 30
24 C 3 31
25 C 4 41
26 C 4 44
27 C 4 42
I would like to process some correlation study for the data frame of each Experiment. The study I want to conduct is to calculate the correlation of 'SampleVol' with its Mean('Mass').
The groupby function can help me to get the mean of masses.
grp = df.groupby(['Experiment', 'SampleVol'])
grp.mean()
Out[17]:
Mass
Experiment SampleVol
A 1 11.500000
2 19.333333
3 28.500000
4 36.000000
B 1 11.500000
2 23.000000
3 31.500000
4 39.500000
C 1 7.500000
2 18.000000
3 30.000000
4 42.333333
I understand for each data frame I should use some numpy function to compute the correlation coefficient. But now, my question is how can I iterate the data frames for each Experiment.
Following is an example of the desired output.
Out[18]:
Experiment Slope Intercept
A 0.91 0.01
B 1.1 0.02
C 0.95 0.03
Thank you very much.
You'll want to group on just the 'Experiment' column, rather than the two columns as you have above. You can iterate through the groups and perform a simple linear regression on the grouped values using the below code:
from scipy import stats
import pandas as pd
import numpy as np
grp = df.groupby(['Experiment'])
output = pd.DataFrame(columns=['Slope', 'Intercept'])
for name, group in grp:
slope, intercept, r_value, p_value, std_err = stats.linregress(group['SampleVol'], group['Mass'])
output.loc[name] = [slope,intercept]
print(output)
For those curious, this is how I generated the dummy data and what it looks like:
df = pd.DataFrame()
df['Experiment'] = np.array(pd.date_range('2018-01-01', periods=12, freq='6h').strftime('%a'))
df['SampleVol'] = np.random.uniform(1,5,12)
df['Mass'] = np.random.uniform(10,42,12)
References:
How to loop over grouped Pandas dataframe?
scipy.stats.linregress — SciPy v1.0.0 Reference Guide
Group By: split-apply-combine — pandas 0.22.0 documentation

Iterate through the rows of a dataframe and reassign minimum values by group

I am working with a dataframe that looks like this.
id time diff
0 0 34 nan
1 0 36 2
2 1 43 7
3 1 55 12
4 1 59 4
5 2 2 -57
6 2 10 8
What is an efficient way find the minimum values for 'time' by id, then set 'diff' to nan at those minimum values. I am looking for a solution that results in:
id time diff
0 0 34 nan
1 0 36 2
2 1 43 nan
3 1 55 12
4 1 59 4
5 2 2 nan
6 2 10 8
groupby('id') and use idxmin to find the location of minimum values of 'time'. Finally, use loc to assign np.nan
df.loc[df.groupby('id').time.idxmin(), 'diff'] = np.nan
df
You can group the time by id and calculate a logical vector where if the time is minimum within the group, the value is True, else False, and use the logical vector to assign NaN to the corresponding rows:
import numpy as np
import pandas as pd
df.loc[df.groupby('id')['time'].apply(lambda g: g == min(g)), "diff"] = np.nan
df
# id time diff
#0 0 34 NaN
#1 0 36 2.0
#2 1 43 NaN
#3 1 55 12.0
#4 1 59 4.0
#5 2 2 NaN
#6 2 10 8.0

How to normalize values in a dataframe column in different ranges

I have a dataframe like this:
T data
0 0 10
1 1 20
2 2 30
3 3 40
4 4 50
5 0 5
6 1 13
7 2 21
8 0 3
9 1 7
10 2 11
11 3 15
12 4 19
The values in T are sequences which all range from 0 up to a certain value whereby the maximal number can differ between the sequences.
Normally, the values in data are NOT equally spaced, that is now just for demonstration purposes.
What I want to achieve is to add a third column called dataDiv where each value in data of a certain sequence is divided by the value at T = 0 that belongs to the respective sequence. In my case, I have 3 sequences and for the first sequence I want to divide each value by 10, in the second sequence each value should be divided by 5 and in the third by 3.
So the expected outcome would look like this:
T data dataDiv
0 0 10 1.000000
1 1 20 2.000000
2 2 30 3.000000
3 3 40 4.000000
4 4 50 5.000000
5 0 5 1.000000
6 1 13 2.600000
7 2 21 4.200000
8 0 3 1.000000
9 1 7 2.333333
10 2 11 3.666667
11 3 15 5.000000
12 4 19 6.333333
The way I currently implement it is as follows:
I first determine the indices at which T = 0. Then I loop through these indices and divide the data in data by the value at T=0 of the respective sequence which gives me the desired output (which is shown above). The code looks as follows:
import pandas as pd
df = pd.DataFrame({'T': range(5) + range(3) + range(5),
'data': range(10, 60, 10) + range(5, 25, 8) + range(3, 21, 4)})
# get indices where T = 0
idZE = df[df['T'] == 0].index.tolist()
# last index of dataframe
idZE.append(max(df.index)+1)
# add the column with normalzed values
df['dataDiv'] = df['data']
# loop through indices where T = 0 and normalize values
for ix, indi in enumerate(idZE[:-1]):
df['dataDiv'].iloc[indi:idZE[ix + 1]] = df['data'].iloc[indi:idZE[ix + 1]] / df['data'].iloc[indi]
My question is: Is there any smarter solution than this which avoids the loop?
The following approach avoids loops if favour of vectorized computations and should perform faster. The basic idea is to label runs of integers in column 'T', find the first value in each of these groups and then divide the values in 'data' by the appropriate first value.
df['grp'] = (df['T'] == 0).cumsum() # label consecutive runs of integers
x = df.groupby('grp')['data'].first() # first value in each group
df['dataDiv'] = df['data'] / df['grp'].map(x) # divide
This gives the DataFrame with the desired column:
T data grp dataDiv
0 0 10 1 1.000000
1 1 20 1 2.000000
2 2 30 1 3.000000
3 3 40 1 4.000000
4 4 50 1 5.000000
5 0 5 2 1.000000
6 1 13 2 2.600000
7 2 21 2 4.200000
8 0 3 3 1.000000
9 1 7 3 2.333333
10 2 11 3 3.666667
11 3 15 3 5.000000
12 4 19 3 6.333333
(You can then drop the 'grp' column if you wish: df.drop('grp', axis=1).)
As #DSM points out below, the three lines of code could be collapsed to into one with the use of groupby.transform:
df['dataDiv'] = df['data'] / df.groupby((df['T'] == 0).cumsum())['data'].transform('first')

Categories

Resources