Given a data frame as following:
In [8]:
df
Out[8]:
Experiment SampleVol Mass
0 A 1 11
1 A 1 12
2 A 2 20
3 A 2 17
4 A 2 21
5 A 3 28
6 A 3 29
7 A 4 35
8 A 4 38
9 A 4 35
10 B 1 12
11 B 1 11
12 B 2 22
13 B 2 24
14 B 3 30
15 B 3 33
16 B 4 37
17 B 4 42
18 C 1 8
19 C 1 7
20 C 2 17
21 C 2 19
22 C 3 29
23 C 3 30
24 C 3 31
25 C 4 41
26 C 4 44
27 C 4 42
I would like to process some correlation study for the data frame of each Experiment. The study I want to conduct is to calculate the correlation of 'SampleVol' with its Mean('Mass').
The groupby function can help me to get the mean of masses.
grp = df.groupby(['Experiment', 'SampleVol'])
grp.mean()
Out[17]:
Mass
Experiment SampleVol
A 1 11.500000
2 19.333333
3 28.500000
4 36.000000
B 1 11.500000
2 23.000000
3 31.500000
4 39.500000
C 1 7.500000
2 18.000000
3 30.000000
4 42.333333
I understand for each data frame I should use some numpy function to compute the correlation coefficient. But now, my question is how can I iterate the data frames for each Experiment.
Following is an example of the desired output.
Out[18]:
Experiment Slope Intercept
A 0.91 0.01
B 1.1 0.02
C 0.95 0.03
Thank you very much.
You'll want to group on just the 'Experiment' column, rather than the two columns as you have above. You can iterate through the groups and perform a simple linear regression on the grouped values using the below code:
from scipy import stats
import pandas as pd
import numpy as np
grp = df.groupby(['Experiment'])
output = pd.DataFrame(columns=['Slope', 'Intercept'])
for name, group in grp:
slope, intercept, r_value, p_value, std_err = stats.linregress(group['SampleVol'], group['Mass'])
output.loc[name] = [slope,intercept]
print(output)
For those curious, this is how I generated the dummy data and what it looks like:
df = pd.DataFrame()
df['Experiment'] = np.array(pd.date_range('2018-01-01', periods=12, freq='6h').strftime('%a'))
df['SampleVol'] = np.random.uniform(1,5,12)
df['Mass'] = np.random.uniform(10,42,12)
References:
How to loop over grouped Pandas dataframe?
scipy.stats.linregress — SciPy v1.0.0 Reference Guide
Group By: split-apply-combine — pandas 0.22.0 documentation
Related
I want to pivot this dataframe and convert the columns to a second level multiindex or column.
Original dataframe:
Type VC C B Security
0 Standard 2 2 2 A
1 Standard 16 13 0 B
2 Standard 52 35 2 C
3 RI 10 10 0 A
4 RI 10 15 31 B
5 RI 10 15 31 C
Desired dataframe:
Type A B C
0 Standard VC 2 16 52
1 Standard C 2 13 35
2 Standard B 2 0 2
3 RI VC 10 10 10
11 RI C 10 15 15
12 RI B 0 31 31
You could try as follows:
Use df.pivot and then transpose using df.T.
Next, chain df.sort_index to rearrange the entries, and apply df.swaplevel to change the order of the MultiIndex.
Lastly, consider getting rid of the Security as columns.name, and adding an index.name for the unnamed variable, e.g. Subtype here.
If you want the MultiIndex as columns, you can of course simply use df.reset_index at this stage.
res = (df.pivot(index='Security', columns='Type').T
.sort_index(level=[1,0], ascending=[False, False])
.swaplevel(0))
res.columns.name = None
res.index.names = ['Type','Subtype']
print(res)
A B C
Type Subtype
Standard VC 2 16 52
C 2 13 35
B 2 0 2
RI VC 10 10 10
C 10 15 15
B 0 31 31
I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values
I would like to analyze a dataset from a clinical study using pandas.
Patients come at different visits to the clinic and some parameters are measured. I would like to normalize the bloodparameters to the values of the first visit (baseline values), i.e: Normalized = Parameter[Visit X] / Parameter[Visit 1].
The dataset looks roughly like the following example:
import pandas as pd
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'Patient': ['A','A','A','B','B','B','C','C','C'],
'Visit': [1,2,3,1,2,3,1,2,3],
'Parameter': rng.randint(0, 100, 9)},
columns = ['Patient', 'Visit', 'Parameter'])
df
Patient Visit Parameter
0 A 1 44
1 A 2 47
2 A 3 64
3 B 1 67
4 B 2 67
5 B 3 9
6 C 1 83
7 C 2 21
8 C 3 36
Now I would like to add a column that includes each parameter normalized to the baseline value, i.e. the value at Visit 1. The simplest thing would be to add a column, which contains only the Visit 1 value for each patient and then simply divide the parameter column by this added column. However I fail to create such a column, which would add the baseline value for each respective patient. But maybe there are also one-line solutions without adding another column.
The result should look like this:
Patient Visit Parameter Normalized
0 A 1 44 1.0
1 A 2 47 1.07
2 A 3 64 1.45
3 B 1 67 1.0
4 B 2 67 1.0
5 B 3 9 0.13
6 C 1 83 1.0
7 C 2 21 0.25
8 C 3 36 0.43
IIUC, GroupBy.transform
df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.000000
1 A 2 47 1.068182
2 A 3 64 1.454545
3 B 1 67 1.000000
4 B 2 67 1.000000
5 B 3 9 0.134328
6 C 1 83 1.000000
7 C 2 21 0.253012
8 C 3 36 0.433735
df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first')).round(2)
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.00
1 A 2 47 1.07
2 A 3 64 1.45
3 B 1 67 1.00
4 B 2 67 1.00
5 B 3 9 0.13
6 C 1 83 1.00
7 C 2 21 0.25
8 C 3 36 0.43
If you need create a new DataFrame:
df2 = df.assign(Normalized = df['Parameter'].div(df.groupby('Patient')['Parameter'].transform('first')))
We could also use lambda as I suggested.
Or:
df2 = df.copy()
df2['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))
What #ansev said: GroupBy.transform
If you wish to preserve the Parameter column, just run the last line he wrote but with Normalized instead of Parameter as the new column name:
df = df.assign(Normalized = lambda x: x['Parameter'].div(x.groupby('Patient')['Parameter'].transform('first')))
Say we have a DataFrame df
df = pd.DataFrame({
"Id": [1, 2],
"Value": [2, 5]
})
df
Id Value
0 1 2
1 2 5
and some function f which takes an element of df and returns a DataFrame.
def f(value):
return pd.DataFrame({"A": range(10, 10 + value), "B": range(20, 20 + value)})
f(2)
A B
0 10 20
1 11 21
We want to apply f to each element in df["Value"], and join the result to df, like so:
Id Value A B
0 1 2 10 20
1 1 2 11 21
2 2 5 10 20
2 2 5 11 21
2 2 5 12 22
2 2 5 13 23
2 2 5 14 24
In T-SQL, with a table df and table-valued function f, we would do this with a CROSS APPLY:
SELECT * FROM df
CROSS APPLY f(df.Value)
How can we do this in pandas?
You could apply the function to each element in Value in a list comprehension and use pd.concat to concatenate all resulting dataframes. Also assign the corresponding Id so that it can be later on used to merge both dataframes:
l = pd.concat([f(row.Value).assign(Id=row.Id) for _, row in df.iterrows()])
df.merge(l, on='Id')
Id Value A B
0 1 2 10 20
1 1 2 11 21
2 2 5 10 20
3 2 5 11 21
4 2 5 12 22
5 2 5 13 23
6 2 5 14 24
One of the few cases I would use DataFrame.iterrows. We can iterate over each row, concat the cartesian product out of your function with the original dataframe and at the same time fillna with bfill and ffill:
df = pd.concat([pd.concat([f(r['Value']), pd.DataFrame(r).T], axis=1).bfill().ffill() for _, r in df.iterrows()],
ignore_index=True)
Which yields:
print(df)
A B Id Value
0 10 20 1.0 2.0
1 11 21 1.0 2.0
2 10 20 2.0 5.0
3 11 21 2.0 5.0
4 12 22 2.0 5.0
5 13 23 2.0 5.0
6 14 24 2.0 5.0
Having a pandas data frame as follow:
a b
0 1 12
1 1 13
2 1 23
3 2 22
4 2 23
5 2 24
6 3 30
7 3 35
8 3 55
I want to find the mean standard deviation of column b in each group.
My following code give me 0 for each group.
stdMeann = lambda x: np.std(np.mean(x))
print(pd.Series(data.groupby('a').b.apply(stdMeann)))
As noted in the comments you can use .agg to aggregate by multiple statistics:
In [11]: df.groupby("a")["b"].agg([np.mean, np.std])
Out[11]:
mean std
a
1 16 6.082763
2 23 1.000000
3 40 13.228757
pandas let's you pass dispatch strings, rather than using the numpy function:
In [12]: df.groupby("a")["b"].agg(["mean", "std"]) # just b
Out[12]:
mean std
a
1 16 6.082763
2 23 1.000000
3 40 13.228757
In [13]: df.groupby("a").agg(["mean", "std"]) # all columns
Out[13]:
b
mean std
a
1 16 6.082763
2 23 1.000000
3 40 13.228757
You can also specify what to do on a per-column basis:
In [14]: df.groupby("a").agg({"b": ["mean", "std"]})
Out[14]:
b
mean std
a
1 16 6.082763
2 23 1.000000
3 40 13.228757
Note: the reason you were getting 0s was that np.std of a single number is 0 (it's a little surprising to me that it's not an error, but there we are):
In [21]: np.std(1)
Out[21]: 0.0