This question already has answers here:
Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer
(4 answers)
Closed 11 months ago.
first of all: thank you for all the questions and answers. So far, I always found a solution to my problems here. However, with the following problem I'm stuck:
I have a dataframe as this:
Jan_x Feb_x Mar_x Apr_x ... driest driest_rr DMAI Station_id
0 -433 -398 -18 508 ... Mar_x 2684 37.189000 2
1 -95 -102 164 631 ... Mar_x 2732 30.568445 10
2 59 272 691 1165 ... Jan_x 1970 40.237462 12
3 30 239 696 1108 ... Feb_x 3548 43.941148 13
4 -1128 -1193 -985 -667 ... Feb_x 12715 334.828246 15
(995 rows in total)
The first 12 columns are monthly mean temperature values (in 0.01 degrees), the last column ('Station_id') is an identifier for climate stations. From another dataframe containing precipitation data I got the driest month ('driest') and it's precipitation amount ('driest_rr'; in 0.01 mm). Finally, 'DMAI' is an annual aridity index already calculated in the step before.
Now I want to compute another Aridity Index (for meteorologists/climate scientists: the Pinna Combinative Index) that includes both annual mean temperature and precipitation (already included in 'DMAI') and mean temperature and precipitation of the driest month. The equation is:
DMAI = P/(T+10)
PCI = 0.5 (DMAI+(12Pd/Td+10))
with P,T annual mean temperature and precipitation
and Pd,Td mean temperature and precipitation of the driest month
(in mm and °C respectively)
I already have:
df['PCI'] = 0.5 * (df.loc[:,'DMAI'] +(12*(df.loc[:,'driest_rr']/100)))/(df.loc[:,'Mar_x']+10))
which works. However, the driest month is not always March, I need the one specified in the column 'driest'.
df['PCI'] = 0.5 * (df.loc[:,'DMAI'] +(12*(df.loc[:,'driest_rr']/100)))/(df.loc[:,df_dmai.loc[:,'driest']]+10))
does not work however.
Is there a way to solve this?
I found a few similar question, like this one here:
How can I select a specific column from each row in a Pandas DataFrame?
However, the answers that I found use either the deprecated df.lookup() or a numpy workaround, so they don't help me in this case.
pandas has a lot of numpy behind it, and so the workaround from the pandas docs is very easy to plug right back into your DataFrame:
In [27]: df = pd.DataFrame({'select': ['a', 'b', 'c', 'b', 'c', 'a'], 'a': range(6), 'b': range(6, 12), 'c': range(12, 18)})
In [28]: idx, cols = pd.factorize(df['select'])
In [29]: df['chosen'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
In [30]: df
Out[30]:
select a b c chosen
0 a 0 6 12 0
1 b 1 7 13 7
2 c 2 8 14 14
3 b 3 9 15 9
4 c 4 10 16 16
5 a 5 11 17 5
You can use loc method or iloc you can find these methods by adding a . after your dataframe name then click tab
Related
I got a table that look like this:
code
year
month
Value A
Value B
1
2020
1
120
100
1
2020
2
130
90
1
2020
3
90
89
1
2020
4
67
65
...
...
...
...
...
100
2020
10
90
90
100
2020
11
115
100
100
2020
12
150
135
I would like to know if there's a way to rearrange the data to find the correlation between A and B for every distinct code.
What I'm thinking is, for example, getting an array for every code, like:
[(A1,A2,A3...,A12),(B1,B2,B3...,B12)]
where A and B is the values for the respective month, and then I could see the correlation between these two columns. Is there a way to make this dynamic?
IIUC, you don't need to re-arrange to get the correlation for each "code". Instead, try with groupby:
>>> df.groupby("code").apply(lambda x: x["Value A"].corr(x["Value B"]))
code
1 0.830163
100 0.977093
dtype: float64
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
I have a dataframe consisting of two columns filled with float values. I need to calculate all the values of 'h' minus all the values of 'c', at the index previous to the current 'h' value.
So for instance, for 'h' in row 1, I need to calculate 1.17322 - 1.17285 (the value of 'c' in the previous row)
I have tried several different methods to accomplish this, including the use of: .iloc, .shift(), .groupby(), and .diff(), but I cannot get exactly what I'm looking for.
If anybody could help, it would be greatly appreciated
c h
0 1.17285 1.17310
1 1.17287 1.17322
2 1.17298 1.17340
3 1.17346 1.17348
4 1.17478 1.17511
5 1.17595 1.17700
6 1.17508 1.17633
7 1.17474 1.17545
8 1.17463 1.17546
9 1.17224 1.17468
10 1.17437 1.17456
11 1.17552 1.17641
12 1.17750 1.17784
13 1.17694 1.17770
Try this using shift, for as an example:
df['c_shift'] = df['c'].shift()
df['diff'] = df['h'] - df['c_shift']
print(df)
Output:
c h c_shift diff
0 1.17285 1.17310 NaN NaN
1 1.17287 1.17322 1.17285 0.00037
2 1.17298 1.17340 1.17287 0.00053
3 1.17346 1.17348 1.17298 0.00050
4 1.17478 1.17511 1.17346 0.00165
5 1.17595 1.17700 1.17478 0.00222
6 1.17508 1.17633 1.17595 0.00038
7 1.17474 1.17545 1.17508 0.00037
8 1.17463 1.17546 1.17474 0.00072
9 1.17224 1.17468 1.17463 0.00005
10 1.17437 1.17456 1.17224 0.00232
11 1.17552 1.17641 1.17437 0.00204
12 1.17750 1.17784 1.17552 0.00232
13 1.17694 1.17770 1.17750 0.00020
Of course, you can do this in one step:
df['diff'] = df['h'] - df['c'].shift()
I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0
I have a ebola dataset with 499 records. I am trying to find the number of observations in each quintile based on the prob(probability variable). the number of observations should fall into categories 0-20%, 20-40% etc. My code I think to do this is,
test = pd.qcut(ebola.prob,5).value_counts()
this returns
[0.044, 0.094] 111
(0.122, 0.146] 104
(0.106, 0.122] 103
(0.146, 0.212] 92
(0.094, 0.106] 89
My question is how do I sort this to return the correct number of observations for 0-20%, 20-40% 40-60% 60-80% 80-100%?
I have tried
test.value_counts(sort=False)
This returns
104 1
89 1
92 1
103 1
111 1
Is this the order 104,89,92,103,111? for each quintile?
I am confused because if I look at the probability outputs from my first piece of code it looks like it should be 111,89,103,104,92?
What you're doing is essentially correct but you might have two issues:
I think you are using pd.cut() instead of pd.qcut().
You are applying value_counts() one too many times.
(1) You can reference this question here here; when you use pd.qcut(), you should have the same number of records in each bin (assuming that your total records are evenly divisible by the # of bins) which you do not. Maybe check and make sure you are using the one you intended to use.
Here is some random data to illustrate (2):
>>> np.random.seed(1234)
>>> arr = np.random.randn(100).reshape(100,1)
>>> df = pd.DataFrame(arr, columns=['prob'])
>>> pd.cut(df.prob, 5).value_counts()
(0.00917, 1.2] 47
(-1.182, 0.00917] 34
(1.2, 2.391] 9
(-2.373, -1.182] 8
(-3.569, -2.373] 2
Adding the sort flag will get you what you want
>>> pd.cut(df.prob, 5).value_counts(sort=False)
(-3.569, -2.373] 2
(-2.373, -1.182] 8
(-1.182, 0.00917] 34
(0.00917, 1.2] 47
(1.2, 2.391] 9
or with pd.qcut()
>>> pd.qcut(df.prob, 5).value_counts(sort=False)
[-3.564, -0.64] 20
(-0.64, -0.0895] 20
(-0.0895, 0.297] 20
(0.297, 0.845] 20
(0.845, 2.391] 20