I would like to take a weighted average of "cycle" based on a "day" as window. The window is not always the same. How do I compute weighted average in pandas?
In [3]: data = {'cycle':[34.1, 41, 49.0, 53.9, 35.8, 49.3, 38.6, 51.2, 44.8],
'day':[6,6,6,13,13,20,20,20,20]}
In [4]: df = pd.DataFrame(data, index=np.arange(9), columns = ['cycle', 'day'])
In [5]: df
Out[5]:
cycle day
0 34.1 6
1 41.0 6
2 49.0 6
3 53.9 13
4 35.8 13
5 49.3 20
6 38.6 20
7 51.2 20
8 44.8 20
I would expect three values (if I have done this correctly):
34.1 * 1/3 + 41 * 1/3 + 49 * 1/3 = 41.36
cycle day
41.36 6
6.90 13
45.90 20
If I'm understanding correctly, I think you just want:
df.groupby(['day']).mean()
Group on day, and then apply a lambda function that calculates the sum of the group and divides it by then number of non-null values within the group.
>>> df.groupby('day').cycle.apply(lambda group: group.sum() / group.count())
day
6 41.366667
13 44.850000
20 45.975000
Name: cycle, dtype: float64
Although you say weighted average, I don't believe there are any weights involved. It appears as a simple average of the cycle value for a particular day. In fact, a simple mean should suffice.
Also, I believe the value for day 13 should be calculated as 53.9 * 1/2 + 35.8 * 1/2 which yields 44.85. Same approach for day 20.
Related
I have been trying to select the most similar molecule in the data below using python. Since I'm new to python programming, I couldn't do more than plotting. So how could we consider all factors, such as surface area, volume, and ovality, for choosing the best molecule? The most similar molecule should replicate the drug V0L in all aspects. V0L IS THE ACTUAL DRUG (the last row), The rest are the molecules.
Mol Su Vol Su/Vol PSA Ov D A Mw Vina
1. 1 357.18 333.9 1.069721473 143.239 1.53 5 10 369.35 -8.3
2. 2 510.31 496.15 1.028539756 137.388 1.68 6 12 562.522 -8.8
3. 3 507.07 449.84 1.127223013 161.116 1.68 6 12 516.527 -9.0
4. 4 536.54 524.75 1.022467842 172.004 1.71 7 13 555.564 -9.8
5. 5 513.67 499.05 1.029295662 180.428 1.69 7 13 532.526 -8.9
6. 6 391.19 371.71 1.052406446 152.437 1.56 6 11 408.387 -8.9
7. 7 540.01 528.8 1.021198941 149.769 1.71 7 13 565.559 -9.4
8. 8 534.81 525.99 1.01676838 174.741 1.7 7 13 555.564 -9.3
9. 9 533.42 520.67 1.024487679 181.606 1.7 7 14 566.547 -9.7
10. 10 532.52 529.47 1.005760477 179.053 1.68 8 14 571.563 -9.4
11. 11 366.72 345.89 1.060221458 159.973 1.54 6 11 385.349 -8.2
12. 12 520.75 504.36 1.032496629 168.866 1.7 6 13 542.521 -8.7
13. 13 512.69 499 1.02743487 179.477 1.69 7 13 532.526-8.6
14. 14 542.78 531.52 1.021184527 189.293 1.71 7 14 571.563 -9.6
15. 15 519.04 505.7 1.026379276 196.982 1.69 8 14 548.525 -8.8
16. 16 328.95 314.03 1.047511384 125.069 1.47 4 9 339.324 -6.9
17. 17 451.68 444.63 1.01585588 118.025 1.6 5 10 466.47 -9.4
18. 18 469.67 466.11 1.007637682 130.99 1.62 5 11 486.501 -8.3
19. 19 500.79 498.09 1.005420707 146.805 1.65 6 12 525.538 -9.8
20. 20 476.59 473.03 1.00752595 149.821 1.62 6 12 502.5 -8.4
21. 21 357.84 347.14 1.030823299 138.147 1.5 5 10 378.361 -8.6
22. 22 484.15 477.28 1.014394066 129.93 1.64 6 11 505.507 -10.2
23. 23 502.15 498.71 1.006897796 142.918 1.65 6 12 525.538 -9.3
24. 24 526.73 530.31 0.993249232 154.106 1.66 7 13 564.575 -9.9
25. 25 509.34 505.64 1.007317459 161.844 1.66 7 13 541.537 -9.2
26. 26 337.53 320.98 1.051560845 144.797 1.49 5 10 355.323 -7.1
27. 27 460.25 451.58 1.019199256 137.732 1.62 5 11 482.469 -9.6
28. 28 478.4 473.25 1.010882198 155.442 1.63 6 12 502.5 -8.9
29. 29 507.62 505.68 1.003836418 161.884 1.65 6 13 541.537 -9.2
30. 30 482.27 479.07 1.006679608 171.298 1.63 7 13 518.499 -9.1
31.V0L 355.19 333.42 1.065293024 59.105 1.530 0 9 345.37 -10.4
Su = Surface Area in squared angstrom
Vol = Volume in cubic angstrom
PSA = Polar Surface Area in squared angstrom
Ov = Ovality
D= Number of Hydrogen Bond Donating group
A = Number of Hydrogen Bond Donating group
Vina = Binding affinity (lower is better)
Mw = Molecular Weight
Mol = The number of molecule candidate
Your question is missing an important ingredient: How do YOU define "most similar"? The answer suggesting euclidean distance is bad because it doesn't even suggest normalizing the data. You should also obviously discard the numbering column when computing the distance.
Once you have defined your distance in some mathematical form, it's a simple matter of computing the distance between the candidate molecules and the target.
Some considerations for defining the distance measure:
I'd suggest normalizing each column in some way. Otherwise, a column with large values will dominate over those with smaller values. Popular ways of normalizing include scaling everything into the range "0, 1" or alternatively shifting and scaling so that the mean is 0 and the standard deviation is 1.
Make sure to get rid of "id"-type columns when computing your distance
After normalization, all columns will truly contribute the same weight. The way to change that depends on your measure but easiest is to just element-wise multiply the columns with factors to emphasize or de-emphasize them.
For the details, using pandas and/or numpy is the way to go here.
In order to find the most similar molecule we can use euclidean distance between all rows and the last one, and pick up the row having minimal distance value:
# make the last row as a new dataframe named `df1`
df1 = df[30:31]
# And the first rows in another dataframe:
df2 = df[0:31]
And use scipy.spatial package :
import scipy.spatial
ary = scipy.spatial.distance.cdist(df2, df1, metric='euclidean')
df2[ary==ary.min()]
Output
This output is by using the previous dataframe before new edits of the question :
Molecule SurfaceAr Volume PSA Ovality HBD HBA Mw Vina BA Su/Vol
15 RiboseGly 1.047511 314.03 125.069 1.47 4 9 339.324 -6.9 0.003336
I have a csv file with days of the year in one column and temperature in another. The days are split into sections and I want to find the average temperature over each day.Eg day 0,1,2,3 etc
The measurements of temperatures has been taken irregularly meaning there are different numbers of measurements at certain times for each day.
Typically I would use df.groupby(np.arange(len(df)) // n).mean() but n, the number of rows will be varying in this case.
I have an example of what the data is like.
Days
Temp
0.75
19
0.8
18
1.2
18
1.25
18
1.75
19
3.05
18
3.55
21
3.60
21
3.9
18
4.5
20
You could convert Days to an integer and use that to group.
>>> df.groupby(df["Days"].astype(int)).mean()
Days Temp
Days
0 0.775 18.500000
1 1.400 18.333333
3 3.525 19.500000
4 4.500 20.000000
I'm trying to calculate the amount of interest that would've accrued during a period of time. I have the starting DataFrame as below
MONTH_BEG_D NO_OF_DAYS RATE
1/10/2017 31 5.22
1/11/2017 30 5.22
1/12/2017 31 5.22
1/1/2018 31 3.5
1/2/2018 28 3.5
1/3/2018 31 3.5
If the starting value is 20, I would like the outcome to be:
FORMULA: INTEREST = (PRINCIPAL_A * RATE * NO_OF_DAYS) / 36500
PRINCIPAL_A MONTH_BEG_D RATE NO_OF_DAYS INTEREST NEW_BALANCE
20 1/10/2017 5.22 31 0.08866849 20.08866849
20.08866849 1/11/2017 5.22 30 0.08618864 20.17485713
20.17485713 1/12/2017 5.22 31 0.08944371 20.26430084
20.26430084 1/1/2018 3.5 31 0.06023772 20.32453856
20.32453856 1/2/2018 3.5 28 0.05456999 20.37910855
Just to explain, the 36500 is from 365 days of the year for NO_OF_DAYS and the 0.01 multiplier for RATE. I can easily add/modify columns for these 2 variables so this is no problem. My problem lies in how I can carry the NEW_BALANCE over as the next month's PRINCIPAL_A
This is basically a cumprod between each column with a cumsum between each row. Is there an easier way of doing this while avoiding loops?
There you go, not the cleanest solution but it does what you require!
#ENSURE TO IMPORT NUMPY LIBRARY.
import numpy as np
#INCREASE PRECISION OUTPUT.
pd.options.display.precision = 7
#INSERT PRINCIPAL_A COLUMN WITH NULL VALUE, LATER TO SET THE INITIAL VALUE.
df.insert(0, 'PRINCIPAL_A', np.nan)
#SET INITIAL VALUE AS 20 IN FIRST ROW ONLY.
df.iloc[0:1, 0] = 20
#LOOP OVER DATAFRAME FOR ROLL-OVER.
for row in range(1, len(df)):
df['INTEREST'] = (df['PRINCIPAL_A'] * df['RATE'] * df['NO_OF_DAYS']) / 36500
df['NEW_BALANCE'] = df['PRINCIPAL_A'] + df['INTEREST']
df.iloc[row : row+1]['PRINCIPAL_A'] = df['NEW_BALANCE'].shift(1) #ROLL-OVER NEW BALANCE USING PANDAS SHIFT.
There is the output (slightly different than yours due to rounding) :
PRINCIPAL_A MONTH_BEG_D NO_OF_DAYS RATE INTEREST NEW_BALANCE
0 20.0000000 01/10/2017 31 5.22 0.0886685 20.0886685
1 20.0886685 01/11/2017 30 5.22 0.0861886 20.1748571
2 20.1748571 01/12/2017 31 5.22 0.0894437 20.2643008
3 20.2643008 01/01/2018 31 3.50 0.0602377 20.3245386
4 20.3245386 01/02/2018 28 3.50 0.0545700 20.3791086
5 20.3791086 01/03/2018 31 3.50 NaN NaN
Below is a sample dataframe which is similar to mine except the one I am working on has 200,000 data points.
import pandas as pd
import numpy as np
df=pd.DataFrame([
[10.07,5], [10.24,5], [12.85,5], [11.85,5],
[11.10,5], [14.56,5], [14.43,5], [14.85,5],
[14.95,5], [10.41,5], [15.20,5], [15.47,5],
[15.40,5], [15.31,5], [15.43,5], [15.65,5]
], columns=['speed','delta_t'])
df
speed delta_t
0 10.07 5
1 10.24 5
2 12.85 5
3 11.85 5
4 11.10 5
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
9 10.41 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5
std_dev = df.iloc[0:3,0].std() # this will give 1.55
print(std_dev)
I have 2 columns, 'Speed' and 'Delta_T'. Delta_T is the difference in time between subsequent rows in my actual data (it has date and time). The operating speed keeps varying and what I want to achieve is to filter out all data points where the speed is nearly steady, say by filtering for a standard deviations of < 0.5 and Delta_T >=15 min. For example, if we start with the first speed, the code should be able to keep jumping to the next speeds, keep calculating the standard deviation and if it less than 0.5 and it delta_T sums up to 30 min and more I should be copy that data into a new dataframe.
So for this dataframe I will be left with index 5 to 8 and 10 to15.
Is this possible? Could you please give me some suggestion on how to do it? Sorry I am stuck. It seems to complicated to me.
Thank you.
Best Regards Arun
Let use rolling,shift and std:
Calculate the rolling std for a window of 3, the find those stds less than 0.5 and use shift(-2) to get the values at the start of the window where std was less than 0.5. Using boolean indexing with |(or) we can get the entire steady state range.
df_std = df['speed'].rolling(3).std()
df_ss = df[(df_std < 0.5) | (df_std < 0.5).shift(-2)]
df_ss
Output:
speed delta_t
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5
I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0