Determining average values over irregular number of rows in a csv file

Determining average values over irregular number of rows in a csv file - python

I have a csv file with days of the year in one column and temperature in another. The days are split into sections and I want to find the average temperature over each day.Eg day 0,1,2,3 etc
The measurements of temperatures has been taken irregularly meaning there are different numbers of measurements at certain times for each day.
Typically I would use df.groupby(np.arange(len(df)) // n).mean() but n, the number of rows will be varying in this case.
I have an example of what the data is like.
Days
Temp
0.75
19
0.8
18
1.2
18
1.25
18
1.75
19
3.05
18
3.55
21
3.60
21
3.9
18
4.5
20

You could convert Days to an integer and use that to group.
>>> df.groupby(df["Days"].astype(int)).mean()
Days Temp
Days
0 0.775 18.500000
1 1.400 18.333333
3 3.525 19.500000
4 4.500 20.000000

Related

How to find the most similar molecule to the actual drug having this data?

I have been trying to select the most similar molecule in the data below using python. Since I'm new to python programming, I couldn't do more than plotting. So how could we consider all factors, such as surface area, volume, and ovality, for choosing the best molecule? The most similar molecule should replicate the drug V0L in all aspects. V0L IS THE ACTUAL DRUG (the last row), The rest are the molecules.
Mol Su Vol Su/Vol PSA Ov D A Mw Vina
1. 1 357.18 333.9 1.069721473 143.239 1.53 5 10 369.35 -8.3
2. 2 510.31 496.15 1.028539756 137.388 1.68 6 12 562.522 -8.8
3. 3 507.07 449.84 1.127223013 161.116 1.68 6 12 516.527 -9.0
4. 4 536.54 524.75 1.022467842 172.004 1.71 7 13 555.564 -9.8
5. 5 513.67 499.05 1.029295662 180.428 1.69 7 13 532.526 -8.9
6. 6 391.19 371.71 1.052406446 152.437 1.56 6 11 408.387 -8.9
7. 7 540.01 528.8 1.021198941 149.769 1.71 7 13 565.559 -9.4
8. 8 534.81 525.99 1.01676838 174.741 1.7 7 13 555.564 -9.3
9. 9 533.42 520.67 1.024487679 181.606 1.7 7 14 566.547 -9.7
10. 10 532.52 529.47 1.005760477 179.053 1.68 8 14 571.563 -9.4
11. 11 366.72 345.89 1.060221458 159.973 1.54 6 11 385.349 -8.2
12. 12 520.75 504.36 1.032496629 168.866 1.7 6 13 542.521 -8.7
13. 13 512.69 499 1.02743487 179.477 1.69 7 13 532.526-8.6
14. 14 542.78 531.52 1.021184527 189.293 1.71 7 14 571.563 -9.6
15. 15 519.04 505.7 1.026379276 196.982 1.69 8 14 548.525 -8.8
16. 16 328.95 314.03 1.047511384 125.069 1.47 4 9 339.324 -6.9
17. 17 451.68 444.63 1.01585588 118.025 1.6 5 10 466.47 -9.4
18. 18 469.67 466.11 1.007637682 130.99 1.62 5 11 486.501 -8.3
19. 19 500.79 498.09 1.005420707 146.805 1.65 6 12 525.538 -9.8
20. 20 476.59 473.03 1.00752595 149.821 1.62 6 12 502.5 -8.4
21. 21 357.84 347.14 1.030823299 138.147 1.5 5 10 378.361 -8.6
22. 22 484.15 477.28 1.014394066 129.93 1.64 6 11 505.507 -10.2
23. 23 502.15 498.71 1.006897796 142.918 1.65 6 12 525.538 -9.3
24. 24 526.73 530.31 0.993249232 154.106 1.66 7 13 564.575 -9.9
25. 25 509.34 505.64 1.007317459 161.844 1.66 7 13 541.537 -9.2
26. 26 337.53 320.98 1.051560845 144.797 1.49 5 10 355.323 -7.1
27. 27 460.25 451.58 1.019199256 137.732 1.62 5 11 482.469 -9.6
28. 28 478.4 473.25 1.010882198 155.442 1.63 6 12 502.5 -8.9
29. 29 507.62 505.68 1.003836418 161.884 1.65 6 13 541.537 -9.2
30. 30 482.27 479.07 1.006679608 171.298 1.63 7 13 518.499 -9.1
31.V0L 355.19 333.42 1.065293024 59.105 1.530 0 9 345.37 -10.4
Su = Surface Area in squared angstrom
Vol = Volume in cubic angstrom
PSA = Polar Surface Area in squared angstrom
Ov = Ovality
D= Number of Hydrogen Bond Donating group
A = Number of Hydrogen Bond Donating group
Vina = Binding affinity (lower is better)
Mw = Molecular Weight
Mol = The number of molecule candidate

Your question is missing an important ingredient: How do YOU define "most similar"? The answer suggesting euclidean distance is bad because it doesn't even suggest normalizing the data. You should also obviously discard the numbering column when computing the distance.
Once you have defined your distance in some mathematical form, it's a simple matter of computing the distance between the candidate molecules and the target.
Some considerations for defining the distance measure:
I'd suggest normalizing each column in some way. Otherwise, a column with large values will dominate over those with smaller values. Popular ways of normalizing include scaling everything into the range "0, 1" or alternatively shifting and scaling so that the mean is 0 and the standard deviation is 1.
Make sure to get rid of "id"-type columns when computing your distance
After normalization, all columns will truly contribute the same weight. The way to change that depends on your measure but easiest is to just element-wise multiply the columns with factors to emphasize or de-emphasize them.
For the details, using pandas and/or numpy is the way to go here.

In order to find the most similar molecule we can use euclidean distance between all rows and the last one, and pick up the row having minimal distance value:
# make the last row as a new dataframe named `df1`
df1 = df[30:31]
# And the first rows in another dataframe:
df2 = df[0:31]
And use scipy.spatial package :
import scipy.spatial
ary = scipy.spatial.distance.cdist(df2, df1, metric='euclidean')
df2[ary==ary.min()]
Output
This output is by using the previous dataframe before new edits of the question :
Molecule SurfaceAr Volume PSA Ovality HBD HBA Mw Vina BA Su/Vol
15 RiboseGly 1.047511 314.03 125.069 1.47 4 9 339.324 -6.9 0.003336

Data sorting from Excel sheet

i tried to sort the values of particular row in data frame, the values are sorting but index values are not changing....i want to change the index values also according to the sorted data
rld=pd.read_excel(r"C:\Users\DELL\nagrajun sagar reservoir data - Copy.xlsx")
rl = rld.iloc[:,1].sort_values()
rl
output:
15 0.043
3 0.370
17 0.391
2 0.823
16 1.105
1 1.579
0 2.070
12 2.235
4 2.728
18 4.490
9 4.905
13 5.036
14 5.074
11 6.481
10 6.613
6 6.806
7 6.807
8 6.824
5 6.841
Name: 2 October, dtype: float64
rl[0]
output:
2.07
I expected rl[0] as 0.043 but actual result is 2.07 which is index value of before sorted list...

I suppose you can try reset_index() with (drop=True)
Something like rl=rl.reset_index(drop=True) in your case or you can do it while sorting like:
rl = rld.iloc[:,1].sort_values().reset_index(drop=True)

Filtering pandas dataframe for a steady speed condition

Below is a sample dataframe which is similar to mine except the one I am working on has 200,000 data points.
import pandas as pd
import numpy as np
df=pd.DataFrame([
[10.07,5], [10.24,5], [12.85,5], [11.85,5],
[11.10,5], [14.56,5], [14.43,5], [14.85,5],
[14.95,5], [10.41,5], [15.20,5], [15.47,5],
[15.40,5], [15.31,5], [15.43,5], [15.65,5]
], columns=['speed','delta_t'])
df
speed delta_t
0 10.07 5
1 10.24 5
2 12.85 5
3 11.85 5
4 11.10 5
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
9 10.41 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5
std_dev = df.iloc[0:3,0].std() # this will give 1.55
print(std_dev)
I have 2 columns, 'Speed' and 'Delta_T'. Delta_T is the difference in time between subsequent rows in my actual data (it has date and time). The operating speed keeps varying and what I want to achieve is to filter out all data points where the speed is nearly steady, say by filtering for a standard deviations of < 0.5 and Delta_T >=15 min. For example, if we start with the first speed, the code should be able to keep jumping to the next speeds, keep calculating the standard deviation and if it less than 0.5 and it delta_T sums up to 30 min and more I should be copy that data into a new dataframe.
So for this dataframe I will be left with index 5 to 8 and 10 to15.
Is this possible? Could you please give me some suggestion on how to do it? Sorry I am stuck. It seems to complicated to me.
Thank you.
Best Regards Arun

Let use rolling,shift and std:
Calculate the rolling std for a window of 3, the find those stds less than 0.5 and use shift(-2) to get the values at the start of the window where std was less than 0.5. Using boolean indexing with |(or) we can get the entire steady state range.
df_std = df['speed'].rolling(3).std()
df_ss = df[(df_std < 0.5) | (df_std < 0.5).shift(-2)]
df_ss
Output:
speed delta_t
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5

weighted average based on a variable window in pandas

I would like to take a weighted average of "cycle" based on a "day" as window. The window is not always the same. How do I compute weighted average in pandas?
In [3]: data = {'cycle':[34.1, 41, 49.0, 53.9, 35.8, 49.3, 38.6, 51.2, 44.8],
'day':[6,6,6,13,13,20,20,20,20]}
In [4]: df = pd.DataFrame(data, index=np.arange(9), columns = ['cycle', 'day'])
In [5]: df
Out[5]:
cycle day
0 34.1 6
1 41.0 6
2 49.0 6
3 53.9 13
4 35.8 13
5 49.3 20
6 38.6 20
7 51.2 20
8 44.8 20
I would expect three values (if I have done this correctly):
34.1 * 1/3 + 41 * 1/3 + 49 * 1/3 = 41.36
cycle day
41.36 6
6.90 13
45.90 20

If I'm understanding correctly, I think you just want:
df.groupby(['day']).mean()

Group on day, and then apply a lambda function that calculates the sum of the group and divides it by then number of non-null values within the group.
>>> df.groupby('day').cycle.apply(lambda group: group.sum() / group.count())
day
6 41.366667
13 44.850000
20 45.975000
Name: cycle, dtype: float64
Although you say weighted average, I don't believe there are any weights involved. It appears as a simple average of the cycle value for a particular day. In fact, a simple mean should suffice.
Also, I believe the value for day 13 should be calculated as 53.9 * 1/2 + 35.8 * 1/2 which yields 44.85. Same approach for day 20.

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!

Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Determining average values over irregular number of rows in a csv file - python

You could convert Days to an integer and use that to group. >>> df.groupby(df["Days"].astype(int)).mean() Days Temp Days 0 0.775 18.500000 1 1.400 18.333333 3 3.525 19.500000 4 4.500 20.000000

Related

How to find the most similar molecule to the actual drug having this data?

Data sorting from Excel sheet

Filtering pandas dataframe for a steady speed condition

weighted average based on a variable window in pandas

Pandas timeseries bins and indexing

Categories

Resources