I am trying to do spline interpolation between two arrays in Python. My data set looks like this:
| 5 15
-------+--------------------
1 32.68 29.16
2 32.73 27.20
3 32.78 28.24
4 32.83 27.27
5 32.88 25.27
6 32.93 31.35
7 32.98 27.39
8 33.03 26.42
9 33.08 27.46
10 33.13 30.50
11 33.18 27.53
12 33.23 29.57
13 33.23 27.99
14 33.23 28.64
15 33.23 26.68
16 33.23 29.72
And I am trying to do a spline interpolation between the two points and produce the values for 10, something that will eventually look like this (but spline interpolated):
| 10
-----+--------
1 30.92
2 29.965
3 30.51
4 30.05
5 29.075
6 32.14
7 30.185
8 29.725
9 30.27
10 31.815
11 30.355
12 31.4
13 30.61
14 30.935
15 29.955
16 31.475
I have been looking at examples of using scipy.interpolate.InterpolatedUnivariateSpline, but it seems to take only one array for x and one for y, and I can't figure out how to make it interpolate these two arrays.
Can someone please help point me in the right direction?
With the amount of data you have, only two points for each x value, piecewise linear interpolation is the most practical tool. Taking your two arrays to be v5 and v15 (values along y=5 line and y=15 line), and the x-values to be 1,2, ..., 16, we can create a piecewise linear interpolant like this:
from scipy.interpolate import interp2d
f = interp2d(np.arange(1, 17), [5, 15], np.stack((v5, v15)), kind='linear')
This can be evaluated in the usual way: for example, f(np.arange(1, 17), 10) returns precisely the numbers you expected.
[ 30.92 , 29.965, 30.51 , 30.05 , 29.075, 32.14 , 30.185,
29.725, 30.27 , 31.815, 30.355, 31.4 , 30.61 , 30.935,
29.955, 31.475]
interp2d can also create a cubic bivariate spline, but not from this data: your grid is too small in the y-direction.
Related
I have been trying to select the most similar molecule in the data below using python. Since I'm new to python programming, I couldn't do more than plotting. So how could we consider all factors, such as surface area, volume, and ovality, for choosing the best molecule? The most similar molecule should replicate the drug V0L in all aspects. V0L IS THE ACTUAL DRUG (the last row), The rest are the molecules.
Mol Su Vol Su/Vol PSA Ov D A Mw Vina
1. 1 357.18 333.9 1.069721473 143.239 1.53 5 10 369.35 -8.3
2. 2 510.31 496.15 1.028539756 137.388 1.68 6 12 562.522 -8.8
3. 3 507.07 449.84 1.127223013 161.116 1.68 6 12 516.527 -9.0
4. 4 536.54 524.75 1.022467842 172.004 1.71 7 13 555.564 -9.8
5. 5 513.67 499.05 1.029295662 180.428 1.69 7 13 532.526 -8.9
6. 6 391.19 371.71 1.052406446 152.437 1.56 6 11 408.387 -8.9
7. 7 540.01 528.8 1.021198941 149.769 1.71 7 13 565.559 -9.4
8. 8 534.81 525.99 1.01676838 174.741 1.7 7 13 555.564 -9.3
9. 9 533.42 520.67 1.024487679 181.606 1.7 7 14 566.547 -9.7
10. 10 532.52 529.47 1.005760477 179.053 1.68 8 14 571.563 -9.4
11. 11 366.72 345.89 1.060221458 159.973 1.54 6 11 385.349 -8.2
12. 12 520.75 504.36 1.032496629 168.866 1.7 6 13 542.521 -8.7
13. 13 512.69 499 1.02743487 179.477 1.69 7 13 532.526-8.6
14. 14 542.78 531.52 1.021184527 189.293 1.71 7 14 571.563 -9.6
15. 15 519.04 505.7 1.026379276 196.982 1.69 8 14 548.525 -8.8
16. 16 328.95 314.03 1.047511384 125.069 1.47 4 9 339.324 -6.9
17. 17 451.68 444.63 1.01585588 118.025 1.6 5 10 466.47 -9.4
18. 18 469.67 466.11 1.007637682 130.99 1.62 5 11 486.501 -8.3
19. 19 500.79 498.09 1.005420707 146.805 1.65 6 12 525.538 -9.8
20. 20 476.59 473.03 1.00752595 149.821 1.62 6 12 502.5 -8.4
21. 21 357.84 347.14 1.030823299 138.147 1.5 5 10 378.361 -8.6
22. 22 484.15 477.28 1.014394066 129.93 1.64 6 11 505.507 -10.2
23. 23 502.15 498.71 1.006897796 142.918 1.65 6 12 525.538 -9.3
24. 24 526.73 530.31 0.993249232 154.106 1.66 7 13 564.575 -9.9
25. 25 509.34 505.64 1.007317459 161.844 1.66 7 13 541.537 -9.2
26. 26 337.53 320.98 1.051560845 144.797 1.49 5 10 355.323 -7.1
27. 27 460.25 451.58 1.019199256 137.732 1.62 5 11 482.469 -9.6
28. 28 478.4 473.25 1.010882198 155.442 1.63 6 12 502.5 -8.9
29. 29 507.62 505.68 1.003836418 161.884 1.65 6 13 541.537 -9.2
30. 30 482.27 479.07 1.006679608 171.298 1.63 7 13 518.499 -9.1
31.V0L 355.19 333.42 1.065293024 59.105 1.530 0 9 345.37 -10.4
Su = Surface Area in squared angstrom
Vol = Volume in cubic angstrom
PSA = Polar Surface Area in squared angstrom
Ov = Ovality
D= Number of Hydrogen Bond Donating group
A = Number of Hydrogen Bond Donating group
Vina = Binding affinity (lower is better)
Mw = Molecular Weight
Mol = The number of molecule candidate
Your question is missing an important ingredient: How do YOU define "most similar"? The answer suggesting euclidean distance is bad because it doesn't even suggest normalizing the data. You should also obviously discard the numbering column when computing the distance.
Once you have defined your distance in some mathematical form, it's a simple matter of computing the distance between the candidate molecules and the target.
Some considerations for defining the distance measure:
I'd suggest normalizing each column in some way. Otherwise, a column with large values will dominate over those with smaller values. Popular ways of normalizing include scaling everything into the range "0, 1" or alternatively shifting and scaling so that the mean is 0 and the standard deviation is 1.
Make sure to get rid of "id"-type columns when computing your distance
After normalization, all columns will truly contribute the same weight. The way to change that depends on your measure but easiest is to just element-wise multiply the columns with factors to emphasize or de-emphasize them.
For the details, using pandas and/or numpy is the way to go here.
In order to find the most similar molecule we can use euclidean distance between all rows and the last one, and pick up the row having minimal distance value:
# make the last row as a new dataframe named `df1`
df1 = df[30:31]
# And the first rows in another dataframe:
df2 = df[0:31]
And use scipy.spatial package :
import scipy.spatial
ary = scipy.spatial.distance.cdist(df2, df1, metric='euclidean')
df2[ary==ary.min()]
Output
This output is by using the previous dataframe before new edits of the question :
Molecule SurfaceAr Volume PSA Ovality HBD HBA Mw Vina BA Su/Vol
15 RiboseGly 1.047511 314.03 125.069 1.47 4 9 339.324 -6.9 0.003336
I have to generate a sine curve of the positive part only between two values. The idea is my variable say monthly-averaged RH, which has 12 data points in a year (i.e. time series) varies between 50 and 70 in a sinusoidal way. The first and the last data points end at 50.
Can anyone help how I can generate this curve/function for the curve to get values of all intermediate data points? I am trying to use numpy/scipy for this.
Best,
Debayan
This is basic trig.
import math
for i in range(12):
print( i, 50 + 20 * math.sin( math.pi * i / 12 ) )
Output:
0 50.0
1 55.17638090205041
2 60.0
3 64.14213562373095
4 67.32050807568876
5 69.31851652578136
6 70.0
7 69.31851652578136
8 67.32050807568878
9 64.14213562373095
10 60.0
11 55.17638090205042
I am new to Python. I need to obtain the ROC curve with two values in my pandas data frame, any solution or recommendation?
I need to use this formula:
x = (1-dfpercentiles['acum_0%'])
y = (1-dfpercentiles['acum_1%'])
I tries using sklearn libs and matplotlib but I didn't find a solution.
This is my DF:
In [109]: dfpercentiles['acum_0%']
Out[110]:
0 10.89
1 22.93
2 33.40
3 44.83
4 55.97
5 67.31
6 78.15
7 87.52
8 95.61
9 100.00
Name: acum_0%, dtype: float64
and
In [111]:dfpercentiles['acum_1%']
Out[112]:
0 2.06
1 5.36
2 8.30
3 13.49
4 18.98
5 23.89
6 29.72
7 42.87
8 62.31
9 100.00
Name: acum_1%, dtype: float64
This seems to be a matplotlib question.
Before anything, your percentiles are in the range 0-100 but your adjustment is 1 - percentile_value so you need to rescale your values to 0-1.
I just used pyplot.plot to generate the ROC curve
import matplotlib.pyplot as plt
plt.plot([1-(x/100) for x in [10.89, 22.93, 33.40, 44.83, 55.97, 67.31, 78.15, 87.52, 95.61, 100.00]],
[1-(x/100) for x in [2.06, 5.36, 8.30, 13.49, 18.98, 23.89, 29.72, 42.87, 62.31, 100.0]])
Using your dataframe, it would be
plt.plot((1-(dfpercentiles['acum_0%']/100)), (1-(dfpercentiles['acum_1%']/100))
I'm looking for a faster way to do odds ratio tests on a large dataset. I have about 1200 variables (see var_col) I want to test against each other for mutual exclusion/ co-occurrence. An odds ratio test is defined as (a * d) / (b * c)), where a, b c,d are number of samples with (a) altered in neither site x & y (b) altered in site x, not in y (c) altered in y, not in x (d) altered in both. I'd also like to calculate the fisher exact test to determine statistical significance. The scipy function fisher_exact can calculate both of these (see below).
#here's a sample of my original dataframe
sample_id_no var_col
0 258.0
1 -24.0
2 -150.0
3 149.0
4 108.0
5 -126.0
6 -83.0
7 2.0
8 -177.0
9 -171.0
10 -7.0
11 -377.0
12 -272.0
13 66.0
14 -13.0
15 -7.0
16 0.0
17 189.0
18 7.0
13 -21.0
19 80.0
20 -14.0
21 -76.0
3 83.0
22 -182.0
import pandas as pd
import numpy as np
from scipy.stats import fisher_exact
import itertools
#create a dataframe with each possible pair of variable
var_pairs = pd.DataFrame(list(itertools.combinations(df.var_col.unique(),2) )).rename(columns = {0:'alpha_site', 1: 'beta_site'})
#create a cross-tab with samples and vars
sample_table = pd.crosstab(df.sample_id_no, df.var_col)
odds_ratio_results = var_pairs.apply(getOddsRatio, axis=1, args = (sample_table,))
#where the function getOddsRatio is defined as:
def getOddsRatio(pairs, sample_table):
alpha_site, beta_site = pairs
oddsratio, pvalue = fisher_exact(pd.crosstab(sample_table[alpha_site] > 0, sample_table[beta_site] > 0))
return ([oddsratio, pvalue])
This code runs very slow, especially when used on large datasets. In my actual dataset, I have around 700k variable pairs. Since the getOddsRatio() function is applied to each pair individually, it is definitely the main source of the slowness. Is there a more efficient solution?
Below is a sample dataframe which is similar to mine except the one I am working on has 200,000 data points.
import pandas as pd
import numpy as np
df=pd.DataFrame([
[10.07,5], [10.24,5], [12.85,5], [11.85,5],
[11.10,5], [14.56,5], [14.43,5], [14.85,5],
[14.95,5], [10.41,5], [15.20,5], [15.47,5],
[15.40,5], [15.31,5], [15.43,5], [15.65,5]
], columns=['speed','delta_t'])
df
speed delta_t
0 10.07 5
1 10.24 5
2 12.85 5
3 11.85 5
4 11.10 5
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
9 10.41 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5
std_dev = df.iloc[0:3,0].std() # this will give 1.55
print(std_dev)
I have 2 columns, 'Speed' and 'Delta_T'. Delta_T is the difference in time between subsequent rows in my actual data (it has date and time). The operating speed keeps varying and what I want to achieve is to filter out all data points where the speed is nearly steady, say by filtering for a standard deviations of < 0.5 and Delta_T >=15 min. For example, if we start with the first speed, the code should be able to keep jumping to the next speeds, keep calculating the standard deviation and if it less than 0.5 and it delta_T sums up to 30 min and more I should be copy that data into a new dataframe.
So for this dataframe I will be left with index 5 to 8 and 10 to15.
Is this possible? Could you please give me some suggestion on how to do it? Sorry I am stuck. It seems to complicated to me.
Thank you.
Best Regards Arun
Let use rolling,shift and std:
Calculate the rolling std for a window of 3, the find those stds less than 0.5 and use shift(-2) to get the values at the start of the window where std was less than 0.5. Using boolean indexing with |(or) we can get the entire steady state range.
df_std = df['speed'].rolling(3).std()
df_ss = df[(df_std < 0.5) | (df_std < 0.5).shift(-2)]
df_ss
Output:
speed delta_t
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5