I'm fitting the following data where t: time (s), G: counts, f: impulse function:
t G f
-7200 4.7 0
-6300 5.17 0
-5400 4.93 0
-4500 4.38 0
-3600 4.47 0
-2700 4.4 0
-1800 3.36 0
-900 3.68 0
0 4.58 0
900 11.73 11
1800 18.23 8.25
2700 19.33 3
3600 19.04 0.5
4500 17.21 0
5400 12.98 0
6300 11.59 0
7200 9.26 0
8100 7.66 0
9000 6.59 0
9900 5.68 0
10800 5.1 0
Using the following convolution integral:
And more specifically:
Where: lambda_1 = 0.000431062 and lambda_2 = 0.000580525.
The code used to perform that fitting is:
#Extract data into numpy arrays
t=df['t'].as_matrix()
g=df['G'].as_matrix()
f=df['f'].as_matrix()
#Definition of the function
def convol(x,A,B,C):
dx=x[1]-x[0]
return A*np.convolve(f, np.exp(-lambda_1*x))[:len(x)]*dx+B*np.convolve(f, np.exp(-lambda_2*x))[:len(x)]*dx+C
#Determination of fit parameters A,B,C
popt, pcov = curve_fit(convol, t, g)
A,B,C= popt
perr = np.sqrt(np.diag(pcov))
#Plot fit
fit = convol(t,A,B,C)
plt.plot(t, fit)
plt.scatter(t, g,s=50, color='black')
plt.show()
The problem is that my fit parameters, A, and B are too low and have no physical meaning. I think my problem is related to the step width dx. It should tend to 0 in order to approximate my sum (np.convolve() corresponds a discrete sum of the convolution product) into an integral.
While this is not an answer, I cannot format code in a comment and so post it here. This code shows how to add bounds to curve_fit. Note that if parameter values are returned at or extremely near the bounds there is likely some other problem.
#Determination of fit parameters A,B,C
lowerBounds = [0.0, 0.0, 0.0] # A, B, C lower bounds
upperBounds = [10.0, 10.0, 10.0] # A, B, C upper bounds
popt, pcov = curve_fit(convol, t, g, bounds=[lowerBounds, upperBounds])
I think the problem is that the convolution calculation is incorrect.
import numpy as np
import scipy.optimize
import matplotlib.pyplot as plt
t = np.array([ -7200, -6300, -5400, -4500, -3600, -2700, -1800, -900, 0, 900, 1800, 2700, 3600, 4500, 5400, 6300, 7200, 8100, 9000, 9900, 10800])
g = np.array([ 4.7, 5.17, 4.93, 4.38, 4.47, 4.4, 3.36, 3.68, 4.58, 11.73, 18.23, 19.33, 19.04, 17.21, 12.98, 11.59, 9.26, 7.66, 6.59, 5.68, 5.1])
f = np.array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 8.25, 3, 0.5, 0, 0, 0, 0, 0, 0, 0, 0])
lambda_1 = 0.000431062
lambda_2 = 0.000580525
delta_t = 900
# Define the exponential parts of the integrals
x_1 = np.exp(-lambda_1 * t)
x_2 = np.exp(-lambda_2 * t)
# Define the convolution for a given 't' (in this case, using the index of 't')
def convolution(n, x):
return np.dot(f[:n], x[:n][::-1])
# The integrals do not vary as part of the optimization, so calculate them now
integral_1 = delta_t * np.array([convolution(i, x_1) for i in range(len(t))])
integral_2 = delta_t * np.array([convolution(i, x_2) for i in range(len(t))])
#Definition of the function
def convol(n,A,B,C):
return A * integral_1[n] + B * integral_2[n] + C
#Determination of fit parameters A,B,C
popt, pcov = scipy.optimize.curve_fit(convol, range(len(t)), g)
A,B,C= popt
perr = np.sqrt(np.diag(pcov))
# Print out the coefficients determined by the optimization
print(A, B, C)
#Plot fit
fit = convol(range(len(t)),A,B,C)
plt.plot(t, fit)
plt.scatter(t, g,s=50, color='black')
plt.show()
The values that I get for the coefficients are:
A = 7.9742184468342304e-05
B = -1.0441976351760864e-05
C = 5.1089841502260178
I don't know if the negative value for B is reasonable or not so I have left it as-is. If you want coefficients that are positive, you can constrain them as shown by James.
Related
For a particular gene scoring system I would like to set up a rudimentary plot such that new sample values that are entered immediately gravitate, based on multiple gene measurements, towards either a healthy or unhealthy group within the plot. Let's presume we have 5 people, each having 6 genes measured.
Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])
This creates the following table:
Gene
Healthy 1
Healthy 2
Healthy 3
Unhealthy 1
Unhealthy 2
A
1.0
1.2
1.4
2.0
2.0
B
1.5
1.0
1.4
1.3
1.2
C
1.0
1.2
1.6
2.0
1.4
D
1.7
1.5
1.5
1.5
1.4
E
1.6
1.9
1.8
3.0
2.5
F
2.0
2.2
1.9
2.0
2.0
The X and Y coordinates of each sample are then calculated based on adding the contribution of the genes together after multiplying it's parameter/weight * measured value. The first 4 genes contribute towards the Y value, whilst gene 5 and 6 determine the X value. wA - wF are the parameter/weights associated with their gene A-F counterpart.
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
n=0
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
n += 1
label = f"({TrueX},{TrueY})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
We thus calculate all the coordinates and plot them
Plot
What I would now like to do is find out how I can optimize the wA-wF parameter/weights such that the healthy samples are pushed towards the origin of the plot, let's say (0.0), whilst the unhealthy samples are pushed towards a reasonable opposite point, let's say (1,1). I've looked into K-means/SVM, but as a novice-coder/biochemist I was thoroughly overwhelmed and would appreciate any help available.
Here's an example using scipy.optimize combined with your code. (Since your code contains some syntax and type errors, I had to make small corrections.)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
[1.5, 1, 1.4, 1.3, 1.2],
[1, 1.2, 1.6, 2, 1.4],
[1.7, 1.5, 1.5, 1.5, 1.4],
[1.6, 1.9, 1.8, 3, 2.5],
[2, 2.2, 1.9, 2, 2]]),
columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
index=[['A', 'B', 'C', 'D', 'E', 'F']])
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
from scipy.optimize import minimize
# use your given weights as the initial guess
w0 = np.array([wA, wB, wC, wD, wE, wF])
# the objective function to be minimized
# - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
def fun(w):
weighted = df.values*w[:, None] # multiply all sample values by their weight
y = sum(weighted[:4]) # compute all 5 "TrueY" coordinates
x = sum(weighted[4:]) # compute all 5 "TrueX" coordinates
y[3:] -= 1 # adjust the "Unhealthy" y to the target (x,1)
x[3:] -= 1 # adjust the "Unhealthy" x to the target (1,y)
return sum(x**2+y**2) # return the sum of (squared) distances
res = minimize(fun, w0)
print(res)
# assign the optimized weights back to your parameters
wA, wB, wC, wD, wE, wF = res.x
# this is mostly your unchanged code
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
label = f"({TrueX:.3f},{TrueY:.3f})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
plt.savefig("mygraph.png")
This yields the parameters [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533] as the solution array. Therewith we can see the healthy samples clustered around (0,0) and the unhealthy samples around (1,1):
You may want to experiment with other optimization methods - see scipy.optimize.minimize.
For example, I have a index array
ax = [0, 0.2, 2] #start from index 0: python
and matrix I
I=
10 20 30 40 50
10 20 30 40 50
10 20 30 40 50
10 20 30 40 50
10 20 30 40 50
In MATLAB, by running this code
[gx, gy] = meshgrid([1,1.2,3], [1,1.2,3]);
I = [10:10:50];
I = vertcat(I,I,I,I,I)
SI = interp2(I,gx,gy,'bilinear');
The resulting SI is
SI =
10 12 30
10 12 30
10 12 30
I tried to do the same interpolation in Python, using NumPy. I first interpolate row-wise, then column-wise
import numpy as np
ax = np.array([0.0, 0.2, 2.0])
ay = np.array([0.0, 0.2, 2.0])
I = np.array([[10,20,30,40,50]])
I = np.concatenate((I,I,I,I,I), axis=0)
r_idx = np.arange(1, I.shape[0]+1)
c_idx = np.arange(1, I.shape[1]+1)
I_row = np.transpose(np.array([np.interp(ax, r_idx, I[:,x]) for x in range(0,I.shape[0])]))
I_col = np.array([np.interp(ay, c_idx, I_row[y,:]) for y in range(0, I_row.shape[0])])
SI = I_col
However, the resulting SI is
SI =
10 10 20
10 10 20
10 10 20
Why are my results using Python different from those using MATLAB?
It seems that you over-corrected yourself by passing from MATLAB to Python, as shown by your first code excerpt.
ax = [0, 0.2, 2] #start from index 0: python
In numpy logic this sequence does not represents the indexes but the coordinate
for the function to interpolate.
Since you already take care of incrementing the coordinate to be compatible with matlab here:
r_idx = np.arange(1, I.shape[0]+1)
c_idx = np.arange(1, I.shape[1]+1)
You can reuse the same interpolation coordinate that you used in Matlab:
ax = [1,1.2,3]
Full code:
import numpy as np
ax = np.array([1.0, 1.2, 3.0])
ay = np.array([1.0, 1.2, 3.0])
I = np.array([[10,20,30,40,50]])
I = np.concatenate((I,I,I,I,I), axis=0)
r_idx = np.arange(1, I.shape[0]+1)
c_idx = np.arange(1, I.shape[1]+1)
I_row = np.transpose(np.array([np.interp(ax, r_idx, I[:,x]) for x in range(0,I.shape[
0])]))
I_col = np.array([np.interp(ay, c_idx, I_row[y,:]) for y in range(0, I_row.shape[0])]
)
SI = I_col
and result:
array([[10., 12., 30.],
[10., 12., 30.],
[10., 12., 30.]])
Explanation about the bug
Since ax represented coordinates the first two values 0.0 and 0.2 were before the first coordinate of r_idx.
According to the documentation, the interpolation will default to I[:,x][0].
I am trying to solve an LP problem with two variables with two constraints where one is inequality and the other one is equality constraint in Scipy.
To convert the inequality in the constraint I have added another variable in it called A.
Min(z) = 80x + 60y
Constraints:
0.2x + 0.32y <= 0.25
x + y = 1
x, y <= 0
I have changed the inequality constraints by the following equations by adding an extra variable A
0.2x + 0.32y + A = 0.25
Min(z) = 80x + 60y + 0A
X+ Y + 0A = 1
from scipy.optimize import linprog
import numpy as np
z = np.array([80, 60, 0])
C = np.array([
[0.2, 0.32, 1],
[1, 1, 0]
])
b = np.array([0.25, 1])
x1 = (0, None)
x2 = (0, None)
sol = linprog(-z, A_eq = C, b_eq = b, bounds = (x1, x2), method='simplex')
However, I am getting an error message
Invalid input for linprog with method = 'simplex'. Length of bounds
is inconsistent with the length of c
How can I fix this?
The problem is that you do not provide bounds for A. If you e.g. run
linprog(-z, A_eq = C, b_eq = b, bounds = (x1, x2, (0, None)), method='simplex')
you will obtain:
con: array([0., 0.])
fun: -80.0
message: 'Optimization terminated successfully.'
nit: 3
slack: array([], dtype=float64)
status: 0
success: True
x: array([1. , 0. , 0.05])
As you can see, the constraints are met:
0.2 * 1 + 0.32 * 0.0 + 0.05 = 0.25 # (0.2x + 0.32y + A = 0.25)
and also
1 + 0 + 0 = 1 # (X + Y + 0A = 1)
I'm defining price momentum is an average of the given stock’s momentum over the past n days.
Momentum, in turn, is a classification: each day is labeled 1 if closing price that day is higher than the day before, and −1 if the price is lower than the day before.
I have stock change percentages as follows:
df['close in percent'] = np.array([0.27772152, 1.05468772,
0.124156 , -0.39298394,
0.56415267, 1.67812005])
momentum = df['close in percent'].apply(lambda x: 1 if x > 0 else -1).values
Momentum should be: [1,1,1,-1,1,1].
So if I'm finding the average momentum for the last n = 3 days, I want my price momentum to be:
Price_momentum = [Nan, Nan, 1, 1/3, 1/3, 1/3]
I managed to use the following code to get it working, but this is extremely slow (the dataset is 5000+ rows and it takes 10 min to execute).
for i in range(3,len(df)+1,1):
data = np.array(momentum[i-3:i])
df['3_day_momentum'].iloc[i-1]=data.mean()
You can create a rolling object:
df = pd.DataFrame()
df['close_in_percent'] = np.array([0.27772152, 1.05468772,
0.124156 , -0.39298394,
0.56415267, 1.67812005])
df['momentum'] = np.where(df['close_in_percent'] > 0, 1, -1)
df['3_day_momentum'] = df.momentum.rolling(3).mean()
Here, np.where is an alternative to apply(), which is generally slow and should be used as a last resort.
close_in_percent momentum 3_day_momentum
0 0.2777 1 NaN
1 1.0547 1 NaN
2 0.1242 1 1.0000
3 -0.3930 -1 0.3333
4 0.5642 1 0.3333
5 1.6781 1 0.3333
You can use np.where + pd.Rolling.mean -
s = df['close in percent']
pd.Series(np.where(s > 0, 1, -1)).rolling(3).mean()
0 NaN
1 NaN
2 1.000000
3 0.333333
4 0.333333
5 0.333333
dtype: float64
For v0.17 or below, there's also rolling_mean which works with arrays directly.
pd.rolling_mean(np.where(s > 0, 1, -1), window=3)
array([ nan, nan, 1. , 0.33333333, 0.33333333,
0.33333333])
Those rolling averages are basically uniform filtered values. Hence, we can use SciPy's uniform filter -
from scipy.ndimage.filters import uniform_filter1d
def rolling_mean(ar, W=3):
hW = (W-1)//2
out = uniform_filter1d(momentum.astype(float), size=W, origin=hW)
out[:W-1] = np.nan
return out
momentum = 2*(df['close in percent'] > 0) - 1
df['out'] = rolling_mean(momentum, W=3)
Benchmarking
Timing pandas.rolling and SciPy's uniform filter -
In [463]: df = pd.DataFrame({'close in percent':np.random.randn(1000000)})
In [464]: df['momentum'] = np.where(df['close in percent'] > 0, 1, -1)
In [465]: momentum = 2*(df['close in percent'] > 0) - 1
# From #Brad Solomon's soln
In [466]: %timeit df['3_day_momentum'] = df.momentum.rolling(3).mean()
10 loops, best of 3: 27.3 ms per loop
# SciPy's uniform filter
In [467]: %timeit df['3_day_momentum_out'] = rolling_mean(momentum, W=3)
100 loops, best of 3: 7.69 ms per loop
I have two data points x and y:
x = 5 (value corresponding to 95%)
y = 17 (value corresponding to 102.5%)
No I would like to calculate the value for xi which should correspond to 100%.
x = 5 (value corresponding to 95%)
xi = ?? (value corresponding to 100%)
y = 17 (value corresponding to 102.5%)
How should I do this using python?
is that what you want?
In [145]: s = pd.Series([5, np.nan, 17], index=[95, 100, 102.5])
In [146]: s
Out[146]:
95.0 5.0
100.0 NaN
102.5 17.0
dtype: float64
In [147]: s.interpolate(method='index')
Out[147]:
95.0 5.0
100.0 13.0
102.5 17.0
dtype: float64
You can use numpy.interp function to interpolate a value
import numpy as np
import matplotlib.pyplot as plt
x = [95, 102.5]
y = [5, 17]
x_new = 100
y_new = np.interp(x_new, x, y)
print(y_new)
# 13.0
plt.plot(x, y, "og-", x_new, y_new, "or");
We can easily plot this on a graph without Python:
This shows us what the answer should be (13).
But how do we calculate this? First, we find the gradient with this:
The numbers substituted into the equation give this:
So we know for 0.625 we increase the Y value by, we increase the X value by 1.
We've been given that Y is 100. We know that 102.5 relates to 17. 100 - 102.5 = -2.5. -2.5 / 0.625 = -4 and then 17 + -4 = 13.
This also works with the other numbers: 100 - 95 = 5, 5 / 0.625 = 8, 5 + 8 = 13.
We can also go backwards using the reciprocal of the gradient (1 / m).
We've been given that X is 13. We know that 102.5 relates to 17. 13 - 17 = -4. -4 / 0.625 = -2.5 and then 102.5 + -2.5 = 100.
How do we do this in python?
def findXPoint(xa,xb,ya,yb,yc):
m = (xa - xb) / (ya - yb)
xc = (yc - yb) * m + xb
return
And to find a Y point given the X point:
def findYPoint(xa,xb,ya,yb,xc):
m = (ya - yb) / (xa - xb)
yc = (xc - xb) * m + yb
return yc
This function will also extrapolate from the data points.