Python - curve_fit gives incorrect coefficients - python

I'm trying to pass two arrays for a fitting function, that takes both values.
Data file:
Column 1: Time
Column 2: Temperature
Column 3: Volume
Column 4: Pressure
0.000,0.946,4.668,0.981
0.050,0.946,4.668,0.981
0.100,0.946,4.669,0.981
0.150,0.952,4.588,0.996
0.200,1.025,4.008,1.117
0.250,1.210,3.093,1.361
0.300,1.445,2.299,1.652
0.350,1.650,1.803,1.887
0.400,1.785,1.524,2.038
0.450,1.867,1.340,2.145
0.500,1.943,1.138,2.280
0.550,2.019,0.958,2.411
0.600,2.105,0.750,2.587
0.650,2.217,0.542,2.791
0.700,2.332,0.366,2.978
0.750,2.420,0.242,3.116
0.800,2.444,0.219,3.114
0.850,2.414,0.219,3.080
here is the code
import numpy as np
from scipy.optimize import curve_fit
# Importing the Data
Time_Air1 = []
Vol_Air1 = []
Temp_Air1 = []
Pres_Air1 = []
with open('Good_Air_Run1.csv', 'r') as Air1:
reader = csv.reader(Air1, delimiter=',')
for row in reader:
Time_Air1.append(row[0])
Temp_Air1.append(row[1])
Vol_Air1.append(row[2])
Pres_Air1.append(row[3])
# Arrays are now passable floats
Time_Air1 = np.float32(np.array(Time_Air1))
Vol_Air1 = np.float32(np.array(Vol_Air1))
Temp_Air1 = np.float32(np.array(Temp_Air1))
Pres_Air1 = np.float32(np.array(Pres_Air1))
# fitting Model
def model_Gamma(V, gam, C):
return -gam*np.log(V) + C
# Air Data Fitting Data
x1 = Vol_Air1
y1 = Pres_Air1
p0_R1 = (1.0 ,1.0)
optR1, pcovR1 = curve_fit(model_Gamma, x1, y1, p0_R1)
gam_R1, C_R1 = optR1
gam_R1p, C_R1p = pcovR1
y1Mair = model_Gamma2(x_air1, gam_R1, C_R1)
compute the gamma coefficient, but it's not giving me the value i'm expecting, ~1.2. It gives me ~0.72
Yes this is the correct value because my friend fit the data into gnuplot and got that value.
If there is any information needed to actually try this, i'm happy to supply it.

Caveat: the result obtained here for gamma (about 1.7) still deviates from the postulated 1.2. This answer merely highlights the source of possible errors and illustrates what a good fit might look like.
You are trying to fit data where the dependent variable is related to the independent variable by a model that resembles that of adiabatic processes for ideal gases. Here, the pressure and the volume of a gass are related through
pressure * volume**gamma = constant
When you rearrange the left hand side and right hand side, you get:
pressure = constant * volume**-gamma
or in logarithmic form:
log(pressure) = log(constant) - gamma * log(volume)
You could fit the pressure data to the volume data using either of these 2 forms,
but the fit might not be optimal, because of measurement errors. One such error could be a fixed offset (e.g. some solid object is present in a beaker: the volume scale on the beaker will not represent accurately the volume of any liquid you pour in it).
When you account for such errors, often times a fit becomes markedly better.
Below, I've shown the fitting of your data using 3 models: the first is your model, the second takes into account a volume offset, and the third is a non-logarithmic variant of the 2nd model (it is basically the 2nd equation, with an optional volume offset). Note that in your code when you fit to what I call model1, you do not pass log(pressure) to the model, which would only make sense in case your pressure data is already tabulated on a logarithmic scale.
>>> import numpy as np
>>> from scipy.optimize import curve_fit
>>> data = np.genfromtxt('/tmp/datafile.txt',
... names=('time', 'temp', 'vol', 'press'), delimiter=',', usecols=range(4))
>>> def model1(volume, gamma, c):
... return np.log(c) - gamma*np.log(volume)
...
>>> def model2(volume, gamma, c, volume_offset):
... return np.log(c) - gamma*np.log(volume + volume_offset)
...
>>> def model3(volume, gamma, c, volume_offset):
... return c * (volume + volume_offset)**(-gamma)
...
>>> vol, press = data['vol'], data['press']
>>> guess1, _ = curve_fit(model1, vol, np.log(press))
>>> guess2, _ = curve_fit(model2, vol, np.log(press))
>>> guess3, _ = curve_fit(model3, vol, press)
>>> guess1, guess2, guess3
(array([ 0.38488521, 2.04536926]),
array([ 1.7269364 , 44.57369479, 4.44625865]),
array([ 1.73186133, 45.20087949, 4.46364872]))
>>> rms = lambda x: np.sqrt(np.mean(x**2))
>>> rms( press - np.exp(model1(vol, *guess1)))
0.29464410744456304
>>> rms(press - model3(vol, *guess3))
0.012672077620951249
Notice how guess2 and guess3 are nearly identical
The last two lines indicate the rms error. You'll notice that it is smaller for the model that takes into account the offset (if you plot them, you'll see the fit is much better than when you use model1*).
As a final remark, have a look at numpy's excellent functions for importing data, like the one I've shown here (np.genfromtxt), as they can save you a lot of tedious typing, like I demonstrated in this code.
Footnote: * when you plot using model1, don't forget to put everything back to linear scale, like this:
plt.plot(vol, np.exp(model1(vol, *guess1)))

Related

JiTCDDE problems with integration

I'm trying to solve a 2D delay differential equation with some parameters. The problem is that I can’t get the right solution (which I know) and I suspect that it comes from the integration step, but I'm not sure and I don't really understand how the JiTCDDE works.
This is the DDE:
This is my model:
def model(p, q, r, alpha, T, tau, tmax, ci):
f = [1/T * (p*y(0)+alpha*y(1, t-tau)), 1/T * (r*y(0)+q*y(1))]
DDE = jitcdde(f)
DDE.constant_past(ci)
DDE.step_on_discontinuities()
data = []
for time in np.arange(DDE.t, DDE.t+tmax, 0.09):
data.append( DDE.integrate(time)[1])
return data
Where I'm only interested in the y(1) solution
And the parameters:
T=32 #escala temporal
p=-2.4/T
q=-1.12/T
r=1.5/T
alpha=.6/T
tau=T*2.4 #delay
tmax=400
ci = np.array([4080, 0])
This is the plot I have with that model and parameters:
And this is (the blue line) the correct solution (someone give me the plot not the data)
The following code works for me and produces a result similar to your control:
import numpy as np
from jitcdde import y,t,jitcdde
T = 1
p = -2.4/T
q = -1.12/T
r = 1.5/T
alpha = .6/T
tau = T*2.4
tmax = 10
ci = np.array([4080, 0])
f = [
1/T * (p*y(0)+alpha*y(1, t-tau)),
1/T * (r*y(0)+q*y(1))
]
DDE = jitcdde(f)
DDE.constant_past(ci)
DDE.adjust_diff()
times = np.linspace( DDE.t, DDE.t+tmax, 1000 )
values = [DDE.integrate(time)[1] for time in times]
from matplotlib.pyplot import subplots
fig,axes = subplots()
axes.plot(times,values)
fig.show()
Note the following:
I set T=1 (and adjusted tmax accordingly). I presume that there still is a mistake here.
I used adjust_diff instead of step_on_discontinuities. The problem with your model is that it has a heavy discontinuity in the derivative at t=0. (A discontinuity is normal, but none of this kind). This causes problems with the adaptive step size control at the very beginning of the integration. Such a discontinuity suggests that there is something wrong with either your model or your initial past. The latter doesn’t matter if you only care about long-term behaviour, but this doesn’t seem to be the case here. I added a passage to the documentation about this kind of issue.

Calculating cross-correlation with fft returning backwards output

I'm trying to cross correlate two sets of data, by taking the fourier transform of both and multiplying the conjugate of the first fft with the second fft, before transforming back to time space. In order to test my code, I am comparing the output with the output of numpy.correlate. However, when I plot my code, (restricted to a certain window), it seems the two signals go in opposite directions/are mirrored about zero.
This is what my output looks like
My code:
import numpy as np
import pyplot as plt
phl_data = np.sin(np.arange(0, 10, 0.1))
mlac_data = np.cos(np.arange(0, 10, 0.1))
N = phl_data.size
zeroes = np.zeros(N-1)
phl_data = np.append(phl_data, zeroes)
mlac_data = np.append(mlac_data, zeroes)
# cross-correlate x = phl_data, y = mlac_data:
# take FFTs:
phl_fft = np.fft.fft(phl_data)
mlac_fft = np.fft.fft(mlac_data)
# fft of cross-correlation
Cw = np.conj(phl_fft)*mlac_fft
#Cw = np.fft.fftshift(Cw)
# transform back to time space:
Cxy = np.fft.fftshift(np.fft.ifft(Cw))
times = np.append(np.arange(-N+1, 0, dt),np.arange(0, N, dt))
plt.plot(times, Cxy)
plt.xlim(-250, 250)
# test against convolving:
c = np.correlate(phl_data, mlac_data, mode='same')
plt.plot(times, c)
plt.show()
(both data sets have been padded with N-1 zeroes)
The documentation to numpy.correlate explains this:
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
and:
Notes
The definition of correlation above is not unique and sometimes correlation may be defined differently. Another common definition is:
c'_{av}[k] = sum_n a[n] conj(v[n+k])
which is related to c_{av}[k] by c'_{av}[k] = c_{av}[-k].
Thus, there is not a unique definition, and the two common definitions lead to a reversed output.

Using np.interp to find x value for a given y gives wrong answer

I want to find the x value for a given y (I want to know at what t, X, the conversion, reaches 0.9). There are questions like this all over SO and they say use np.interp but I did that in two ways and both were wrong. The code is:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
# Create time domain
t = np.linspace(0,4000,100)
# Parameters
A = 1.5*10**(-3) # Arrhenius constant
T = 300 # Temperature [K]
R = 8.31 # Ideal gas constant [J/molK]
E_a= 1000 # Activation energy [J/mol]
V = 5 # Reactor volume [m3]
# Initial condition
C_A0 = 0.1 # Initial concentration [mol/m3]
def dNdt(C_A,t):
r_A = (-k*C_A)/V
dNdt = r_A*V
return dNdt
k=A*np.exp(-E_a/(R*T))
C_A = odeint(dNdt,C_A0,t)
N_A0 = C_A0*V
N_A = C_A*V
X = (N_A0 - N_A)/N_A0
# Plot
plt.figure()
plt.plot(t,X,'b-',label='Conversion')
plt.plot(t,C_A,'r--',label='Concentration')
plt.legend(loc='best')
plt.grid(True)
plt.xlabel('Time [s]')
plt.ylabel('Conversion')
Looking at the graph, at roughly t=2300, the conversion is 0.9.
Method 1:
I wrote this function so I can ask for any given point and get the x-value:
def find(x_val,f):
f = np.reshape(f,len(f))
global t
t = np.reshape(t,len(t))
return np.interp(x_val,t,f)
print('Conversion of 0.9 is reached at: ',int(find(0.9,X)),'s')
When I call the function at 0.9 I get 0.0008858 which gets rounded to 0 which is wrong. I thought maybe something is going wrong when I declare global t??
Method 2:
When I do it outside the function; so I manually reshape X and t and use np.interp(0.9,t,X), the output is 0.9.
X = np.reshape(X,len(X))
t = np.reshape(t,len(t))
print(np.interp(0.9,t,X))
I thought I made a mistake in the order of the variables so I did np.interp(0.9,X,t), and again it surprised me with 0.9.
I'm unsure as to where I'm going wrong. Any help would be appreciated. Many thanks :)
On your plot, t is horizontal and X is vertical. You want to find the horizontal coordinate where the vertical one is 0.9. That is, find t for a given X. Saying
find x value for a given y
is bound to lead to confusion, as it did here.
The problem is solved with
print(np.interp(0.9, X.ravel(), t)) # prints 2292.765497278863
(It's better to use ravel for flattening, instead of the reshape as you did). There is no need to reshape t, which is already one-dimensional.
I did np.interp(0.9,X,t), and again it surprised me with 0.9.
That sounds unlikely, you probably mistyped. This was the correct order.

Least Squares method in practice

Very simple regression task. I have three variables x1, x2, x3 with some random noise. And I know target equation: y = q1*x1 + q2*x2 + q3*x3. Now I want to find target coefs: q1, q2, q3 evaluate the
performance using the mean Relative Squared Error (RSE) (Prediction/Real - 1)^2 to evaluate the performance of our prediction methods.
In the research, I see that this is ordinary Least Squares Problem. But I can't get from examples on the internet how to solve this particular problem in Python. Let say I have data:
import numpy as np
sourceData = np.random.rand(1000, 3)
koefs = np.array([1, 2, 3])
target = np.dot(sourceData, koefs)
(In real life that data are noisy, with not normal distribution.) How to find this koefs using Least Squares approach in python? Any lib usage.
#ayhan made a valuable comment.
And there is a problem with your code: Actually there is no noise in the data you collect. The input data is noisy, but after the multiplication, you don't add any additional noise.
I've added some noise to your measurements and used the least squares formula to fit the parameters, here's my code:
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
estimated_theta = np.linalg.inv(data.T # data) # data.T # noisy_measurements
The estimated_theta will be close to true_theta. If you don't add noise to the measurements, they will be equal.
I've used the python3 matrix multiplication syntax.
You could use np.dot instead of #
That makes the code longer, so I've split the formula:
MTM_inv = np.linalg.inv(np.dot(data.T, data))
MTy = np.dot(data.T, noisy_measurements)
estimated_theta = np.dot(MTM_inv, MTy)
You can read up on least squares here: https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#The_general_problem
UPDATE:
Or you could just use the builtin least squares function:
np.linalg.lstsq(data, noisy_measurements)
In addition to the #lhk answer I have found great scipy Least Squares function. It is easy to get the requested behavior with it.
This way we can provide a custom function that returns residuals and form Relative Squared Error instead of absolute squared difference:
import numpy as np
from scipy.optimize import least_squares
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
#noisy_measurements[-1] = data[-1] # (1000 * true_theta) - uncoment this outliner to see how much Relative Squared Error esimator works better then default abs diff for this case.
def my_func(params, x, y):
res = (x # params) / y - 1 # If we change this line to: (x # params) - y - we will got the same result as np.linalg.lstsq
return res
res = least_squares(my_func, x0, args=(data, noisy_measurements) )
estimated_theta = res.x
Also, we can provide custom loss with loss argument function that will process the residuals and form final loss.

Autocorrelation code in Python produces errors (guitar pitch detection)

This link provides code for an autocorrelation-based pitch detection algorithm. I am using it to detect pitches in simple guitar melodies.
In general, it produces very good results. For example, for the melody C4, C#4, D4, D#4, E4 it outputs:
262.743653536
272.144441273
290.826273006
310.431336809
327.094621169
Which correlates to the correct notes.
However, in some cases like this audio file (E4, F4, F#4, G4, G#4, A4, A#4, B4) it produces errors:
325.861452246
13381.6439242
367.518651703
391.479384923
414.604661221
218.345286173
466.503751322
244.994090035
More specifically, there are three errors here: 13381Hz is wrongly detected instead of F4 (~350Hz) (weird error), and also 218Hz instead of A4 (440Hz) and 244Hz instead of B4 (~493Hz), which are octave errors.
I assume the two errors are caused by something different? Here is the code:
slices = segment_signal(y, sr)
for segment in slices:
pitch = freq_from_autocorr(segment, sr)
print pitch
def segment_signal(y, sr, onset_frames=None, offset=0.1):
if (onset_frames == None):
onset_frames = remove_dense_onsets(librosa.onset.onset_detect(y=y, sr=sr))
offset_samples = int(librosa.time_to_samples(offset, sr))
print onset_frames
slices = np.array([y[i : i + offset_samples] for i
in librosa.frames_to_samples(onset_frames)])
return slices
You can see the freq_from_autocorr function in the first link above.
The only think that I have changed is this line:
corr = corr[len(corr)/2:]
Which I have replaced with:
corr = corr[int(len(corr)/2):]
UPDATE:
I noticed the smallest the offset I use (the smallest the signal segment I use to detect each pitch), the more high-frequency (10000+ Hz) errors I get.
Specifically, I noticed that the part that goes differently in those cases (10000+ Hz) is the calculation of the i_peak value. When in cases with no error it is in the range of 50-150, in the case of the error it is 3-5.
The autocorrelation function in the code snippet that you linked is not particularly robust. In order to get the correct result, it needs to locate the first peak on the left hand side of the autocorrelation curve. The method that the other developer used (calling the numpy.argmax() function) does not always find the correct value.
I've implemented a slightly more robust version, using the peakutils package. I don't promise that it's perfectly robust either, but in any case it achieves a better result than the version of the freq_from_autocorr() function that you were previously using.
My example solution is listed below:
import librosa
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import fftconvolve
from pprint import pprint
import peakutils
def freq_from_autocorr(signal, fs):
# Calculate autocorrelation (same thing as convolution, but with one input
# reversed in time), and throw away the negative lags
signal -= np.mean(signal) # Remove DC offset
corr = fftconvolve(signal, signal[::-1], mode='full')
corr = corr[len(corr)//2:]
# Find the first peak on the left
i_peak = peakutils.indexes(corr, thres=0.8, min_dist=5)[0]
i_interp = parabolic(corr, i_peak)[0]
return fs / i_interp, corr, i_interp
def parabolic(f, x):
"""
Quadratic interpolation for estimating the true position of an
inter-sample maximum when nearby samples are known.
f is a vector and x is an index for that vector.
Returns (vx, vy), the coordinates of the vertex of a parabola that goes
through point x and its two neighbors.
Example:
Defining a vector f with a local maximum at index 3 (= 6), find local
maximum if points 2, 3, and 4 actually defined a parabola.
In [3]: f = [2, 3, 1, 6, 4, 2, 3, 1]
In [4]: parabolic(f, argmax(f))
Out[4]: (3.2142857142857144, 6.1607142857142856)
"""
xv = 1/2. * (f[x-1] - f[x+1]) / (f[x-1] - 2 * f[x] + f[x+1]) + x
yv = f[x] - 1/4. * (f[x-1] - f[x+1]) * (xv - x)
return (xv, yv)
# Time window after initial onset (in units of seconds)
window = 0.1
# Open the file and obtain the sampling rate
y, sr = librosa.core.load("./Vocaroo_s1A26VqpKgT0.mp3")
idx = np.arange(len(y))
# Set the window size in terms of number of samples
winsamp = int(window * sr)
# Calcualte the onset frames in the usual way
onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
onstm = librosa.frames_to_time(onset_frames, sr=sr)
fqlist = [] # List of estimated frequencies, one per note
crlist = [] # List of autocorrelation arrays, one array per note
iplist = [] # List of peak interpolated peak indices, one per note
for tm in onstm:
startidx = int(tm * sr)
freq, corr, ip = freq_from_autocorr(y[startidx:startidx+winsamp], sr)
fqlist.append(freq)
crlist.append(corr)
iplist.append(ip)
pprint(fqlist)
# Choose which notes to plot (it's set to show all 8 notes in this case)
plidx = [0, 1, 2, 3, 4, 5, 6, 7]
# Plot amplitude curves of all notes in the plidx list
fgwin = plt.figure(figsize=[8, 10])
fgwin.subplots_adjust(bottom=0.0, top=0.98, hspace=0.3)
axwin = []
ii = 1
for tm in onstm[plidx]:
axwin.append(fgwin.add_subplot(len(plidx)+1, 1, ii))
startidx = int(tm * sr)
axwin[-1].plot(np.arange(startidx, startidx+winsamp), y[startidx:startidx+winsamp])
ii += 1
axwin[-1].set_xlabel('Sample ID Number', fontsize=18)
fgwin.show()
# Plot autocorrelation function of all notes in the plidx list
fgcorr = plt.figure(figsize=[8,10])
fgcorr.subplots_adjust(bottom=0.0, top=0.98, hspace=0.3)
axcorr = []
ii = 1
for cr, ip in zip([crlist[ii] for ii in plidx], [iplist[ij] for ij in plidx]):
if ii == 1:
shax = None
else:
shax = axcorr[0]
axcorr.append(fgcorr.add_subplot(len(plidx)+1, 1, ii, sharex=shax))
axcorr[-1].plot(np.arange(500), cr[0:500])
# Plot the location of the leftmost peak
axcorr[-1].axvline(ip, color='r')
ii += 1
axcorr[-1].set_xlabel('Time Lag Index (Zoomed)', fontsize=18)
fgcorr.show()
The printed output looks like:
In [1]: %run autocorr.py
[325.81996740236065,
346.43374761017725,
367.12435233192753,
390.17291696559079,
412.9358117076161,
436.04054933498134,
465.38986619237039,
490.34120132405866]
The first figure produced by my code sample depicts the amplitude curves for the next 0.1 seconds following each detected onset time:
The second figure produced by the code shows the autocorrelation curves, as computed inside of the freq_from_autocorr() function. The vertical red lines depict the location of the first peak on the left for each curve, as estimated by the peakutils package. The method used by the other developer was getting incorrect results for some of these red lines; that's why his version of that function was occasionally returning the wrong frequencies.
My suggestion would be to test the revised version of the freq_from_autocorr() function on other recordings, see if you can find more challenging examples where even the improved version still gives incorrect results, and then get creative and try to develop an even more robust peak finding algorithm that never, ever mis-fires.
The autocorrelation method is not always right. You may want to implement a more sophisticated method like YIN:
http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf
or MPM:
http://www.cs.otago.ac.nz/tartini/papers/A_Smarter_Way_to_Find_Pitch.pdf
Both of the above papers are good reads.

Categories

Resources