Related
step = [0.1,0.2,0.3,0.4,0.5]
static = []
for x in step:
range = np.arrange(5,10 + x, x)
static.append(range)
# this return a list that looks something like this [[5.,5.1,5.2,...],[5.,5.2,5.4,...],[5.,5.3,5.6,...],...]
Im trying to create standard and dynamic stop/step ranges from 5.0-10. For the standard ranges I used a list with the steps and then looped it to get the different interval lists.
What I want now is to get varying step sizes within the 5.0-10.0 interval. So for example from 5.0-7.3, the step size is 0.2, from 7.3-8.3, the range is 0.5 and then from 8.3-10.0 the lets say the step is 0.8. What I don't understand how to do is to make the dynamic run through and get all the possible combinations.
Using a list of steps and a list of "milestones" that we are going to use to determine the start and end points of each np.arange, we can do this:
import numpy as np
def dynamic_range(milestones, steps) -> list:
start = milestones[0]
dynamic_range = []
for end, step in zip(milestones[1:], steps):
dynamic_range += np.arange(start, end, step).tolist()
start = end
return dynamic_range
print(dynamic_range(milestones=(5.0, 7.3, 8.3, 10.0), steps=(0.2, 0.5, 0.8)))
# [5.0, 5.2, 5.4, 5.6, 5.8, 6.0, 6.2, 6.4, 6.6, 6.8, 7.0,
# 7.2, 7.3, 7.8, 8.3, 8.3, 9.1, 9.9]
Note on performance: this answer assumes that you are going to use a few hundred points in your dynamic range. If you want millions of points, we should try another approach with pure numpy and no list concatenation.
if you want to be it within <5,10> interval then dont add x to 10:
import numpy as np
step = [0.1, 0.2, 0.3, 0.4, 0.5]
static = []
for x in step:
range = np.arange(5, 10, x)
static.append(range)
print(static)
Dinamic:
import numpy as np
step = [0.1, 0.2, 0.3, 0.4, 0.5]
breakingpoints=[6,7,8,9,10]
dinamic = []
i=0
startingPoint=5
for x in step:
#print(breakingpoints[i])
range = np.arange(startingPoint, breakingpoints[i], x)
dinamic.append(range)
i+=1
#print(range[-1])
startingPoint=range[-1]
print(dinamic)
def derivative(X, t, A, B, C, D):
x, y = X
dotx = x * (A - B * y)
doty = y * (-D + C * x)
return np.array([dotx, doty])
def integration(t,A,B,C,D,X0):
res = odeint(derivative, X0, t, args = (A,B,C,D))
return res
X0 = [30, 4]
X = array([[30. , 4. ],
[47.2, 6.1],
[70.2, 9.8],
[77.4, 35.2],
[36.3, 59.4],
[20.6, 41.7],
[18.1, 19. ],
[21.4, 13. ],
[22. , 8.3],
[25.4, 9.1],
[27.1, 7.4],
[40.3, 8. ],
[57. , 12.3],
[76.6, 19.5],
[52.3, 45.7],
[19.5, 51.1],
[11.2, 29.7],
[ 7.6, 15.8],
[14.6, 9.7],
[16.2, 10.1],
[24.7, 8.6]])
t = [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]
XData = t
YData = X
curve_fit(integration,XData,YData)
So X is my data, the first column is species x, and second column is species y.
I tried to infer parameters for this Lotka-Volterra model using ode and curve fit.
The error says not enough values to unpack (expected 2, got 1)
I am actually not even sure whether I should infer parameter this way.
Can anyone help me with this, are there any better methods of infering parameters.
Thanks in advance!
Note that ydata is required to be a flat array. While it is strongly suggested that xdata contains one input value or vector per element of ydata, there is no requirement for it. xdata is a constant that could also have been passed some other way. It is there just for the convenience in standard regression tasks.
Thus it is also no problem to have ydata twice as long as xdata. Just apply .flatten() to the 2-dimensional arrays.
Next, the parameter list has to be a list of scalars, so add Y0 and pass the initial vector [X0,Y0].
Together these corrections lead to a result. Which is not very convincing.
I got a better result, but still not perfect, in using a multiple shooting approach, taking the points in X[:-1] and integrating for the time step 1, comparing the collected list of end-points to X[1:]. This works better in finding parameters that match amplitude and frequency, but produces a slight speed difference that looks better with a 3% correction of the coefficients.
One would probably need a mix of both approaches to get the local as well as global characteristics respected.
And indeed it works, giving parameters
A,B = 0.5215206964006734, 0.02567364947581818
C,D = 0.02493663631623848, 0.8476224408838039
X0,Y0 = 34.53872014350661, 4.653177640949391
Code for that complex fitting program: For the residual computation, first encapsulate the solver to avoid repetition of solver parameters. Then use that to first integrate over the full interval with the variable initial point, and then over the time step 1 segments.
def solver(XY,t,para):
return odeint(derivative, XY, t, args = para, atol=1e-8, rtol=1e-11)
def integration(XY_arr,*para):
XY0 = para[4:]
para = para[:4]
T = np.arange(len(XY_arr))
res0 = solver(XY0,T, para)
res1 = [ solver(XY,[t,t+1],para)[-1]
for t,XY in enumerate(XY_arr[:-1]) ]
return np.concatenate([res0,res1]).flatten()
This obviously needs the reference array prepared in a similar fashion
XData = X
YData = np.concatenate([ X,X[1:]]).flatten()
p0 =[ 0.5215, 0.02567,
0.02493, 0.8476,
34.53, 4.653]
After that the curve fitting procedure call remains the same, all changes happened before
params, info = curve_fit(integration,XData,YData,p0=p0)
XY0, para = params[4:], params[:4]
print(XY0,tuple(para))
t_plot = np.linspace(0,len(X),500)
x_plot = solver(XY0, t_plot, tuple(para))
I'm trying to compute the standard deviation of a list vr. The list size is 32, containing an array of size 3980. This array represents a value at a given height (3980 heights).
First I split the data into 15 minute chunks, where the minutes are given in raytimes. raytimes is a list of size 32 as well (containing just the time of the observation, vr).
I want the standard deviation computed at each height level, such that I end up with one final array of size 3980. This happens OK in my code. However my code does not produce the correct standard deviation value when I test it — that is to say the values that are output to w1sd, w2sd etc, are not correct (however the array is the correct size: an array of 3980 elements). I assume I am mixing up the wrong indices when computing the standard deviation.
Below are example values from the dataset. All data should fall into w1 and w1sd as the raytimes provided in this example are all within 15 minutes (< 0.25). I want to compute the standard deviation of the first element of vr, that is, the standard deviation of 2.0 + 3.1 + 2.1, then the second element, or standard deviation of 3.1 + 4.1 + nan etc.
The result for w1sd SHOULD BE [0.497, 0.499, 1.0, 7.5] but instead the code as below gives a nanstd in w1sd = [0.497, 0.77, 1.31, 5.301]. Is it something wrong with nanstd or my indexing?
vr = [
[2.0, 3.1, 4.1, nan],
[3.1, 4.1, nan, 5.1],
[2.1, nan, 6.1, 20.1]
]
Height = [10.0, 20.0, 30.0, 40]
raytimes = [0, 0.1, 0.2]
for j, h in enumerate(Height):
for i, t in enumerate(raytimes):
if raytimes[i] < 0.25:
w1.append(float(vr[i][j]))
elif 0.25 <= raytimes[i] < 0.5:
w2.append(float(vr[i][j]))
elif 0.5 <= raytimes[i] < 0.75:
w3.append(float(vr[i][j]))
else:
w4.append(float(vr[i][j]))
w1sd.append(round(nanstd(w1), 3))
w2sd.append(round(nanstd(w2), 3))
w3sd.append(round(nanstd(w3), 3))
w4sd.append(round(nanstd(w4), 3))
w1 = []
w2 = []
w3 = []
w4 = []
I would consider using pandas for this. It is a library that allows for efficient processing of datasets in numpy arrays and takes all the looping and indexing out of your hands.
In this case I would define a dataframe with N_raytimes rows and N_Height columns, which would allow to easily slice and aggregate the data any way you like.
This code gives the expected output.
import pandas as pd
import numpy as np
vr = [
[2.0, 3.1, 4.1, np.nan],
[3.1, 4.1, np.nan, 5.1],
[2.1, np.nan, 6.1, 20.1]
]
Height = [10.0, 20.0, 30.0, 40]
raytimes = [0, 0.1, 0.2]
# Define a dataframe with the data
df = pd.DataFrame(vr, columns=Height, index=raytimes)
df.columns.name = "Height"
df.index.name = "raytimes"
# Split it out (this could be more elegant)
w1 = df[df.index < 0.25]
w2 = df[(df.index >= 0.25) & (df.index < 0.5)]
w3 = df[(df.index >= 0.5) & (df.index < 0.75)]
w4 = df[df.index >= 0.75]
# Compute standard deviations
w1sd = w1.std(axis=0, ddof=0).values
w2sd = w2.std(axis=0, ddof=0).values
w3sd = w3.std(axis=0, ddof=0).values
w4sd = w4.std(axis=0, ddof=0).values
I have a hypothetical y function of x and trying to find/fit a lognormal distribution curve that would shape over the data best. I am using curve_fit function and was able to fit normal distribution, but the curve does not look optimized.
Below are the give y and x data points where y = f(x).
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
y-axis are probabilities of an event occurring in x-axis time bins:
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
I was able to get a better fit on my data using excel and lognormal approach. When I attempt to use lognormal in python, the fit does not work and I am doing something wrong.
Below is the code I have for fitting a normal distribution, which seems to be the only one that I can fit in python (hard to believe):
#fitting distributino on top of savitzky-golay
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import scipy.stats
import numpy as np
from scipy.stats import gamma, lognorm, halflogistic, foldcauchy
from scipy.optimize import curve_fit
matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')
# results from savgol
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
## y_axis values must be normalised
sum_ys = sum(y_axis)
# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]
# def gamma_f(x, a, loc, scale):
# return gamma.pdf(x, a, loc, scale)
def norm_f(x, loc, scale):
# print 'loc: ', loc, 'scale: ', scale, "\n"
return norm.pdf(x, loc, scale)
fitting = norm_f
# param_bounds = ([-np.inf,0,-np.inf],[np.inf,2,np.inf])
result = curve_fit(fitting, x_axis, y_axis)
result_mod = result
# mod scale
# results_adj = [result_mod[0][0]*.75, result_mod[0][1]*.85]
plt.plot(x_axis, y_axis, 'ro')
plt.bar(x_axis, y_axis, 1, alpha=0.75)
plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-')
plt.axis([0,35,0,.1])
# convert back into probability
y_norm_fit = [fitting(_, *result[0]) for _ in x_axis]
y_fit = [_*sum_ys for _ in y_norm_fit]
print list(y_fit)
plt.show()
I am trying to get answers two questions:
Is this the best fit I will get from normal distribution curve? How can I imporve my the fit?
Normal distribution result:
How can I fit a lognormal distribution to this data or is there a better distribution that I can use?
I was playing around with lognormal distribution curve adjust mu and sigma, it looks like that there is possible a better fit. I don't understand what I am doing wrong to get similar results in python.
Actually, Gamma distribution might be good fit as #Glen_b proposed. I'm using second definition with \alpha and \beta.
NB: trick I use for a quick fit is to compute mean and variance and for typical two-parametric distribution it is enough to recover parameters and get quick idea if it is good fit or not.
Code
import math
from scipy.misc import comb
import matplotlib.pyplot as plt
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
## y_axis values must be normalised
sum_ys = sum(y_axis)
# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]
m = 0.0
for k in range(0, len(x_axis)):
m += y_axis[k] * x_axis[k]
v = 0.0
for k in range(0, len(x_axis)):
t = (x_axis[k] - m)
v += y_axis[k] * t * t
print(m, v)
b = m/v
a = m * b
print(a, b)
z = []
for k in range(0, len(x_axis)):
q = b**a * x_axis[k]**(a-1.0) * math.exp( - b*x_axis[k] ) / math.gamma(a)
z.append(q)
plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, z, 'b*')
plt.axis([0, 35, 0, .1])
plt.show()
Discrete distribution might look better - your x are all integers after all. You have distribution with variance about 3 times higher than mean, asymmetric - so most likely something like Negative Binomial might work quite well. Here is quick fit
r is a bit above 6, so you might want to move to distribution with real r - Polya distribution.
Code
from scipy.misc import comb
import matplotlib.pyplot as plt
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
## y_axis values must be normalised
sum_ys = sum(y_axis)
# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]
s = 1.0 # shift by 1 to have them all at 0
m = 0.0
for k in range(0, len(x_axis)):
m += y_axis[k] * (x_axis[k] - s)
v = 0.0
for k in range(0, len(x_axis)):
t = (x_axis[k] - s - m)
v += y_axis[k] * t * t
print(m, v)
p = 1.0 - m/v
r = int(m*(1.0 - p) / p)
print(p, r)
z = []
for k in range(0, len(x_axis)):
q = comb(k + r - 1, k) * (1.0 - p)**r * p**k
z.append(q)
plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, z, 'b*')
plt.axis([0, 35, 0, .1])
plt.show()
Note that if a lognormal curve is correct and you take logs of both variables, you should have a quadratic relationship; even if that's not a suitable scale for a final model (because of variance effects -- if your variance is near constant on the original scale it will overweight the small values) it should at least give a good starting point for a nonlinear fit.
Indeed aside from the first two points this looks fairly good:
-- a quadratic fit to the solid points would describe that data quite well and should give suitable starting values if you then want to do a nonlinear fit.
(If error in x is at all possible, the lack of fit at the lowest x may be as much issues with error in x as error in y)
Incidentally, that plot seems to hint that a gamma curve may fit a little better overall than a lognormal one (in particular if you don't want to reduce the impact of those first two points relative to points 4-6). A good initial fit for that can be had by regressing log(y) on x and log(x):
The scaled gamma density is g = c.x^(a-1) exp(-bx) ... taking logs, you get log(g) = log(c) + (a-1) log(x) - b x = b0 + b1 log(x) + b2 x ... so supplying log(x) and x to a linear regression routine will fit that. The same caveats about variance effects apply (so it might be best as a starting point for a nonlinear least squares fit if your relative error in y isn't nearly constant).
In Python, I explained a trick here of how to fit a LogNormal very simply using OpenTURNS library:
import openturns as ot
n_times = [int(y_axis[i] * N) for i in range(len(y_axis))]
S = np.repeat(x_axis, n_times)
sample = ot.Sample([[p] for p in S])
fitdist = ot.LogNormalFactory().buildAsLogNormal(sample)
That's it!
print(fitdist) will show you >>> LogNormal(muLog = 2.92142, sigmaLog = 0.305, gamma = -6.24996)
and the fitting seems good:
import matplotlib.pyplot as plt
plt.hist(S, density =True, color = 'grey', bins = 34, alpha = 0.5)
plt.scatter(x_axis, y_axis, color= 'red')
plt.plot(x_axis, fitdist.computePDF(ot.Sample([[p] for p in x_axis])), color = 'black')
plt.show()
I wanted to use the built-in range function for floats, but apparently it doesn't work and from a quick research, i understood that there isn't a built in option for that and that I'll need to code my own function for this. So I did:
def fltrange(mini, maxi, step):
lst = []
while mini < maxi:
lst.append(mini)
mini += step
return lst
rang = fltrange(-20.0, 20.1, 0.1)
print(rang)
input()
but this is what I get:
result
the step should be just 0.1000000..., but instead it's about (sometimes it changes) 0.100000000000001.
Thanks in advance.
Fun fact: 1/10 can't be exactly represented by floating point numbers. The closest you can get is 0.1000000000000000055511151231257827021181583404541015625. The rightmost digits usually get left out when you print them, but they're still there. This explains the accumulation of errors as you continually add more 0.1s to the sum.
You can eliminate some inaccuracy (but not all of it) by using a multiplication approach instead of a cumulative sum:
def fltrange(mini, maxi, step):
lst = []
width = maxi - mini
num_steps = int(width/step)
for i in range(num_steps):
lst.append(mini + i*step)
return lst
rang = fltrange(-20.0, 20.1, 0.1)
print(rang)
Result (newlines added by me for clarity):
[-20.0, -19.9, -19.8, -19.7, -19.6, -19.5, -19.4, -19.3, -19.2, -19.1,
-19.0, -18.9, -18.8, -18.7, -18.6, -18.5, -18.4, -18.3, -18.2, -18.1,
-18.0, -17.9, -17.8, -17.7, -17.6, -17.5, -17.4, -17.3, -17.2, -17.1,
-17.0, -16.9, -16.8, -16.7, -16.6, -16.5, -16.4, -16.3, -16.2, -16.1,
-16.0, -15.899999999999999, -15.8, -15.7, -15.6, -15.5, -15.399999999999999, -15.3, -15.2, -15.1, -15.0,
...
19.1, 19.200000000000003, 19.300000000000004, 19.400000000000006, 19.5, 19.6, 19.700000000000003, 19.800000000000004, 19.900000000000006, 20.0]
You can use numpy for it. There are a few functions for your needs.
import numpy as np # of course :)
linspace :
np.linspace(1, 10, num=200)
array([ 1. , 1.04522613, 1.09045226, 1.13567839,
1.18090452, 1.22613065, 1.27135678, 1.31658291,
...
9.68341709, 9.72864322, 9.77386935, 9.81909548,
9.86432161, 9.90954774, 9.95477387, 10. ])
arange :
np.arange(1., 10., 0.1)
array([ 1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. ,
2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1,
...
8.7, 8.8, 8.9, 9. , 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7,
9.8, 9.9])
P.S. However, it's not technically a generator, which is a range in Python3 (xrange for Python2.x).