calculating Gini coefficient in Python/numpy

calculating Gini coefficient in Python/numpy - python

i'm calculating Gini coefficient (similar to: Python - Gini coefficient calculation using Numpy) but i get an odd result. for a uniform distribution sampled from np.random.rand(), the Gini coefficient is 0.3 but I would have expected it to be close to 0 (perfect equality). what is going wrong here?
def G(v):
bins = np.linspace(0., 100., 11)
total = float(np.sum(v))
yvals = []
for b in bins:
bin_vals = v[v <= np.percentile(v, b)]
bin_fraction = (np.sum(bin_vals) / total) * 100.0
yvals.append(bin_fraction)
# perfect equality area
pe_area = np.trapz(bins, x=bins)
# lorenz area
lorenz_area = np.trapz(yvals, x=bins)
gini_val = (pe_area - lorenz_area) / float(pe_area)
return bins, yvals, gini_val
v = np.random.rand(500)
bins, result, gini_val = G(v)
plt.figure()
plt.subplot(2, 1, 1)
plt.plot(bins, result, label="observed")
plt.plot(bins, bins, '--', label="perfect eq.")
plt.xlabel("fraction of population")
plt.ylabel("fraction of wealth")
plt.title("GINI: %.4f" %(gini_val))
plt.legend()
plt.subplot(2, 1, 2)
plt.hist(v, bins=20)
for the given set of numbers, the above code calculates the fraction of the total distribution's values that are in each percentile bin.
the result:
uniform distributions should be near "perfect equality" so the lorenz curve bending is off.

This is to be expected. A random sample from a uniform distribution does not result in uniform values (i.e. values that are all relatively close to each other). With a little calculus, it can be shown that the expected value (in the statistical sense) of the Gini coefficient of a sample from the uniform distribution on [0, 1] is 1/3, so getting values around 1/3 for a given sample is reasonable.
You'll get a lower Gini coefficient with a sample such as v = 10 + np.random.rand(500). Those values are all close to 10.5; the relative variation is lower than the sample v = np.random.rand(500).
In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n) is 1/(6*base + 3).
Here's a simple implementation of the Gini coefficient. It uses the fact that the Gini coefficient is half the relative mean absolute difference.
def gini(x):
# (Warning: This is a concise implementation, but it is O(n**2)
# in time and memory, where n = len(x). *Don't* pass in huge
# samples!)
# Mean absolute difference
mad = np.abs(np.subtract.outer(x, x)).mean()
# Relative mean absolute difference
rmad = mad/np.mean(x)
# Gini coefficient
g = 0.5 * rmad
return g
(For some more efficient implementations, see More efficient weighted Gini coefficient in Python)
Here's the Gini coefficient for several samples of the form v = base + np.random.rand(500):
In [80]: v = np.random.rand(500)
In [81]: gini(v)
Out[81]: 0.32760618249832563
In [82]: v = 1 + np.random.rand(500)
In [83]: gini(v)
Out[83]: 0.11121487509454202
In [84]: v = 10 + np.random.rand(500)
In [85]: gini(v)
Out[85]: 0.01567937753659053
In [86]: v = 100 + np.random.rand(500)
In [87]: gini(v)
Out[87]: 0.0016594595244509495

A slightly faster implementation (using numpy vectorization and only computing each difference once):
def gini_coefficient(x):
"""Compute Gini coefficient of array of values"""
diffsum = 0
for i, xi in enumerate(x[:-1], 1):
diffsum += np.sum(np.abs(xi - x[i:]))
return diffsum / (len(x)**2 * np.mean(x))
Note: x must be a numpy array.

Gini coefficient is the area under the Lorence curve, usually calculated for analyzing the distribution of income in population. https://github.com/oliviaguest/gini provides simple implementation for the same using python.

A quick note on the original methodology:
When calculating Gini coefficients directly from areas under curves with np.traps or another integration method, the first value of the Lorenz curve needs to be 0 so that the area between the origin and the second value is accounted for. The following changes to G(v) fix this:
yvals = [0]
for b in bins[1:]:
I also discussed this issue in this answer, where including the origin in those calculations provides an equivalent answer to using the other methods discussed here (which do not need 0 to be appended).
In short, when calculating Gini coefficients directly using integration, start from the origin. If using the other methods discussed here, then it's not needed.

Note that gini index is currently present in skbio.diversity.alpha as gini_index. It might give a bit different result with examples mentioned above.

You are getting the right answer. The Gini Coefficient of the uniform distribution is not 0 "perfect equality", but (b-a) / (3*(b+a)). In your case, b = 1, and a = 0, so Gini = 1/3.
The only distributions with perfect equality are the Kroneker and the Dirac deltas. Remember that equality means "all the same", not "all equally probable".

There were some issues with the previous implementations. They never gave the gini index = 1 for perfectly sparse data.
example:
def gini_coefficient(x):
"""Compute Gini coefficient of array of values"""
diffsum = 0
for i, xi in enumerate(x[:-1], 1):
diffsum += np.sum(np.abs(xi - x[i:]))
return diffsum / (len(x)**2 * np.mean(x))
gini_coefficient(np.array([0, 0, 1]))
gives the answer 0.666666. That happens because of the implied "integration scheme" it uses.
Here is another variant that bypasses the issue, although it is computationally heavier:
import numpy as np
from scipy.interpolate import interp1d
def gini(v, n_new = 1000):
"""Compute Gini coefficient of array of values"""
v_abs = np.sort(np.abs(v))
cumsum_v = np.cumsum(v_abs)
n = len(v_abs)
vals = np.concatenate([[0], cumsum_v/cumsum_v[-1]])
x = np.linspace(0, 1, n+1)
f = interp1d(x=x, y=vals, kind='previous')
xnew = np.linspace(0, 1, n_new+1)
dx_new = 1/(n_new)
vals_new = f(xnew)
return 1 - 2 * np.trapz(y=vals_new, x=xnew, dx=dx_new)
gini(np.array([0, 0, 1]))
it gives 0.999 output, which is closer to what one wants to have =)

Related

How can I make this python code run faster?

I have two questions:
1- This code takes too long to execute. Any idea how I can make it faster?
With the code bellow I want generate 100 random discrete values between 700 and 1200.
I choosed the weibull distribution because I wanted to generate failure rates data please see the histogram bellow.
import random
nums = []
alpha = 0.6
beta = 0.4
while len(nums) !=100:
temp = int(random.weibullvariate(alpha, beta))
if 700 <= temp <1200:
nums.append(temp)
print(nums)
# plotting a graph
#plt.hist(nums, bins = 200)
#plt.show()
print(nums)
I wanted to generate a histogram like this one:
Histogram
2- I have this function for discrete weibull distribution
def DiscreteWeibull(q, b, x):
return q**(x**b) - q**((x + 1)**b)
How can I generate random values that follow this distribution?

Since the Weibull distribution with shape parameter K and scale parameter lambda can be characterized as this function on the Uniform (0,1) dist. U, we can 'cut' the distribution to a desired minimum and maximum value. We do this by inverting the equation, setting W to 700 or 1200, and finding the values between 0 and 1 that correspond. Here's some sample code.
def weibull_from_uniform(shape, scale, x):
assert 0 <= x <= 1
return scale * pow(-1 * math.log(x), 1.0 / shape)
scale_param = 0.6
shape_param = 0.4
min_value = 700.0
max_value = 1200.0
lower_bound = math.exp(-1 * pow(min_value / scale_param, shape_param))
upper_bound = math.exp(-1 * pow(max_value / scale_param, shape_param))
if lower_bound > upper_bound:
lower_bound, upper_bound = upper_bound, lower_bound
nums = []
while len(nums) < 100:
nums.append(weibull_from_uniform(shape_param, scale_param, random.uniform(lower_bound, upper_bound)))
print(nums)
plt.hist(nums, bins=8)
plt.show()
This code gives a histogram very similar to the one you provided; the method will give values from the same distribution as your original method, just faster. Note that this direct approach only works when our shape parameter K <= 1, so that the density function is strictly decreasing. When K > 1, the Weibull density function increases to a mode, then decreases, so you may need to draw from two uniform intervals for particular min and max values (since inverting for W and U may give two answers).

Your question is not very clear on why you thought using this Weibull distribution was a good idea, nor what distribution you are looking to achieve.
Discrete uniform distribution
Here are two ways to achieve the discrete uniform distribution on [700, 1200).
1) With random
import random
nums = [random.randrange(700, 1200) for _ in range(100)]
2) With numpy
import numpy
nums = numpy.random.randint(700, 1200, 100)
Geometric distribution
You have edited your question with an example histogram, and the mention "I wanted to generate a histogram like this one". The histogram vaguely looks like a geometric distribution.
We can use numpy.random.geometric:
import numpy
n_samples = 100
p = 0.5
a, b = 50, 650
cap = 1200
nums = numpy.random.geometric(p, size = 2 * n_samples) * a + b
nums = nums[numpy.where(nums < cap)][:n_samples]

Is there a DP solution for my subset average problem?

I have a combinatorics problem that I can't solve.
Given a set of vectors and a target vector, return a scalar for each vector, so that the average of the scaled vectors in the set is closest to the target.
Edit: Weights w_i are in range [0, 1]. This is a constrained optimisation problem:
minimise d(avg(w_i * x_i), target)
subject to sum(w_i) - 1 = 0
If i had to name this problem it would be unbounded subset average.
I have looked at the unbounded knapsack and similar problems, but a dynamic programming implementation seems to be impossible due to the interdependence of the numbers.
I also inplemented a genetic algorithm that is able to approximate the weights moderately well, but it takes too long and I was initially hoping to solve the problem using dynamic programming.
Is there any hope?

Visualization
In a 2D space the solution to the problem can be represented like this
Problem class identification
As recognized by others this is a an optimization problem. You have linear constraints and a convex objective function, it can be cast to quadratic programming, (read Least squares session)
Casting to standard form
If you want to minimize the average of w[i] * x[i], this is sum(w[i] * x[i]) / N, if you arrange w[i] as the elements of a (1 x N_vectors) matrix, and each vector x[i] as the i-th row of a (N_vectors x DIM) matrix, it becomes w # X / N_vectors (with # being the matrix product operator).
To cast to that form you would have to construct a matrix so that each rows of A*x < b expressing -w[i] < 0, the equality is sum(w) = 1 becomes sum(w) < 1 and -sum(w) < -1. But there there are amazing tools to automate this part.
Implementation
This can be readily implemented using cvxpy, and you don't have to care about expanding all the constraints.
The following function solves the problem and if the vectors have dimension 2 plot the result.
import cvxpy;
import numpy as np
import matplotlib.pyplot as plt
def place_there(X, target):
# some linear algebra arrangements
target = target.reshape((1, -1))
ncols = target.shape[1]
X = np.array(X).reshape((-1, ncols))
N_vectors = X.shape[0]
# variable of the problem
w = cvxpy.Variable((1, X.shape[0]))
# solve the problem with the objective of minimize the norm of w * X - T (# is the matrix product)
P = cvxpy.Problem(cvxpy.Minimize(cvxpy.norm((w # X) / N_vectors - target)), [w >= 0, cvxpy.sum(w) == 1])
# here it is solved
print('Distance from target is: ', P.solve())
# show the solution in a nice plot
# w.value is the w that gave the optimal solution
Y = w.value.transpose() * X / N_vectors
path = np.zeros((X.shape[0] + 1, 2))
path[1:, :] = np.cumsum(Y, axis=0)
randColors=np.random.rand( 3* X.shape[0], 3).reshape((-1, 3)) * 0.7
plt.quiver(path[:-1,0], path[:-1, 1], Y[:, 0], Y[:, 1], color=randColors, angles='xy', scale_units='xy', scale=1)
plt.plot(target[:, 0], target[:, 1], 'or')
And you can run it like this
target = np.array([[1.234, 0.456]]);
plt.figure(figsize=(12, 4))
for i in [1,2,3]:
X = np.random.randn(20) * 100
plt.subplot(1,3,i)
place_there(X, target)
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.grid()
plt.show();

Smooth a curve in Python while preserving the value and slope at the end points

I have two solutions to this problem actually, they are both applied below to a test case. The thing is that none of them is perfect: first one only take into account the two end points, the other one can't be made "arbitrarily smooth": there is a limit in the amount of smoothness one can achieve (the one I am showing).
I am sure there is a better solution, that kind-of go from the first solution to the other and all the way to no smoothing at all. It may already be implemented somewhere. Maybe solving a minimization problem with an arbitrary number of splines equidistributed?
Thank you very much for your help
Ps: the seed used is a challenging one
import matplotlib.pyplot as plt
from scipy import interpolate
from scipy.signal import savgol_filter
import numpy as np
import random
def scipy_bspline(cv, n=100, degree=3):
""" Calculate n samples on a bspline
cv : Array ov control vertices
n : Number of samples to return
degree: Curve degree
"""
cv = np.asarray(cv)
count = cv.shape[0]
degree = np.clip(degree,1,count-1)
kv = np.clip(np.arange(count+degree+1)-degree,0,count-degree)
# Return samples
max_param = count - (degree * (1-periodic))
spl = interpolate.BSpline(kv, cv, degree)
return spl(np.linspace(0,max_param,n))
def round_up_to_odd(f):
return np.int(np.ceil(f / 2.) * 2 + 1)
def generateRandomSignal(n=1000, seed=None):
"""
Parameters
----------
n : integer, optional
Number of points in the signal. The default is 1000.
Returns
-------
sig : numpy array
"""
np.random.seed(seed)
print("Seed was:", seed)
steps = np.random.choice(a=[-1, 0, 1], size=(n-1))
roughSig = np.concatenate([np.array([0]), steps]).cumsum(0)
sig = savgol_filter(roughSig, round_up_to_odd(n/10), 6)
return sig
# Generate a random signal to illustrate my point
n = 1000
t = np.linspace(0, 10, n)
seed = 45136. # Challenging seed
sig = generateRandomSignal(n=1000, seed=seed)
sigInit = np.copy(sig)
# Add noise to the signal
mean = 0
std = sig.max()/3.0
num_samples = n/5
idxMin = n/2-100
idxMax = idxMin + num_samples
tCut = t[idxMin+1:idxMax]
noise = np.random.normal(mean, std, size=num_samples-1) + 2*std*np.sin(2.0*np.pi*tCut/0.4)
sig[idxMin+1:idxMax] += noise
# Define filtering range enclosing the noisy area of the signal
idxMin -= 20
idxMax += 20
# Extreme filtering solution
# Spline between first and last points, the points in between have no influence
sigTrim = np.delete(sig, np.arange(idxMin,idxMax))
tTrim = np.delete(t, np.arange(idxMin,idxMax))
f = interpolate.interp1d(tTrim, sigTrim, kind='quadratic')
sigSmooth1 = f(t)
# My attempt. Not bad but not perfect because there is a limit in the maximum
# amount of smoothing we can add (degree=len(tSlice) is the maximum)
# If I could do degree=10*len(tSlice) and converging to the first solution
# I would be done!
sigSlice = sig[idxMin:idxMax]
tSlice = t[idxMin:idxMax]
cv = np.stack((tSlice, sigSlice)).T
p = scipy_bspline(cv, n=len(tSlice), degree=len(tSlice))
tSlice = p.T[0]
sigSliceSmooth = p.T[1]
sigSmooth2 = np.copy(sig)
sigSmooth2[idxMin:idxMax] = sigSliceSmooth
# Plot
plt.figure()
plt.plot(t, sig, label="Signal")
plt.plot(t, sigSmooth1, label="Solution 1")
plt.plot(t, sigSmooth2, label="Solution 2")
plt.plot(t[idxMin:idxMax], sigInit[idxMin:idxMax], label="What I'd want (kind of, smoother will be even better actually)")
plt.plot([t[idxMin],t[idxMax]], [sig[idxMin],sig[idxMax]],"o")
plt.legend()
plt.show()
sys.exit()

Yes, a minimization is a good way to approach this smoothing problem.
Least squares problem
Here is a suggestion for a least squares formulation: let s[0], ..., s[N] denote the N+1 samples of the given signal to smooth, and let L and R be the desired slopes to preserve at the left and right endpoints. Find the smoothed signal u[0], ..., u[N] as the minimizer of
min_u (1/2) sum_n (u[n] - s[n])² + (λ/2) sum_n (u[n+1] - 2 u[n] + u[n-1])²
subject to
s[0] = u[0], s[N] = u[N] (value constraints),
L = u[1] - u[0], R = u[N] - u[N-1] (slope constraints),
where in the minimization objective, the sums are over n = 1, ..., N-1 and λ is a positive parameter controlling the smoothing strength. The first term tries to keep the solution close to the original signal, and the second term penalizes u for bending to encourage a smooth solution.
The slope constraints require that
u[1] = L + u[0] = L + s[0] and u[N-1] = u[N] - R = s[N] - R. So we can consider the minimization as over only the interior samples u[2], ..., u[N-2].
Finding the minimizer
The minimizer satisfies the Euler–Lagrange equations
(u[n] - s[n]) / λ + (u[n+2] - 4 u[n+1] + 6 u[n] - 4 u[n-1] + u[n-2]) = 0
for n = 2, ..., N-2.
An easy way to find an approximate solution is by gradient descent: initialize u = np.copy(s), set u[1] = L + s[0] and u[N-1] = s[N] - R, and do 100 iterations or so of
u[2:-2] -= (0.05 / λ) * (u - s)[2:-2] + np.convolve(u, [1, -4, 6, -4, 1])[4:-4]
But with some more work, it is possible to do better than this by solving the E–L equations directly. For each n, move the known quantities to the right-hand side: s[n] and also the endpoints u[0] = s[0], u[1] = L + s[0], u[N-1] = s[N] - R, u[N] = s[N]. The you will have a linear system "A u = b", and matrix A has rows like
0, ..., 0, 1, -4, (6 + 1/λ), -4, 1, 0, ..., 0.
Finally, solve the linear system to find the smoothed signal u. You could use numpy.linalg.solve to do this if N is not too large, or if N is large, try an iterative method like conjugate gradients.

you can apply a simple smoothing method and plot the smooth curves with different smoothness values to see which one works best.
def smoothing(data, smoothness=0.5):
last = data[0]
new_data = [data[0]]
for datum in data[1:]:
new_value = smoothness * last + (1 - smoothness) * datum
new_data.append(new_value)
last = datum
return new_data
You can plot this curve for multiple values of smoothness and pick the curve which suits your needs. You can also apply this method only on a range of values in the actual curve by defining start and end

Efficiently sample from arbitrary multivariate function

I would like to sample from an arbitrary function in Python.
In Fast arbitrary distribution random sampling it was stated that one could use inverse transform sampling and in Pythonic way to select list elements with different probability it was mentioned that one should use inverse cumulative distribution function. As far as I undestand those methods only work the univariate case. My function is multivariate though and too complex that any of the suggestions in https://stackoverflow.com/a/48676209/4533188 would apply.
Prinliminaries: My function is based on Rosenbrock's banana function, which value we can get the value of the function with
import scipy.optimize
scipy.optimize.rosen([1.1,1.2])
(here [1.1,1.2] is the input vector) from scipy, see https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.optimize.rosen.html.
Here is what I came up with: I make a grid over my area of interest and calculate for each point the function value. Then I sort the resulting data frame by the value and make a cumulative sum. This way we get "slots" which have different sizes - points which have large function values have larger slots than points with small function values. Now we generate random values and look into which slot the random value falls into. The row of the data frame is our final sample.
Here is the code:
import scipy.optimize
from itertools import product
from dfply import *
nb_of_samples = 50
nb_of_grid_points = 30
rosen_data = pd.DataFrame(array([item for item in product(*[linspace(fm[0], fm[1], nb_of_grid_points) for fm in zip([-2,-2], [2,2])])]), columns=['x','y'])
rosen_data['z'] = [np.exp(-scipy.optimize.rosen(row)**2/500) for index, row in rosen_data.iterrows()]
rosen_data = rosen_data >> \
arrange(X.z) >> \
mutate(z_upperbound=cumsum(X.z)) >> \
mutate(z_upperbound=X.z_upperbound/np.max(X.z_upperbound))
value = np.random.sample(1)[0]
def get_rosen_sample(value):
return (rosen_data >> mask(X.z_upperbound >= value) >> select(X.x, X.y)).iloc[0,]
values = pd.DataFrame([get_rosen_sample(s) for s in np.random.sample(nb_of_samples)])
This works well, but I don't think it is very efficient. What would be a more efficient solution to my problem?
I read that Markov chain Monte Carlo might helping, but here I am in over my head for now on how to do this in Python.

I was in a similar situation, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method) to sample from a bivariate distribution. An example follows.
Say, we want to sample from the following denisty:
def density1(z):
z = np.reshape(z, [z.shape[0], 2])
z1, z2 = z[:, 0], z[:, 1]
norm = np.sqrt(z1 ** 2 + z2 ** 2)
exp1 = np.exp(-0.5 * ((z1 - 2) / 0.8) ** 2)
exp2 = np.exp(-0.5 * ((z1 + 2) / 0.8) ** 2)
u = 0.5 * ((norm - 4) / 0.4) ** 2 - np.log(exp1 + exp2)
return np.exp(-u)
which looks like this
The following function implements MH with multivariate normal as the proposal
def metropolis_hastings(target_density, size=500000):
burnin_size = 10000
size += burnin_size
x0 = np.array([[0, 0]])
xt = x0
samples = []
for i in range(size):
xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
accept_prob = (target_density(xt_candidate))/(target_density(xt))
if np.random.uniform(0, 1) < accept_prob:
xt = xt_candidate
samples.append(xt)
samples = np.array(samples[burnin_size:])
samples = np.reshape(samples, [samples.shape[0], 2])
return samples
Run MH and plot samples
samples = metropolis_hastings(density1)
plt.hexbin(samples[:,0], samples[:,1], cmap='rainbow')
plt.gca().set_aspect('equal', adjustable='box')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.show()
Check out this repo of mine for details.

Lyapunov Exponent Python Implementation

I have 10k data points like this:
0.010222
0.010345
0.010465
0.010611
0.010768
0.010890
0.011049
0.011206
0.011329
0.011465
0.011613
0.11763
0.011888
0.012015
0.012154
0.012282
0.012408
0.012524
....
I want to calculate Lyapunov exponent for that. This is what I've done so far:
lyapunovs = []
eps = 0.0001
for i in range(N):
for j in range(i + 1, N):
if np.abs(data[i] - data[j]) < eps:
for k in range(1, min(N - i, N - j)):
d0 = np.abs(data[i] - data[j])
dn = np.abs(data[i + k] - data[j + k])
lyapunovs.append(math.log(dn) - math.log(d0)) # problem
My problem is that I don't know first Lyapunov exponent is average of all the lyapunovs when k = 1 or average of all the lyapunovs for the first time that data[i] - data[j] < eps?
Is this right implementation for Lyapunov exponent?
And this is the Numerical Calculation of Lyapunov Exponent

I would calculate the Lyapunov Exponent in this way and then output the results as tuples in a file see blog:
https://blog.abhranil.net/2014/07/22/calculating-the-lyapunov-exponent-of-a-time-series-with-python-code/:
from math import log
import numpy as np
with open('data.txt', 'r') as f:
data = [float(i) for i in f.read().split()]
N = len(data)
eps = 0.001
lyapunovs = [[] for i in range(N)]
for i in range(N):
for j in range(i + 1, N):
if np.abs(data[i] - data[j]) < eps:
for k in range(min(N - i, N - j)):
lyapunovs[k].append(log(np.abs(data[i+k] - data[j+k])))
with open('lyapunov.txt', 'w') as f:
for i in range(len(lyapunovs)):
if len(lyapunovs[i]):
string = str((i, sum(lyapunovs[i]) / len(lyapunovs[i])))
f.write(string + '\n')

I see from the chosen loop structure in the question that a triangle of the Cartesian product of the points is being used. This might improve the estimate of the derivatives, which are susceptible to noise, but it is not part of the Lyapunov exponent explicitly. See this example of the calculations on a known function in the absence of measurement error. Feel free to look into that aspect more, but below I will assume the comparison of signal points adjacent in time.
Your original question uses NumPy, so I will also make use of it. One of the rules of thumb to using NumPy well is to avoid loops, although it is possible to vectorize functions that contain loops. With no explicit time measurements, and no repeated values, you could simply do:
import numpy as np
x = np.random.normal(0,1,size=10**4) # Mock signal data
np.mean(np.log(np.abs(np.diff(x))))
Or if the signal is paired with an array of timepoints, then the numerical derivative can involve time:
import numpy as np
x = np.random.normal(0,1,size=10**4) # Mock signal data
t = np.arange(10**4) # Mock time data
np.mean(np.log(np.abs(np.diff(x) / np.diff(t))))
However, in some datasets it is possible for adjacent values to repeat! This can occur when you've measured the signal only to a few decimal places, and it is a problem because it leads to np.log(0) (=-np.inf) which will blow up your calculation. A simple solution is to remove duplicated values, but this will only be suitable if duplicates are relatively rare and you have a large sample size. It is possible to estimate an upper bound on the estimate of the L-exponent by considering the precision of your measurements, but that is not the estimate of the L-exponent itself.

I just want to mention that knowing the literal expression is the best.
I will take an example with the logistic map equation :
def logisticmap(x_init, r, length):
x = [x_init]
for t in range(length):
x.append(r*x[-1]*(1-x[-1]))
return np.array(x)
Now let's generate the data :
x = logistic(0.2, 3.92, 1000)
plt.plot(x)
plt.show()
Plot logistic map
Here is the proposed solution by Galan,
np.mean(np.log(abs(np.diff(x))))
Which gives : -1.0379
When you derive the Lyapunov exponent from the logistic map equation :
np.mean(np.log(abs(r*(1-2*x))))
It gives : 0.538296
Which is the actual true value for the Lyapunov, since the system is in its chaotic regime it must be positive, so I guess the evaluation from data points is not working in this example, you can try with more data points, but it will still give you a negative LE.
Unfortunately I don't know enough to guide you towards a better estimation for the Lyapunov if you can't derive a mathematical expression, but I would be intersted to know !

I tried to reduce computational complexity with numpy vectorization.
def lyapunov_exponent(series: np.array, threshold: float): -> np.array
N = len(series)
eps = threshold
L = [np.array([0]*N)]
for i in range(1, N):
diff = np.abs(series[i:]-series[:-i])
dist = np.log(diff)
L.append(np.concatenate([[0]*i, dist]))
L = np.array(L)
tf_L = np.where(L<eps, 1, 0)
count_L = np.zeros_like(tf_L)
for i in range(N):
indices = ( np.array(range(0,N-i)), np.array(range(i,N)) )
count_L[indices] = np.cumsum(tf_L[indices])
avg = np.sum(count_L * L, axis=0) / np.sum(count_L, axis=0)
return avg
If there is room for improvement or you get some different result than already answered, please reply.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

calculating Gini coefficient in Python/numpy - python

Gini coefficient is the area under the Lorence curve, usually calculated for analyzing the distribution of income in population. https://github.com/oliviaguest/gini provides simple implementation for the same using python.

Note that gini index is currently present in skbio.diversity.alpha as gini_index. It might give a bit different result with examples mentioned above.

Related

How can I make this python code run faster?

Is there a DP solution for my subset average problem?

Smooth a curve in Python while preserving the value and slope at the end points

Efficiently sample from arbitrary multivariate function

Lyapunov Exponent Python Implementation

Categories

Resources