What drives numerical instability in eigenvalue computations in python? - python

Let's say I have a data matrix X with num_samples = 1600, dim_data = 2, from which I can build a 1600*1600 similarity matrix S using the rbf kernel. I can normalize each row of the matrix, by multiplying all entries of the row by (1 / sum(entries of the row)). This procedure gives me a (square) right stochastic matrix, which we expect to have an eigenvalue equal to 1 associated to a constant eigenvector full of 1s.
We can easily check that this is indeed an eigenvector by taking its product with the matrix. However, using scipy.linalg.eig the obtained eigenvector associated to eigenvalue 1 is only piecewise constant.
I have tried scipy.linalg.eig on similarly sized matrices with randomly generated data which I transformed into stochastic matrices and consistently obtained a constant eigenvector associated to eigenvalue 1.
My question is then, what factors may cause numerical instabilities when computing eigenvalues of stochastic matrices using scipy.linalg.eig?
Reproducible example:
def kernel(sigma,X):
"""
param sigma: variance
param X: (num_samples,data_dim)
"""
squared_norm = np.expand_dims(np.sum(X**2,axis=1),axis=1) + np.expand_dims(np.sum(X**2,axis=1),axis=0)-2*np.einsum('ni,mi->nm',X,X)
return np.exp(-0.5*squared_norm/sigma**2)
def normalize(array):
degrees = []
M = array.shape[0]
for i in range(M):
norm = sum(array[i,:])
degrees.append(norm)
degrees_matrix = np.diag(np.array(degrees))
P = np.matmul(np.linalg.inv(degrees_matrix),array)
return P
#generate the data
points = np.linspace(0,4*np.pi,1600)
Z = np.zeros((1600,2))
Z[0:800,:] = np.array([2.2*np.cos(points[0:800]),2.2*np.sin(points[0:800])]).T
Z[800:,:] = np.array([4*np.cos(points[0:800]),4*np.sin(points[0:800])]).T
X = np.zeros((1600,2))
X[:,0] = np.where(Z[:,1] >= 0, Z[:,0] + .8 + params[1], Z[:,0] - .8 + params[2])
X[:,1] = Z[:,1] + params[0]
#create the stochastic matrix P
P = normalize(kernel(.05,X))
#inspect the eigenvectors
e,v = scipy.linalg.eig(P)
p = np.flip(np.argsort(e))
e = e[p]
v = v[:,p]
plot_array(v[:,0])
#check on synthetic data:
Y = np.random.normal(size=(1600,2))
P = normalize(kernel(Y))
#inspect the eigenvectors
e,v = scipy.linalg.eig(P)
p = np.flip(np.argsort(e))
e = e[p]
v = v[:,p]
plot_array(v[:,0])
Using the code provided by Ahmed AEK, here are some results on the divergence of the obtained eigenvector from the constant eigenvector.
[-1.36116641e-05 -1.36116641e-05 -1.36116641e-05 ... 5.44472888e-06
5.44472888e-06 5.44472888e-06]
norm = 0.9999999999999999
max difference = 0.04986484253966891
max difference / element value -3663.3906291852545
UPDATE:
I have observed that a low value of sigma in the construction of the kernel matrix produces a less sharp decay in the (sorted) eigenvalues. In fact, for sigma=0.05, the first 4 eigenvalues produced by scipy.linalg.eig are rounded up to 1. This may be linked to the imprecision in the eigenvectors. When sigma is increased to 0.5, I do obtain a constant eigenvector.
First 5 eigenvectors in the sigma=0.05 case
First 5 eigenvectors in the sigma=0.5 case

the computer has an expected accuracy of 14 digits as of 64 bits of float as shown here, which means that any result will only be accurate up to 14 digits.
using the below code you can check this result:
Y = np.random.normal(size=(1600,2))
P = normalize(kernel(5,Y))
P = P / np.sum(P,axis=1)
#inspect the eigenvectors
e,v = np.linalg.eig(P)
p = np.flip(np.argsort(e))
a = np.isclose(e,1)
e1 = e[a]
v1 = v[:,a]
v11 = v1[:,0]
print(v11)
print('norm = ',np.sum(v11**2))
print('max difference = ',np.amax(np.abs(np.diff(v11))))
print('max difference / element value =',np.amax(np.abs(np.diff(v11)))/v11[0])
result is:
[0.025+0.j 0.025+0.j 0.025+0.j ... 0.025+0.j 0.025+0.j 0.025+0.j]
norm = (1+0j)
max difference = 1.97758476261356e-16
max difference / element value = (7.91033905045416e-15+0j)
as you can see, the difference is accuate to within 8e-15 which is around 14 digits of precision, the norm will sometimes be 0.99999999999998, which is within 14 digits of precision.

Related

Efficient metric for to measure whether two clouds of points overlap?

Assume I have two sets of points A and B. A is a matrix of size N-by-D and B is a matrix of size M-by-D. Both sets have the same dimensions D but may be composed of different numbers of samples N,M > 1 (assume that number of samples is sufficiently large).
I am looking for an efficient metric to determine to what degree set A overlaps with set B. This metric should have the following properties:
It should return a large value when set A is contained in set B
It should return an intermediate value when set A partially overlaps with set B
It should return a small value when set A and set B do not overlap
I have thought of some ways to achieve this, but none of them quite make the cut:
Determine the convex hull of B, then calculate what percentage of A lies within this convex hull. This is more or less reliable (assuming B is sufficiently convex), but calculating the convex hull becomes prohibitively expensive for large D.
Estimate the mean and covariance of A and B and calculate a Kullback-Leibler divergence between the two resulting multivariate Gaussian distributions. This is reasonably efficient, but does not distinguish the case when A is completely embedded in B but has significantly lower spread, and when A and B have similar spread but only overlap partially.
Do you have any other ideas on how I could tackle this issue? Below I have provided an example code illustrating the problem with using the Kullback-Leibler divergence:
import numpy as np
import scipy.stats
N = 1000
M = 1000
D = 3
def metric_KLD(A,B):
# Get the dimension of A and B
D = A.shape[-1]
# Estimate mean and cov of A
A_mean = np.mean(A,axis=0)
A_cov = np.cov(A.T)
# Estimate mean and cov of B
B_mean = np.mean(B,axis=0)
B_cov = np.cov(B.T)
# Calculate the KLD
KLD = 0.5*(np.log(np.linalg.det(B_cov)/np.linalg.det(A_cov)) - \
D + np.trace(np.dot(np.linalg.inv(B_cov),A_cov)) + \
np.linalg.multi_dot((
(B_mean - A_mean)[np.newaxis,:],
np.linalg.inv(B_cov),
(B_mean - A_mean)[:,np.newaxis])))
return KLD
# Case 1: Both distributions overlap perfectly
A1 = scipy.stats.multivariate_normal.rvs(
mean = np.zeros(D),
cov = np.identity(D)*10**2,
size = N)
B1 = scipy.stats.multivariate_normal.rvs(
mean = np.zeros(D),
cov = np.identity(D)*10**2,
size = M)
print('KLD case 1: '+str(metric_KLD(A1,B1)))
# Case 2: Both distributions overlap partially
A2 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([0,0,0]),
cov = np.identity(D)*10**2,
size = N)
B2 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([10,0,0]),
cov = np.identity(D)*10**2,
size = M)
print('KLD case 2: '+str(metric_KLD(A2,B2)))
# Case 3: Both distributions don't overlap at all
A3 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([0,0,0]),
cov = np.identity(D)*10**2,
size = N)
B3 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([30,30,30]),
cov = np.identity(D)*10**2,
size = M)
print('KLD case 3: '+str(metric_KLD(A3,B3)))
# Case 4: A is included in B, but A has significantly smaller spread
A4 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([0,0,0]),
cov = np.identity(D)*1**2,
size = N)
B4 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([0,0,0]),
cov = np.identity(D)*10**2,
size = M)
# This is problematic: Should be better than case 2 but isn't
print('KLD case 4: '+str(metric_KLD(A4,B4)))
Another idea is to compare the standard deviation within each set to the overall standard deviation in the pooled set. The more the sets overlap, the more similar these numbers will be.
To handle the case of one set with a small spread within the other one, you could just use the larger individual standard deviation for comparison with the pooled one.
This could be done for each dimension, and the final metric would be the average over all dimensions:
def metric_std(A, B):
# standard deviations
A_std = np.std(A, axis=0)
B_std = np.std(B, axis=0)
pooled_std = np.std(np.vstack([A, B]), axis=0)
# metric: ratio of higher individual to pooled std
std_ratio = np.max([A_std, B_std], axis=0) / pooled_std
return np.mean(std_ratio)
print('case 1: '+str(metric_std(A1, B1)))
print('case 2: '+str(metric_std(A2, B2)))
print('case 3: '+str(metric_std(A3, B3)))
print('case 4: '+str(metric_std(A4, B4)))
case 1: 1.012685295955191
case 2: 0.973728554241719
case 3: 0.5569612281519413
case 4: 1.4071848466046386

Coding Isomap (& MDS) function using only numpy and scipy in python

I have coded Isomap function starting with computing the eulidean distance matrix (using scipy.spatial.distance.cdist), next basing on K-nearest neighbors method and Dijkstra algorithm (to determinate the shortest path) I have Computed the full distance matrix over all paths, finally I have did map computations, following by the dimensionality reduction.
BUT, I want to use epsilon instead of K-nearest neighbors like in the following :
Y = isomap (X, epsilon, d)
• X is an n × m matrix which corresponds to n points with m attributes.
• epsilon is an anonymous function of the distance matrix used to find the parameters of neighborhood. (The neighborhood graph must be formed by eliminating the edges whose width is greater to epsilon of the complete distance graph).
• d is a parameter which signifies the output dimension.
• Y is an n × d matrix, which signifies the embedding resulting from isomap.
THANKS in advance
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
def distance_Matrix(X):
return cdist(X,X,'euclidean')
def Dijkstra(h):
q = h.copy()
for i in range(ndata):
for j in range(ndata):
k = np.argmin(q[i,:])
while not(np.isinf(q[i,k])):
q[i,k] = np.inf
for l in neighbours[k,:]:
possible = h[i,l] + h[l,k]
if possible < h[i,k]:
h[i,k] = possible
k = np.argmin(q[i,:])
return h
def MDS(D,newdim=2):
n = D.shape[0]
# Torgerson formula
I = np.eye(n)
J = np.ones(D.shape)
J = I-(1/n)*J
B = (-1/2)*np.dot(np.dot(J,D),np.dot(D,J)) # B = -(1/2).JD²J
#
eigenval, eigenvec = np.linalg.eig(B)
indices = np.argsort(eigenval)[::-1]
eigenval = eigenval[indices]
eigenvec = eigenvec[:, indices]
# dimension reduction
K = eigenvec[:, :newdim]
L = np.diag(eigenval[:newdim])
# result
Y = K # L **(1/2)
return np.real(Y)
def isomap(data,newdim=2,K=12):
ndata = np.shape(data)[0]
ndim = np.shape(data)[1]
d = distance_Matrix(X)
# replace begin
# K-nearest neighbours
indices = d.argsort()
#notneighbours = indices[:,K+1:]
neighbours = indices[:,:K+1]
# replace end
h = np.ones((ndata,ndata),dtype=float)*np.inf
for i in range(ndata):
h[i,neighbours[i,:]] = d[i,neighbours[i,:]]
h = Dijkstra(h)
return MDS(h,newdim)
Try sklearn.neighbors.radius_neighbors_graph for your distance matrix

Elements of a matrix inverse for ill-conditioned matrix

I am trying to find the elements of a matrix inverse for an ill-conditioned matrix
Consider the complex non-Hermitian matrix M, I know this matrix has one zero eigenvalue, and is therefor singular. However, I need to find the sum of the matrix elements: v#f(M)#u, where u and v are both vectors and f(x)=1/x (effectively the matrix inverse). I know that the zeroth eigenvalue does not contribute to this sum, so there is no explicit issue with the singularity. However, my code is very numerically unstable, I presume this is a consequence of an error in finding the eigenvalues of the system.
Starting by building the preliminary matrices:
import numpy as np
import scipy as sc
g0 = np.array([0,0,1])
g1 = np.array([0,1,0])
e0 = np.array([1,0,0])
sm = np.outer(g0, e0)
sp = np.outer(e0, g0)
def spre(op):
return np.kron(np.eye(op.shape[0]),op)
def spost(op):
return np.kron(op.T,np.eye(op.shape[0]))
def sprepost(op1,op2):
return np.kron(op1.T,op2)
sm_reg = spre(sm)
sp_reg = spre(sp)
spsm_reg=spre(sp#sm)
hil_dim = int(g0.shape[0])
cav_proj= np.eye(hil_dim).reshape(hil_dim**2,)
rho0 =(np.outer(e0,e0)).reshape(hil_dim**2,)
def ham(g):
return g * (np.outer(g1,e0) + np.outer(e0, g1))
def lind_op(A):
L = 2 * sprepost(A,A.conj().T) - spre(A.conj().T # A)
L += - spost(A.conj().T # A)
return L
def JC_lio(g, kappa, gamma):
unit = -1j * (spre(ham(g)) - spost(ham(g)))
lind = gamma * lind_op(np.outer(g0 , e0)) + kappa * lind_op(np.outer(g0 , g1))
return unit + lind
Now define a function that first finds the left and right eigenvalues, and then finds the sum of the matrix elements:
def power_int(g, kappa, gamma):
# Construct the non-Hermitian matrix of interest
lio = JC_lio(g,kappa,gamma)
#Find its left and right eigenvectors:
ev, left, right = scipy.linalg.eig(lio, left=True,right=True)
# Find the appropriate normalisation factors
norm = np.array([(left.conj().T[ii]).dot(right.conj().T[ii]) for ii in range(len(ev))])
#Find the similarity transformation for the problem
P = right
Pinv = (left/norm).conj().T
#find the projectors for the Eigenbasis
Proj = [np.outer(P.conj().T[ii],Pinv[ii]) for ii in range(len(ev))]
#Find the relevant matrix elements between the Eigenbasis and the projectors --- this is where the zero eigenvector gets removed
PowList = [(spsm_reg# Proj[ii] # rho0).dot(cav_proj) for ii in range(len(ev))]
#apply the function
Pow = 0
for ii in range(len(ev)):
if PowList[ii]==0:
Pow = Pow
else:
Pow += PowList[ii]/ev[ii]
return -np.pi * np.real(Pow)
#example run:
grange = np.linspace(0.001,10,40)
dat = np.array([power_int(g, 1, 1) for g in grange])
Running this code leads to extremely oscillatory results where I expect a smooth curve. I suspect this error is due to poor accuracy in determining the eigenvectors, but I can't seem to find any documentation on this. Any insights would be welcome.

Why does simple gradient descent diverge?

This is my second attempt at implementing gradient descent in one variable and it always diverges. Any ideas?
This is simple linear regression for minimizing the residual sum of squares in one variable.
def gradient_descent_wtf(xvalues, yvalues):
tolerance = 0.1
#y=mx+b
#some line to predict y values from x values
m=1.
b=1.
#a predicted y-value has value mx + b
for i in range(0,10):
#calculate y-value predictions for all x-values
predicted_yvalues = list()
for x in xvalues:
predicted_yvalues.append(m*x + b)
# predicted_yvalues holds the predicted y-values
#now calculate the residuals = y-value - predicted y-value for each point
residuals = list()
number_of_points = len(yvalues)
for n in range(0,number_of_points):
residuals.append(yvalues[n] - predicted_yvalues[n])
## calculate the residual sum of squares from the residuals, that is,
## square each residual and add them all up. we will try to minimize
## the residual sum of squares later.
residual_sum_of_squares = 0.
for r in residuals:
residual_sum_of_squares += r**2
print("RSS = %s" % residual_sum_of_squares)
##
##
##
#now make a version of the residuals which is multiplied by the x-values
residuals_times_xvalues = list()
for n in range(0,number_of_points):
residuals_times_xvalues.append(residuals[n] * xvalues[n])
#now create the sums for the residuals and for the residuals times the x-values
residuals_sum = sum(residuals)
residuals_times_xvalues_sum = sum(residuals_times_xvalues)
# now multiply the sums by a positive scalar and add each to m and b.
residuals_sum *= 0.1
residuals_times_xvalues_sum *= 0.1
b += residuals_sum
m += residuals_times_xvalues_sum
#and repeat until convergence.
#convergence occurs when ||sum vector|| < some tolerance.
# ||sum vector|| = sqrt( residuals_sum**2 + residuals_times_xvalues_sum**2 )
#check for convergence
magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
if magnitude_of_sum_vector < tolerance:
break
return (b, m)
Result:
gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 300170125.7
RSS = 4.86943013045e+11
RSS = 7.90447409339e+14
RSS = 1.28312217794e+18
RSS = 2.08287421094e+21
RSS = 3.38110045417e+24
RSS = 5.48849288217e+27
RSS = 8.90939341376e+30
RSS = 1.44624932026e+34
Out[108]:
(-3.475524066284303e+16, -2.4195981188763203e+17)
The gradients are huge -- hence you are following large vectors for long distances (0.1 times a large number is large). Find unit vectors in the appropriate direction. Something like this (with comprehensions replacing your loops):
def gradient_descent_wtf(xvalues, yvalues):
tolerance = 0.1
m=1.
b=1.
for i in range(0,10):
predicted_yvalues = [m*x+b for x in xvalues]
residuals = [y-y_hat for y,y_hat in zip(yvalues,predicted_yvalues)]
residual_sum_of_squares = sum(r**2 for r in residuals) #only needed for debugging purposes
print("RSS = %s" % residual_sum_of_squares)
residuals_times_xvalues = [r*x for r,x in zip(residuals,xvalues)]
residuals_sum = sum(residuals)
residuals_times_xvalues_sum = sum(residuals_times_xvalues)
# (residuals_sum,residual_times_xvalues_sum) is a vector which points in the negative
# gradient direction. *Find a unit vector which points in same direction*
magnitude = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
residuals_sum /= magnitude
residuals_times_xvalues_sum /= magnitude
b += residuals_sum * (0.1)
m += residuals_times_xvalues_sum * (0.1)
#check for convergence -- this needs work!
magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
if magnitude_of_sum_vector < tolerance:
break
return (b, m)
For example:
>>> gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 368732.1655050716
RSS = 367039.18363896786
RSS = 365354.0543519137
RSS = 363676.7775934381
RSS = 362007.3533123621
RSS = 360345.7814567845
RSS = 358692.061974069
RSS = 357046.1948108295
RSS = 355408.17991291644
(1.1157111313023558, 1.9932828425473605)
which is certainly much more plausible.
It isn't a trivial matter to make a numerically stable gradient-descent algorithm. You might want to consult a decent textbook in numerical analysis.
First, Your code is right.
But you should consider something about math when you do linear regression.
For example, the residual is -205.8 and your learning rate is 0.1 so you will get a huge descent step -25.8.
It's a so large step that you can't go back to the correct m and b. You have to make your step small enough.
There are two ways to make gradient descent step reasonable:
initialize a small learning rate, such as 0.001 and 0.0003.
Divide your step by the total amount of your input values.

calculating Gini coefficient in Python/numpy

i'm calculating Gini coefficient (similar to: Python - Gini coefficient calculation using Numpy) but i get an odd result. for a uniform distribution sampled from np.random.rand(), the Gini coefficient is 0.3 but I would have expected it to be close to 0 (perfect equality). what is going wrong here?
def G(v):
bins = np.linspace(0., 100., 11)
total = float(np.sum(v))
yvals = []
for b in bins:
bin_vals = v[v <= np.percentile(v, b)]
bin_fraction = (np.sum(bin_vals) / total) * 100.0
yvals.append(bin_fraction)
# perfect equality area
pe_area = np.trapz(bins, x=bins)
# lorenz area
lorenz_area = np.trapz(yvals, x=bins)
gini_val = (pe_area - lorenz_area) / float(pe_area)
return bins, yvals, gini_val
v = np.random.rand(500)
bins, result, gini_val = G(v)
plt.figure()
plt.subplot(2, 1, 1)
plt.plot(bins, result, label="observed")
plt.plot(bins, bins, '--', label="perfect eq.")
plt.xlabel("fraction of population")
plt.ylabel("fraction of wealth")
plt.title("GINI: %.4f" %(gini_val))
plt.legend()
plt.subplot(2, 1, 2)
plt.hist(v, bins=20)
for the given set of numbers, the above code calculates the fraction of the total distribution's values that are in each percentile bin.
the result:
uniform distributions should be near "perfect equality" so the lorenz curve bending is off.
This is to be expected. A random sample from a uniform distribution does not result in uniform values (i.e. values that are all relatively close to each other). With a little calculus, it can be shown that the expected value (in the statistical sense) of the Gini coefficient of a sample from the uniform distribution on [0, 1] is 1/3, so getting values around 1/3 for a given sample is reasonable.
You'll get a lower Gini coefficient with a sample such as v = 10 + np.random.rand(500). Those values are all close to 10.5; the relative variation is lower than the sample v = np.random.rand(500).
In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n) is 1/(6*base + 3).
Here's a simple implementation of the Gini coefficient. It uses the fact that the Gini coefficient is half the relative mean absolute difference.
def gini(x):
# (Warning: This is a concise implementation, but it is O(n**2)
# in time and memory, where n = len(x). *Don't* pass in huge
# samples!)
# Mean absolute difference
mad = np.abs(np.subtract.outer(x, x)).mean()
# Relative mean absolute difference
rmad = mad/np.mean(x)
# Gini coefficient
g = 0.5 * rmad
return g
(For some more efficient implementations, see More efficient weighted Gini coefficient in Python)
Here's the Gini coefficient for several samples of the form v = base + np.random.rand(500):
In [80]: v = np.random.rand(500)
In [81]: gini(v)
Out[81]: 0.32760618249832563
In [82]: v = 1 + np.random.rand(500)
In [83]: gini(v)
Out[83]: 0.11121487509454202
In [84]: v = 10 + np.random.rand(500)
In [85]: gini(v)
Out[85]: 0.01567937753659053
In [86]: v = 100 + np.random.rand(500)
In [87]: gini(v)
Out[87]: 0.0016594595244509495
A slightly faster implementation (using numpy vectorization and only computing each difference once):
def gini_coefficient(x):
"""Compute Gini coefficient of array of values"""
diffsum = 0
for i, xi in enumerate(x[:-1], 1):
diffsum += np.sum(np.abs(xi - x[i:]))
return diffsum / (len(x)**2 * np.mean(x))
Note: x must be a numpy array.
Gini coefficient is the area under the Lorence curve, usually calculated for analyzing the distribution of income in population. https://github.com/oliviaguest/gini provides simple implementation for the same using python.
A quick note on the original methodology:
When calculating Gini coefficients directly from areas under curves with np.traps or another integration method, the first value of the Lorenz curve needs to be 0 so that the area between the origin and the second value is accounted for. The following changes to G(v) fix this:
yvals = [0]
for b in bins[1:]:
I also discussed this issue in this answer, where including the origin in those calculations provides an equivalent answer to using the other methods discussed here (which do not need 0 to be appended).
In short, when calculating Gini coefficients directly using integration, start from the origin. If using the other methods discussed here, then it's not needed.
Note that gini index is currently present in skbio.diversity.alpha as gini_index. It might give a bit different result with examples mentioned above.
You are getting the right answer. The Gini Coefficient of the uniform distribution is not 0 "perfect equality", but (b-a) / (3*(b+a)). In your case, b = 1, and a = 0, so Gini = 1/3.
The only distributions with perfect equality are the Kroneker and the Dirac deltas. Remember that equality means "all the same", not "all equally probable".
There were some issues with the previous implementations. They never gave the gini index = 1 for perfectly sparse data.
example:
def gini_coefficient(x):
"""Compute Gini coefficient of array of values"""
diffsum = 0
for i, xi in enumerate(x[:-1], 1):
diffsum += np.sum(np.abs(xi - x[i:]))
return diffsum / (len(x)**2 * np.mean(x))
gini_coefficient(np.array([0, 0, 1]))
gives the answer 0.666666. That happens because of the implied "integration scheme" it uses.
Here is another variant that bypasses the issue, although it is computationally heavier:
import numpy as np
from scipy.interpolate import interp1d
def gini(v, n_new = 1000):
"""Compute Gini coefficient of array of values"""
v_abs = np.sort(np.abs(v))
cumsum_v = np.cumsum(v_abs)
n = len(v_abs)
vals = np.concatenate([[0], cumsum_v/cumsum_v[-1]])
x = np.linspace(0, 1, n+1)
f = interp1d(x=x, y=vals, kind='previous')
xnew = np.linspace(0, 1, n_new+1)
dx_new = 1/(n_new)
vals_new = f(xnew)
return 1 - 2 * np.trapz(y=vals_new, x=xnew, dx=dx_new)
gini(np.array([0, 0, 1]))
it gives 0.999 output, which is closer to what one wants to have =)

Categories

Resources