I am trying to minimize the weight as close to average as possible and meet both constrain of w>0 and Aw=b. However, with help, I was able to meet the second constrain (Aw=b) almost 100% but my w still return negative and at the same time not as close to average. As you can see the result below returns two large negative number (-39.99771255,-2.9709434) and one large position number (43.82670431) that are out of bound, which defeat the purpose of the exercise to minimize the weight. I even tried to divide my data set to randomly different groups as well but no optimized solution.
Please let me know if there are any math or better python suggestions as well as any methodology that could help lead to the most optimized solution. I am expecting the number to be as close as 0.2. Any numerical suggestion and advise are very welcome.
Question: Find the minimized weight that are close to average and at
the same time meet the target as close as possible and weight must be larger
than 0
My math right now is
**min(w-mean(w))**
**s.t.** Aw-b=0, w>=0
**bound** 0.2'<'w<0.5
Data:
variable1 variable2 variable3 weight
1.49E+03 9.65E+07 2.27E+08 w1
1.52E+03 1.02E+08 2.20E+08 w2
1.54E+03 1.11E+08 2.23E+08 w3
1.49E+03 9.73E+07 2.18E+08 w4
1.60E+03 1.05E+08 2.38E+08 w5
1.57E+03 1.01E+08 2.25E+08 w6
1.59E+03 1.06E+08 2.30E+08 w7
1.49E+03 9.15E+07 2.20E+08 w8
1.54E+03 1.02E+08 2.20E+08 w9
1.54E+03 1.01E+08 2.28E+08 w10
1.57E+03 1.08E+08 2.25E+08 w11
1.54E+03 1.00E+08 2.25E+08 w12
1.57E+03 9.97E+07 2.26E+08 w13
1.49E+03 9.64E+07 2.13E+08 w14
1.53E+03 1.05E+08 2.36E+08 w15
1.53E+03 1.09E+08 2.23E+08 w16
1.55E+03 1.00E+08 2.45E+08 w17
1.52E+03 1.07E+08 2.17E+08 w18
1.50E+03 9.62E+07 2.07E+08 w19
1.57E+03 9.91E+07 2.32E+08 w20
Target
8531 429386951 1079115532
Current Result:
('x: ', array([ 0.28601612, 0.28601614, 43.82670431, 0.28601614, 0.28601613, 0.28601614, 0.28601614, -2.9709434 ,
0.28601614, 0.28601613, 0.28601615, 0.28601613, 0.28601613, 0.28601614, 0.28601614, 0.28601618,
-39.99771255, 0.28601618, 0.28601614, 0.28601612]))
Python Code:
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 9 13:15:49 2017
#author: zkmkq7a
"""
import numpy as np
import scipy.optimize as spo
#from __future__ import print_function
import csv
import os
import sys
import operator
import itertools
import random
import numpy as np
from array import array
import matplotlib.pyplot as plt
from scipy.optimize import nnls, minimize
from scipy.sparse import coo_matrix
import pandas as pd
from sklearn.utils import shuffle
import networkx as nx
np.set_printoptions(linewidth=120)
np.random.seed(1)
""" Test Solver with random data """
#M, N = 2, 3
#A = np.random.random(size=(M,N))
#x_hidden = np.random.random(size=N)
#b = A.dot(x_hidden) # target
#n = N
#
#print('Original A')
#print(A)
#print('hidden x: ', x_hidden)
#print('target: ', b)
A = np.array(
[
[1486,1522,1536,1485,1598,1570,1593,1491,1540,1544,1566,1544,1573,1490,1530,1526,1546,1518,1498,1567],
[96454395,101615856,110933285,97323661,104973593,101499747,106069432,91496414,101843454,101040852,107795438,100260356,99722283,96419402,104875509,108908675,100158114,107298576,96209051,99065264],
[226553304,219597259,223410668,217613414,237752337,225150384,230103384,219982453,219726185,227791091,225452565,225279979,225682678,213304290,236148563,223027694,244961198,216856884,207345788,231758040]
]
)
b = np.array([8531., 1079115532., 429386951.])
n = 20
""" Optimize as LP """
def solve(A, b, n):
print('Reformulation')
am, an = A.shape
# Introduce aux-vars
# 1: y = mean
# n: z = x - mean
# n: abs(z)
n_plus_aux_vars = 3*n + 1
# Equality constraint: y = mean
eq_mean_A = np.zeros(n_plus_aux_vars)
eq_mean_A[:n] = 1. / n
eq_mean_A[n] = -1.
eq_mean_b = np.array([0])
print('y = mean A:')
print(eq_mean_A)
print('y = mean b:')
print(eq_mean_b)
# Equality constraints: Ax = b
eq_A = np.zeros((am, n_plus_aux_vars))
eq_A[:, :n] = A[:, :n]
eq_b = np.copy(b)
print('Ax=b A:')
print(eq_A)
print('Ax=b b:')
print(eq_b)
# Equality constraints: z = x - mean
eq_mean_A_z = np.hstack([-np.eye(n), np.ones((n, 1)), + np.eye(n), np.zeros((n, n))])
eq_mean_b_z = np.zeros(n)
print('z = x - mean A:')
print(eq_mean_A_z)
print('z = x - mean b:')
print(eq_mean_b_z)
# Inequality constraints: absolute values -> x <= x' ; -x <= x'
ineq_abs_0_A = np.hstack([np.zeros((n, n)), np.zeros((n, 1)), np.eye(n), -np.eye(n)])
ineq_abs_0_b = np.zeros(n)
ineq_abs_1_A = np.hstack([np.zeros((n, n)), np.zeros((n, 1)), -np.eye(n), -np.eye(n)])
ineq_abs_1_b = np.zeros(n)
# Bounds
# REMARK: Do not touch anything besides the first bounds-row!
bounds = [(-50., 50.) for i in range(n)] + \
[(None, None)] + \
[(None, None) for i in range(n)] + \
[(0, None) for i in range(n)]
# Objective
c = np.zeros(n_plus_aux_vars)
c[-n:] = 1
A_eq = np.vstack((eq_mean_A, eq_A, eq_mean_A_z))
b_eq = np.hstack([eq_mean_b, eq_b, eq_mean_b_z])
A_ineq = np.vstack((ineq_abs_0_A, ineq_abs_1_A))
b_ineq = np.hstack([ineq_abs_0_b, ineq_abs_1_b])
print('solve...')
result = spo.linprog(c, A_ineq, b_ineq, A_eq, b_eq, bounds=bounds, method='interior-point')
print(result)
x = result.x[:n]
print('x: ', x)
print('residual Ax-b: ', A.dot(x) - b)
print('residual Ax-b %: ', np.divide(A.dot(x)-b,b))
print('mean: ', result.x[n])
print('x - mean: ', x - result.x[n])
print('l1-norm(x - mean) / objective: ', np.linalg.norm(x - result.x[n], 1))
solve(A, b, n)
Related
How to generate a random integer as with np.random.randint(), but with a normal distribution around 0.
np.random.randint(-10, 10) returns integers with a discrete uniform distribution
np.random.normal(0, 0.1, 1) returns floats with a normal distribution
What I want is a kind of combination between the two functions.
One other way to get a discrete distribution that looks like the normal distribution is to draw from a multinomial distribution where the probabilities are calculated from a normal distribution.
import scipy.stats as ss
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-10, 11)
xU, xL = x + 0.5, x - 0.5
prob = ss.norm.cdf(xU, scale = 3) - ss.norm.cdf(xL, scale = 3)
prob = prob / prob.sum() # normalize the probabilities so their sum is 1
nums = np.random.choice(x, size = 10000, p = prob)
plt.hist(nums, bins = len(x))
Here, np.random.choice picks an integer from [-10, 10]. The probability for selecting an element, say 0, is calculated by p(-0.5 < x < 0.5) where x is a normal random variable with mean zero and standard deviation 3. I chose a std. dev. of 3 because this way p(-10 < x < 10) is almost 1.
The result looks like this:
It may be possible to generate a similar distribution from a Truncated Normal Distribution that is rounded up to integers. Here's an example with scipy's truncnorm().
import numpy as np
from scipy.stats import truncnorm
import matplotlib.pyplot as plt
scale = 3.
range = 10
size = 100000
X = truncnorm(a=-range/scale, b=+range/scale, scale=scale).rvs(size=size)
X = X.round().astype(int)
Let's see what it looks like
bins = 2 * range + 1
plt.hist(X, bins)
The accepted answer here works, but I tried Will Vousden's solution and it works well too:
import numpy as np
# Generate Distribution:
randomNums = np.random.normal(scale=3, size=100000)
randomInts = np.round(randomNums)
# Plot:
axis = np.arange(start=min(randomInts), stop = max(randomInts) + 1)
plt.hist(randomInts, bins = axis)
Old question, new answer:
For a bell-shaped distribution on the integers {-10, -9, ..., 9, 10}, you can use the binomial distribution with n=20 and p=0.5, and subtract 10 from the samples.
For example,
In [167]: import numpy as np
In [168]: import matplotlib.pyplot as plt
In [169]: rng = np.random.default_rng()
In [170]: N = 5000000 # Number of samples to generate
In [171]: samples = rng.binomial(n=20, p=0.5, size=N) - 10
In [172]: samples.min(), samples.max()
Out[172]: (-10, 10)
Note that the probability of -10 or 10 is pretty low, so you won't necessarily see them in any given sample, especially if you use a smaller N.
np.bincount() is an efficient way to generate a histogram for a collection of small nonnegative integers:
In [173]: counts = np.bincount(samples + 10, minlength=20)
In [174]: counts
Out[174]:
array([ 4, 104, 889, 5517, 22861, 73805, 184473, 369441,
599945, 800265, 881140, 801904, 600813, 370368, 185082, 73635,
23325, 5399, 931, 95, 4])
In [175]: plt.bar(np.arange(-10, 11), counts)
Out[175]: <BarContainer object of 21 artists>
This version is mathematically not correct (because you crop the bell) but will do the job quick and easily understandable if preciseness is not needed that much:
def draw_random_normal_int(low:int, high:int):
# generate a random normal number (float)
normal = np.random.normal(loc=0, scale=1, size=1)
# clip to -3, 3 (where the bell with mean 0 and std 1 is very close to zero
normal = -3 if normal < -3 else normal
normal = 3 if normal > 3 else normal
# scale range of 6 (-3..3) to range of low-high
scaling_factor = (high-low) / 6
normal_scaled = normal * scaling_factor
# center around mean of range of low high
normal_scaled += low + (high-low)/2
# then round and return
return np.round(normal_scaled)
Drawing 100000 numbers results in this histogramm:
Here we start by getting values from the bell curve.
CODE:
#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: Discretize a normal distribution centered at 0
#--------*---------*---------*---------*---------*---------*---------*---------*
import sys
import random
from math import sqrt, pi
import numpy as np
import matplotlib.pyplot as plt
def gaussian(x, var):
k1 = np.power(x, 2)
k2 = -k1/(2*var)
return (1./(sqrt(2. * pi * var))) * np.exp(k2)
#--------*---------*---------*---------*---------*---------*---------*---------#
while 1:# M A I N L I N E #
#--------*---------*---------*---------*---------*---------*---------*---------#
# # probability density function
# # for discrete normal RV
pdf_DGV = []
pdf_DGW = []
var = 9
tot = 0
# # create 'rough' gaussian
for i in range(-var - 1, var + 2):
if i == -var - 1:
r_pdf = + gaussian(i, 9) + gaussian(i - 1, 9) + gaussian(i - 2, 9)
elif i == var + 1:
r_pdf = + gaussian(i, 9) + gaussian(i + 1, 9) + gaussian(i + 2, 9)
else:
r_pdf = gaussian(i, 9)
tot = tot + r_pdf
pdf_DGV.append(i)
pdf_DGW.append(r_pdf)
print(i, r_pdf)
# # amusing how close tot is to 1!
print('\nRough total = ', tot)
# # no need to normalize with Python 3.6,
# # but can't help ourselves
for i in range(0,len(pdf_DGW)):
pdf_DGW[i] = pdf_DGW[i]/tot
# # print out pdf weights
# # for out discrte gaussian
print('\npdf:\n')
print(pdf_DGW)
# # plot random variable action
rv_samples = random.choices(pdf_DGV, pdf_DGW, k=10000)
plt.hist(rv_samples, bins = 100)
plt.show()
sys.exit()
OUTPUT:
-10 0.0007187932912256041
-9 0.001477282803979336
-8 0.003798662007932481
-7 0.008740629697903166
-6 0.017996988837729353
-5 0.03315904626424957
-4 0.05467002489199788
-3 0.0806569081730478
-2 0.10648266850745075
-1 0.12579440923099774
0 0.1329807601338109
1 0.12579440923099774
2 0.10648266850745075
3 0.0806569081730478
4 0.05467002489199788
5 0.03315904626424957
6 0.017996988837729353
7 0.008740629697903166
8 0.003798662007932481
9 0.001477282803979336
10 0.0007187932912256041
Rough total = 0.9999715875468381
pdf:
[0.000718813714486599, 0.0014773247784004072, 0.003798769940305483, 0.008740878047691289, 0.017997500190860556, 0.033159988420867426, 0.05467157824565407, 0.08065919989878699, 0.10648569402724471, 0.12579798346031068, 0.13298453855078374, 0.12579798346031068, 0.10648569402724471, 0.08065919989878699, 0.05467157824565407, 0.033159988420867426, 0.017997500190860556, 0.008740878047691289, 0.003798769940305483, 0.0014773247784004072, 0.000718813714486599]
I'm not sure if there (in scipy generator) is an option of var-type choice to be generated, but common generation can be such with scipy.stats
# Generate pseudodata from a single normal distribution
import scipy
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
dist_mean = 0.0
dist_std = 0.5
n_events = 500
toy_data = scipy.stats.norm.rvs(dist_mean, dist_std, size=n_events)
toy_data2 = [[i, j] for i, j in enumerate(toy_data )]
arr = np.array(toy_data2)
print("sample:\n", arr[1:500, 0])
print("bin:\n",arr[1:500, 1])
plt.scatter(arr[1:501, 1], arr[1:501, 0])
plt.xlabel("bin")
plt.ylabel("sample")
plt.show()
or in such a way (also option of dtype choice is absent):
import matplotlib.pyplot as plt
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 500)
count, bins, ignored = plt.hist(s, 30, density=True)
plt.show()
print(bins) # <<<<<<<<<<
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
linewidth=2, color='r')
plt.show()
without visualization the most common way (also no possibility to point out var-type)
bins = np.random.normal(3, 2.5, size=(10, 1))
a wrapper class could be done to instantiate the container with a given vars-dtype (e.g. by rounding floats to integers, as mentioned above)...
I am trying to fit a quadratic function to some data, and I'm trying to do this without using numpy's polyfit function.
Mathematically I tried to follow this website https://neutrium.net/mathematics/least-squares-fitting-of-a-polynomial/ but somehow I don't think that I'm doing it right. If anyone could assist me that would be great, or If you could suggest another way to do it that would also be awesome.
What I've tried so far:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
ones = np.ones(3)
A = np.array( ((0,1),(1,1),(2,1)))
xfeature = A.T[0]
squaredfeature = A.T[0] ** 2
b = np.array( (1,2,0), ndmin=2 ).T
b = b.reshape(3)
features = np.concatenate((np.vstack(ones), np.vstack(xfeature), np.vstack(squaredfeature)), axis = 1)
featuresc = features.copy()
print(features)
m_det = np.linalg.det(features)
print(m_det)
determinants = []
for i in range(3):
featuresc.T[i] = b
print(featuresc)
det = np.linalg.det(featuresc)
determinants.append(det)
print(det)
featuresc = features.copy()
determinants = determinants / m_det
print(determinants)
plt.scatter(A.T[0],b)
u = np.linspace(0,3,100)
plt.plot(u, u**2*determinants[2] + u*determinants[1] + determinants[0] )
p2 = np.polyfit(A.T[0],b,2)
plt.plot(u, np.polyval(p2,u), 'b--')
plt.show()
As you can see my curve doesn't compare well to nnumpy's polyfit curve.
Update:
I went through my code and removed all the stupid mistakes and now it works, when I try to fit it over 3 points, but I have no idea how to fit over more than three points.
This is the new code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
ones = np.ones(3)
A = np.array( ((0,1),(1,1),(2,1)))
xfeature = A.T[0]
squaredfeature = A.T[0] ** 2
b = np.array( (1,2,0), ndmin=2 ).T
b = b.reshape(3)
features = np.concatenate((np.vstack(ones), np.vstack(xfeature), np.vstack(squaredfeature)), axis = 1)
featuresc = features.copy()
print(features)
m_det = np.linalg.det(features)
print(m_det)
determinants = []
for i in range(3):
featuresc.T[i] = b
print(featuresc)
det = np.linalg.det(featuresc)
determinants.append(det)
print(det)
featuresc = features.copy()
determinants = determinants / m_det
print(determinants)
plt.scatter(A.T[0],b)
u = np.linspace(0,3,100)
plt.plot(u, u**2*determinants[2] + u*determinants[1] + determinants[0] )
p2 = np.polyfit(A.T[0],b,2)
plt.plot(u, np.polyval(p2,u), 'r--')
plt.show()
Instead using Cramer's Rule, actually solve the system using least squares. Remember that Cramer's Rule will only work if the total number of points you have equals the desired order of polynomial plus 1.
If you don't have this, then Cramer's Rule will not work as you're trying to find an exact solution to the problem. If you have more points, the method is unsuitable as we will create an overdetermined system of equations.
To adapt this to more points, numpy.linalg.lstsq would be a better fit as it solves the solution to the Ax = b by computing the vector x that minimizes the Euclidean norm using the matrix A. Therefore, remove the y values from the last column of the features matrix and solve for the coefficients and use numpy.linalg.lstsq to solve for the coefficients:
import numpy as np
import matplotlib.pyplot as plt
ones = np.ones(4)
xfeature = np.asarray([0,1,2,3])
squaredfeature = xfeature ** 2
b = np.asarray([1,2,0,3])
features = np.concatenate((np.vstack(ones),np.vstack(xfeature),np.vstack(squaredfeature)), axis = 1) # Change - remove the y values
determinants = np.linalg.lstsq(features, b)[0] # Change - use least squares
plt.scatter(xfeature,b)
u = np.linspace(0,3,100)
plt.plot(u, u**2*determinants[2] + u*determinants[1] + determinants[0] )
plt.show()
I get this plot now, which matches what the dashed curve is in your graph, also matching what numpy.polyfit gives you:
In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:
def entropy(labels):
""" Computes entropy of 0-1 vector. """
n_labels = len(labels)
if n_labels <= 1:
return 0
counts = np.bincount(labels)
probs = counts[np.nonzero(counts)] / n_labels
n_classes = len(probs)
if n_classes <= 1:
return 0
return - np.sum(probs * np.log(probs)) / np.log(n_classes)
Is there a faster way?
#Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.
Four different approaches: (1) scipy/numpy, (2) numpy/math, (3) pandas/numpy, (4) numpy
import numpy as np
from scipy.stats import entropy
from math import log, e
import pandas as pd
import timeit
def entropy1(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
return entropy(counts, base=base)
def entropy2(labels, base=None):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
value,counts = np.unique(labels, return_counts=True)
probs = counts / n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute entropy
base = e if base is None else base
for i in probs:
ent -= i * log(i, base)
return ent
def entropy3(labels, base=None):
vc = pd.Series(labels).value_counts(normalize=True, sort=False)
base = e if base is None else base
return -(vc * np.log(vc)/np.log(base)).sum()
def entropy4(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
norm_counts = counts / counts.sum()
base = e if base is None else base
return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()
Timeit operations:
repeat_number = 1000000
a = timeit.repeat(stmt='''entropy1(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',
repeat=3, number=repeat_number)
b = timeit.repeat(stmt='''entropy2(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',
repeat=3, number=repeat_number)
c = timeit.repeat(stmt='''entropy3(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',
repeat=3, number=repeat_number)
d = timeit.repeat(stmt='''entropy4(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',
repeat=3, number=repeat_number)
Timeit results:
# for loop to print out results of timeit
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):
print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))
Method: scipy/numpy, Avg.: 63.315312
Method: numpy/math, Avg.: 49.256894
Method: pandas/numpy, Avg.: 884.644023
Method: numpy, Avg.: 60.026938
Winner: numpy/math (entropy2)
It's also worth noting that the entropy2 function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab')). The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.
With the data as a pd.Series and scipy.stats, calculating the entropy of a given quantity is pretty straightforward:
import pandas as pd
import scipy.stats
def ent(data):
"""Calculates entropy of the passed `pd.Series`
"""
p_data = data.value_counts() # counts occurrence of each value
entropy = scipy.stats.entropy(p_data) # get entropy from counts
return entropy
Note: scipy.stats will normalize the provided data, so this doesn't need to be done explicitly, i.e. passing an array of counts works fine.
An answer that doesn't rely on numpy, either:
import math
from collections import Counter
def eta(data, unit='natural'):
base = {
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.
}
if len(data) <= 1:
return 0
counts = Counter()
for d in data:
counts[d] += 1
ent = 0
probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ent
This will accept any datatype you could throw at it:
>>> eta(['mary', 'had', 'a', 'little', 'lamb'])
1.6094379124341005
>>> eta([c for c in "mary had a little lamb"])
2.311097886212714
The answer provided by #Jarad suggested timings as well. To that end:
repeat_number = 1000000
e = timeit.repeat(
stmt='''eta(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import eta''',
repeat=3,
number=repeat_number)
Timeit results: (I believe this is ~4x faster than the best numpy approach)
print('Method: {}, Avg.: {:.6f}'.format("eta", np.array(e).mean()))
Method: eta, Avg.: 10.461799
Following the suggestion from unutbu I create a pure python implementation.
def entropy2(labels):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
counts = np.bincount(labels)
probs = counts / n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute standard entropy.
for i in probs:
ent -= i * log(i, base=n_classes)
return ent
The point I was missing was that labels is a large array, however probs is 3 or 4 elements long. Using pure python my application now is twice as fast.
Here is my approach:
labels = [0, 0, 1, 1]
from collections import Counter
from scipy import stats
stats.entropy(list(Counter(labels).values()), base=2)
My favorite function for entropy is the following:
def entropy(labels):
prob_dict = {x:labels.count(x)/len(labels) for x in labels}
probs = np.array(list(prob_dict.values()))
return - probs.dot(np.log2(probs))
I am still looking for a nicer way to avoid the dict -> values -> list -> np.array conversion. Will comment again if I found it.
Uniformly distributed data (high entropy):
s=range(0,256)
Shannon entropy calculation step by step:
import collections
import math
# calculate probability for each byte as number of occurrences / array length
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
# [0.00390625, 0.00390625, 0.00390625, ...]
# calculate per-character entropy fractions
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
# [0.03125, 0.03125, 0.03125, ...]
# sum fractions to obtain Shannon entropy
entropy = sum(e_x)
>>> entropy
8.0
One-liner (assuming import collections):
def H(s): return sum([-p_x*math.log(p_x,2) for p_x in [n_x/len(s) for x,n_x in collections.Counter(s).items()]])
A proper function:
import collections
import math
def H(s):
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
return sum(e_x)
Test cases - English text taken from CyberChef entropy estimator:
>>> H(range(0,256))
8.0
>>> H(range(0,64))
6.0
>>> H(range(0,128))
7.0
>>> H([0,1])
1.0
>>> H('Standard English text usually falls somewhere between 3.5 and 5')
4.228788210509104
This method extends the other solutions by allowing for binning. For example, bin=None (default) won't bin x and will compute an empirical probability for each element of x, while bin=256 chunks x into 256 bins before computing the empirical probabilities.
import numpy as np
def entropy(x, bins=None):
N = x.shape[0]
if bins is None:
counts = np.bincount(x)
else:
counts = np.histogram(x, bins=bins)[0] # 0th idx is counts
p = counts[np.nonzero(counts)]/N # avoids log(0)
H = -np.dot( p, np.log2(p) )
return H
This is the fastest Python implementation I've found so far:
import numpy as np
def entropy(labels):
ps = np.bincount(labels) / len(labels)
return -np.sum([p * np.log2(p) for p in ps if p > 0])
from collections import Counter
from scipy import stats
labels = [0.9, 0.09, 0.1]
stats.entropy(list(Counter(labels).keys()), base=2)
BiEntropy wont be the fastest way of computing entropy, but it is rigorous and builds upon Shannon Entropy in a well defined way. It has been tested in various fields including image related applications.
It is implemented in Python on Github.
Bit late for the party, but I stumbled at this and all answers seems to rely on Kullback–Leibler divergence, which has no upper bound, and hence, doesn't fit my needs.
Here I have an approximation (the TODO!could be improved) of an entropy function that goes from [0,1].
It calculates the biass of a single column.
class Pandas_Dataframe_helper:
#some other methods here...
#staticmethod
def column_biass(df_column):
df_column_as_list = list(df_column)
N = len(df_column_as_list)
values,counts = np.unique(df_column_as_list, return_counts=True)
#generate synth list (TODO! what if not even number? Minimum Comun Multiple of(num_different_labels,[x for x in counts]))
num_different_labels = len(values)
num_items_per_label = N // num_different_labels
synthetic_list = []
for current_value in values:
synthetic_list.extend([current_value] * num_items_per_label)
#TODO! aproximacion
if(len(synthetic_list) != len(df_column_as_list)):
synthetic_list.extend([current_value] * (len(df_column_as_list) - len(synthetic_list)))
#now, extrapolate differences between sorted-input-list and synsthetic_list
df_column_as_list_sorted = sorted(df_column_as_list)
counter_unmatches = 0
for i in range(0,N):
if(df_column_as_list_sorted[i] != synthetic_list[i]):
counter_unmatches += 1
#upper_bound = g(N,num_different_labels)
#((K-1)M)-1 K==num_different_labels , M==num theorically perfect distribution's items per label
upper_bound = ((num_different_labels-1)*num_items_per_label)-1
return counter_unmatches/upper_bound
#---------------------------------------------------------------------------------------------------------------------
Complete code at https://github.com/glezo1/pcommonlibs/blob/master/com/glezo/pandas_dataframe_helper/Pandas_Dataframe_Helper.py
The above answer is good, but if you need a version that can operate along different axes, here's a working implementation.
def entropy(A, axis=None):
"""Computes the Shannon entropy of the elements of A. Assumes A is
an array-like of nonnegative ints whose max value is approximately
the number of unique values present.
>>> a = [0, 1]
>>> entropy(a)
1.0
>>> A = np.c_[a, a]
>>> entropy(A)
1.0
>>> A # doctest: +NORMALIZE_WHITESPACE
array([[0, 0], [1, 1]])
>>> entropy(A, axis=0) # doctest: +NORMALIZE_WHITESPACE
array([ 1., 1.])
>>> entropy(A, axis=1) # doctest: +NORMALIZE_WHITESPACE
array([[ 0.], [ 0.]])
>>> entropy([0, 0, 0])
0.0
>>> entropy([])
0.0
>>> entropy([5])
0.0
"""
if A is None or len(A) < 2:
return 0.
A = np.asarray(A)
if axis is None:
A = A.flatten()
counts = np.bincount(A) # needs small, non-negative ints
counts = counts[counts > 0]
if len(counts) == 1:
return 0. # avoid returning -0.0 to prevent weird doctests
probs = counts / float(A.size)
return -np.sum(probs * np.log2(probs))
elif axis == 0:
entropies = map(lambda col: entropy(col), A.T)
return np.array(entropies)
elif axis == 1:
entropies = map(lambda row: entropy(row), A)
return np.array(entropies).reshape((-1, 1))
else:
raise ValueError("unsupported axis: {}".format(axis))
def entropy(base, prob_a, prob_b ):
import math
base=2
x=prob_a
y=prob_b
expression =-((x*math.log(x,base)+(y*math.log(y,base))))
return [expression]
import numpy as np
from numpy.linalg import solve,norm,cond,inv,pinv
import math
import matplotlib.pyplot as plt
from scipy.linalg import toeplitz
from numpy.random import rand
c = np.zeros(512)
c[0] = 2
c[1] = -1
a = c
A = toeplitz(c,a)
cond_A = cond(A,2)
# creating 10 random vectors 512 x 1
b = rand(10,512)
# making b into unit vector
for i in range (10):
b[i]= b[i]/norm(b[i],2)
# creating 10 random del_b vectors
del_b = [rand(10,512), rand(10,512), rand(10,512), rand(10,512), rand(10,512), rand(10,512), rand(10,512), rand(10,512), rand(10,512), rand(10,512)]
# del_b = 10 sets of 10 vectors (512x1) whose norm is 0.01,0.02 ~0.1
for i in range(10):
for j in range(10):
del_b[i][j] = del_b[i][j]/(norm(del_b[i][j],2)/((float(j+1)/100)))
x_in = [np.zeros(512), np.zeros(512), np.zeros(512), np.zeros(512), np.zeros(512), np.zeros(512), np.zeros(512), np.zeros(512), np.zeros(512), np.zeros(512)]
x2 = np.zeros((10,10,512))
for i in range(10):
x_in[i] = A.transpose()*b[i]
for i in range(10):
for j in range(10):
x2[i][j] = ((A.transpose()*(b[i]+del_b[i][j]))
LAST line is giving me the error. ( output operand requires a reduction, but reduction is not enabled)
How do i fix it?
I'm new to python and please let me know if there is easier way to do this
Thanks
The error you're seeing is because of a mismatch in the dimensions of what you have created, but your code is also quite inefficient with all the looping and not making use of Numpy's automatic broadcasting optimally. I've rewritten the code to do what it seems you want:
import numpy as np
from numpy.linalg import solve,norm,cond,inv,pinv
import math
import matplotlib.pyplot as plt
from scipy.linalg import toeplitz
from numpy.random import rand
# These should probably get more sensible names
Nvec = 10 # number of vectors in b
Nlevels = 11 # number of perturbation norm levels
Nd = 512 # dimension of the vector space
c = np.zeros(Nd)
c[0] = 2
c[1] = -1
a = c
# NOTE: I'm assuming you want A to be a matrix
A = np.asmatrix(toeplitz(c, a))
cond_A = cond(A,2)
# create Nvec random vectors Nd x 1
# Note: packing the vectors in the columns makes the next step easier
b = rand(Nd, Nvec)
# normalise each column of b to be a unit vector
b /= norm(b, axis=0)
# create Nlevels of Nd x Nvec random del_b vectors
del_b = rand(Nd, Nvec, Nlevels)
# del_b = 10 sets of 10 vectors (512x1) whose norm is 0.01,0.02 ~0.1
targetnorms = np.linspace(0.01, 0.1, Nlevels)
# cause the norms in the Nlevels dimension to be equal to the target norms
del_b /= norm(del_b, axis=0)[None, :, :]/targetnorms[None, None, :]
# Straight linear transformation - make sure you actually want the transpose
x_in = A.T*b
# same linear transformation on perturbed versions of b
x2 = np.zeros((Nd, Nvec, Nlevels))
for i in range(Nlevels):
x2[:, :, i] = A.T*(b + del_b[:, :, i])
I followed the advice of defining the autocorrelation function in another post:
def autocorr(x):
result = np.correlate(x, x, mode = 'full')
maxcorr = np.argmax(result)
#print 'maximum = ', result[maxcorr]
result = result / result[maxcorr] # <=== normalization
return result[result.size/2:]
however the maximum value was not "1.0". therefore I introduced the line tagged with "<=== normalization"
I tried the function with the dataset of "Time series analysis" (Box - Jenkins) chapter 2. I expected to get a result like fig. 2.7 in that book. However I got the following:
anybody has an explanation for this strange not expected behaviour of autocorrelation?
Addition (2012-09-07):
I got into Python - programming and did the following:
from ClimateUtilities import *
import phys
#
# the above imports are from R.T.Pierrehumbert's book "principles of planetary
# climate"
# and the homepage of that book at "cambridge University press" ... they mostly
# define the
# class "Curve()" used in the below section which is not necessary in order to solve
# my
# numpy-problem ... :)
#
import numpy as np;
import scipy.spatial.distance;
# functions to be defined ... :
#
#
def autocorr(x):
result = np.correlate(x, x, mode = 'full')
maxcorr = np.argmax(result)
# print 'maximum = ', result[maxcorr]
result = result / result[maxcorr]
#
return result[result.size/2:]
##
# second try ... "Box and Jenkins" chapter 2.1 Autocorrelation Properties
# of stationary models
##
# from table 2.1 I get:
s1 = np.array([47,64,23,71,38,64,55,41,59,48,71,35,57,40,58,44,\
80,55,37,74,51,57,50,60,45,57,50,45,25,59,50,71,56,74,50,58,45,\
54,36,54,48,55,45,57,50,62,44,64,43,52,38,59,\
55,41,53,49,34,35,54,45,68,38,50,\
60,39,59,40,57,54,23],dtype=float);
# alternatively in order to test:
s2 = np.array([47,64,23,71,38,64,55,41,59,48,71])
##################################################################################3
# according to BJ, ch.2
###################################################################################3
print '*************************************************'
global s1short, meanshort, stdShort, s1dev, s1shX, s1shXk
s1short = s1
#s1short = s2 # for testing take s2
meanshort = s1short.mean()
stdShort = s1short.std()
s1dev = s1short - meanshort
#print 's1short = \n', s1short, '\nmeanshort = ', meanshort, '\ns1deviation = \n',\
# s1dev, \
# '\nstdShort = ', stdShort
s1sh_len = s1short.size
s1shX = np.arange(1,s1sh_len + 1)
#print 'Len = ', s1sh_len, '\nx-value = ', s1shX
##########################################################
# c0 to be computed ...
##########################################################
sumY = 0
kk = 1
for ii in s1shX:
#print 'ii-1 = ',ii-1,
if ii > s1sh_len:
break
sumY += s1dev[ii-1]*s1dev[ii-1]
#print 'sumY = ',sumY, 's1dev**2 = ', s1dev[ii-1]*s1dev[ii-1]
c0 = sumY / s1sh_len
print 'c0 = ', c0
##########################################################
# now compute autocorrelation
##########################################################
auCorr = []
s1shXk = s1shX
lenS1 = s1sh_len
nn = 1 # factor by which lenS1 should be divided in order
# to reduce computation length ... 1, 2, 3, 4
# should not exceed 4
#print 's1shX = ',s1shX
for kk in s1shXk:
sumY = 0
for ii in s1shX:
#print 'ii-1 = ',ii-1, ' kk = ', kk, 'kk+ii-1 = ', kk+ii-1
if ii >= s1sh_len or ii + kk - 1>=s1sh_len/nn:
break
sumY += s1dev[ii-1]*s1dev[ii+kk-1]
#print sumY, s1dev[ii-1], '*', s1dev[ii+kk-1]
auCorrElement = sumY / s1sh_len
auCorrElement = auCorrElement / c0
#print 'sum = ', sumY, ' element = ', auCorrElement
auCorr.append(auCorrElement)
#print '', auCorr
#
#manipulate s1shX
#
s1shX = s1shXk[:lenS1-kk]
#print 's1shX = ',s1shX
#print 'AutoCorr = \n', auCorr
#########################################################
#
# first 15 of above Values are consistent with
# Box-Jenkins "Time Series Analysis", p.34 Table 2.2
#
#########################################################
s1sh_sdt = s1dev.std() # Standardabweichung short
#print '\ns1sh_std = ', s1sh_sdt
print '#########################################'
# "Curve()" is a class from RTP ClimateUtilities.py
c2 = Curve()
s1shXfloat = np.ndarray(shape=(1,lenS1),dtype=float)
s1shXfloat = s1shXk # to make floating point from integer
# might be not necessary
#print 'test plotting ... ', s1shXk, s1shXfloat
c2.addCurve(s1shXfloat)
c2.addCurve(auCorr, '', 'Autocorr')
c2.PlotTitle = 'Autokorrelation'
w2 = plot(c2)
##########################################################
#
# now try function "autocorr(arr)" and plot it
#
##########################################################
auCorr = autocorr(s1short)
c3 = Curve()
c3.addCurve( s1shXfloat )
c3.addCurve( auCorr, '', 'Autocorr' )
c3.PlotTitle = 'Autocorr with "autocorr"'
w3 = plot(c3)
#
# well that should it be!
#
So your problem with your initial attempt is that you did not subtract the average from your signal. The following code should work:
timeseries = (your data here)
mean = np.mean(timeseries)
timeseries -= np.mean(timeseries)
autocorr_f = np.correlate(timeseries, timeseries, mode='full')
temp = autocorr_f[autocorr_f.size/2:]/autocorr_f[autocorr_f.size/2]
iact.append(sum(autocorr_f[autocorr_f.size/2:]/autocorr_f[autocorr_f.size/2]))
In my example temp is the variable you are interested in; it is the forward integrated autocorrelation function. If you want the integrated autocorrelation time you are interested in iact.
I'm not sure what the issue is.
The autocorrelation of a vector x has to be 1 at lag 0 since that is just the squared L2 norm divided by itself, i.e., dot(x, x) / dot(x, x) == 1.
In general, for any lags i, j in Z, where i != j the unit-scaled autocorrelation is dot(shift(x, i), shift(x, j)) / dot(x, x) where shift(y, n) is a function that shifts the vector y by n time points and Z is the set of integers since we're talking about the implementation (in theory the lags can be in the set of real numbers).
I get 1.0 as the max with the following code (start on the command line as $ ipython --pylab), as expected:
In[1]: n = 1000
In[2]: x = randn(n)
In[3]: xc = correlate(x, x, mode='full')
In[4]: xc /= xc[xc.argmax()]
In[5]: xchalf = xc[xc.size / 2:]
In[6]: xchalf_max = xchalf.max()
In[7]: print xchalf_max
Out[1]: 1.0
The only time when the lag 0 autocorrelation is not equal to 1 is when x is the zero signal (all zeros).
The answer to your question is: no, there is no NumPy function that automatically performs standardization for you.
Besides, even if it did you would still have to check it against your expected output, and if you're able to say "Yes this performed the standardization correctly", then I would assume that you know how to implement it yourself.
I'm going to suggest that it might be the case that you've implemented their algorithm incorrectly, although I can't be sure since I'm not familiar with it.