Merging similar columns in NumPy, probability vector - python

I have a numpy array of the following shape:
img1
It is a probability vector, where the second row corresponds to a value and the first row to the probability that this value is realized. (e.g. the probability of getting 1.0 is 20%)
When two values are close to each other, I want to merge their columns by adding up the probabilities. In this example I want to have:
img2
My current solution involves 3 loops and is really slow for larger arrays. Does someone know an efficient way to program this in NumPy?

While it won't do exactly what you want, you could try to use np.histogram to tackle the problem.
For example say you just want two "bins" like in your example you could do
import numpy as np
x = np.array([[0.1, 0.2, 0.6, 0.1], [0.0, 1.0, 0.0, 1.01]])
hist, bin_edges = np.histogram(x[1, :], bins=[0, 1.0, 1.5], weights=x[0, :])
and then stack your histogram with the leading bin edges to get your output
print(np.stack([hist, bin_edges[:-1]]))
This will print
[[0.7 0.3]
[0. 1. ]]
You can use the bins parameter to get your desired output. I hope this helps.

Related

Functionally is torch.multinomial the same as torch.distributions.categorical.Categorical?

For example, if I provide a probability array of [0.5, 0.5], both functions will sample the index [0,1] with equal probability?
Yes:
[torch.distributions.categorical.Categorical()] is equivalent to the distribution that torch.multinomial() samples from.
https://pytorch.org/docs/stable/distributions.html#categorical

Find linear combination of vectors that is the best fit for a target vector

I am trying to find weights across a number of forecasts to give a result that is as close as possible (say, mean squared error) to a known target.
Here is a simplified example showing three different types of forecast across four data points:
target = [1.0, 1.02, 1.01, 1.04] # all approx 1.0
forecasts = [
[0.9, 0.91, 0.92, 0.91], # all approx 0.9
[1.1, 1.11, 1.13, 1.11], # all approx 1.1
[1.21, 1.23, 1.21, 1.23] # all approx 1.2
]
where one forecast is always approximately 0.9, one is always approximately 1.1 and one is always approximately 1.2.
I'd like an automated way of finding weights of approximately [0.5, 0.5, 0.0] for the three forecasts because averaging the first two forecasts and ignoring the third is very close to the target. Ideally the weights would be constrained to be non-negative and sum to 1.
I think I need to use some form of linear programming or quadratic programming to do this. I have installed the Python quadprog library, but I'm not sure how to translate this problem into the form that solvers like this require. Can anyone point me in the right direction?
If I understand you correctly, you want to model some optimization problem and solve it. If you are interested in the general case (without any constraints), your problem seems pretty close to the regular least square error problem (which you can solve with scikit-learn for example).
I recommend to use cvxpy library for modeling an optimization problem. It's a convenient way to model a convex optimization problem, and you can choose which solver you want to work in the background.
Expanding cvxpy least square example, by adding the constraints you mentioned:
# Import packages.
import cvxpy as cp
import numpy as np
# Generate data.
m = 20
n = 15
np.random.seed(1)
A = np.random.randn(m, n)
b = np.random.randn(m)
# Define and solve the CVXPY problem.
x = cp.Variable(n)
cost = cp.sum_squares(A # x - b)
prob = cp.Problem(cp.Minimize(cost), [x>=0, cp.sum(x)==1])
prob.solve()
# Print result.
print("\nThe optimal value is", prob.value)
print("The optimal x is")
print(x.value)
print("The norm of the residual is ", cp.norm(A # x - b, p=2).value)
In this example, A (the matrix) is a matrix of all your vector, x (the variable) is the weights, and b is the known target.
EDIT:
example with your data:
forecasts = np.array([
[0.9, 0.91, 0.92, 0.91],
[1.1, 1.11, 1.13, 1.11],
[1.21, 1.23, 1.21, 1.23]
])
target = np.array([1.0, 1.02, 1.01, 1.04])
x = cp.Variable(forecasts.shape[0])
cost = cp.sum_squares(forecasts.T # x - target)
prob = cp.Problem(cp.Minimize(cost), [x >= 0, cp.sum(x) == 1])
prob.solve()
print("\nThe optimal value is", prob.value)
print("The optimal x is")
print(x.value)
Output:
The optimal value is 0.0005306233766233817
The optimal x is
[ 6.52207792e-01 -1.45736370e-24 3.47792208e-01]
results are approximately [0.65, 0, 0.34] which is different from the [0.5, 0.5, 0.0] you mentioned, but that depends on how you define your problem. This is a solution for the least squares error.
We can see this problem as a least squares, which is indeed equivalent to quadratic programming. If I understand correctly, the weight vector you are looking for is a convex combination, so in least squares form the problem is:
minimize || [w0 w1 w2] * forecasts - target ||^2
s.t. w0 >= 0, w1 >= 0, w2 >= 0
w0 + w1 + w2 == 1
There is a least-squares function you can use out of the box in the qpsolvers package:
import numpy as np
from qpsolvers import solve_ls
target = np.array(target)
forecasts = np.array(forecasts)
w = solve_ls(forecasts.T, target, G=-np.eye(3), h=np.zeros(3), A=np.array([1, 1., 1]), b=np.array([1.]))
You can check in the documentation that the matrices G, h, A and b correspond to the problem above. Using quadprog as the backend solver, I get the following solution on my machine:
In [6]: w
Out[6]: array([6.52207792e-01, 9.94041282e-15, 3.47792208e-01])
In [7]: np.dot(w, forecasts)
Out[7]: array([1.00781558, 1.02129351, 1.02085974, 1.02129351])
Which is the same solution as in Roim's answer. (CVXPY is indeed a great way to start!)

Choosing variables for scipy.optimize from a pre-defined set

I am trying to minimize a function with scipy.optimize with three input variables, two of which are bounded and one has to be chosen from a set of values. To ensure that the third variable is chosen from a predefined set of values, I introduced the following constraint:
from scipy.optimize import rosin, shgo
import numpy as np
# Set from which the third variable to be optimized can hold
Z = np.array([-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1])
def Reson_Test(x): # arbitrary objective function
print (x)
return rosen(x)**2 - np.sin(x[0])
def Cond_1(x):
if x[2] in Z:
return 1
else:
return -1
bounds = [(-512,512),]*3
conds = ({'type': 'ineq' , 'fun' : Cond_1})
result = shgo(Rosen_Test, bounds, constraints=conds)
print (result)
However, when looking at the print results from Rosen_Test, it is evident that the condition is not being enforced - perhaps condition is not defined correctly?
I was wondering if anyone has any ideas to ensure that the third variable can be chosen from a set.
Note: The use of the shgo method was chosen such that constraints can be introduced and can be changed. Also, I am open to use other optimization packages if this condition is met
The inequality constraints do not work like that.
As mentioned in the docs they are defined as
g(x) <= 0
and you need to write g(x) work like that. In your cases that is not the case. You are only returning a single scalar for one dimension. You need to return a vector with three dimensions, of shape (3,).
In your case you could try to use the equality constraints instead, as this could allow a slightly better hack. But I am still not sure if it will work as these optimizers don't work like that.
And the whole thing will probably leave the optimizer with a rather bumpy and discontinuous objective function. You can read on Mixed-Integer Nonlinear Programming (MINLP), maybe start here.
There are is one more reasons why your approach won't work as expected
As optimizers work with floating point numbers it will likely never find a number in your array when optimizing and guessing new solutions.
This illustrates the issue:
import numpy as np
Z = np.array([-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1])
print(0.7999999 in Z) # False, this is what the optimizer will find
print(0.8 in Z) # True, this is what you want
Maybe you should try to define your problem in a way that allows to use an inequality constraint on the whole range of Z.
But let's see how it could work.
An equality constraint is defined as
h(x) == 0
So you could use
def Cond_1(x):
if x[2] in Z:
return numpy.zeros_like(x)
else:
return numpy.ones_like(x) * 1.0 # maybe multiply with some scalar?
The idea is to return an array [0.0, 0.0, 0.0] that satisfies the equality constraint if the number is found. Else return [1.0, 1.0, 1.0] to show that it is not satisfied.
Caveats:
1.)
You might have to tune this to return an array like [0.0, 0.0, 1.0] to show the optimizer which dimension you are unhappy about so the optimizer can make better guesses by only adjusting a single dimension.
2.)
You might have to return a larger value than 1.0 to state an non-satisfied equality constraint. This depends on the implementation. The optimizer could think that 1.0 is fine as it is close to 0.0. So maybe you have to try something [0.0, 0.0, 999.0].
This solves the problem with the dimension. But still will not find any numbers do to the floating point number thing mentioned above.
But we can try to hack this like
import numpy as np
Z = np.array([-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1])
def Cond_1(x):
# how close you want to get to the numbers in your array
tolerance = 0.001
delta = np.abs(x[2] - Z)
print(delta)
print(np.min(delta) < tolerance)
if np.min(delta) < tolerance:
return np.zeros_like(x)
else:
# maybe you have to multiply this with some scalar
# I have no clue how it is implemented
# we need a value stating to the optimizer "NOT THIS ONE!!!"
return np.ones_like(x) * 1.0
sol = np.array([0.5123, 0.234, 0.2])
print(Cond_1(sol)) # True
sol = np.array([0.5123, 0.234, 0.202])
print(Cond_1(sol)) # False
Here are some recommendations on optimization. To make sure it works in a reliable way try to start the optimization at different initial values. Global optimization algorithms might not have initial values if used with boundaries. The optimizer somehow discretizes the space.
What you could do to check the reliability of your optimization and get better overall results:
Optimize on the complete region [-512, 512] (for all three dimensions)
Try 1/2 of that: [-512, 0] and [0, 512] (8 sub-optimizations, 2 for each dimension)
Try 1/3 of that: [-512, -171], [-171, 170], [170, 512] (27 sub-optimizations, 3 for each dimension)
Now compare the converged results to see if the complete global optimization found the same result
If the global optimizer did not find the "real" minima but the sub-optimization:
your objective function is too difficult on the whole domain
try a different global optimizer
tune the parameters (maybe the 999 for the equality constraint)
I often use the sub-optimization as part of the normal process, not only for testing. Especially for blackbox problems.
Please also see these answers:
Scipy.optimize Inequality Constraint - Which side of the inequality is considered?
Scipy minimize constrained function

Matplotlib: Probability Mass Graph

Why does the total probability exceed 1?
import matplotlib.pyplot as plt
figure, axes = plt.subplots(nrows = 1, ncols = 1)
axes.hist(x = [0.1, 0.2, 0.3, 0.4], density = True)
figure.show()
Expected y-values: [0.25, 0.25, 0.25, 0.25]
Following is my understanding as per the documentation. I don't claim to be an expert in matplotlib nor I am one of the authors. Your question made me think and then I read the documentation and took some logical steps to understand it. So this is not an expert opinion.
===================================================================
Since you have not passed bins information, matplotlib went ahead and created its own bins. In this case the bins are as below.
bins = [0.1 , 0.13, 0.16, 0.19, 0.22, 0.25, 0.28, 0.31, 0.34, 0.37, 0.4 ]
You can see the bind width is 0.03.
Now according to the documentation.
density : bool, optional
If True, the first element of the return
tuple will be the counts normalized to form a probability density,
i.e., the area (or integral) under the histogram will sum to 1. This
is achieved by dividing the count by the number of observations times
the bin width and not dividing by the total number of observations.
In order to make the sum to 1, it is normalizing the counts so that when you multiply the resulting normalized counts in each bin with bin width the resulting sum of the individual product becomes 1.
Your counts are as below for X = [0.1,0.2,0.3,0.4]
OriginalCounts = [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]
As you can see if you multiply the OriginalCounts array with the bin width and sum all of them, it is going to come out to 4*0.03 = 0.12 .. which is less than one.
So according to the documentation we need to divide the OriginalCounts array with a factor .. which is (number of observations * bin width).
In this case the number of observations are 4 and bin width is 0.03. So 4*0.03 is equal to 0.12. Thus you divide each element of OriginalCounts with 0.12 to get a Normalized histogram values array.
That means that the revised counts are as below
NormalizedCounts = [8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333]
Please note that, now, if you sum the Normalized counts multiplied by bin width, it will be equal to 1. You can quickly check this: 8.333333*4*0.03=0.9999999.. which is very close to 1.
This normalized counts is finally shown in the graph. This is the reason why the height of the bars in the histogram is close to 8 for at four positions.
Hope this helps.

Distributions and p-values in python

I have a big list of numbers, and I would like to create a distribution out of this data, plot it, then find the p-value for every number in my list with regards to the distribution.
Is it possible to do this in python? I can't find it in the matplotlib documentation. Should i be using something else?
I would suggest to look into the stats module of scipy; it offers numerous statistical functions for things like this. For plotting, I would still use matplotlib.
You can use the searchsorted function from the numpy module, which will give you the order of a set of values in an ordered array. you can then transform it to a pvalue just by renormalizing it to the dimension of the original array:
data = sorted(rand(10))
new_data = rand(5)
pvals = searchsorted(data,new_data)*1./len(data)
print pvals
#array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
Well, in fact if you want the pvalues of the original number you don't need any special function at all: the pvalues are just the order in the sorted dataset divided by it's lenght.
If you need the pvalues of new values in respect to your original ones, you can use the snippet I gave you

Categories

Resources