I got some troubles when did numpy 2d-array divide.
I have a 2D numpy array A (shape=(N,N)), then i divide it by row_sum(axis=1) and got 2D-array B, but when i computed the row_sum(axis=1) of B, is not equal to one at some rows, the code is followed:
(python2.7.x)
from __future__ import division
import numpy as np
A = np.array([[x_11, x_12, ..., x_1N],
[x_21, x_22, ..., x_2N],
[... ... ... ... ]
[x_N1, x_N2, ..., x_NN]]) # x_ij are some np.float64 values
B = A / np.sum(A, axis=1, keepdims=True)
Theoretically result:
np.count_nonzero(np.sum(B, axis=1) != 1)
# it should be 0
Reality result:
np.count_nonzero(np.sum(B, axis=1) != 1)
# something bigger than 0
I believe the reason is precise lost, though i use the dtype=np.float64. Because in my project, the A 2D-array (shape=(N, N), N>8000), most of the values is very small(eg. =1.0) and the others are very big(eg. =2000) at the same row.
I have try this: Add the losts
while np.count_nonzero(np.sum(B, axis=1) != 1) != 0
losts = 1 - B
B[:, i] += losts # the i may change by some conditions
Though, finally it can solve this problems, but is not good for next step in my project.
Could anyone help me? Thanks a lot!!!
When working with floating numbers you get loss in precision and floating numbers very hardly match exactly natural numbers.
A simple test to demonstrate this is:
>>> 0.1 + 0.2 == 0.3
False
This is because the floatting point representation of 0.1 + 0.2 is 0.30000000000000004.
To solve this you just need to switch to np.isclose or np.allclose:
import numpy as np
N = 100
A = np.random.randn(N, N)
B = A / np.sum(A, axis=1, keepdims=True)
Then:
>>> np.count_nonzero(np.sum(B, axis=1) != 1)
79
whereas
>>> np.allclose(np.sum(B, axis=1), 1)
True
In short, your rows are properly normalized, they just don't sum exactly to 1.
From the documentation np.isclose(a, b) is equivalent to:
absolute(a - b) <= (atol + rtol * absolute(b))
with atol = 1e-8 and rtol = 1e-5 (by default), which is the proper way of comparing that two floating point numbers represent the same number (or at least, approximately).
Related
I have a 1-D array arr and I need to compute the variance of all possible contiguous subvectors that begin at position 0. It may be easier to understand with a for loop:
np.random.seed(1)
arr = np.random.normal(size=100)
res = []
for i in range(1, arr.size+1):
subvector = arr[:i]
var = np.var(subvector)
res.append(var)
Is there any way to compute res witouth the for loop?
Yes, since var = sum_squares / N - mean**2, and mean = sum /N, you can do cumsum to get the accumulate sums:
cumsum = np.cumsum(arr)
cummean = cumsum/(np.arange(len(arr)) + 1)
sq = np.cumsum(arr**2)
# correct the dof here
cumvar = sq/(np.arange(len(arr))+1) - cummean**2
np.allclose(res, cumvar)
# True
With pandas, you could use expanding:
import pandas as pd
pd.Series(arr).expanding().var(ddof=0).values
NB. one of the advantages is that you can benefit from the var parameters (by default ddof=1), and of course, you can run many other methods.
I need help to compute a mathematical expression using only numpy operations. The expression I want to compute is the following :
Where : x is an (N, S) array and f is a numpy function (that can work with broadcastable arrays e.g np.maximum, np.sum, np.prod, ...). If that is of importance, in my case f is a symetric function.
So far my code looks like this:
s = 0
for xp in x: # Loop over N...
s += np.sum(np.prod(f(xp, x), axis=1))
And still has loop that I'd like to get rid of.
Typically N is "large" (around 30k) but S is small (less than 20) so if anyone can find a trick to only loop over S this would still be a major improvement.
I belive the problem is easy by N-plicating the array but one of size (32768, 32768, 20) requires 150Go of RAM that I don't have. However, (32768, 32768) fits in memory though I would appreciate a solution that does not allocate such array.
Maybe a use of np.einsum with well-chosen arrays is possible?
Thanks for your replies. If any information is missing let me know!
Have a nice day !
Edit 1 :
Form of f I'm interested in includes (for now) : f(x, y) = |x - y|, f(x, y) = |x - y|^2, f(x, y) = 2 - max(x, y).
Your loop is very efficient. Some possible ways are
Method-1 (looping over S)
import numpy as np
def f(x,y):
return np.abs(x-y)
N = 200
S = 20
x_data = random.rand(N,S) #(i,s)
y_data = random.rand(N,S) #(i',s)
product = f(broadcast_to(x_data[:,0][...,None],(N,N)) ,broadcast_to(y_data[:,0][...,None],(N,N)).T)
for i in range(1,S):
product *= f(broadcast_to(x_data[:,i][...,None],(N,N)) ,broadcast_to(y_data[:,i][...,None],(N,N)).T)
sum = np.sum(product)
Method-2 (dispatching S number of blocks)
import numpy as np
def f(x,y):
x1 = np.broadcast_to(x[:,None,...],(x.shape[0],y.shape[0],x.shape[1]))
y1 = np.broadcast_to(y[None,...],(x.shape[0],y.shape[0],x.shape[1]))
return np.abs(x1-y1)
def f1(x1,y1):
return np.abs(x1-y1)
N = 5000
S = 20
x_data = np.random.rand(N,S) #(i,s)
y_data = np.random.rand(N,S) #(i',s)
def fun_new(x_data1,y_data1):
s = 0
pp =np.split(x_data1,S,axis=0)
for xp in pp:
s += np.sum(np.prod(f(xp, y_data1), axis=2))
return s
def fun_op(x_data1,y_data1):
s = 0
for xp in x_data1: # Loop over N...
s += np.sum(np.prod(f1(xp, y_data1), axis=1))
return s
fun_new(x_data,y_data)
I have been working on converting some MatLab code over to python for one of my professors (not an assignment just working on putting together some stuff) and I am stuck on this one part.
When I run the code I am getting UnitTypeError: "Can only apply 'exp' function to dimensionless quantities", all of the methods I have tried to fix this won't work. I imagine the error is caused by the linspace command but am not sure. Any help with this would be great.
here is the line
IM0 = ((2*h*c**2)/(l**5))/(np.exp(h*c/( h*c/(k*T1*l)))-1)
with the constants being from astropy
h = const.h;
c = const.c;
k = const.k_B;
l = np.linspace(0, 1.5e-6, 1500);
T1 = 3750
The astropy constants are instances of classes. Try extracting the "value" for each before using them as an argument to np.exp():
import astropy.constants as const
import numpy as np
h = const.h.value
c = const.c.value
k = const.k_B.value
l = np.linspace(0, 1.5e-6, 1500);
T1 = 3750
IM0 = ((2*h*c**2)/(l**5))/(np.exp(h*c/( h*c/(k*T1*l)))-1)
However please note there are numerical problems with IM0. The denominator is zero over all l.
The accepted answer is fine and works so far as you're confident that the bare values you're using are in the correct dimensions. But in general using .value and throwing away the units information can be potentially dangerous--using Quantities with units ensures that all your calculations are being done with compatible units.
Let's look at just the exponent in your exponential, which normally should be a dimensionless quantity.
First note that all the constants you're using from Astropy have units:
>>> from astropy.constants import h, c, k_B
>>> h
<<class 'astropy.constants.codata2018.CODATA2018'> name='Planck constant' value=6.62607015e-34 uncertainty=0.0 unit='J s' reference='CODATA 2018'>
>>> c
<<class 'astropy.constants.codata2018.CODATA2018'> name='Speed of light in vacuum' value=299792458.0 uncertainty=0.0 unit='m / s' reference='CODATA 2018'>
>>> k_B
<<class 'astropy.constants.codata2018.CODATA2018'> name='Boltzmann constant' value=1.380649e-23 uncertainty=0.0 unit='J / K' reference='CODATA 2018'>
You then declared some unitless values and mixed them in with those:
>>> T1 = 3750
>>> l = np.linspace(0, 1.5e-6, 1500)
>>> h*c/(h*c/(k_B*T1*l))
/home/embray/.virtualenvs/astropy/lib/python3.6/site-packages/astropy/units/quantity.py:481: RuntimeWarning: divide by zero encountered in true_divide
result = super().__array_ufunc__(function, method, *arrays, **kwargs)
<Quantity [0.00000000e+00, 5.18088768e-29, 1.03617754e-28, ...,
7.75578885e-26, 7.76096974e-26, 7.76615063e-26] J / K>
Giving a result in joules per kelvin, from k_B which needs to be canceled out with some values in the correct unit. I'm guessing T1 is supposed to be temperature in Kelvin (I'm not sure about l but let's say it's a thermodynamic beta in J-1 though you should double-check whatever units this value is supposed to be in).
So what you probably want to do is declare these values with appropriate units (as an aside, you can avoid the annoying divide-by-zero by defining some epsilon and using that as the start of your range):
>>> from astropy import units as u
>>> eps = np.finfo(float).eps
>>> T1 = 3750 * u.K
>>> l = np.linspace(eps, 1.5e-6, 1500) * (u.J**-1)
Now your exponent is properly a dimensionless quantity:
>>> h*c/(h*c/(k_B*T1*l))
<Quantity [1.14962123e-35, 5.18088882e-29, 1.03617765e-28, ...,
7.75578885e-26, 7.76096974e-26, 7.76615063e-26]>
>>> np.exp(h*c/(h*c/(k_B*T1*l)))
<Quantity [1., 1., 1., ..., 1., 1., 1.]>
(In this case the dimensionless values were all so close to zero that the exponents round to 1--if this is not correct then you'll want to check some of my assumptions about your units).
In any case, this is how the library is intended to be used, and the error you're getting is a deliberate safecheck against your assumptions.
Update: I saw in your other question that you gave some more context to your problem, in particular specifying that l are wavelengths in meters (this was my first guess but I wasn't sure based on the equation you gave).
Actually you can avoid directly using h and c in your Plank equation by taking advantage of equivalencies. Here you can define l as wavelengths in meters:
>>> l = np.linspace(eps, 1.5e-6, 1500) * u.m
and convert this directly to spectral energy:
>>> E = l.to(u.J, equivalencies=u.spectral())
>>> E
<Quantity [8.94615682e-10, 1.98512112e-16, 9.92560670e-17, ...,
1.32606651e-19, 1.32518128e-19, 1.32429724e-19] J>
Then write the exponent in your Plank's law equation like:
>>> np.exp(E / (k_B * T1))
/home/embray/.virtualenvs/astropy/lib/python3.6/site-packages/astropy/units/quantity.py:481: RuntimeWarning: overflow encountered in exp
result = super().__array_ufunc__(function, method, *arrays, **kwargs)
<Quantity [ inf, inf, inf, ..., 12.95190431,
12.92977839, 12.90771972]>
(here it gives some infinities near low wavelengths, but you can avoid this by clipping to a larger lower bound).
I have an array of scalars of m rows and n columns. I have a Variable(m) and a Variable(n) that I would like to find solutions for.
The two variables represent values that need to be broadcast over the columns and rows respectively.
I was naively thinking of writing the variables as Variable((m, 1)) and Variable((1, n)), and adding them together as if they're ndarrays. However, that doesn't work, as broadcasting is not allowed.
import cvxpy as cp
import numpy as np
# Problem data.
m = 3
n = 4
np.random.seed(1)
data = np.random.randn(m, n)
# Construct the problem.
x = cp.Variable((m, 1))
y = cp.Variable((1, n))
objective = cp.Minimize(cp.sum(cp.abs(x + y + data)))
# or:
#objective = cp.Minimize(cp.sum_squares(x + y + data))
prob = cp.Problem(objective)
result = prob.solve()
print(x.value)
print(y.value)
This fails on the x + y expression: ValueError: Cannot broadcast dimensions (3, 1) (1, 4).
Now I'm wondering two things:
Is my problem indeed solvable using convex optimization?
If yes, how can I express it in a way that cvxpy understands?
I'm very new to the concept of convex optimization, as well as cvxpy, and I hope I described my problem well enough.
I offered to show you how to represent this as a linear program, so here it goes. I'm using Pyomo, since I'm more familiar with that, but you could do something similar in PuLP.
To run this, you will need to first install Pyomo and a linear program solver like glpk. glpk should work for reasonable-sized problems, but if you are finding it's taking too long to solve, you could try a (much faster) commercial solver like CPLEX or Gurobi.
You can install Pyomo via pip install pyomo or conda install -c conda-forge pyomo. You can install glpk from https://www.gnu.org/software/glpk/ or via conda install glpk. (I think PuLP comes with a version of glpk built-in, so that might save you a step.)
Here's the script. Note that this calculates absolute error as a linear expression by defining one variable for the positive component of the error and another for the negative part. Then it seeks to minimize the sum of both. In this case, the solver will always set one to zero since that's an easy way to reduce the error, and then the other will be equal to the absolute error.
import random
import pyomo.environ as po
random.seed(1)
# ~50% sparse data set, big enough to populate every row and column
m = 10 # number of rows
n = 10 # number of cols
data = {
(r, c): random.random()
for r in range(m)
for c in range(n)
if random.random() >= 0.5
}
# define a linear program to find vectors
# x in R^m, y in R^n, such that x[r] + y[c] is close to data[r, c]
# create an optimization model object
model = po.ConcreteModel()
# create indexes for the rows and columns
model.ROWS = po.Set(initialize=range(m))
model.COLS = po.Set(initialize=range(n))
# create indexes for the dataset
model.DATAPOINTS = po.Set(dimen=2, initialize=data.keys())
# data values
model.data = po.Param(model.DATAPOINTS, initialize=data)
# create the x and y vectors
model.X = po.Var(model.ROWS, within=po.NonNegativeReals)
model.Y = po.Var(model.COLS, within=po.NonNegativeReals)
# create dummy variables to represent errors
model.ErrUp = po.Var(model.DATAPOINTS, within=po.NonNegativeReals)
model.ErrDown = po.Var(model.DATAPOINTS, within=po.NonNegativeReals)
# Force the error variables to match the error
def Calculate_Error_rule(model, r, c):
pred = model.X[r] + model.Y[c]
err = model.ErrUp[r, c] - model.ErrDown[r, c]
return (model.data[r, c] + err == pred)
model.Calculate_Error = po.Constraint(
model.DATAPOINTS, rule=Calculate_Error_rule
)
# Minimize the total error
def ClosestMatch_rule(model):
return sum(
model.ErrUp[r, c] + model.ErrDown[r, c]
for (r, c) in model.DATAPOINTS
)
model.ClosestMatch = po.Objective(
rule=ClosestMatch_rule, sense=po.minimize
)
# Solve the model
# get a solver object
opt = po.SolverFactory("glpk")
# solve the model
# turn off "tee" if you want less verbose output
results = opt.solve(model, tee=True)
# show solution status
print(results)
# show verbose description of the model
model.pprint()
# show X and Y values in the solution
for r in model.ROWS:
print('X[{}]: {}'.format(r, po.value(model.X[r])))
for c in model.COLS:
print('Y[{}]: {}'.format(c, po.value(model.Y[c])))
Just to complete the story, here's a solution that's closer to your original example. It uses cvxpy, but with the sparse data approach from my solution.
I don't know the "official" way to do elementwise calculations with cvxpy, but it seems to work OK to just use the standard Python sum function with a lot of individual cp.abs(...) calculations.
This gives a solution that is very slightly worse than the linear program, but you may be able to fix that by adjusting the solution tolerance.
import cvxpy as cp
import random
random.seed(1)
# Problem data.
# ~50% sparse data set
m = 10 # number of rows
n = 10 # number of cols
data = {
(i, j): random.random()
for i in range(m)
for j in range(n)
if random.random() >= 0.5
}
# Construct the problem.
x = cp.Variable(m)
y = cp.Variable(n)
objective = cp.Minimize(
sum(
cp.abs(x[i] + y[j] + data[i, j])
for (i, j) in data.keys()
)
)
prob = cp.Problem(objective)
result = prob.solve()
print(x.value)
print(y.value)
I did not get the idea, but just some hacky stuff based on the assumption:
you want some cvxpy-equivalent to numpy's broadcasting-rules behaviour on arrays (m, 1) + (1, n)
So numpy-wise:
m = 3
n = 4
np.random.seed(1)
a = np.random.randn(m, 1)
b = np.random.randn(1, n)
a
array([[ 1.62434536],
[-0.61175641],
[-0.52817175]])
b
array([[-1.07296862, 0.86540763, -2.3015387 , 1.74481176]])
a + b
array([[ 0.55137674, 2.48975299, -0.67719333, 3.36915713],
[-1.68472504, 0.25365122, -2.91329511, 1.13305535],
[-1.60114037, 0.33723588, -2.82971045, 1.21664001]])
Let's mimic this with np.kron, which has a cvxpy-equivalent:
aLifted = np.kron(np.ones((1,n)), a)
bLifted = np.kron(np.ones((m,1)), b)
aLifted
array([[ 1.62434536, 1.62434536, 1.62434536, 1.62434536],
[-0.61175641, -0.61175641, -0.61175641, -0.61175641],
[-0.52817175, -0.52817175, -0.52817175, -0.52817175]])
bLifted
array([[-1.07296862, 0.86540763, -2.3015387 , 1.74481176],
[-1.07296862, 0.86540763, -2.3015387 , 1.74481176],
[-1.07296862, 0.86540763, -2.3015387 , 1.74481176]])
aLifted + bLifted
array([[ 0.55137674, 2.48975299, -0.67719333, 3.36915713],
[-1.68472504, 0.25365122, -2.91329511, 1.13305535],
[-1.60114037, 0.33723588, -2.82971045, 1.21664001]])
Let's check cvxpy semi-blindly (we only dimensions; too lazy to setup a problem and fix variable to check the output :-D):
import cvxpy as cp
x = cp.Variable((m, 1))
y = cp.Variable((1, n))
cp.kron(np.ones((1,n)), x) + cp.kron(np.ones((m, 1)), y)
# Expression(AFFINE, UNKNOWN, (3, 4))
# looks good!
Now some caveats:
i don't know how efficient cvxpy can reason about this matrix-form internally
unclear if more efficient as a simple list-comprehension based form using cp.vstack and co (it probably is)
this operation itself kills all sparsity
(if both vectors are dense; your matrix is dense)
cvxpy and more or less all convex-optimization solvers are based on some sparsity assumption
scaling this problem up to machine-learning dimensions will not make you happy
there is probably a much more concise mathematical theory for your problem then to use (sparsity-assuming) (pretty) general (DCP implemented in cvxpy is a subset) convex-optimization
I want to compute binomial probabilities on python. I tried to apply the formula:
probability = scipy.misc.comb(n,k)*(p**k)*((1-p)**(n-k))
Some of the probabilities I get are infinite. I checked some values for which p=inf. For one of them, n=450,000 and k=17. This value must be greater than 1e302 which is the maximum value handled by floats.
I then tried to use sum(np.random.binomial(n,p,numberOfTrials)==valueOfInterest)/numberOfTrials
This draws numberOfTrials samples and computes the average number of times the value valueOfInterest is drawn.
This doesn't raise any infinite value. However, is this a valid way to proceed? And why this way wouldn't raise any infinite value whereas computing the probabilities does?
Because you're using scipy I thought I would mention that scipy already has statistical distributions implemented. Also note that when n is this large the binomial distribution is well approximated by the normal distribution (or Poisson if p is very small).
n = 450000
p = .5
k = np.array([17., 225000, 226000])
b = scipy.stats.binom(n, p)
print b.pmf(k)
# array([ 0.00000000e+00, 1.18941527e-03, 1.39679862e-05])
n = scipy.stats.norm(n*p, np.sqrt(n*p*(1-p)))
print n.pdf(k)
# array([ 0.00000000e+00, 1.18941608e-03, 1.39680605e-05])
print b.pmf(k) - n.pdf(k)
# array([ 0.00000000e+00, -8.10313274e-10, -7.43085142e-11])
Work in the log domain to compute combination and exponentiation functions and then raise them to exponent.
Something like this:
combination_num = range(k+1, n+1)
combination_den = range(1, n-k+1)
combination_log = np.log(combination_num).sum() - np.log(combination_den).sum()
p_k_log = k * np.log(p)
neg_p_K_log = (n - k) * np.log(1 - p)
p_log = combination_log + p_k_log + neg_p_K_log
probability = np.exp(p_log)
Gets rid of numeric underflow/overflow because of large numbers. On your example with n=450000 and p = 0.5, k = 17, it returns p_log = -311728.4, i. e., the log of final probability is pretty small and hence underflow occurs while taking np.exp. However, you can still work with log probability.
I thing you should do all you computation using logarithms:
from scipy import special, exp, log
lgam = special.gammaln
def binomial(n, k, p):
return exp(lgam(n+1) - lgam(n-k+1) - lgam(k+1) + k*log(p) + (n-k)*log(1.-p))
To avoid multiplicity like zero by like infinity use step by step multiplication as this.
def Pbinom(N,p,k):
q=1-p
lt1=[q]*(N-k)
gt1=list(map(lambda x: p*(N-k+x)/x, range(1,k+1)))
Pb=1.0
while (len(lt1) + len(gt1)) > 0:
if Pb>1:
if len(lt1)>0:
Pb*=lt1.pop()
else:
if len(gt1)>0:
Pb*=gt1.pop()
else:
if len(gt1)>0:
Pb*=gt1.pop()
else:
if len(lt1)>0:
Pb*=lt1.pop()
return Pb