I had some code that random-initialized some numpy arrays with:
rng = np.random.default_rng(seed=seed)
new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(np.float32) # [-1.0, 1.0)
new_vectors /= vector_size
And all was working well, all project tests passing.
Unfortunately, uniform() returns np.float64, though downstream steps only want np.float32, and in some cases, this array is very large (think millions of 400-dimensional word-vectors). So the temporary np.float64 return-value momentarily uses 3X the RAM necessary.
Thus, I replaced the above with what definitionally should be equivalent:
rng = np.random.default_rng(seed=seed)
new_vectors = rng.random(target_shape, dtype=np.float32) # [0.0, 1.0)
new_vectors *= 2.0 # [0.0, 2.0)
new_vectors -= 1.0 # [-1.0, 1.0)
new_vectors /= vector_size
And after this change, all closely-related functional tests still pass, but a single distant, fringe test relying on far-downstream calculations from the vectors so-initialized has started failing. And failing in a very reliable way. It's a stochastic test, and passes with large margin-for-error in top case, but always fails in bottom case. So: something has changed, but in some very subtle way.
The superficial values of new_vectors seem properly and similarly distributed in both cases. And again, all the "close-up" tests of functionality still pass.
So I'd love theories for what non-intuitive changes this 3-line change may have made that could show up far-downstream.
(I'm still trying to find a minimal test that detects whatever's different. If you'd enjoy doing a deep-dive into the affected project, seeing the exact close-up tests that succeed & one fringe test that fails, and commits with/without the tiny change, at https://github.com/RaRe-Technologies/gensim/pull/2944#issuecomment-704512389. But really, I'm just hoping a numpy expert might recognize some tiny corner-case where something non-intuitive happens, or offer some testable theories of same.)
Any ideas, proposed tests, or possible solutions?
A way to maintain precision and conserve memory could be to create your large target array, then fill it in using blocks at higher precision.
For example:
def generate(shape, value, *, seed=None, step=10):
arr = np.empty(shape, dtype=np.float32)
rng = np.random.default_rng(seed=seed)
(d0, *dr) = shape
for i in range(0, d0, step):
j = min(d0, i + step)
arr[i:j,:] = rng.uniform(-1/value, 1/value, size=[j-i]+dr)
return arr
which can be used as:
generate((100, 1024, 1024), 7, seed=13)
You can tune the size of these blocks (via step) to maintain performance.
Let's print new_vectors * 2**22 % 1 for both methods, i.e., let's look at what's left after the first 22 fractional bits (program is at the end). With the first method:
[[0. 0.5 0.25 0. 0. ]
[0.5 0.875 0.25 0. 0.25 ]
[0. 0.25 0. 0.5 0.5 ]
[0.6875 0.328125 0.75 0.5 0.52539062]
[0.75 0.75 0.25 0.375 0.25 ]]
With the second method:
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
Quite a difference! The second method doesn't produce any numbers with 1-bits after the first 22 fractional bits.
Let's imagine we had a type float3 which could only hold three significant bits (think span of non-zero bits), so for example the numbers (in binary) 1.01 or 11100.0 or 0.0000111, but not 10.01 because that has four significant bits.
Then the random number generator for the range [0, 1) would pick from these eight numbers:
0.000
0.001
0.010
0.011
0.100
0.101
0.110
0.111
Wait, hold on. Why only from those eight? What about for example the aforementioned 0.0000111? That's in [0, 1) and can be represented, right?
Well yes, but note that that's in [0, 0.5). And there are no further representable numbers in the range [0.5, 1), as those numbers all start with "0.1" and thus any further 1-bits can only be at the second or third fractional bit. For example 0.1001 would not be representable, as that has four significant bits.
So if the generator would also pick from any other than those eight numbers above, they'd all have to be in [0, 0.5), creating a bias. It could pick from different four numbers in that range instead, or possibly include all representable numbers in that range with proper probabilities, but either way you'd have a "gap bias", where numbers picked from [0, 0.5) can have smaller or larger gaps than numbers picked from [0.5, 1). Not sure "gap bias" is a thing or the correct term, but the point is that the distribution in [0, 0.5) would look different than that in [0.5, 1). The only way to make them look the same is if you stick with picking from those equally-spaced eight numbers above. The distribution/possibilities in [0.5, 1) dictate what you should use in [0, 0.5).
So... a random number generator for float3 would pick from those eight numbers and never generate for example 0.0000111. But now imagine we also had a type float5, which could hold five significant bits. Then a random number generator for that could pick 0.00001. And if you then convert that to our float3, that would survive, you'd have 0.00001 as float3. But in the range [0.5, 1), this process of generating float5 numbers and converting them to float3 would still only produce the numbers 0.100, 0.101, 0.110 and 0.111, since float3 still can't represent any other numbers in that range.
So that's what you get, just with float32 and float64. Your two methods give you different distributions. I'd say the second method's distribution is actually better, as the first method has what I called "gap bias". So maybe it's not your new method that's broken, but the test. If that's the case, fix the test. Otherwise, an idea to fix your situation might be to use the old float64-to-float32 way, but not produce everything at once. Instead, prepare the float32 structure with just 0.0 everywhere, and then fill it in smaller chunks generated with your new way.
Small caveat, btw: Looks like there's a bug in NumPy for generating random float32 values, not using the lowest-position bit. So that might be another reason the test fails. You could try your second method with (rng.integers(0, 2**24, target_shape) / 2**24).astype(np.float32) instead of rng.random(target_shape, dtype=np.float32). I think that's equivalent to what the fixed version would be (since it's apparently currently doing it that way, except with 23 instead of 24).
The program for the experiment at the top (also at repl.it):
import numpy as np
# Some setup
seed = 13
target_shape = (5, 5)
vector_size = 1
# First way
rng = np.random.default_rng(seed=seed)
new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(np.float32) # [-1.0, 1.0)
new_vectors /= vector_size
print(new_vectors * 2**22 % 1)
# Second way
rng = np.random.default_rng(seed=seed)
new_vectors = rng.random(target_shape, dtype=np.float32) # [0.0, 1.0)
new_vectors *= 2.0 # [0.0, 2.0)
new_vectors -= 1.0 # [-1.0, 1.0)
new_vectors /= vector_size
print(new_vectors * 2**22 % 1)
I ran your code with the following values:
seed = 0
target_shape = [100]
vector_size = 3
I noticed that the code in your first solution generated a different new_vectors then your second solution.
Specifically it looks like uniform keeps half of the values from the random number generator that random does with the same seed. This is probably because of an implementation detail within the random generator from numpy.
In the following snippet i only inserted spaces to align similar values. there is probably also some float rounding going on making the result appear not identical.
[ 0.09130779, -0.15347552, -0.30601767, -0.32231492, 0.20884682, ...]
[0.23374946, 0.09130772, 0.007424275, -0.1534756, -0.12811375, -0.30601773, -0.28317323, -0.32231498, -0.21648853, 0.20884681, ...]
Based on this i speculate that your stochastic test case only tests your solution with one seed and because you generate a different sequence with the new solution. and this result causes a failure in the test case.
Related
I can see how to define a custom continuous random variable class in Scipy using the rv_continuous class or a discrete one using rv_discrete.
But is there any way to create a non-continuous random variable which can model the following random variable which is a combination of a normal distribution and a discontinuous distribution that simply outputs 0?
def my_rv(a, loc, scale):
if np.random.random() > a:
return 0
else:
return np.random.normal(loc=loc, scale=scale)
# Example
samples = np.array([my_rv(0.5,0,1) for i in range(1000)])
plt.hist(samples, bins=21, density=True)
plt.grid()
plt.show()
print(samples[:10].round(2))
# [0. 0. 0. 0. 0. 0. 2.2 0.07 0. 0.12]
The pdf of this random variable looks something like this:
[In my actual application a is quite small, e.g. 0.01 so this variable is 0 most of the time].
Obviously, I can do this without Scipy in Python either with the simple function above or by generating samples from each distribution separately and then merging them appropriately but I was hoping to make this a Scipy custom random variable and get all the benefits and features of that class.
You could do something like:
temp = np.random.normal(loc=loc, scale=scale, size=1000)
mask = np.random.choice((0, 1), size=1000, p=(a, 1-a))
temp[np.where(mask == 0)] = 0
So you start off with a normal distribution, and then you replace each element with 0 with probability a. (Note, that I'm using Robert's correction above. If you think you really need to multiply by a, then do so.)
Alternatively, the last line could just be:
result = temp * mask
But this only works when you're setting some fraction of the values to 0.
=== Edited ===
Change #1. In my call to np.random.choice, I left out the most important part! p=(a, 1-a) to give you the right probabilities. Sorry about that. I must have deleted that after copying.
Change #2. I'm assuming that a is fairly close to 1, and that you need to calculate most of the normally distributed numbers anyway. (Calculating a normally distributed random number is pretty cheap.). But if a is close to 0, you might instead want:
result = np.random.choice((0.0, 1.0), size=1000, p=(a, 1-a))
mask = np.where(result != 0)
result[mask] = np.random.normal(size=len(mask[0]))
which only calculates the number of normally-distributed values that you need.
I recently discovered there are other packages that can model more general types of stochastic processes such as this one.
I'm not sure this code is completely correct but I think the above process can be represented in Pyro something like this:
import torch
import pyro
import pyro.distributions as dist
def process(a, loc, scale):
c = pyro.sample('c', dist.Bernoulli(a))
switch = [
pyro.deterministic('my_rv', torch.tensor(0.0)),
pyro.sample('my_rv', pyro.distributions.Normal(loc, scale))
]
return switch[c.long()]
The documentation explains how Pyro handles discrete latent variables.
I haven't figured it all out yet, but I believe you can do a lot with pyro such as conditioning the model on data and probabilistic inference.
Another package is PyMC3. I would be interested to know if there are others that would be good for this.
I'm trying to calculate the maximum log-likelihood (MLE) for the following probability density function (PDF):
I'm computing it by minimising the objective function (negative log-likelihood) without relying on any predefined log-likelihood python built-in modules whatsoever. The code is:
# Alpha Distribution (PDF)
def AD(z, *params):
a, scale = z
diameters = params
return -np.sum(np.log((((diameters)/(a**2) * np.exp(-diameters/a))) / scale))
# load data
currpath = ('path')
os.chdir(currpath)
diameters = scipy.io.loadmat('data.mat')["m1"]
# minimise
x0 = [1,1] # initial guesses
res = optimize.minimize(AD, x0, args = diameters, method='Nelder-Mead',
tol=1e-6)
print(res.x)
My data vector (here already sorted) comprises a number of diameters in the following form (0.19, 0.19, 0.19, 0.2, 0.21, 0.21, 0.22, 0.22, 0.22, 0.25, 0.27 ...).
First question: Since I'm fairly new to the topic of MLE, is the form of my data vector correct? I'm not completely sure whether I use a data vector containing every observed diameter (like shown above), or a data vector which only contains the "possible" diameters (which would be: 0.19, 0.2, 0.21, 0.22, 0.25, 0.27 ...), or just the frequencies of the observed diameters (which would be: 3, 1, 2, 3, 1, 1 ...). I think the first option is the right one, but I just wanted to be completely sure.
Second question: If I wish to use a cumulative distribution function (CDF) instead of a PDF to perform my MLE on, I would have to change my PDF function to a CDF, right? I was just wondering if I could alternatively somehow modify my data vector and still use the PDF.
However, for the minimisation in python (if I understood it correctly) I had to rethink the definition of my variables. That means, normally I would assume that the parameters of my PDF (here "a" and "scale") are the variables which should be passed to "args" in "optimize.minimize". However, in the documentation it is stated, that args should contain the "constant" parameters, therefore I used my data vector as a constant "parameter vector" for the minimisation.
Third question: Is this assumption an error in reasoning?
Fourth question: Is the optimisation method "Nelder-Mead" appropriate? I'm not really familiar with optimisation methods and not sure which of the options I should use/is the best.
Finally, the program returns an error "TypeError: bad operand type for unary -: 'tuple'", where I have no clue how to deal with it, since I'm not passing any tuples to the minimisation function ...
Fifth question: Where does the tuple come from and how can I solve this error?
I'd appreciate any help you could give me very much!
Best regards!
PS: Since this post is kind of a mixture between general math and programming, I wasn't completely sure if this is the right place to put the question. Sorry if I'm mistaken!
First, apart from the first part (before the multiplication operator), we are discussing what is generally called maximum likelihood estimation (MLE) for the exponential distribution. It has just been reparameterised in terms of something called a.
We want to estimate this single parameter based on a sample of diameters; there is no scale parameter. Under MLE, we pretend that the sample is fixed and treat the parameter as something that can be varied. We form the likelihood of the sample by taking the product of the density functions (not the cdfs) where each density function is to be calculated for one element of the sample.
(Likelihood is, in concept, like throwing a die twice. In ultra ugly terms, we could say that the likelihood of getting two ones in a row might be (1/6)(1/6).)
We want to maximise this likelihood. However, to make the optimisation problem mathematically and/or computationally tractable we take the function's logarithm. Since all of its constituent functions are densities, less than one, this function must be everywhere less than zero. Thus, the maximisation problem becomes one of minimisatiion.
If you want to avoid almost all of the algebra then you would:
Write a function to calculate the density function for a given diameter and parameter value.
Write another function that would accept a density function parameter value as its Python parameter, and the sample as its second. Make it call the first function once for each sample value, take the log of each of these and return the sum of these.
Call minimize with the second function as its first argument, some reasonable guess for the density function parameter, in a list, as the second argument, the sample for args. Nelder-Mead is probably ok.
Edit: In a nutshell:
diameters =[ 0.19, 0.19, 0.19, 0.2, 0.21, 0.21, 0.22, 0.22, 0.22, 0.25, 0.27]
from scipy.optimize import minimize
from math import exp, log
def pdf(d, a):
result = d*exp(-d/a)/a**2
return result
def log_L(a, diameters):
result = sum(log(pdf(d, a)) for d in diameters)
return result
res = minimize(log_L, [1], args=diameters)
print (res)
Output:
fun: -337.80985348524604
hess_inv: array([[ 8.71770021e+10]])
jac: array([ -7.62939453e-06])
message: 'Optimization terminated successfully.'
nfev: 93
nit: 30
njev: 31
status: 0
success: True
x: array([ 2157576.39996697])
Addendum:
The wikipedia article offers the following form for the pdf of the exponential.
The constant 'lambda' can be viewed as a value that scales the integral of the remainder of the expression from zero to infinity to one. We can ignore it and equate the exponents of your pdf, without the scaling factor, and the exponential. We have to remember that d takes the role of x.
Solve for 'lambda'.
We see that this is the normalising expression in your pdf. In other words, the alpha is an exponential expressed with different parameters.
Here is another approach, assuming that you're analysing data and not simply working out the details of MLE.
scipy provides means for generating samples from arbitrary distributions. Here I define just the pdf for your alpha. Your parameter a becomes p because a is used as the lower limit for the distribution support, which I define to be zero.
I draw a sample of size 100 with p set somewhat arbitrarily to 0.4. I did a little experimentation, trying to find a value that would give me a sample whose lowest 11 values would approximate those in your sample.
The scipy rv_continuous object has a method called fit that will attempt calculation of MLE estimates of location, scale and 'shape'. In this case, the value for shape, about 0.36, is not all that far from 0.4.
from scipy.stats import rv_continuous
import numpy as np
class Alpha(rv_continuous):
'alpha distribution'
def _pdf(self, x, p):
return x*np.exp(-x/p)/p**2
alpha = Alpha(a=0, shapes='p')
sample = sorted(alpha.rvs(size=100,p=0.4))
for a in sample[:12]:
print ('{:10.2f}'.format(a))
print (Alpha(a=0, shapes='p').fit(sample))
I don't believe that your sample is alpha-distributed. The values seem to be too 'uniform' compared with what I could generate. But I've been wrong before.
I would suggest plotting your sample cdf to see if you can recognise what it is.
Incidentally, when I changed the sign of the log-likelihood in the other answer the code croaked. I suspect that the alpha is just a poor fit.
0.00
0.03
0.04
0.04
0.08
0.09
0.09
0.11
0.12
0.14
0.19
0.20
(1.0902616847853124, -0.039102949269294023, 0.35922022997329517)
So the output of my network is a list of propabilities, which I then round using tf.round() to be either 0 or 1, this is crucial for this project.
I then found out that tf.round isn't differentiable so I'm kinda lost there.. :/
Something along the lines of x - sin(2pi x)/(2pi)?
I'm sure there's a way to squish the slope to be a bit steeper.
You can use the fact that tf.maximum() and tf.minimum() are differentiable, and the inputs are probabilities from 0 to 1
# round numbers less than 0.5 to zero;
# by making them negative and taking the maximum with 0
differentiable_round = tf.maximum(x-0.499,0)
# scale the remaining numbers (0 to 0.5) to greater than 1
# the other half (zeros) is not affected by multiplication
differentiable_round = differentiable_round * 10000
# take the minimum with 1
differentiable_round = tf.minimum(differentiable_round, 1)
Example:
[0.1, 0.5, 0.7]
[-0.0989, 0.001, 0.20099] # x - 0.499
[0, 0.001, 0.20099] # max(x-0.499, 0)
[0, 10, 2009.9] # max(x-0.499, 0) * 10000
[0, 1.0, 1.0] # min(max(x-0.499, 0) * 10000, 1)
This works for me:
x_rounded_NOT_differentiable = tf.round(x)
x_rounded_differentiable = x - tf.stop_gradient(x - x_rounded_NOT_differentiable)
Rounding is a fundamentally nondifferentiable function, so you're out of luck there. The normal procedure for this kind of situation is to find a way to either use the probabilities, say by using them to calculate an expected value, or by taking the maximum probability that is output and choose that one as the network's prediction. If you aren't using the output for calculating your loss function though, you can go ahead and just apply it to the result and it doesn't matter if it's differentiable. Now, if you want an informative loss function for the purpose of training the network, maybe you should consider whether keeping the output in the format of probabilities might actually be to your advantage (it will likely make your training process smoother)- that way you can just convert the probabilities to actual estimates outside of the network, after training.
Building on a previous answer, a way to get an arbitrarily good approximation is to approximate round() using a finite Fourier approximation and use as many terms as you need. Fundamentally, you can think of round(x) as adding a reverse (i. e. descending) sawtooth wave to x. So, using the Fourier expansion of the sawtooth wave we get
With N = 5, we get a pretty nice approximation:
Kind of an old question, but I just solved this problem for TensorFlow 2.0. I am using the following round function on in my audio auto-encoder project. I basically want to create a discrete representation of sound which is compressed in time. I use the round function to clamp the output of the encoder to integer values. It has been working well for me so far.
#tf.custom_gradient
def round_with_gradients(x):
def grad(dy):
return dy
return tf.round(x), grad
In range 0 1, translating and scaling a sigmoid can be a solution:
slope = 1000
center = 0.5
e = tf.exp(slope*(x-center))
round_diff = e/(e+1)
In tensorflow 2.10, there is a function called soft_round which achieves exactly this.
Fortunately, for those who are using lower versions, the source code is really simple, so I just copy-pasted those lines, and it works like a charm:
def soft_round(x, alpha, eps=1e-3):
"""Differentiable approximation to `round`.
Larger alphas correspond to closer approximations of the round function.
If alpha is close to zero, this function reduces to the identity.
This is described in Sec. 4.1. in the paper
> "Universally Quantized Neural Compression"<br />
> Eirikur Agustsson & Lucas Theis<br />
> https://arxiv.org/abs/2006.09952
Args:
x: `tf.Tensor`. Inputs to the rounding function.
alpha: Float or `tf.Tensor`. Controls smoothness of the approximation.
eps: Float. Threshold below which `soft_round` will return identity.
Returns:
`tf.Tensor`
"""
# This guards the gradient of tf.where below against NaNs, while maintaining
# correctness, as for alpha < eps the result is ignored.
alpha_bounded = tf.maximum(alpha, eps)
m = tf.floor(x) + .5
r = x - m
z = tf.tanh(alpha_bounded / 2.) * 2.
y = m + tf.tanh(alpha_bounded * r) / z
# For very low alphas, soft_round behaves like identity
return tf.where(alpha < eps, x, y, name="soft_round")
alpha sets how soft the function is. Greater values leads to better approximations of round function, but then it becomes harder to fit since gradients vanish:
x = tf.convert_to_tensor(np.arange(-2,2,.1).astype(np.float32))
for alpha in [ 3., 7., 15.]:
y = soft_round(x, alpha)
plt.plot(x.numpy(), y.numpy(), label=f'alpha={alpha}')
plt.legend()
plt.title('Soft round function for different alphas')
plt.grid()
In my case, I tried different values for alpha, and 3. looks like a good choice.
In my code I'm using theano to calculate an euclidean distance matrix (code from here):
import theano
import theano.tensor as T
MAT = T.fmatrix('MAT')
squared_euclidean_distances = (MAT ** 2).sum(1).reshape((MAT.shape[0], 1)) + (MAT ** 2).sum(1).reshape((1, MAT.shape[0])) - 2 * MAT.dot(MAT.T)
f_euclidean = theano.function([MAT], T.sqrt(squared_euclidean_distances))
def pdist_euclidean(mat):
return f_euclidean(mat)
But the following code causes some values of the matrix to be NaN. I've read that this happens when calculating theano.tensor.sqrt() and here it's suggested to
Add an eps inside the sqrt (or max(x,EPs))
So I've added an eps to my code:
import theano
import theano.tensor as T
eps = 1e-9
MAT = T.fmatrix('MAT')
squared_euclidean_distances = (MAT ** 2).sum(1).reshape((MAT.shape[0], 1)) + (MAT ** 2).sum(1).reshape((1, MAT.shape[0])) - 2 * MAT.dot(MAT.T)
f_euclidean = theano.function([MAT], T.sqrt(eps+squared_euclidean_distances))
def pdist_euclidean(mat):
return f_euclidean(mat)
And I'm adding it before performing sqrt. I'm getting less NaNs, but I'm still getting them. What is the proper solution to the problem? I've also noticed that if MAT is T.dmatrix() there are no NaN
There are two likely sources of NaNs when computing Euclidean distances.
Floating point representation approximation issues causing negative distances when it's really just zero. The square root of a negative number is undefined (assuming you're not interested in the complex solution).
Imagine MAT has the value
[[ 1.62434536 -0.61175641 -0.52817175 -1.07296862 0.86540763]
[-2.3015387 1.74481176 -0.7612069 0.3190391 -0.24937038]
[ 1.46210794 -2.06014071 -0.3224172 -0.38405435 1.13376944]
[-1.09989127 -0.17242821 -0.87785842 0.04221375 0.58281521]]
Now, if we break down the computation we see that (MAT ** 2).sum(1).reshape((MAT.shape[0], 1)) + (MAT ** 2).sum(1).reshape((1, MAT.shape[0])) has value
[[ 10.3838024 -9.92394296 10.39763039 -1.51676099]
[ -9.92394296 18.16971188 -14.23897281 5.53390084]
[ 10.39763039 -14.23897281 15.83764622 -0.65066204]
[ -1.51676099 5.53390084 -0.65066204 4.70316652]]
and 2 * MAT.dot(MAT.T) has value
[[ 10.3838024 14.27675714 13.11072431 7.54348446]
[ 14.27675714 18.16971188 17.00367905 11.4364392 ]
[ 13.11072431 17.00367905 15.83764622 10.27040637]
[ 7.54348446 11.4364392 10.27040637 4.70316652]]
The diagonal of these two values should be equal (the distance between a vector and itself is zero) and from this textual representation it looks like that is true, but in fact they are slightly different -- the differences are too small to show up when we print the floating point values like this
This becomes apparent when we print the value of the full expression (the second of the matrices above subtracted from the first)
[[ 0.00000000e+00 2.42007001e+01 2.71309392e+00 9.06024545e+00]
[ 2.42007001e+01 -7.10542736e-15 3.12426519e+01 5.90253836e+00]
[ 2.71309392e+00 3.12426519e+01 0.00000000e+00 1.09210684e+01]
[ 9.06024545e+00 5.90253836e+00 1.09210684e+01 0.00000000e+00]]
The diagonal is almost composed of zeros but the item in the second row, second column is now a very small negative value. When you then compute the square root of all these values you get NaN in that position because the square root of a negative number is undefined (for real numbers).
[[ 0. 4.91942071 1.64714721 3.01002416]
[ 4.91942071 nan 5.58951267 2.42951402]
[ 1.64714721 5.58951267 0. 3.30470398]
[ 3.01002416 2.42951402 3.30470398 0. ]]
Computing the gradient of a Euclidean distance expression with respect to a variable inside the input to the function. This can happen not only if a negative number of generated due to floating point approximations, as above, but also if any of the inputs are zero length.
If y = sqrt(x) then dy/dx = 1/(2 * sqrt(x)). So if x=0 or, for your purposes, if squared_euclidean_distances=0 then the gradient will be NaN because 2 * sqrt(0) = 0 and dividing by zero is undefined.
The solution to the first problem can be achieved by ensuring squared distances are never negative by forcing them to be no less than zero:
T.sqrt(T.maximum(squared_euclidean_distances, 0.))
To solve both problems (if you need gradients) then you need to make sure the squared distances are never negative or zero, so bound with a small positive epsilon:
T.sqrt(T.maximum(squared_euclidean_distances, eps))
The first solution makes sense since the problem only arises from approximate representations. The second is a bit more questionable because the true distance is zero so, in a sense, the gradient should be undefined. Your specific use case may yield some alternative solution that is maintains the semantics without an artificial bound (e.g. by ensuring that gradients are never computed/used for zero-length vectors). But NaN values can be pernicious: they can spread like weeds.
Just checking
In squared_euclidian_distances you're adding a column, a row, and a matrix. Are you sure this is what you want?
More precisely, if MAT is of shape (n, p), you're adding matrices of shapes (n, 1), (1, n) and (n, n).
Theano seems to silently repeat the rows (resp. the columns) of each one-dimensional member to match the number of rows and columns of the dot product.
If this is what you want
In reshape, you should probably specify ndim=2 according to basic tensor functionality : reshape.
If the shape is a Variable argument, then you might need to use the optional ndim parameter to declare how many elements the shape has, and therefore how many dimensions the reshaped Variable will have.
Also, it seems that squared_euclidean_distances should always be positive, unless imprecision errors in the difference change zero values into small negative values. If this is true, and if negative values are responsible for the NaNs you're seeing, you could indeed get rid of them without corrupting your result by surrounding squared_euclidean_distances with abs(...).
I'm trying to understand how the error bands are calculated in the tsplot. Examples of the error bands are shown here.
When I plot something simple like
sns.tsplot(np.array([[0,1,0,1,0,1,0,1], [1,0,1,0,1,0,1,0], [.5,.5,.5,.5,.5,.5,.5,.5]]))
I get a vertical line at y=0.5 as expected. The top error band is also a vertical line at around y=0.665 and the bottom error band is a vertical line at around y=0.335. Can someone explain how these are derived?
EDIT: The question and this answer referred to old versions of Seaborn and is not relevant for new versions. See #CGFoX 's comment below.
I'm not a statistician, but I read through the seaborn code in order to see exactly what's happening. There are three steps:
Bootstrap resampling. Seaborn creates resampled versions of your data. Each of these is
a 3x8 matrix like yours, but each row is randomly selected from the
three rows of your input. For example, one might be:
[[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]]
and another might be:
[[ 1. 0. 1. 0. 1. 0. 1. 0. ]
[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[ 0. 1. 0. 1. 0. 1. 0. 1. ]]
It creates n_boot of these (10000 by default).
Central tendency estimation. Seaborn runs a function on each of the columns of each of the 10000 resampled versions of your data. Because you didn't specify this argument (estimator), it feeds the columns to a mean function (numpy.mean with axis=0). Lots of your columns in your bootstrap iterations are going to have a mean of 0.5, because they will be things like [0, 0.5, 1], [0.5, 1, 0], [0.5, 0.5, 0.5], etc. but you will also have some [1,1,0] and even some [1,1,1] which will result in higher means.
Confidence interval determination. For each column, seaborn sorts the 1000 estimates of the means calculated from each resampled version of the data from smallest to greatest, and picks the ones which represent the upper and lower CI. By default, it's using a 68% CI, so if you line up all 1000 mean estimates, then it will pick the 160th and the 840th. (840-160 = 680, or 68% of 1000).
A couple of notes:
There are actually only 3^3, or 27, possible resampled versions of your array, and if you use a function such as mean where the order doesn't matter then there's only 3!, or 6. So all 10000 bootstrap iterations will be identical to one of those 27 versions, or 6 versions in the unordered case. This means that it's probably silly to do 10000 iterations in this case.
The means 0.3333... and 0.6666... that show up as your confidence intervals are the means for [1,1,0] and [1,0,0] or rearranged versions of those.
They show a bootstrap confidence interval, computed by resampling units (rows in the 2d array input form). By default it shows a 68 percent confidence interval, which is equivalent to a standard error, but this can be changed with the ci parameter.