Differentiable round function in Tensorflow?

Differentiable round function in Tensorflow? - python

So the output of my network is a list of propabilities, which I then round using tf.round() to be either 0 or 1, this is crucial for this project.
I then found out that tf.round isn't differentiable so I'm kinda lost there.. :/

Something along the lines of x - sin(2pi x)/(2pi)?
I'm sure there's a way to squish the slope to be a bit steeper.

You can use the fact that tf.maximum() and tf.minimum() are differentiable, and the inputs are probabilities from 0 to 1
# round numbers less than 0.5 to zero;
# by making them negative and taking the maximum with 0
differentiable_round = tf.maximum(x-0.499,0)
# scale the remaining numbers (0 to 0.5) to greater than 1
# the other half (zeros) is not affected by multiplication
differentiable_round = differentiable_round * 10000
# take the minimum with 1
differentiable_round = tf.minimum(differentiable_round, 1)
Example:
[0.1, 0.5, 0.7]
[-0.0989, 0.001, 0.20099] # x - 0.499
[0, 0.001, 0.20099] # max(x-0.499, 0)
[0, 10, 2009.9] # max(x-0.499, 0) * 10000
[0, 1.0, 1.0] # min(max(x-0.499, 0) * 10000, 1)

This works for me:
x_rounded_NOT_differentiable = tf.round(x)
x_rounded_differentiable = x - tf.stop_gradient(x - x_rounded_NOT_differentiable)

Rounding is a fundamentally nondifferentiable function, so you're out of luck there. The normal procedure for this kind of situation is to find a way to either use the probabilities, say by using them to calculate an expected value, or by taking the maximum probability that is output and choose that one as the network's prediction. If you aren't using the output for calculating your loss function though, you can go ahead and just apply it to the result and it doesn't matter if it's differentiable. Now, if you want an informative loss function for the purpose of training the network, maybe you should consider whether keeping the output in the format of probabilities might actually be to your advantage (it will likely make your training process smoother)- that way you can just convert the probabilities to actual estimates outside of the network, after training.

Building on a previous answer, a way to get an arbitrarily good approximation is to approximate round() using a finite Fourier approximation and use as many terms as you need. Fundamentally, you can think of round(x) as adding a reverse (i. e. descending) sawtooth wave to x. So, using the Fourier expansion of the sawtooth wave we get
With N = 5, we get a pretty nice approximation:

Kind of an old question, but I just solved this problem for TensorFlow 2.0. I am using the following round function on in my audio auto-encoder project. I basically want to create a discrete representation of sound which is compressed in time. I use the round function to clamp the output of the encoder to integer values. It has been working well for me so far.
#tf.custom_gradient
def round_with_gradients(x):
def grad(dy):
return dy
return tf.round(x), grad

In range 0 1, translating and scaling a sigmoid can be a solution:
slope = 1000
center = 0.5
e = tf.exp(slope*(x-center))
round_diff = e/(e+1)

In tensorflow 2.10, there is a function called soft_round which achieves exactly this.
Fortunately, for those who are using lower versions, the source code is really simple, so I just copy-pasted those lines, and it works like a charm:
def soft_round(x, alpha, eps=1e-3):
"""Differentiable approximation to `round`.
Larger alphas correspond to closer approximations of the round function.
If alpha is close to zero, this function reduces to the identity.
This is described in Sec. 4.1. in the paper
> "Universally Quantized Neural Compression"<br />
> Eirikur Agustsson & Lucas Theis<br />
> https://arxiv.org/abs/2006.09952
Args:
x: `tf.Tensor`. Inputs to the rounding function.
alpha: Float or `tf.Tensor`. Controls smoothness of the approximation.
eps: Float. Threshold below which `soft_round` will return identity.
Returns:
`tf.Tensor`
"""
# This guards the gradient of tf.where below against NaNs, while maintaining
# correctness, as for alpha < eps the result is ignored.
alpha_bounded = tf.maximum(alpha, eps)
m = tf.floor(x) + .5
r = x - m
z = tf.tanh(alpha_bounded / 2.) * 2.
y = m + tf.tanh(alpha_bounded * r) / z
# For very low alphas, soft_round behaves like identity
return tf.where(alpha < eps, x, y, name="soft_round")
alpha sets how soft the function is. Greater values leads to better approximations of round function, but then it becomes harder to fit since gradients vanish:
x = tf.convert_to_tensor(np.arange(-2,2,.1).astype(np.float32))
for alpha in [ 3., 7., 15.]:
y = soft_round(x, alpha)
plt.plot(x.numpy(), y.numpy(), label=f'alpha={alpha}')
plt.legend()
plt.title('Soft round function for different alphas')
plt.grid()
In my case, I tried different values for alpha, and 3. looks like a good choice.

Related

Different Sigmoid Equations and its implementation

When reviewing through the Sigmoid function that is used in Neural Nets, we found this equation from https://en.wikipedia.org/wiki/Softmax_function#Softmax_Normalization:
Different from the standard sigmoid equation:
The first equation on top somehow involves the mean and standard deviation (I hope I didn't read the symbols wrongly) whereas the 2nd equation generalized the minus mean and divided by standard deviation as a constant since it's the same throughout all terms within a vector/matrix/tensor.
So when implementing the equations, I get different results.
With the 2nd equation (standard sigmoid function):
def sigmoid(x):
return 1. / (1 + np.exp(-x))
I get these output:
>>> x = np.array([1,2,3])
>>> print sigmoid(x)
[ 0.73105858 0.88079708 0.95257413]
I would have expect the 1st function to be the similar but the gap between the first and second element widens by quite a bit (though the ranking of the elements remains:
def get_statistics(x):
n = float(len(x))
m = x.sum() / n
s2 = sum((x - m)**2) / (n-1.)
s = s2**0.5
return m, s2, s
m, s, s2 = get_statistics(x)
sigmoid_x1 = 1 / (1 + np.exp(-(x[0] - m) / s2))
sigmoid_x2 = 1 / (1 + np.exp(-(x[1] - m) / s2))
sigmoid_x3 = 1 / (1 + np.exp(-(x[2] - m) / s2))
sigmoid_x1, sigmoid_x2, sigmoid_x3
[out]:
(0.2689414213699951, 0.5, 0.7310585786300049)
Possibly it has to do with the fact that the first equation contains some sort of softmax normalization but if it's generic softmax then the elements need to sum to one as such:
def softmax(x):
exp_x = np.exp(x)
return exp_x / exp_x.sum()
[out]:
>>> x = np.array([1,2,3])
>>> print softmax(x)
[ 0.09003057 0.24472847 0.66524096]
But the output from the first equation don't sum to one and it isn't similar/same as the standard sigmoid equation. So the question is:
Have I implemented the function for equation 1 wrongly?
Is equation 1 on the wikipedia page wrong? Or is it referring to something else and not really the sigmoid/logistic function?
Why is there a difference in the first and second equation?

You have implemented the equations correctly. Your problem is that you are mixing up the definitions of softmax and sigmoid functions.
A softmax function is a way to normalize your data by making outliers "less interesting". Additionally, it "squashes" your input vector in a way that it ensures the sum of the vector to be 1.
For your example:
> np.sum([ 0.09003057, 0.24472847, 0.66524096])
> 1.0
It is simply a generalization of a logistic function with the additional "constraint" to get every element of the vector in the interval (0, 1) and its sum to 1.0.
The sigmoid function is another special case of logistic functions. It is just a real-valued, differentiable function with a bell shape. It is interesting for neural networks because it is rather easy to compute, non-linear and has negative and positive boundaries, so your activation can not diverge but runs into saturation if it gets "too high".
However, a sigmoid function is not ensuring that an input vector sums up to 1.0.
In neural networks, sigmoid functions are used frequently as an activation function for single neurons, while a sigmoid/softmax normalization function is rather used at the output layer, to ensure the whole layer adds up to 1. You just mixed up the sigmoid function (for single neurons) versus the sigmoid/softmax normalization functions (for a whole layer).
EDIT: To clearify this for you I will give you an easy example with outliers, this demonstrates the behaviour of the two different functions for you.
Let's implement a sigmoid function:
import numpy as np
def s(x):
return 1.0 / (1.0 + np.exp(-x))
And the normalized version (in little steps, making it easier to read):
def sn(x):
numerator = x - np.mean(x)
denominator = np.std(x)
fraction = numerator / denominator
return 1.0 / (1.0 + np.exp(-fraction))
Now we define some measurements of something with huge outliers:
measure = np.array([0.01, 0.2, 0.5, 0.6, 0.7, 1.0, 2.5, 5.0, 50.0, 5000.0])
Now we take a look at the results that s (sigmoid) and sn (normalized sigmoid) give:
> s(measure)
> array([ 0.50249998, 0.549834 , 0.62245933, 0.64565631, 0.66818777,
0.73105858, 0.92414182, 0.99330715, 1. , 1. ])
> sn(measure)
> array([ 0.41634425, 0.41637507, 0.41642373, 0.41643996, 0.41645618,
0.41650485, 0.41674821, 0.41715391, 0.42447515, 0.9525677 ])
As you can see, s only translates the values "one-by-one" via a logistic function, so the outliers are fully satured with 0.999, 1.0, 1.0. The distance between the other values varies.
When we look at sn we see that the function actually normalized our values. Everything now is extremely identical, except for 0.95 which was the 5000.0.
What is this good for or how to interpret this?
Think of an output layer in a neural network: an activation of 5000.0 in one class on an output layer (compared to our other small values) means that the network is really sure that this is the "right" class to your given input. If you would have used s there, you would end up with 0.99, 1.0 and 1.0 and would not be able to distinguish which class is the correct guess for your input.

In this case you have to differentiate between three things : a sigmoid function, a sigmoid function with softmax normalization and softmax function.
A sigmoid function is a real-valued function which is simpy given by an equation f(x) = 1 / (1 + exp(-x)). For many years it was used in machine learning domain because of the fact that it squashes a real input to (0,1) interval which might be interpreted as e.g. probability value. Right now - many experts advice not to use it because of it's saturation and non zero mean problems. You can read about it (as long as how to deal with the problem e.g. here http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf).
A sigmoid with softmax normalization is used to deal with two important problems which may occur during using of sigmoid function. First is dealing with outliers (it squashes your x to be around 0 and to make it sd = 1 what normalize your data) and second (and what is in my opinion more crucial) is to make different variables equally important in further analysis. To understand this phenomenon imagine that you have two variables age and income, where age varies from 20 to 70 and income varies from 2000 to 60000. Without normalizing data - both variables would be squashed to almost one by sigmoid transformation. Moreover - due to greater mean absolute value - income variable will be significantly more important for your analysis without any rational explaination.
I think that standarization is even more crucial in understanding softmax normalization than dealing with outliers. To understand why imagine a variable which is equal to 0 in 99% of cases and 1 in others. In this case your sd ~ 0.01, mean ~ 0and softmax normalization would outlie 1 even more.
Something completely different is a softmax function. A softmax function is a mathematical transformation from R^k to R^k which squashes a real valued vector to a positive valued vector of the same size which sums up to 1. It's given by an equation softmax(v) = exp(v)/sum(exp(v)). It's something completely different than softmax normalization and it's usually used in case of multiclass classification.

sampling multinomial from small log probability vectors in numpy/scipy

Is there a function in numpy/scipy that lets you sample multinomial from a vector of small log probabilities, without losing precision? example:
# sample element randomly from these log probabilities
l = [-900, -1680]
the naive method fails because of underflow:
import scipy
import numpy as np
# this makes a all zeroes
a = np.exp(l) / scipy.misc.logsumexp(l)
r = np.random.multinomial(1, a)
this is one attempt:
def s(l):
m = np.max(l)
norm = m + np.log(np.sum(np.exp(l - m)))
p = np.exp(l - norm)
return np.where(np.random.multinomial(1, p) == 1)[0][0]
is this the best/fastest method and can np.exp() in the last step be avoided?

First of all, I believe the problem you're encountering is because you're normalizing your probabilities incorrectly. This line is incorrect:
a = np.exp(l) / scipy.misc.logsumexp(l)
You're dividing a probability by a log probability, which makes no sense. Instead you probably want
a = np.exp(l - scipy.misc.logsumexp(l))
If you do that, you find a = [1, 0] and your multinomial sampler works as expected up to floating point precision in the second probability.
A Solution for Small N: Histograms
That said, if you still need more precision and performance is not as much of a concern, one way you could make progress is by implementing a multinomial sampler from scratch, and then modifying this to work at higher precision.
NumPy's multinomial function is implemented in Cython, and essentially performs a loop over a number of binomial samples and combines them into a multinomial sample.
and you can call it like this:
np.random.multinomial(10, [0.1, 0.2, 0.7])
# [0, 1, 9]
(Note that the precise output values here & below are random, and will change from call to call).
Another way you might implement a multinomial sampler is to generate N uniform random values, then compute the histogram with bins defined by the cumulative probabilities:
def multinomial(N, p):
rand = np.random.uniform(size=N)
p_cuml = np.cumsum(np.hstack([[0], p]))
p_cuml /= p_cuml[-1]
return np.histogram(rand, bins=p_cuml)[0]
multinomial(10, [0.1, 0.2, 0.7])
# [1, 1, 8]
With this method in mind, we can think about doing things to higher precision by keeping everything in log-space. The main trick is to realize that the log of uniform random deviates is equivalent to the negative of exponential random deviates, and so you can do everything above without ever leaving log space:
def multinomial_log(N, logp):
log_rand = -np.random.exponential(size=N)
logp_cuml = np.logaddexp.accumulate(np.hstack([[-np.inf], logp]))
logp_cuml -= logp_cuml[-1]
return np.histogram(log_rand, bins=logp_cuml)[0]
multinomial_log(10, np.log([0.1, 0.2, 0.7]))
# [1, 2, 7]
The resulting multinomial draws will maintain precision even for very small values in the p array.
Unfortunately, these histogram-based solutions will be much slower than the native numpy.multinomial function, so if performance is an issue you may need another approach. One option would be to adapt the Cython code linked above to work in log-space, using similar mathematical tricks as I used here.
A Solution for Large N: Poisson Approximation
The problem with the above solution is that as N grows large, it becomes very slow.
I was thinking about this and realized there's a more efficient way forward, despite np.random.multinomial failing for probabilities smaller than 1E-16 or so.
Here's an example of that failure: on a 64-bit machine, this will always give zero for the first entry because of the way the code is implemented, when in reality it should give something near 10:
np.random.multinomial(1E18, [1E-17, 1])
# array([ 0, 1000000000000000000])
If you dig into the source, you can trace this issue to the binomial function upon which the multinomial function is built. The cython code internally does something like this:
def multinomial_basic(N, p, size=None):
results = np.array([np.random.binomial(N, pi, size) for pi in p])
results[-1] = int(N) - results[:-1].sum(0)
return np.rollaxis(results, 0, results.ndim)
multinomial_basic(1E18, [1E-17, 1])
# array([ 0, 1000000000000000000])
The problem is that the binomial function chokes on very small values of p – this is because the algorithm computes the value (1 - p), so the value of p is limited by floating-point precision.
So what can we do? Well, it turns out that for small values of p, the Poisson distribution is an extremely good approximation of the binomial distribution, and the implementation doesn't have these issues. So we can build a robust multinomial function based on a robust binomial sampler that switches to a Poisson sampler at small p:
def binomial_robust(N, p, size=None):
if p < 1E-7:
return np.random.poisson(N * p, size)
else:
return np.random.binomial(N, p, size)
def multinomial_robust(N, p, size=None):
results = np.array([binomial_robust(N, pi, size) for pi in p])
results[-1] = int(N) - results[:-1].sum(0)
return np.rollaxis(results, 0, results.ndim)
multinomial_robust(1E18, [1E-17, 1])
array([ 12, 999999999999999988])
The first entry is nonzero and near 10 as expected! Note that we can't use N larger than 1E18, because it will overflow the long integer.
But we can confirm that our approach works for smaller probabilities using the size parameter, and averaging over results:
p = [1E-23, 1E-22, 1E-21, 1E-20, 1]
size = int(1E6)
multinomial_robust(1E18, p, size).mean(0)
# array([ 1.70000000e-05, 9.00000000e-05, 9.76000000e-04,
# 1.00620000e-02, 1.00000000e+18])
We see that even for these very small probabilities, the multinomial values are turning up in the right proportion. The result is a very robust and very fast approximation to the multinomial distribution for small p.

Batch Gradient Descent for Logistic Regression

I've been following Andrew Ng CSC229 machine learning course, and am now covering logistic regression. The goal is to maximize the log likelihood function and find the optimal values of theta to do so. The link to the lecture notes is: [http://cs229.stanford.edu/notes/cs229-notes1.ps][1] -pages 16-19. Now the code below was shown on the course homepage (in matlab though--I converted it to python).
I'm applying it to a data set with 100 training examples (a data set given on the Coursera homepage for a introductory machine learning course). The data has two features which are two scores on two exams. The output is a 1 if the student received admission and 0 is the student did not receive admission. The have shown all of the code below. The following code causes the likelihood function to converge to maximum of about -62. The corresponding values of theta are [-0.05560301 0.01081111 0.00088362]. Using these values when I test out a training example like [1, 30.28671077, 43.89499752] which should give a value of 0 as output, I obtain 0.576 which makes no sense to me. If I test the hypothesis function with input [1, 10, 10] I obtain 0.515 which once again makes no sense. These values should correspond to a lower probability. This has me quite confused.
import numpy as np
import sig as s
def batchlogreg(X, y):
max_iterations = 800
alpha = 0.00001
(m,n) = np.shape(X)
X = np.insert(X, 0, 1, 1)
theta = np.array([0] * (n+1), 'float')
ll = np.array([0] * max_iterations, 'float')
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta))
d = y - hx
theta = theta + alpha*np.dot(np.transpose(X),d)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
return (theta, ll)

Note that the sigmoid function has:
sig(0) = 0.5
sig(x > 0) > 0.5
sig(x < 0) < 0.5
Since you get all probabilities above 0.5, this suggests that you never make X * theta negative, or that you do, but your learning rate is too small to make it matter.
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta)) # this will probably be > 0.5 initially
d = y - hx # then this will be "very" negative when y is 0
theta = theta + alpha*np.dot(np.transpose(X),d) # (1)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
The problem is most likely at (1). The dot product will be very negative, but your alpha is very small and will negate its effect. So theta will never decrease enough to properly handle correctly classifying labels that are 0.
Positive instances are then only barely correctly classified for the same reason: your algorithm does not discover a reasonable hypothesis under your number of iterations and learning rate.
Possible solution: increase alpha and / or the number of iterations, or use momentum.

It sounds like you could be confusing probabilities with assignments.
The probability will be a real number between 0.0 and 1.0. A label will be an integer (0 or 1). Logistic regression is a model that provides the probability of a label being 1 given the input features. To obtain a label value, you need to make a decision using that probability. An easy decision rule is that the label is 0 if the probability is less than 0.5, and 1 if the probability is greater than or equal to 0.5.
So, for the example you gave, the decisions would both be 1 (which means the model is wrong for the first example where it should be 0).

I came to the same question and found the reason.
Normalize X first or set a scale-comparable intercept like 50.
Otherwise contours of cost function are too "narrow". A big alpha makes it overshoot and a small alpha fails to progress.

Fitting curve: why small numbers are better?

I spent some time these days on a problem. I have a set of data:
y = f(t), where y is very small concentration (10^-7), and t is in second. t varies from 0 to around 12000.
The measurements follow an established model:
y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)
And I need to find Vs, Vi, and k. So I used curve_fit, which returns the best fitting parameters, and I plotted the curve.
And then I used a similar model:
y = (Vs * t/3600 - ((Vs - Vi) * (1 - np.exp(-k * t/3600)) / k)) * 10**7
By doing that, t is a number of hour, and y is a number between 0 and about 10. The parameters returned are of course different. But when I plot each curve, here is what I get:
http://i.imgur.com/XLa4LtL.png
The green fit is the first model, the blue one with the "normalized" model. And the red dots are the experimental values.
The fitting curves are different. I think it's not expected, and I don't understand why. Are the calculations more accurate if the numbers are "reasonnable" ?

The docstring for optimize.curve_fit says,
p0 : None, scalar, or M-length sequence
Initial guess for the parameters. If None, then the initial
values will all be 1 (if the number of parameters for the function
can be determined using introspection, otherwise a ValueError
is raised).
Thus, to begin with, the initial guess for the parameters is by default 1.
Moreover, curve fitting algorithms have to sample the function for various values of the parameters. The "various values" are initially chosen with an initial step size on the order of 1. The algorithm will work better if your data varies somewhat smoothly with changes in the parameter values that on the order of 1.
If the function varies wildly with parameter changes on the order of 1, then the algorithm may tend to miss the optimum parameter values.
Note that even if the algorithm uses an adaptive step size when it tweaks the parameter values, if the initial tweak is so far off the mark as to produce a big residual, and if tweaking in some other direction happens to produce a smaller residual, then the algorithm may wander off in the wrong direction and miss the local minimum. It may find some other (undesired) local minimum, or simply fail to converge. So using an algorithm with an adaptive step size won't necessarily save you.
The moral of the story is that scaling your data can improve the algorithm's chances of of finding the desired minimum.
Numerical algorithms in general all tend to work better when applied to data whose magnitude is on the order of 1. This bias enters into the algorithm in numerous ways. For instance, optimize.curve_fit relies on optimize.leastsq, and the call signature for optimize.leastsq is:
def leastsq(func, x0, args=(), Dfun=None, full_output=0,
col_deriv=0, ftol=1.49012e-8, xtol=1.49012e-8,
gtol=0.0, maxfev=0, epsfcn=None, factor=100, diag=None):
Thus, by default, the tolerances ftol and xtol are on the order of 1e-8. If finding the optimum parameter values require much smaller tolerances, then these hard-coded default numbers will cause optimize.curve_fit to miss the optimize parameter values.
To make this more concrete, suppose you were trying to minimize f(x) = 1e-100*x**2. The factor of 1e-100 squashes the y-values so much that a wide range of x-values (the parameter values mentioned above) will fit within the tolerance of 1e-8. So, with un-ideal scaling, leastsq will not do a good job of finding the minimum.
Another reason to use floats on the order of 1 is because there are many more (IEEE754) floats in the interval [-1,1] than there are far away from 1. For example,
import struct
def floats_between(x, y):
"""
http://stackoverflow.com/a/3587987/190597 (jsbueno)
"""
a = struct.pack("<dd", x, y)
b = struct.unpack("<qq", a)
return b[1] - b[0]
In [26]: floats_between(0,1) / float(floats_between(1e6,1e7))
Out[26]: 311.4397707054894
This shows there are over 300 times as many floats representing numbers between 0 and 1 than there are in the interval [1e6, 1e7].
Thus, all else being equal, you'll typically get a more accurate answer if working with small numbers than very large numbers.

I would imagine it has more to do with the initial parameter estimates you are passing to curve fit. If you are not passing any I believe they all default to 1. Normalizing your data makes those initial estimates closer to the truth. If you don't want to use normalized data just pass the initial estimates yourself and give them reasonable values.

Others have already mentioned that you probably need to have a good starting guess for your fit. In cases like this is, I usually try to find some quick and dirty tricks to get at least a ballpark estimate of the parameters. In your case, for large t, the exponential decays pretty quickly to zero, so for large t, you have
y == Vs * t - (Vs - Vi) / k
Doing a first-order linear fit like
[slope1, offset1] = polyfit(t[t > 2000], y[t > 2000], 1)
you will get slope1 == Vs and offset1 == (Vi - Vs) / k.
Subtracting this straight line from all the points you have, you get the exponential
residual == y - slope1 * t - offset1 == (Vs - Vi) * exp(-t * k)
Taking the log of both sides, you get
log(residual) == log(Vs - Vi) - t * k
So doing a second fit
[slope2, offset2] = polyfit(t, log(y - slope1 * t - offset1), 1)
will give you slope2 == -k and offset2 == log(Vs - Vi), which should be solvable for Vi since you already know Vs. You might have to limit the second fit to small values of t, otherwise you might be taking the log of negative numbers. Collect all the parameters you obtained with these fits and use them as the starting points for your curve_fit.
Finally, you might want to look into doing some sort of weighted fit. The information about the exponential part of your curve is contained in just the first few points, so maybe you should give those a higher weight. Doing this in a statistically correct way is not trivial.

Equivalent python command for quantile in matlab

I'm trying to replicate some Matlab code in python. I could not find an exact equivalent to the Matlab function quantile. What I found most close is python's mquantiles.
Matlab example:
quantile( [ 8.60789925e-05, 1.98989354e-05 , 1.68308882e-04, 1.69379370e-04], 0.8)
...gives: 0.00016958
Same example in python:
scipy.stats.mstats.mquantiles( [8.60789925e-05, 1.98989354e-05, 1.68308882e-04, 1.69379370e-04], 0.8)
...gives 0.00016912
Does anyone know how to exactly replicate Matlab's quantile function?

The documentation for quantile (under the More About => Algorithms section) gives the exact algorithm used. Here's some python code that does it for a single quantile for a flat array, using bottleneck to do partial sorting:
import numpy as np
import botteleneck as bn
def quantile(a, prob):
"""
Estimates the prob'th quantile of the values in a data array.
Uses the algorithm of matlab's quantile(), namely:
- Remove any nan values
- Take the sorted data as the (.5/n), (1.5/n), ..., (1-.5/n) quantiles.
- Use linear interpolation for values between (.5/n) and (1 - .5/n).
- Use the minimum or maximum for quantiles outside that range.
See also: scipy.stats.mstats.mquantiles
"""
a = np.asanyarray(a)
a = a[np.logical_not(np.isnan(a))].ravel()
n = a.size
if prob >= 1 - .5/n:
return a.max()
elif prob <= .5 / n:
return a.min()
# find the two bounds we're interpreting between:
# that is, find i such that (i+.5) / n <= prob <= (i+1.5)/n
t = n * prob - .5
i = np.floor(t)
# partial sort so that the ith element is at position i, with bigger ones
# to the right and smaller to the left
a = bn.partsort(a, i)
if i == t: # did we luck out and get an integer index?
return a[i]
else:
# we'll linearly interpolate between this and the next index
smaller = a[i]
larger = a[i+1:].min()
if np.isinf(smaller):
return smaller # avoid inf - inf
return smaller + (larger - smaller) * (t - i)
I only did the single-quantile, 1d case because that's all I needed. If you want several quantiles, it's probably worth just doing the full sort; to do it per-axis and knew you didn't have any nans, all you should need to do is add an axis argument to the sort and vectorize the linear interpolation bit. Doing it per-axis with nans would be a little trickier.
This code gives:
>>> quantile([ 8.60789925e-05, 1.98989354e-05 , 1.68308882e-04, 1.69379370e-04], 0.8)
0.00016905822360000001
and the matlab code gave 0.00016905822359999999; the difference is 3e-20. (which is less than machine precision)

Your input vector only has 4 values, which is far too few to get a good approximation of the quantiles of the underlying distribution. The discrepancy is probably the result of Matlab and SciPy using different heuristics to compute quantiles on under sampled distributions.

A bit late, but:
mquantiles is very flexible. You just need to provide alphap and betap parameters.
Here, since MATLAB does a linear interpolation, you need to set the parameters to (0.5,0.5).
In [9]: scipy.stats.mstats.mquantiles( [8.60789925e-05, 1.98989354e-05, 1.68308882e-04, 1.69379370e-04], 0.8, alphap=0.5, betap=0.5)
EDIT: MATLAB says that it does linear interpolation, however it seems that it calculates the quantile through piece-wise linear interpolation, which is equivalent to Type 5 quantile in R, and (0.5, 0.5) in scipy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Differentiable round function in Tensorflow? - python

So the output of my network is a list of propabilities, which I then round using tf.round() to be either 0 or 1, this is crucial for this project. I then found out that tf.round isn't differentiable so I'm kinda lost there.. :/

Something along the lines of x - sin(2pi x)/(2pi)? I'm sure there's a way to squish the slope to be a bit steeper.

This works for me: x_rounded_NOT_differentiable = tf.round(x) x_rounded_differentiable = x - tf.stop_gradient(x - x_rounded_NOT_differentiable)

In range 0 1, translating and scaling a sigmoid can be a solution: slope = 1000 center = 0.5 e = tf.exp(slope*(x-center)) round_diff = e/(e+1)

Related

Different Sigmoid Equations and its implementation

sampling multinomial from small log probability vectors in numpy/scipy

Batch Gradient Descent for Logistic Regression

Fitting curve: why small numbers are better?

Equivalent python command for quantile in matlab

Categories

Resources