I'm trying to understand how the error bands are calculated in the tsplot. Examples of the error bands are shown here.
When I plot something simple like
sns.tsplot(np.array([[0,1,0,1,0,1,0,1], [1,0,1,0,1,0,1,0], [.5,.5,.5,.5,.5,.5,.5,.5]]))
I get a vertical line at y=0.5 as expected. The top error band is also a vertical line at around y=0.665 and the bottom error band is a vertical line at around y=0.335. Can someone explain how these are derived?
EDIT: The question and this answer referred to old versions of Seaborn and is not relevant for new versions. See #CGFoX 's comment below.
I'm not a statistician, but I read through the seaborn code in order to see exactly what's happening. There are three steps:
Bootstrap resampling. Seaborn creates resampled versions of your data. Each of these is
a 3x8 matrix like yours, but each row is randomly selected from the
three rows of your input. For example, one might be:
[[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]]
and another might be:
[[ 1. 0. 1. 0. 1. 0. 1. 0. ]
[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[ 0. 1. 0. 1. 0. 1. 0. 1. ]]
It creates n_boot of these (10000 by default).
Central tendency estimation. Seaborn runs a function on each of the columns of each of the 10000 resampled versions of your data. Because you didn't specify this argument (estimator), it feeds the columns to a mean function (numpy.mean with axis=0). Lots of your columns in your bootstrap iterations are going to have a mean of 0.5, because they will be things like [0, 0.5, 1], [0.5, 1, 0], [0.5, 0.5, 0.5], etc. but you will also have some [1,1,0] and even some [1,1,1] which will result in higher means.
Confidence interval determination. For each column, seaborn sorts the 1000 estimates of the means calculated from each resampled version of the data from smallest to greatest, and picks the ones which represent the upper and lower CI. By default, it's using a 68% CI, so if you line up all 1000 mean estimates, then it will pick the 160th and the 840th. (840-160 = 680, or 68% of 1000).
A couple of notes:
There are actually only 3^3, or 27, possible resampled versions of your array, and if you use a function such as mean where the order doesn't matter then there's only 3!, or 6. So all 10000 bootstrap iterations will be identical to one of those 27 versions, or 6 versions in the unordered case. This means that it's probably silly to do 10000 iterations in this case.
The means 0.3333... and 0.6666... that show up as your confidence intervals are the means for [1,1,0] and [1,0,0] or rearranged versions of those.
They show a bootstrap confidence interval, computed by resampling units (rows in the 2d array input form). By default it shows a 68 percent confidence interval, which is equivalent to a standard error, but this can be changed with the ci parameter.
Related
I had some code that random-initialized some numpy arrays with:
rng = np.random.default_rng(seed=seed)
new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(np.float32) # [-1.0, 1.0)
new_vectors /= vector_size
And all was working well, all project tests passing.
Unfortunately, uniform() returns np.float64, though downstream steps only want np.float32, and in some cases, this array is very large (think millions of 400-dimensional word-vectors). So the temporary np.float64 return-value momentarily uses 3X the RAM necessary.
Thus, I replaced the above with what definitionally should be equivalent:
rng = np.random.default_rng(seed=seed)
new_vectors = rng.random(target_shape, dtype=np.float32) # [0.0, 1.0)
new_vectors *= 2.0 # [0.0, 2.0)
new_vectors -= 1.0 # [-1.0, 1.0)
new_vectors /= vector_size
And after this change, all closely-related functional tests still pass, but a single distant, fringe test relying on far-downstream calculations from the vectors so-initialized has started failing. And failing in a very reliable way. It's a stochastic test, and passes with large margin-for-error in top case, but always fails in bottom case. So: something has changed, but in some very subtle way.
The superficial values of new_vectors seem properly and similarly distributed in both cases. And again, all the "close-up" tests of functionality still pass.
So I'd love theories for what non-intuitive changes this 3-line change may have made that could show up far-downstream.
(I'm still trying to find a minimal test that detects whatever's different. If you'd enjoy doing a deep-dive into the affected project, seeing the exact close-up tests that succeed & one fringe test that fails, and commits with/without the tiny change, at https://github.com/RaRe-Technologies/gensim/pull/2944#issuecomment-704512389. But really, I'm just hoping a numpy expert might recognize some tiny corner-case where something non-intuitive happens, or offer some testable theories of same.)
Any ideas, proposed tests, or possible solutions?
A way to maintain precision and conserve memory could be to create your large target array, then fill it in using blocks at higher precision.
For example:
def generate(shape, value, *, seed=None, step=10):
arr = np.empty(shape, dtype=np.float32)
rng = np.random.default_rng(seed=seed)
(d0, *dr) = shape
for i in range(0, d0, step):
j = min(d0, i + step)
arr[i:j,:] = rng.uniform(-1/value, 1/value, size=[j-i]+dr)
return arr
which can be used as:
generate((100, 1024, 1024), 7, seed=13)
You can tune the size of these blocks (via step) to maintain performance.
Let's print new_vectors * 2**22 % 1 for both methods, i.e., let's look at what's left after the first 22 fractional bits (program is at the end). With the first method:
[[0. 0.5 0.25 0. 0. ]
[0.5 0.875 0.25 0. 0.25 ]
[0. 0.25 0. 0.5 0.5 ]
[0.6875 0.328125 0.75 0.5 0.52539062]
[0.75 0.75 0.25 0.375 0.25 ]]
With the second method:
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
Quite a difference! The second method doesn't produce any numbers with 1-bits after the first 22 fractional bits.
Let's imagine we had a type float3 which could only hold three significant bits (think span of non-zero bits), so for example the numbers (in binary) 1.01 or 11100.0 or 0.0000111, but not 10.01 because that has four significant bits.
Then the random number generator for the range [0, 1) would pick from these eight numbers:
0.000
0.001
0.010
0.011
0.100
0.101
0.110
0.111
Wait, hold on. Why only from those eight? What about for example the aforementioned 0.0000111? That's in [0, 1) and can be represented, right?
Well yes, but note that that's in [0, 0.5). And there are no further representable numbers in the range [0.5, 1), as those numbers all start with "0.1" and thus any further 1-bits can only be at the second or third fractional bit. For example 0.1001 would not be representable, as that has four significant bits.
So if the generator would also pick from any other than those eight numbers above, they'd all have to be in [0, 0.5), creating a bias. It could pick from different four numbers in that range instead, or possibly include all representable numbers in that range with proper probabilities, but either way you'd have a "gap bias", where numbers picked from [0, 0.5) can have smaller or larger gaps than numbers picked from [0.5, 1). Not sure "gap bias" is a thing or the correct term, but the point is that the distribution in [0, 0.5) would look different than that in [0.5, 1). The only way to make them look the same is if you stick with picking from those equally-spaced eight numbers above. The distribution/possibilities in [0.5, 1) dictate what you should use in [0, 0.5).
So... a random number generator for float3 would pick from those eight numbers and never generate for example 0.0000111. But now imagine we also had a type float5, which could hold five significant bits. Then a random number generator for that could pick 0.00001. And if you then convert that to our float3, that would survive, you'd have 0.00001 as float3. But in the range [0.5, 1), this process of generating float5 numbers and converting them to float3 would still only produce the numbers 0.100, 0.101, 0.110 and 0.111, since float3 still can't represent any other numbers in that range.
So that's what you get, just with float32 and float64. Your two methods give you different distributions. I'd say the second method's distribution is actually better, as the first method has what I called "gap bias". So maybe it's not your new method that's broken, but the test. If that's the case, fix the test. Otherwise, an idea to fix your situation might be to use the old float64-to-float32 way, but not produce everything at once. Instead, prepare the float32 structure with just 0.0 everywhere, and then fill it in smaller chunks generated with your new way.
Small caveat, btw: Looks like there's a bug in NumPy for generating random float32 values, not using the lowest-position bit. So that might be another reason the test fails. You could try your second method with (rng.integers(0, 2**24, target_shape) / 2**24).astype(np.float32) instead of rng.random(target_shape, dtype=np.float32). I think that's equivalent to what the fixed version would be (since it's apparently currently doing it that way, except with 23 instead of 24).
The program for the experiment at the top (also at repl.it):
import numpy as np
# Some setup
seed = 13
target_shape = (5, 5)
vector_size = 1
# First way
rng = np.random.default_rng(seed=seed)
new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(np.float32) # [-1.0, 1.0)
new_vectors /= vector_size
print(new_vectors * 2**22 % 1)
# Second way
rng = np.random.default_rng(seed=seed)
new_vectors = rng.random(target_shape, dtype=np.float32) # [0.0, 1.0)
new_vectors *= 2.0 # [0.0, 2.0)
new_vectors -= 1.0 # [-1.0, 1.0)
new_vectors /= vector_size
print(new_vectors * 2**22 % 1)
I ran your code with the following values:
seed = 0
target_shape = [100]
vector_size = 3
I noticed that the code in your first solution generated a different new_vectors then your second solution.
Specifically it looks like uniform keeps half of the values from the random number generator that random does with the same seed. This is probably because of an implementation detail within the random generator from numpy.
In the following snippet i only inserted spaces to align similar values. there is probably also some float rounding going on making the result appear not identical.
[ 0.09130779, -0.15347552, -0.30601767, -0.32231492, 0.20884682, ...]
[0.23374946, 0.09130772, 0.007424275, -0.1534756, -0.12811375, -0.30601773, -0.28317323, -0.32231498, -0.21648853, 0.20884681, ...]
Based on this i speculate that your stochastic test case only tests your solution with one seed and because you generate a different sequence with the new solution. and this result causes a failure in the test case.
I have a numpy array of the following shape:
img1
It is a probability vector, where the second row corresponds to a value and the first row to the probability that this value is realized. (e.g. the probability of getting 1.0 is 20%)
When two values are close to each other, I want to merge their columns by adding up the probabilities. In this example I want to have:
img2
My current solution involves 3 loops and is really slow for larger arrays. Does someone know an efficient way to program this in NumPy?
While it won't do exactly what you want, you could try to use np.histogram to tackle the problem.
For example say you just want two "bins" like in your example you could do
import numpy as np
x = np.array([[0.1, 0.2, 0.6, 0.1], [0.0, 1.0, 0.0, 1.01]])
hist, bin_edges = np.histogram(x[1, :], bins=[0, 1.0, 1.5], weights=x[0, :])
and then stack your histogram with the leading bin edges to get your output
print(np.stack([hist, bin_edges[:-1]]))
This will print
[[0.7 0.3]
[0. 1. ]]
You can use the bins parameter to get your desired output. I hope this helps.
Why does the total probability exceed 1?
import matplotlib.pyplot as plt
figure, axes = plt.subplots(nrows = 1, ncols = 1)
axes.hist(x = [0.1, 0.2, 0.3, 0.4], density = True)
figure.show()
Expected y-values: [0.25, 0.25, 0.25, 0.25]
Following is my understanding as per the documentation. I don't claim to be an expert in matplotlib nor I am one of the authors. Your question made me think and then I read the documentation and took some logical steps to understand it. So this is not an expert opinion.
===================================================================
Since you have not passed bins information, matplotlib went ahead and created its own bins. In this case the bins are as below.
bins = [0.1 , 0.13, 0.16, 0.19, 0.22, 0.25, 0.28, 0.31, 0.34, 0.37, 0.4 ]
You can see the bind width is 0.03.
Now according to the documentation.
density : bool, optional
If True, the first element of the return
tuple will be the counts normalized to form a probability density,
i.e., the area (or integral) under the histogram will sum to 1. This
is achieved by dividing the count by the number of observations times
the bin width and not dividing by the total number of observations.
In order to make the sum to 1, it is normalizing the counts so that when you multiply the resulting normalized counts in each bin with bin width the resulting sum of the individual product becomes 1.
Your counts are as below for X = [0.1,0.2,0.3,0.4]
OriginalCounts = [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]
As you can see if you multiply the OriginalCounts array with the bin width and sum all of them, it is going to come out to 4*0.03 = 0.12 .. which is less than one.
So according to the documentation we need to divide the OriginalCounts array with a factor .. which is (number of observations * bin width).
In this case the number of observations are 4 and bin width is 0.03. So 4*0.03 is equal to 0.12. Thus you divide each element of OriginalCounts with 0.12 to get a Normalized histogram values array.
That means that the revised counts are as below
NormalizedCounts = [8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333]
Please note that, now, if you sum the Normalized counts multiplied by bin width, it will be equal to 1. You can quickly check this: 8.333333*4*0.03=0.9999999.. which is very close to 1.
This normalized counts is finally shown in the graph. This is the reason why the height of the bars in the histogram is close to 8 for at four positions.
Hope this helps.
The matplotlib.pyplot.hist() documentation describes the parameter "density" (its deprecated name was "normed") as:
density : bool, optional
If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations.
With the first element of the tuple it refers to the y-axis values. It says that it manages to get the area under the histogram to be 1 by: dividing the count by the number of observations times the bin width.
What is the difference between count and number of observations? In my head they are the same thing: the number of instances (or number of counts or number of observations) the variable value falls into a certain bin. However, this would mean that the transformed number of counts for each bin is just one over the bin width (since # / #*bin_width = 1/bin_width) which does not make any sense.
Could someone clarify this for me? Thank you for your help and sorry for the probably stupid question.
I think the wording in the documentation is a bit confusing. The count is the number of entries in a given bin (height of the bin) and the number of observation is the total number of events that go into the histogram.
The documentation makes the distinction about how they normalized because there are generally two ways to do the normalization:
count / number of observations - in this case if you add up all the entries of the output array you would get 1.
count / (number of observations * bin width) - in this case the integral of the output array is 1 so it is a true probability density. This is what matplotlib does, and they just want to be clear in this distinction.
The count of all obervations is the number of observations. But with a histogram you're interested in the counts per bin. So for each bin you divide the count of this bin by the total number of observations times the bin width.
import numpy as np
observations = [1.2, 1.5, 1.7, 1.9, 2.2, 2.3, 3.6, 4.1, 4.2, 4.4]
bin_edges = [0,1,2,3,4,5]
counts, edges = np.histogram(observations, bins=bin_edges)
print(counts) # prints [0 4 2 1 3]
density, edges = np.histogram(observations, bins=bin_edges, density=True)
print(density) # prints [0. 0.4 0.2 0.1 0.3]
# calculate density manually according to formula
man_density = counts/(len(observations)*np.diff(edges))
print(man_density) # prints [0. 0.4 0.2 0.1 0.3]
# Check that density == manually calculated density
assert(np.all(man_density == density))
I'm new to Python and was wondering why np.var(x) gives a different answer from the cov(x,x) values found in the output of np.cov(x, y). Shouldn't they be the same? I understand it has something to do with the bias or ddof, something about normalising it but I am not really sure what that means and could not find any resources that specifically answer my question. Hope someone can help!
In numpy, cov defaults to a "delta degree of freedom" of 1 while var defaults to a ddof of 0. From the notes to numpy.var
Notes
-----
The variance is the average of the squared deviations from the mean,
i.e., ``var = mean(abs(x - x.mean())**2)``.
The mean is normally calculated as ``x.sum() / N``, where ``N = len(x)``.
If, however, `ddof` is specified, the divisor ``N - ddof`` is used
instead. In standard statistical practice, ``ddof=1`` provides an
unbiased estimator of the variance of a hypothetical infinite population.
``ddof=0`` provides a maximum likelihood estimate of the variance for
normally distributed variables.
So you can get them to agree by taking:
In [69]: cov(x,x)#defaulting to ddof=1
Out[69]:
array([[ 0.5, 0.5],
[ 0.5, 0.5]])
In [70]: x.var(ddof=1)
Out[70]: 0.5
In [71]: cov(x,x,ddof=0)
Out[71]:
array([[ 0.25, 0.25],
[ 0.25, 0.25]])
In [72]: x.var()#defaulting to ddof=0
Out[72]: 0.25