understanding min and max values of randn while using NumPy

understanding min and max values of randn while using NumPy - python

I am fairly new to Python and learning some basic python for data science, and I am trying to ascertain the min and max values for an array when you run randn.
to clarify the question, how potentially low and high could the below numbers get?
does it have anything to do with the row/column values entered?
I thought they were for values between -1 and 1 but this is not the case when I test.
import numpy as np
np.random.randn(3,3)
array([[ 1.61311526, -1.20028357, -0.41723647],
[-0.31983635, -3.05411198, -0.43453723],
[ 0.09385744, -0.28239577, -1.17262933]])

As mentioned by others, graphically, the probability distribution looks like this.
Probability of getting a value from -0.5 to 0.5: 19.1% + 19.1% = 38.2%
Probability of getting a value larger than 3 = 0.1%

np.random.randn return a sample (or samples) from the “standard normal” distribution (see documentation here).
The standard normal distribution is not bounded. However, the probability for example to get a sample smaller than -3 is 0.0013.

The function numpy.random.randn returns values from the standard normal distribution, which can be anything between negative and positive infinity, so there's no max or min. These values are distributed along the "bell curve" centered at 0, and are exponentially less likely to occur the farther you get from 0.
The row/column parameters don't affect determine any (non-existent) max/min, they just determine the shape of the output array (see the documentation)
So in your example, passing (3,3) into np.random.randn(3,3) returns a 3x3 array of values from the standard normal distribution.

Basically, there's no max or min value, but since higher numbers are less likely to come up or in other words have lower probabilities, you're usually only looking at -3.5 to 3.5. But the larger the size of random data you are trying to generate, the higher the chances of generating a larger value.

The numpy.random.randn function is based on a standard normal distribution, meaning that there is not a maximum value or a minimum value. However, more positive and negative values are less likely to be produced than ones closer to zero.

Related

Is it possible to make the mean of the list of numbers generated from the normal distribution become exactly zero?

So, even in the absence of errors due to the finite machine precision it would be mathematically incorrect to think of a scenario where finite number of points sampled from a Gaussian distribution give exactly zero mean always. One would truly need an infinite number of points for this to be exactly true.
Nonetheless, I am manually (in an ad hoc manner) trying to center the distribution so that the mean is at zero. For that I first generate a gaussian distribution, find it's mean and then shift each point with that mean. By doing this I take the mean very close to zero but then I encounter a small value close to the machine precision (of the order 10**(-17)) and I do not know how to make it exactly zero.
Here is the code I used:
import numpy as np
n=10000
X=np.random.normal(0,1,size=(2,n))[0,:]
Xm=np.mean(X)
print("Xm = ", Xm)
Y=np.random.normal(0,1,size=(2,n))[1,:]
Ym=np.mean(Ym)
print("Ym = ", Y)
for i in range(len(X)):
X[i]=X[i]-Xm
Y[i]=Y[i]-Ym
new_X=np.mean(X)
new_Y=np.mean(Y)
print(new_X)
print(new_Y)
Output:
Zreli = 0.002713682499601005
Preli = -0.0011499576497770079
-3.552713678800501e-18
2.2026824808563105e-17

I am not good with code but mathematically you could have a while loop that checks for the sum of the numbers you have to not be 0. If it isn't 0 you would add 1 to the lowest unit you allow.

How Can I find the min and max values of y and draw a border between them in python?

I am currently gathering X gyro data of a device. I need to find the differences between them in terms of their y values in the graphs such as this one.
I have written an algorithm where it finds the min and max y value but when they have some minor fluctuations like this one my algorithm returns faulty answers. I have written it in python and it is as follows;
import numpy as np
x, y = np.loadtxt("gyroDataX.txt", unpack=True, delimiter=",")
max_value = max(y)
min_value = min(x)
print(min_value)
borders = max_value - min_value
I need to now write and algorithm that will;
Determine the max and min y value and draw their borders.
If it sees minor fluctuations it will ignore them.
Would writing such an algorithm be possible, and if so how could I go about writing one? Is there any libraries or any reading material you would recommend? Thanks for your help.

1. Generally maths like this should be done in pure code, with the help of little or none external API's, so it's easier to debug the algorithm and processes.
2. Now for a possible answer:
Since you do not want any outliers (those buggy and irritating minor fluctuations), you need to calculate the standard deviation of your data.
What is the standard deviation, you might ask?
It represents how far your data is from the mean(Symbol: µ) (average) of your data set.
If you do not know what it is, here is quick tutorial on the standard deviation and its partner, variance.
First, the MEAN:
It should not be that hard to calculate the mean of your x and y arrays/lists . Just loop (for loop would be optimal) through the lists and add up all the values and divide it by the length of the list itself. There you have the mean.
Second, the VARIANCE(σ squared):
If you followed the website above, to calculate the variance, loop through the x and y lists again, subtract the x and y values from their respective mean to get the difference, square this difference, add all the differences up and divide by the length of the respective lists and you have the variance.
For the Standard Deviation (Symbol: σ), just take the square root of the variance.
Now the standard deviation can be used to find the real max and min (leaving out those buggy outliers).
Use this graph as an approximate reference to find where your most of your values may be:
Since your data is mostly uniform, you should get a pretty accurate answer.
Do test the different situations: σ + µ or 2σ + µ; to find the optimum max and min.
Edit:
Sorry, only y now:
Sorry for the horrible representation and drawing. This is what it should like graphically is. Also do experiment by yourself with the Standard Deviation from the mean (like above; σ + µ or 2σ + µ) to find the your suited max and min.

Adding 0 to log-uniform parameter distribution in BayesSearchCV

I'm using scikit-optimize to do a BayesSearchCV within my RandomForestClassifier hyperparameter space. One hyperparameter is supposed to also be 0 (zero) while having a log-uniform distribution:
ccp_alpha = Real(min(ccp_alpha), max(ccp_alpha), prior='log-uniform')
Since log(0) is impossible to calculate, it is apparently impossible to have the parameter take the value 0 at some point.
Consequently, the following error is thrown:
ValueError: Not all points are within the bounds of the space.
Is there any way to work around this?

Note that getting a 0 from a log-uniform distribution is not well defined. How would you normalize this distribution, or in other words what would the odds be of drawing a 0?
The simplest approach would be to yield a list of values to try with the specified distribution. Since the values in this list will be sampled uniformly, you can use any distribution you like. For example with the list
reals = [0,0,0,0,x1,x2,x3,x4]
wehere x1 to x4 are log-uniformly distributed will give you odds 4 / 8 of drawing a 0, and odds 4 / 8 of drawing a log-uniformly distributed value.
If you really wanted to, you could also implement a class called MyReal (probably subclassed from Real) that implements a rvs method that yields the distribution you want.

Question about numpy correlate: not giving expected result

I want to make sure I am using numpy's correlate correctly, it is not giving me the answer I expect. Perhaps I am misunderstanding the correlate function. Here is a code snipet with comments:
import numpy as np
ref = np.sin(np.linspace(-2*np.pi, 2*np.pi, 10000)) # make some data
fragment = ref[2149:7022] # create a fragment of data from ref
corr = np.correlate(ref, fragment) # Find the correlation between the two
maxLag = np.argmax(corr) # find the maximum lag, this should be the offset that we chose above, 2149
print(maxLag)
2167 # I expected this to be 2149.
Isn't the index in the corr array where the correlation is maximum the lag between these two datasets? I would think the starting index I chose for the smaller dataset would be the offset with the greatest correlation.
Why is there a discrepancy between what I expect, 2149, and the result, 2167?
Thanks

That looks like a precision error to me, cross-correlation is an integral and it will always have problems when being represented in discrete space, I guess the problem arises when the values are close to 0. Maybe if you increase the numbers or increase the precision that difference will disappear but I don't think it is really necessary since you are already dealing with approximation when using the discrete cross-correlation, below is the graph of the correlation for you te see that the values are indeed close:

How to get a percentile for an empirical data distribution and get it's x-coordinate?

I have some discrete data values, that taken together form some sort of distribution.
This is one of them, but they are different with the peak being in all possible locations, from 0 to end.
So, I want to use it's quantiles (percentiles) in Python. I think I could write some sort of function, that would some up all values starting from zero, until it reaches desired percent. But probably there is a better solution? For example, to create an empirical distribution of some sort in SciPy and then use SciPy's methods of calculating percentiles?
In the very end I need x-coordinates of a left percentile and a right percentile. One could use 20% and 80% percentiles as an example, I will have to find the best numbers for my case later.
Thank you in advance!
EDIT:
some example code for almost what I want.
import numpy as np
np.random.seed(0)
distribution = np.random.normal(0, 1, 1000)
left, right = np.percentile(distribution, [20, 80])
print left, right
This returns percentiles themselves, I need to get their x-coordinates somehow. For normal distribution here it is possible, obviously, but I have a distribution of an unknown shape, so if a percentile isn't equal to one of the values (which is the most common thing, obviously), it becomes much more complicated.

if you are looking for empirical CDF then you may use statsmodels ECDF. For percentiles/quantiles you can use numpy percentile

OK, for now I have written the following function and now use it:
def percentile(distribution, percent):
percent = 1.0*percent/100
cum_percent = 0
i=0
while cum_percent <= percent:
cum_percent = cum_percent + distribution[i]
i = i+1
return i
It is a little rough, because returns an index of the most close value to the left of the required value. For my purposes it works as a temporary solution, but I would love to see a working solution for precise percentile x-coordinate determination.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.