What is the best way of getting random numbers in NumPy? - python

I want to generate random numbers in the range -1, 1 and want each one to have equal probability of being generated. I.e. I don't want the extremes to be less likely to come up. What is the best way of doing this?
So far, I have used:
2 * numpy.random.rand() - 1
and also:
2 * numpy.random.random_sample() - 1

Your approach is fine. An alternative is to use the function numpy.random.uniform():
>>> numpy.random.uniform(-1, 1, size=10)
array([-0.92592953, -0.6045348 , -0.52860837, 0.00321798, 0.16050848,
-0.50421058, 0.06754615, 0.46329675, -0.40952318, 0.49804386])
Regarding the probability for the extremes: If it would be idealised, continuous random numbers, the probability to get one of the extremes would be 0. Since floating point numbers are a discretisation of the continuous real numbers, in realitiy there is some positive probability to get some of the extremes. This is some form of discretisation error, and it is almost certain that this error will be dwarved by other errors in your simulation. Stop worrying!

Note that numpy.random.rand allows to generate multiple samples from a uniform distribution at one call:
>>> np.random.rand(5)
array([ 0.69093485, 0.24590705, 0.02013208, 0.06921124, 0.73329277])
It also allows to generate samples in a given shape:
>>> np.random.rand(3,2)
array([[ 0.14022471, 0.96360618],
[ 0.37601032, 0.25528411],
[ 0.49313049, 0.94909878]])
As You said, uniformly distributed random numbers between [-1, 1) can be generated with:
>>> 2 * np.random.rand(5) - 1
array([ 0.86704088, -0.65406928, -0.02814943, 0.74080741, -0.14416581])

From the documentation for numpy.random.random_sample:
Results are from the “continuous uniform” distribution over the stated interval. To sample Unif[A, b), b > a multiply the output of random_sample by (b-a) and add a:
(b - a) * random_sample() + a
Per Sven Marnach's answer, the documentation probably needs updating to reference numpy.random.uniform.

To ensure that the extremes of range [-1, 1] are included, I randomly generate a numpy array of integers in the range [0, 200000001[. The value of the latter integer depends on the final numpy data type that is desired. Here, I take the numpy float64, which is the default type used for numpy arrays. Then, I divide the numpy array with 100000000 to generate floats and subtract with unity. Code for this is:
>>> import numpy as np
>>> number = ((np.random.randint(low=0, high=200000001, size=5)) / 100000000) - 1
>>> print(number)
[-0.65960772 0.30378946 -0.05171788 -0.40737182 0.12998227]
Make sure not to transform these numpy floats to python floats to avoid rounding errors.

Related

Min/max scaling with additional points

I'm trying to normalize an array within a range, e.g. [10,100]
But I also want to manually specify additional points in my result array, for example:
num = [1,2,3,4,5,6,7,8]
num_expected = [min(num), 5, max(num)]
expected_range = [10, 20, 100]
result_array = normalize(num, num_expected, expected_range)
Intended results:
Values from 1-5 are normalized to range (10,20].
5 in num array is mapped to 20 in expected range.
Values from 6-8 are normalized to range (20,100].
I know I can do it by normalizing the array twice, but I might have many additional points to add. I was wondering if there's any built-in function in numpy or scipy to do this?
I've checked MinMaxScaler in sklearn, but did not find the functionality I want.
Thanks!
Linear interpolation will do exactly what you want:
import scipy.interpolate
interp = scipy.interpolate.interp1d(num_expected, expected_range)
Then just pass numbers or arrays of numbers that you want to interpolate:
In [20]: interp(range(1, 9))
Out[20]:
array([ 10. , 12.5 , 15. , 17.5 ,
20. , 46.66666667, 73.33333333, 100. ])

How to efficiently round only one column of numpy array to nearest 0.5?

Is there an efficient way of rounding only one column of a numpy array? I.e. I want the numbers to round to nearest 0.5 which can be done with round(number * 2.0) / 2.0.
Assuming, I have the numpy array tmp and I aim at rounding the third column. There are the following things, I tried:
For just rounding to certain decimals, I could use
tmp[:,2] = np.around(tmp[:,2],1)
But that's not what I want.
I define a function and try to apply along axis:
def roundToHalf(number):
return round(number * 2.0) / 2.0
tmp[:,2] = np.apply_along_axis(roundToHalf,0,tmp[:,2])
or
tmp[:,2] = roundToHalf(tmp[:,2])
This doesn't work because I get an error:
*** TypeError: type numpy.ndarray doesn't define __round__ method
In the worst case, I would just go with a for loop. But I hope you guys can help me to find a smoother solution.
The problem is that you wrote the function to handle a single number, not an array. You can use numpy's around to round an entire array. Your function would then be
import numpy as np
def roundToHalf(array):
return np.around(array * 2.0) / 2.0
and if you input a numpy array it should work. Example below
In [24]: roundToHalf(np.asarray([3.6,3.8,3.3,3.1]))
Out[24]: array([3.5, 4. , 3.5, 3. ])
You can apply np.vectorize() on your function roundToHalf(), for it to be appliable on a numpy array
roundToHalf_vect = np.vectorize(roundToHalf)
tmp[:,2] = roundToHalf_vect(tmp[:,2])

How to use numpy to calculate mean and standard deviation of an irregular shaped array

I have a numpy array that has many samples in it of varying length
Samples = np.array([[1001, 1002, 1003],
... ,
[1001, 1002]])
I want to (elementwise) subtract the mean of the array then divide by the standard deviation of the array. Something like:
newSamples = (Samples-np.mean(Samples))/np.std(Samples)
Except that doesn't work for irregular shaped arrays,
np.mean(Samples) causes
unsupported operand type(s) for /: 'list' and 'int'
due to what I assume to be it having set a static size for each axis and then when it encounters a different sized sample it can't handle it. What is an approach to solve this using numpy?
example input:
Sample = np.array([[1, 2, 3],
[1, 2]])
After subtracting by the mean and then dividing by standard deviation:
Sample = array([[-1.06904497, 0.26726124, 1.60356745],
[-1.06904497, 0.26726124]])
Don't make ragged arrays. Just don't. Numpy can't do much with them, and any code you might make for them will always be unreliable and slow because numpy doesn't work that way. It turns them into object dtypes:
Sample
array([[1, 2, 3], [1, 2]], dtype=object)
Which almost no numpy functions work on. In this case those objects are list objects, which makes your code even more confusing as you either have to switch between list and ndarray methods, or stick to list-safe numpy methods. This a recipe for disaster as anyone noodling around with the code later (even yourself if you forget) will be dancing in a minefield.
There's two things you can do with your data to make things work better:
First method is to index and flatten.
i = np.cumsum(np.array([len(x) for x in Sample]))
flat_sample = np.hstack(Sample)
This preserves the index of the end of each sample in i, while keeping the sample as a 1D array
The other method is to pad one dimension with np.nan and use nan-safe functions
m = np.array([len(x) for x in Sample]).max()
nan_sample = np.array([x + [np.nan] * (m - len(x)) for x in Sample])
So to do your calculations, you can use flat_sample and do similar to above:
new_flat_sample = (flat_sample - np.mean(flat_sample)) / np.std(flat_sample)
and use i to recreate your original array (or list of arrays, which I recommend:, see np.split).
new_list_sample = np.split(new_flat_sample, i[:-1])
[array([-1.06904497, 0.26726124, 1.60356745]),
array([-1.06904497, 0.26726124])]
Or use nan_sample, but you will need to replace np.mean and np.std with np.nanmean and np.nanstd
new_nan_sample = (nan_sample - np.nanmean(nan_sample)) / np.nanstd(nan_sample)
array([[-1.06904497, 0.26726124, 1.60356745],
[-1.06904497, 0.26726124, nan]])
#MichaelHackman (following the comment remark).
That's weird because when I compute the overall std and mean then apply it, I obtain different result (see code below).
import numpy as np
Samples = np.array([[1, 2, 3],
[1, 2]])
c = np.hstack(Samples) # Will gives [1,2,3,1,2]
mean, std = np.mean(c), np.std(c)
newSamples = np.asarray([(np.array(xi)-mean)/std for xi in Samples])
print newSamples
# [array([-1.06904497, 0.26726124, 1.60356745]), array([-1.06904497, 0.26726124])]
edit: Add np.asarray(), put mean,std computation outside loop following Imanol Luengo's excellent comments (Thanks!)

How to test for closeness in angular quantities

I'm trying to write a unit test where the result should be an array of arrays of zero degrees. Using np.assert_allclose results in the following failure:
E AssertionError:
E Not equal to tolerance rtol=1e-07, atol=0.000277778
E
E (mismatch 100.0%)
E x: array([[ 3.600000e+02],
E [ 3.155310e-10]])
E y: array([[0],
E [0]])
What's clearly happening is that the code is working ( [[360], [3e-10]] is close enough to [[0], [0]] for an angular quantities for me), but np.assert_allclose doesn't realize that 0 ≅ 360.
Is there a way to use numpy's testing framework for comparisons where I don't care if the values are off by multiples of 360?
In this particular case, printing the first element of the array with np.set_printoptions(precision=30) gives me 359.999999999823955931788077577949, so this isn't a case that can just be normalized to be between 0 and 360.
This is not a package I maintain, so I'd like to not include other dependencies besides astropy and numpy.
(edited answer, previous version was wrong)
Use e.g. this to reduce your values to the required range:
>>> def _h(x, a):
... xx = np.mod(x, a)
... return np.minimum(xx, np.abs(a - xx))
Then
>>> xx = np.asarray([1, -1, 359, 361, 360*3+1, -8*360 + 2])
>>> _h(xx, 360)
array([1, 1, 1, 1, 1, 2])
Given that all the numbers you want to test for closeness on a circle are in a ndarray named a, then
np.allclose(np.fmod(a+180, 360)-180,0, atol=mytol)
or, even simpler,
np.allclose(np.fmod(a+180, 360),180, atol=mytol)
is all you need (note that 180 is quite arbitrary indeed, it's just that you have to move the comparison away from 0 aka 360)
Edit
I had deleted my answer because of a flaw, that was shown to me in a comment by ev-br, but later I changed my mind because (thank you ev-br) I saw the light.
One wants to test if a point on a circle, identified by an angle in degrees, is close to the point identified by the angle 0. First, the distance on the circumference D(0,theta) is equal to D(0,-theta), hence we can compare the absolute values of the angles.
The test I proposed above is valid, or at least I think so, for any positive value of theta.
If I use the above test on the absolute values of the angles to be tested, everything
should be ok, shouldn't it? Here follows a bit of testing
In [1]: import numpy as np
In [2]: a = np.array([0, 1e-5,-1e-7,360.1,-360.1,359.9,-359.9,3600.1,-3600.1,3599.9,-3599.9])
In [3]: np.allclose(np.mod(np.abs(a)+180, 360), 180, atol=0.2)
Out[3]: True
In [4]:

Why don't scipy.stats.mstats.pearsonr results agree with scipy.stats.pearsonr?

I expected that the results for scipy.stats.mstats.pearsonr for masked array inputs would give the same results for scipy.stats.pearsonr for the unmasked values of the input data, but it doesn't:
from pylab import randn,rand
from numpy import ma
import scipy.stats
# Normally distributed data with noise
x=ma.masked_array(randn(10000),mask=False)
y=x+randn(10000)*0.6
# Randomly mask one tenth of each of x and y
x[rand(10000)<0.1]=ma.masked
y[rand(10000)<0.1]=ma.masked
# Identify indices for which both data are unmasked
bothok=((~x.mask)*(~y.mask))
# print results of both functions, passing only the data where
# both x and y are good to scipy.stats
print "scipy.stats.mstats.pearsonr:", scipy.stats.mstats.pearsonr(x,y)[0]
print "scipy.stats.pearsonr:", scipy.stats.pearsonr(x[bothok].data,y[bothok].data)[0]
The answer will vary a little bit each time you do this, but the values differed by about 0.1 for me, and the bigger the masked fraction, the bigger the disagreement.
I noticed that if the same mask was used for both x and y, the results are the same for both functions, i.e.:
mask=rand(10000)<0.1
x[mask]=ma.masked
y[mask]=ma.masked
...
Is this a bug, or am I expected to precondition the input data to make sure the masks in both x and y are identical (surely not)?
I'm using numpy version '1.8.0' and scipy version '0.11.0b1'
This looks like a bug in scipy.stats.mstats.pearsonr. It appears that the values in x and y are expected to be paired by index, so if one is masked, the other should be ignored. That is, if x and y look like (using -- for a masked value):
x = [1, --, 3, 4, 5]
y = [9, 8, --, 6, 5]
then both (--, 8) and (3, --) are to be ignored, and the result should should be the same as scipy.stats.pearsonr([1, 4, 5], [9, 6, 5]).
The bug in the mstats version is that the code to compute the means of x and y does not use the common mask.
I created an issue for this on the scipy github site: https://github.com/scipy/scipy/issues/3645
We have (at least) two options for missing value handling, complete case deletion and pairwise deletion.
In your use of scipy.stats.pearsonr you completely drop cases where there is a missing value in any of the variables.
numpy.ma.corrcoef gives the same results.
Checking the source of scipy.stats.mstats.pearsonr, it doesn't do complete case deletion for the calculating the variance or the mean.
>>> xm = x - x.mean(0)
>>> ym = y - y.mean(0)
>>> np.ma.dot(xm, ym) / np.sqrt(np.ma.dot(xm, xm) * np.ma.dot(ym, ym))
0.7731167378113557
>>> scipy.stats.mstats.pearsonr(x,y)[0]
0.77311673781135637
However, the difference between complete and pairwise case deletion on mean and standard deviations is small.
The main discrepancy seems to come from the missing correction for the different number of non-missing elements. Iignoring degrees of freedom corrections, I get
>>> np.ma.dot(xm, ym) / bothok.sum() / \
np.sqrt(np.ma.dot(xm, xm) / (~xm.mask).sum() * np.ma.dot(ym, ym) / (~ym.mask).sum())
0.85855728319303393
which is close to the complete case deletion case.

Categories

Resources