Min/max scaling with additional points

Min/max scaling with additional points - python

I'm trying to normalize an array within a range, e.g. [10,100]
But I also want to manually specify additional points in my result array, for example:
num = [1,2,3,4,5,6,7,8]
num_expected = [min(num), 5, max(num)]
expected_range = [10, 20, 100]
result_array = normalize(num, num_expected, expected_range)
Intended results:
Values from 1-5 are normalized to range (10,20].
5 in num array is mapped to 20 in expected range.
Values from 6-8 are normalized to range (20,100].
I know I can do it by normalizing the array twice, but I might have many additional points to add. I was wondering if there's any built-in function in numpy or scipy to do this?
I've checked MinMaxScaler in sklearn, but did not find the functionality I want.
Thanks!

Linear interpolation will do exactly what you want:
import scipy.interpolate
interp = scipy.interpolate.interp1d(num_expected, expected_range)
Then just pass numbers or arrays of numbers that you want to interpolate:
In [20]: interp(range(1, 9))
Out[20]:
array([ 10. , 12.5 , 15. , 17.5 ,
20. , 46.66666667, 73.33333333, 100. ])

Related

numpy: efficiently obtain a statistic over array elements grouped by the elements of another array

Apologies in advance for the potentially misleading title. I could not think of the way to properly word the problem without an illustrative example.
I have some data array (e.g.):
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
and a corresponding array of equal length which indicates which elements of x are grouped:
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
In this example, there are two groupings in x: [2,2,2,3,3,3,4,4,4] where y=0; and [1,1,2,2,3,3] where y=1. I want to obtain a statistic on all elements of x where y is 0, then 1. I would like this to be extendable to large arrays with many groupings. y is always ordered from lowest to highest AND is always sequentially increasing without any missing integers between the min and max. For example, y could be np.array([0,0,**1**,2,2,2,2,3,3,3]) for some x array of the same length but not y = np.array([0,0,**2**,2,2,2,2,3,3,3]) as this has no ones.
I can do this by brute force quite easily for this example.
import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
y_max = np.max(y)
stat_min = np.zeros(y_max+1)
stat_sum = np.zeros(y_max+1)
for i in np.arange(y_max+1):
stat_min[i] = np.min(x[y==i])
stat_sum[i] = np.sum(x[y==i])
print(stat_min)
print(stat_sum)
Gives: [2. 1.] and [27. 12.] for the minimum and sum statistics for each grouping, respectively. I need a way to make this efficient for large numbers of groupings and where the arrays are very large (> 1 million elements).
EDIT
A bit better with list comprehension.
import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
y_max = np.max(y)
stat_min = np.array([np.min(x[y==i]) for i in range(y_max+1)])
stat_sum = np.array([np.sum(x[y==i]) for i in range(y_max+1)])
print(stat_min)
print(stat_sum)

You'd put your arrays into a dataframe, then use groupby and the various methods of it: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
import pandas as pd
df = pd.DataFrame({'x': x, 'y': y})`
mins = df.groupby('y').min()

pearson correlation using np.random.rand failing

I have the following code to calculate the correlation coefficient using two different ways to generate number series. It fails to work for the first way (corr_coeff_pearson) but works for the 2nd way (corr_coeff_pearson_1). Why is this so? In both cases, the variables are of class 'numpy.ndarray'
import numpy as np
np.random.seed(1000)
inp_vct_lngt = 5
X = 2*np.random.rand(inp_vct_lngt,1)
y=4+3*X+np.random.randn(inp_vct_lngt,1)
print(type(X))
corr_coeff_pearson=0
corr_coeff_pearson = np.corrcoef(X,y)
print("Pearson Correlation:")
print(corr_coeff_pearson)
X_1 = np.random.randint(0,50,5)
y_1 = X_1 + np.random.normal(0,10,5)
print(type(X_1))
corr_coeff_pearson_1 = np.corrcoef(X_1,y_1)
print("Pearson Correlation:")
print(corr_coeff_pearson_1)
Is there some way to "convert" the number in the first way of generating the series that I am missing?

The issue is that X and y are 2 dimensional:
>>> X
array([[1.9330627 ],
[0.19204405],
[0.21168505],
[0.65018234],
[0.83079548]])
>>> y
array([[8.60619212],
[6.09210226],
[5.33097283],
[5.71649684],
[5.18771916]])
So corrcoef is thinking
Each row of x represents a variable, and each column a single observation of all those variables
(quoted from the docs)
What you can do is either flatten the two to one dimension:
>>> np.corrcoef(X.flatten(),y.flatten())
array([[1. , 0.84196446],
[0.84196446, 1. ]])
Or use rowvar=False:
>>> np.corrcoef(X,y,rowvar=False)
array([[1. , 0.84196446],
[0.84196446, 1. ]])

Generating random sparse data in the range 0-1

I am trying to generate sparse 3 dimensional nonparametric datasets in the range 0-1, where the dataset should contain zeros as well. I tried to generate this using:
training_matrix = numpy.random.rand(3000, 3)
but it is not printing the data as 0.00000 in any of the rows.

We start by creating an array of zeros of nrows rows by 3 columns:
import numpy as np
nrows = 3000 # total number of rows
training_matrix = np.zeros((nrows, 3))
Then we randomly draw (without replacement) nz integers from range(nrows). These numbers are the indices of the rows with nonzero data. The sparsity of training_matrix is determined by nz. You can adjust its value to fit your needs (in this example sparsity is set to 50%):
nz = 1500 # number of rows with nonzero data
indices = np.random.choice(nrows, nz, replace=False)
And finally, we populate the selected rows with random numbers through advanced indexing:
training_matrix[indices, :] = np.random.rand(nz, 3)
This is what you get by running the code above:
>>> print(training_matrix)
[[ 0.96088615 0.81550102 0.21647398]
[ 0. 0. 0. ]
[ 0.55381338 0.66734065 0.66437689]
...,
[ 0. 0. 0. ]
[ 0.03182902 0.85349965 0.54315029]
[ 0.71628805 0.2242126 0.02481218]]

Since you want all 5 numbers to be zero, the probability of that occurring is 1/10^5 = 0.00001, with replacement. The probability of getting that is still negligible, even if you have 3000*3=9000 values. Something else you can try doing for your peace of mind is to generate random numbers and truncate them at a certain point, ie 5 decimal places if you want.

How to Create an Evenly spaced Numpy Array from Set Values

I am looking to create a an array via numpy that generates an equally spaced values from interval to interval based on values in a given array.
I understand there is:
np.linspace(min, max, num_elements)
but what I am looking for is imagine you have a set of values:
arr = np.array([1, 2, 4, 6, 7, 8, 12, 10])
When I do:
#some numpy function
arr = np.somefunction(arr, 16)
>>>arr
>>> array([1, 1.12, 2, 2.5, 4, 4.5, etc...)]
# new array with 16 elements including all the numbers from previous
# array with generated numbers to 'evenly space them out'
So I am looking for the same functionality as linspace() but takes all the elements in an array and creates another array with the desired elements but evenly spaced intervals from the set values in the array. I hope I am making myself clear on this..
What I am trying to actually do with this set up is take existing x,y data and expand the data to have more 'control points' in a sense so i can do calculations in the long run.
Thank you in advance.

xp = np.arange(len(arr)) # X coordinates of arr
targets = np.arange(0, len(arr)-0.5, 0.5) # X coordinates desired
np.interp(targets, xp, arr)
The above does simple linear interpolation of 8 data points at 0.5 spacing for a total of 15 points (because of fenceposting):
array([ 1. , 1.5, 2. , 3. , 4. , 5. , 6. , 6.5, 7. ,
7.5, 8. , 10. , 12. , 11. , 10. ])
There are some additional options you can use in numpy.interp to tweak the behavior. You can also generate targets in different ways if you want.

What is the best way of getting random numbers in NumPy?

I want to generate random numbers in the range -1, 1 and want each one to have equal probability of being generated. I.e. I don't want the extremes to be less likely to come up. What is the best way of doing this?
So far, I have used:
2 * numpy.random.rand() - 1
and also:
2 * numpy.random.random_sample() - 1

Your approach is fine. An alternative is to use the function numpy.random.uniform():
>>> numpy.random.uniform(-1, 1, size=10)
array([-0.92592953, -0.6045348 , -0.52860837, 0.00321798, 0.16050848,
-0.50421058, 0.06754615, 0.46329675, -0.40952318, 0.49804386])
Regarding the probability for the extremes: If it would be idealised, continuous random numbers, the probability to get one of the extremes would be 0. Since floating point numbers are a discretisation of the continuous real numbers, in realitiy there is some positive probability to get some of the extremes. This is some form of discretisation error, and it is almost certain that this error will be dwarved by other errors in your simulation. Stop worrying!

Note that numpy.random.rand allows to generate multiple samples from a uniform distribution at one call:
>>> np.random.rand(5)
array([ 0.69093485, 0.24590705, 0.02013208, 0.06921124, 0.73329277])
It also allows to generate samples in a given shape:
>>> np.random.rand(3,2)
array([[ 0.14022471, 0.96360618],
[ 0.37601032, 0.25528411],
[ 0.49313049, 0.94909878]])
As You said, uniformly distributed random numbers between [-1, 1) can be generated with:
>>> 2 * np.random.rand(5) - 1
array([ 0.86704088, -0.65406928, -0.02814943, 0.74080741, -0.14416581])

From the documentation for numpy.random.random_sample:
Results are from the “continuous uniform” distribution over the stated interval. To sample Unif[A, b), b > a multiply the output of random_sample by (b-a) and add a:
(b - a) * random_sample() + a
Per Sven Marnach's answer, the documentation probably needs updating to reference numpy.random.uniform.

To ensure that the extremes of range [-1, 1] are included, I randomly generate a numpy array of integers in the range [0, 200000001[. The value of the latter integer depends on the final numpy data type that is desired. Here, I take the numpy float64, which is the default type used for numpy arrays. Then, I divide the numpy array with 100000000 to generate floats and subtract with unity. Code for this is:
>>> import numpy as np
>>> number = ((np.random.randint(low=0, high=200000001, size=5)) / 100000000) - 1
>>> print(number)
[-0.65960772 0.30378946 -0.05171788 -0.40737182 0.12998227]
Make sure not to transform these numpy floats to python floats to avoid rounding errors.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Min/max scaling with additional points - python

Related

numpy: efficiently obtain a statistic over array elements grouped by the elements of another array

pearson correlation using np.random.rand failing

Generating random sparse data in the range 0-1

How to Create an Evenly spaced Numpy Array from Set Values

What is the best way of getting random numbers in NumPy?

Categories

Resources