Generating random sparse data in the range 0-1 - python

I am trying to generate sparse 3 dimensional nonparametric datasets in the range 0-1, where the dataset should contain zeros as well. I tried to generate this using:
training_matrix = numpy.random.rand(3000, 3)
but it is not printing the data as 0.00000 in any of the rows.

We start by creating an array of zeros of nrows rows by 3 columns:
import numpy as np
nrows = 3000 # total number of rows
training_matrix = np.zeros((nrows, 3))
Then we randomly draw (without replacement) nz integers from range(nrows). These numbers are the indices of the rows with nonzero data. The sparsity of training_matrix is determined by nz. You can adjust its value to fit your needs (in this example sparsity is set to 50%):
nz = 1500 # number of rows with nonzero data
indices = np.random.choice(nrows, nz, replace=False)
And finally, we populate the selected rows with random numbers through advanced indexing:
training_matrix[indices, :] = np.random.rand(nz, 3)
This is what you get by running the code above:
>>> print(training_matrix)
[[ 0.96088615 0.81550102 0.21647398]
[ 0. 0. 0. ]
[ 0.55381338 0.66734065 0.66437689]
...,
[ 0. 0. 0. ]
[ 0.03182902 0.85349965 0.54315029]
[ 0.71628805 0.2242126 0.02481218]]

Since you want all 5 numbers to be zero, the probability of that occurring is 1/10^5 = 0.00001, with replacement. The probability of getting that is still negligible, even if you have 3000*3=9000 values. Something else you can try doing for your peace of mind is to generate random numbers and truncate them at a certain point, ie 5 decimal places if you want.

Related

NumPy array with largest value on diagonal and other values shuffled

I am trying to create a square NumPy (or PyTorch, since PyTorch code can be turned into NumPy with minimal effort) matrix which has the following property: given a set of values, the diagonal elements in each row have the largest value and the other values are randomly shuffled for the other positions.
For example, if I have [1, 2, 3, 4], a possible desired output is:
[[4, 3, 1, 2],
[1, 4, 3, 2],
[2, 1, 4, 3],
[2, 3, 1, 4]]
There can be (several) other possible outputs, as long as the diagonal elements are the largest value (4 in this case) and the off-diagonal elements in each row contain the other values but shuffled.
A hacky/inefficient way of doing this could be first creating a square matrix (4x4) of zeros and putting the largest value (4) in all the diagonal positions, and then traversing the matrix row by row, where for each row i, populate the elements except index i with shuffled remaining values (shuffled versions of [1, 2, 3]). This would be very slow as the matrix size increases. Is there a cleaner/faster/Pythonic way of doing it? Thank you.
First you can generate a randomized array on the first axis with np.random.shuffle(), then I've used a (not so easy to understand) mathematical tricks to shift each rows:
import numpy as np
from numpy.fft import fft, ifft
# First create your randomized array with np.random.shuffle()
x = np.array([[1,2,3,4],
[2,4,3,1],
[4,1,2,3],
[2,3,1,4]])
# We use np.where to determine on which column each 4 are.
_,s = np.where(x==4);
# We compute the left shift that need to be applied to each row in order to get each 4 on the diagonal
s = s-np.r_[0:x.shape[0]]
# And here is the tricks, we can use the fast fourrier transform in order to left shift each row by a given value:
L = np.real(ifft(fft(x,axis=1)*np.exp(2*1j*np.pi/x.shape[1]*s[:,None]*np.r_[0:x.shape[1]][None,:]),axis=1).round())
# Noticed that we could also use a right shift, we simply have to negate our exponential exponant:
# np.exp(-2*1j*np.pi...
And we obtain the following matrix:
[[4. 1. 2. 3.]
[2. 4. 1. 3.]
[2. 3. 4. 1.]
[3. 2. 1. 4.]]
No hidden for loop, only pure linear algaebra stuff.
To give you an idea it take only a few milliseconds for a 1000x1000 matrix on my computer and ~20s for a 10000x10000 matrix.

Min/max scaling with additional points

I'm trying to normalize an array within a range, e.g. [10,100]
But I also want to manually specify additional points in my result array, for example:
num = [1,2,3,4,5,6,7,8]
num_expected = [min(num), 5, max(num)]
expected_range = [10, 20, 100]
result_array = normalize(num, num_expected, expected_range)
Intended results:
Values from 1-5 are normalized to range (10,20].
5 in num array is mapped to 20 in expected range.
Values from 6-8 are normalized to range (20,100].
I know I can do it by normalizing the array twice, but I might have many additional points to add. I was wondering if there's any built-in function in numpy or scipy to do this?
I've checked MinMaxScaler in sklearn, but did not find the functionality I want.
Thanks!
Linear interpolation will do exactly what you want:
import scipy.interpolate
interp = scipy.interpolate.interp1d(num_expected, expected_range)
Then just pass numbers or arrays of numbers that you want to interpolate:
In [20]: interp(range(1, 9))
Out[20]:
array([ 10. , 12.5 , 15. , 17.5 ,
20. , 46.66666667, 73.33333333, 100. ])

Correlate a single time series with a large number of time series

I have a large number (M) of time series, each with N time points, stored in an MxN matrix. Then I also have a separate time series with N time points that I would like to correlate with all the time series in the matrix.
An easy solution is to go through the matrix row by row and run numpy.corrcoef. However, I was wondering if there is a faster or more concise way to do this?
Let's use this correlation formula :
You can implement this for X as the M x N array and Y as the other separate time series array of N elements to be correlated with X. So, assuming X and Y as A and B respectively, a vectorized implementation would look something like this -
import numpy as np
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:,None]
B_mB = B - B.mean()
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum()
# Finally get corr coeff
out = np.dot(A_mA,B_mB.T).ravel()/np.sqrt(ssA*ssB)
# OR out = np.einsum('ij,j->i',A_mA,B_mB)/np.sqrt(ssA*ssB)
Verify results -
In [115]: A
Out[115]:
array([[ 0.1001229 , 0.77201334, 0.19108671, 0.83574124],
[ 0.23873773, 0.14254842, 0.1878178 , 0.32542199],
[ 0.62674274, 0.42252403, 0.52145288, 0.75656695],
[ 0.24917321, 0.73416177, 0.40779406, 0.58225605],
[ 0.91376553, 0.37977182, 0.38417424, 0.16035635]])
In [116]: B
Out[116]: array([ 0.18675642, 0.3073746 , 0.32381341, 0.01424491])
In [117]: out
Out[117]: array([-0.39788555, -0.95916359, -0.93824771, 0.02198139, 0.23052277])
In [118]: np.corrcoef(A[0],B), np.corrcoef(A[1],B), np.corrcoef(A[2],B)
Out[118]:
(array([[ 1. , -0.39788555],
[-0.39788555, 1. ]]),
array([[ 1. , -0.95916359],
[-0.95916359, 1. ]]),
array([[ 1. , -0.93824771],
[-0.93824771, 1. ]]))

Representing a ragged array in numpy by padding

I have a 1-dimensional numpy array scores of scores associated with some objects. These objects belong to some disjoint groups, and all the scores of the items in the first group are first, followed by the scores of the items in the second group, etc.
I'd like to create a 2-dimensional array where each row corresponds to a group, and each entry is the score of one of its items. If all the groups are of the same size I can just do:
scores.reshape((numGroups, groupSize))
Unfortunately, my groups may be of varying size. I understand that numpy doesn't support ragged arrays, but it is fine for me if the resulting array simply pads each row with a specified value to make all rows the same length.
To make this concrete, suppose I have set A with 3 items, B with 2 items, and C with four items.
scores = numpy.array([f(a[0]), f(a[1]), f(a[2]), f(b[0]), f(b[1]),
f(c[0]), f(c[1]), f(c[2]), f(c[3])])
rowStarts = numpy.array([0, 3, 5])
paddingValue = -1.0
scoresByGroup = groupIntoRows(scores, rowStarts, paddingValue)
The desired value of scoresByGroup would be:
[[f(a[0]), f(a[1]), f(a[2]), -1.0],
[f(b[0]), f(b[1]), -1.0, -1.0]
[f(c[0]), f(c[1]), f(c[2]), f(c[3])]]
Is there some numpy function or composition of functions I can use to create groupIntoRows?
Background:
This operation will be used in calculating the loss for a minibatch for a gradient descent algorithm in Theano, so that's why I need to keep it as a composition of numpy functions if possible, rather than falling back on native Python.
It's fine to assume there is some known maximum row size
The original objects being scored are vectors and the scoring function is a matrix multiplication, which is why we flatten things out in the first place. It would be possible to pad everything to the maximum item set size before doing the matrix multiplication, but the biggest set is over ten times bigger than the average set size, so this is undesirable for speed reasons.
Try this:
scores = np.random.rand(9)
row_starts = np.array([0, 3, 5])
row_ends = np.concatenate((row_starts, [len(scores)]))
lens = np.diff(row_ends)
pad_len = np.max(lens) - lens
where_to_pad = np.repeat(row_ends[1:], pad_len)
padding_value = -1.0
padded_scores = np.insert(scores, where_to_pad,
padding_value).reshape(-1, np.max(lens))
>>> padded_scores
array([[ 0.05878244, 0.40804443, 0.35640463, -1. ],
[ 0.39365072, 0.85313545, -1. , -1. ],
[ 0.133687 , 0.73651147, 0.98531828, 0.78940163]])

Why do I get rows of zeros in my 2D fft?

I am trying to replicate the results from a paper.
"Two-dimensional Fourier Transform (2D-FT) in space and time along sections of constant latitude (east-west) and longitude (north-south) were used to characterize the spectrum of the simulated flux variability south of 40degS." - Lenton et al(2006)
The figures published show "the log of the variance of the 2D-FT".
I have tried to create an array consisting of the seasonal cycle of similar data as well as the noise. I have defined the noise as the original array minus the signal array.
Here is the code that I used to plot the 2D-FT of the signal array averaged in latitude:
import numpy as np
from numpy import ma
from matplotlib import pyplot as plt
from Scientific.IO.NetCDF import NetCDFFile
### input directory
indir = '/home/nicholas/data/'
### get the flux data which is in
### [time(5day ave for 10 years),latitude,longitude]
nc = NetCDFFile(indir + 'CFLX_2000_2009.nc','r')
cflux_southern_ocean = nc.variables['Cflx'][:,10:50,:]
cflux_southern_ocean = ma.masked_values(cflux_southern_ocean,1e+20) # mask land
nc.close()
cflux = cflux_southern_ocean*1e08 # change units of data from mmol/m^2/s
### create an array that consists of the seasonal signal fro each pixel
year_stack = np.split(cflux, 10, axis=0)
year_stack = np.array(year_stack)
signal_array = np.tile(np.mean(year_stack, axis=0), (10, 1, 1))
signal_array = ma.masked_where(signal_array > 1e20, signal_array) # need to mask
### average the array over latitude(or longitude)
signal_time_lon = ma.mean(signal_array, axis=1)
### do a 2D Fourier Transform of the time/space image
ft = np.fft.fft2(signal_time_lon)
mgft = np.abs(ft)
ps = mgft**2
log_ps = np.log(mgft)
log_mgft= np.log(mgft)
Every second row of the ft consists completely of zeros. Why is this?
Would it be acceptable to add a randomly small number to the signal to avoid this.
signal_time_lon = signal_time_lon + np.random.randint(0,9,size=(730, 182))*1e-05
EDIT: Adding images and clarify meaning
The output of rfft2 still appears to be a complex array. Using fftshift shifts the edges of the image to the centre; I still have a power spectrum regardless. I expect that the reason that I get rows of zeros is that I have re-created the timeseries for each pixel. The ft[0, 0] pixel contains the mean of the signal. So the ft[1, 0] corresponds to a sinusoid with one cycle over the entire signal in the rows of the starting image.
Here are is the starting image using following code:
plt.pcolormesh(signal_time_lon); plt.colorbar(); plt.axis('tight')
Here is result using following code:
ft = np.fft.rfft2(signal_time_lon)
mgft = np.abs(ft)
ps = mgft**2
log_ps = np.log1p(mgft)
plt.pcolormesh(log_ps); plt.colorbar(); plt.axis('tight')
It may not be clear in the image but it is only every second row that contains completely zeros. Every tenth pixel (log_ps[10, 0]) is a high value. The other pixels (log_ps[2, 0], log_ps[4, 0] etc) have very low values.
Consider the following example:
In [59]: from scipy import absolute, fft
In [60]: absolute(fft([1,2,3,4]))
Out[60]: array([ 10. , 2.82842712, 2. , 2.82842712])
In [61]: absolute(fft([1,2,3,4, 1,2,3,4]))
Out[61]:
array([ 20. , 0. , 5.65685425, 0. ,
4. , 0. , 5.65685425, 0. ])
In [62]: absolute(fft([1,2,3,4, 1,2,3,4, 1,2,3,4]))
Out[62]:
array([ 30. , 0. , 0. , 8.48528137,
0. , 0. , 6. , 0. ,
0. , 8.48528137, 0. , 0. ])
If X[k] = fft(x), and Y[k] = fft([x x]), then Y[2k] = 2*X[k] for k in {0, 1, ..., N-1} and zero otherwise.
Therefore, I would look into how your signal_time_lon is being tiled. That may be where the problem lies.

Categories

Resources