Efficient Numpy multiple sampling which results in a Matrix - python

I would like to create a 2d numpy matrix such that each row is a sampled draw from a bigger population (Without replacement).
I've created the following code snippet:
import numpy as np
full_population = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
number_of_iterations = 8
drawn_observations = 6
rng = np.random.default_rng()
for single_draw in range(number_of_iterations):
indeces = rng.choice(a=full_population, size=drawn_observations, replace=False, shuffle=True)
However this code runs slowly (serially) comparing to my needs.
I've tried to look it up, this one seems close (But not exactly what i need) vectorized question
Note that the real length of full_population is 2m , number_of_iterations = 5000, and drawn_observations = 20k to 600k
Any help on that would be awesome!

Use random permutations after repeatedly tiling your full_population array:
repeats = np.tile(full_population, (number_of_iterations, 1))
permutations = rng.permuted(repeats, axis=1)
sample_array = permutations[:, :drawn_observations]
Should be much faster than the looping approach!

Related

Numpy - Fastest way to create a integer index matrix

Problem:
I have an array that represents products, lets say 3 for example
prod = [1,2,3]
then I have a correlation matrix for those products (just a number that represents something between two products, lets call c_ij for simplicity), in this case a 3x3 matrix
corr = [[c_11,c_12,c_13],
[c_21,c_22,c_23],
[c_31,c_32,c_33]]
The problem is that a need to shuffle the prod array, then I need to shuffle the corr matrix in a way that corr[i,j] still represent the correlation between prod[i] and prod[j]
My solution:
I know I can use a integer array as index to shuffle multiple array in the same way, like this:
order = [0,1,2]
new_order = np.random.permutation(order) # [2,0,1] for example
shuf_prod = prod[new_order]
Looking in the web I find that to make this work in a matrix I need to transform the order array in a matrix like
new_order = [2,0,1]
new_index = [ [[2,2,2],[0,0,0],[1,1,1]],
[[2,0,1],[2,0,1],[2,0,1]] ]
new_corr = corr[tuple(new_index)]
# this output what I want that is:
# [[c_33,c_31,c_32],
# [c_13,c_11,c_12],
# [c_23,c_21,c_22]]
Question:
The entire solution of shuffling look chunky and not efficient, this is a performance critical application so there is a faster way to do this? (I don't really care for simplicity of code, just performance)
If this is a good way of doing this, how I can create the new_index matrix from new_order array?
EDIT: Michael Szczesny solved the problem
new_corr = corr[new_order].T[new_order].T
you can use the indices directly as subscripts to the matrix as long as you provide the right shape for the second axis:
import numpy as np
mat = np.array([[3,4,5],
[4,8,9],
[5,9,7]])
order = np.array([2,0,1])
mat[order,order[:,None]]
array([[7, 5, 9],
[5, 3, 4],
[9, 4, 8]])

use `numpy.take` to randomly select 2d points

Problem Setup:
points - 2D numpy.array of length N.
centroids - 2D numpy.array that I get as an output from K-Means algorithm, of length k < N.
as a centroid initialization routine for an MLE algorithm, I want to assign each point in points a random centroid from centroids.
Required Output:
A numpy.array of shape (N, 2), of randomly chosen 2D points from centroids
My Efforts:
I've tried using the numpy.take with the numpy.random.choice as shown in Code 1, but it doesn't return the desired output.
Code 1:
import numpy as np
a = np.random.randint(1, 10, 10).reshape((5, 2))
idx = np.random.choice(5, 20)
np.take(a, idx)
Out: array([6, 2, 3, 3, 8, 2, 5, 2, 6, 3, 3, 8, 6, 6, 6, 6, 8, 2, 6, 5])
From numpy.take documentation page I've learned that it chooses items from flattened array, which is not what I need.
I'd appreciate any ideas on how to accomplish this task. Thanks in advance for any help.
One way is sampling the indexes, and then use that to index the first dimension of centroids:
idx = np.random.choice(np.arange(len(centroids)), size=len(a))
out = centroids[idx]
A similar to #Quang Hoang's answer, but a bit more intuitive in my opinion, will be :
a = np.random.randint(1, 10, 10).reshape((5, 2))
n_sampled_points = 20
a[np.random.randint(0, a.shape[0], n_sampled_points)]
Cheers.

Fill numpy array with other numpy array

I have following numpy arrays:
whole = np.array(
[1, 0, 3, 0, 6]
)
sparse = np.array(
[9, 8]
)
Now I want to replace every zero in the whole array in chronological order with the items in the sparse array. In the example my desired array would look like:
merged = np.array(
[1, 9, 3, 8, 6]
)
I could write a small algorithm by myself to fix this but if someone knows a time efficient way to solve this I would be very grateful for you help!
Do you assume that sparse has the same length as there is zeros in whole ?
If so, you can do:
import numpy as np
from copy import copy
whole = np.array([1, 0, 3, 0, 6])
sparse = np.array([9, 8])
merge = copy(whole)
merge[whole == 0] = sparse
if the lengths mismatch, you have to restrict to the correct length using len(...) and slicing.

numpy padding matrix of different row size

I have a numpy array of different row size
a = np.array([[1,2,3,4,5],[1,2,3],[1]])
and I would like to become this one into a dense (fixed n x m size, no variable rows) matrix. Until now I tried with something like this
size = (len(a),5)
result = np.zeros(size)
result[[0],[len(a[0])]]=a[0]
But I receive an error telling me
shape mismatch: value array of shape (5,) could not be broadcast to
indexing result of shape (1,)
I also tried to do padding wit np.pad, but according to the documentation of numpy.pad it seems I need to specify in the pad_width, the previous size of the rows (which is variable and produced me errors trying with -1,0, and biggest row size).
I know I can do it padding padding lists per row as it's shown here, but I need to do that with a much bigger array of data.
If someone can help me with the answer to this question, I would be glad to know of it.
There's really no way to pad a jagged array such that it would loose its jaggedness, without having to iterate over the rows of the array. You'll have to iterate over the array twice even: once to find out the maximum length you need to pad to, another to actually do the padding.
The code proposal you've linked to will get the job done, but it's not very efficient, because it adds zeroes in a python for-loop that iterates over the elements of the rows, whereas that appending could have been precalculated, thereby pushing more of that code to C.
The code below precomputes an array of the required minimal dimensions, filled with zeroes and then simply adds the row from the jagged array M in place, which is far more efficient.
import random
import numpy as np
M = [[random.random() for n in range(random.randint(0,m))] for m in range(10000)] # play-data
def pad_to_dense(M):
"""Appends the minimal required amount of zeroes at the end of each
array in the jagged array `M`, such that `M` looses its jagedness."""
maxlen = max(len(r) for r in M)
Z = np.zeros((len(M), maxlen))
for enu, row in enumerate(M):
Z[enu, :len(row)] += row
return Z
To give you some idea for speed:
from timeit import timeit
n = [10, 100, 1000, 10000]
s = [timeit(stmt='Z = pad_to_dense(M)', setup='from __main__ import pad_to_dense; import numpy as np; from random import random, randint; M = [[random() for n in range(randint(0,m))] for m in range({})]'.format(ni), number=1) for ni in n]
print('\n'.join(map(str,s)))
# 7.838103920221329e-05
# 0.0005027339793741703
# 0.01208890089765191
# 0.8269036808051169
If you want to prepend zeroes to the arrays, rather than append, that's a simple enough change to the code, which I'll leave to you.
You can do something like this with numpy.pad
import numpy as np
a = np.array([[1,2,3,4,5],[1,2,3],[1]])
l = np.array([len(a[i]) for i in range(len(a))])
width = l.max()
b=[]
for i in range(len(a)):
if len(a[i]) != width:
x = np.pad(a[i], (0,width-len(a[i])), 'constant',constant_values = 0)
else:
x = a[i]
b.append(x)
b = np.array(b)
print(b)
Above piece of code outputs something like this.
b = [[1, 2, 3, 4, 5],
[1, 2, 3, 0, 0],
[1, 0, 0, 0, 0]]
You can read back your input version of data by doing something as follows
a = []
for i in range(len(b)):
a.append(b[i][0:l[i]])
a = np.array(a)
print(a)
where you get the following output
a = array([array([1, 2, 3, 4, 5]), array([1, 2, 3]), array([1])], dtype=object)
Hopefully this helps someone who struggled like me to solve the issue.
Thank you.

Python time-lat-lon array manipulation and grouping

For a t-x-y array representing time-latitude-longitude and where the values of the t-x-y grid hold arbitrary measured variables, how can i 'group' x-y slices of the array for a give time condition?
For example, if a companion t-array is a 1d list of datetimes, how can i find the elementwise mean of the x-y grids that have months equal to 1. If t has only 10 elements where month = 1 then I want a (10, len(x), len(y)) array. From here I know I can do np.mean(out, axis=0) to get my desired mean values across the x-y grid, where out is the result of the array manipulation.
The shape of t-x-y is approximately (2000, 50, 50), that is a (50, 50) grid of values for 2000 different times. Assume that the number of unique conditions (whether I'm slicing by month or year) are << than the total number of elements in the t array.
What is the most pythonic way to achieve this? This operation will be repeated with many datasets so a computationally efficient solution is preferred. I'm relatively new to python (I can't even figure out how to create an example array for you to test with) so feel free to recommend other modules that may help. (I have looked at Pandas, but it seems like it mainly handles 1d time-series data...?)
Edit:
This is the best I can do as an example array:
>>> t = np.repeat([1,2,3,4,5,6,7,8,9,10,11,12],83)
>>> t.shape
(996,)
>>> a = np.random.randint(1,101,2490000).reshape(996, 50, 50)
>>> a.shape
(996, 50, 50)
>>> list(set(t))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
So a is the array of random data, t is (say) your array representing months of the year, in this case just plain integers. In this example there are 83 instances of each month. How can we separate out the 83 x-yslices of a that correspond to when t = 1 (to create a monthly mean dataset)?
One possible answer to the (my) question, using numpy.where
To find the slices of a, where t = 1:
>>> import numpy as np
>>> out = a[np.where(t == 1),:,:]
although this gives the slightly confusing (to me at least) output of:
>>> out.shape
(1, 83, 50, 50)
but if we follow through with my needing the mean
>>> out2 = np.mean(np.mean(out, axis = 0), axis = 0)
reduces the result to the expected:
>>> out2.shape
(50,50)
Can anyone improve on this or see any issues here?

Categories

Resources