Monte Carlo Simulation with multiple distributions in each loop - python

I have an array of NaNs 10 columns wide and 5 rows long.
I have a 5x3 array of poisson random number generations. This represents 5 runs of each A, B, and C, where each column has a different lambda value for the poisson distribution.
A B C
[1, 1, 2,
1, 2, 2,
2, 1, 4,
1, 2, 3,
0, 1, 2]
Each row represents the number of events. That is, the first row would produce one event of type A, one event of type B, and two events of type C.
I would like to loop through each row and produce a set of uniform random numbers. For A, it would between 1 and 100, for B it would be between 101 and 200, and for C it would be between 201 and 300.
The output of the first row would have four numbers, one number between 1 and 100, one number between 101 and 200, and two numbers between 201 and 300. So a sample output of the first row might be:
[34, 105, 287, 221]
The second output row would have five numbers in it, the third row would have seven, etc. I would like to store it in my array of NaNs by overwriting the NaNs that get replaced in each row. Can anyone please help with this? Thanks!

I've got a rather inefficient/unvectorised method which may or may not be what you're looking for, because one part of your question is unclear to me. Do you want the final array to have rows of different sizes, or to be the same size but padded with nans?
This solution assumes padding with nans, since you talked about the nans being overwritten and didn't mention the extra/unused nans being deleted. I'm also assuming that your ABC thing is structured into a numpy array of size (5,3), and I'm calling the array of nans results_arr.
import numpy as np
from random import randint
# Initializing the arrays
results_arr = np.full((5,10), np.nan)
abc = np.array([[1, 1, 2], [1, 2, 2], [2, 1, 4], [1, 2, 3], [0, 1, 2]])
# Loops through each row in ABC
for row_idx in range(len(abc)):
a, b, c = abc[row_idx]
# Here, I'm getting a number in the specified uniform distribution as many times as is specified in the A column. The other 2 loops do the same for the B and C columns.
for i in range(0, a):
results_arr[row_idx, i] = randint(1, 100)
for j in range(a, a+b):
results_arr[row_idx, j] = randint(101, 200)
for k in range(a+b, a+b+c):
results_arr[row_idx, k] = randint(201, 300)
Hope that helps!
P.S. Here's a solution with uneven rows. The result is stored in a list of lists because numpy doesn't support ragged arrays (i.e. rows of different lengths).
import numpy as np
from random import randint
# Initializations
results_arr = []
abc = np.array([[1, 1, 2], [1, 2, 2], [2, 1, 4], [1, 2, 3], [0, 1, 2]])
# Same code logic as before, just storing the results differently
for row_idx in range(len(abc)):
a, b, c = abc[row_idx]
results_this_row = []
for i in range(0, a):
results_this_row.append(randint(1, 100))
for j in range(a, a+b):
results_this_row.append(randint(101, 200))
for k in range(a+b, a+b+c):
results_this_row.append(randint(201, 300))
results_arr.append(results_this_row)
I hope these two solutions cover what you're looking for!

Related

Differences in an array based on groups defined by another array

I have two arrays of the same size. One, call it A, contains a series of repeated numbers; the other, B contains random numbers.
import numpy as np
A = np.array([1,1,1,2,2,2,0,0,0,3,3])
B = np.array([1,2,3,6,5,4,7,8,9,10,11])
I need to find the differences in B between the two extremes defined by the groups in A. More specifically, I need an output C such as
C = [2, -2, 2, 1]
where each term is the difference 3 - 1, 4 - 6, 9 - 7, and 11 - 10, i.e., the difference between the extremes in B identified by the groups of repeated numbers in A.
I tried to play around with itertools.groupby to isolate the groups in the first array, but it is not clear to me how to exploit the indexing to operate the differences in the second.
Edit: C is now sorted the same way as in the question
C = []
_, idx = np.unique(A, return_index=True)
for i in A[np.sort(idx)]:
bs = B[A==i]
C.append(bs[-1] - bs[0])
print(C) // [2, -2, 2, 1]
np.unique returns, for each unique value in A, the index of the first appearance of it.
i in A[np.sort(idx)] iterates over the unique values in the order of the indexes.
B[A==i] extracts the values from B at the same indexes as those values in A.
This is easily achieved using pandas' groupby:
A = np.array([1,1,1,2,2,2,0,0,0,3,3])
B = np.array([1,2,3,6,5,4,7,8,9,10,11])
import pandas as pd
pd.Series(B).groupby(A, sort=False).agg(lambda g: g.iloc[-1]-g.iloc[0]).to_numpy()
output: array([ 2, -2, 2, 1])
using itertools.groupby:
from itertools import groupby
[(x:=list(g))[-1][1]-x[0][1] for k, g in groupby(zip(A,B), lambda x: x[0])]
output: [2, -2, 2, 1]
NB. Note that the two solutions will behave differently if there are different non-consecutive groups

Tensor reduction based off index vector

As an example, I have 2 tensors: A = [1;2;3;4;5;6;7] and B = [2;3;2]. The idea is that I want to reduce A based off B - such that B's values represent how to sum A's values- such that B = [2;3;2] means the reduced A shall be the sum of the first 2 values, next 3, and last 2: A' = [(1+2);(3+4+5);(6+7)]. It is apparent that the sum of B shall always be equal to the length of A. I'm trying to do this as efficiently as possible - preferably specific functions or matrix operations contained within pytorch/python. Thanks!
Here is the solution.
First, we create an array of indices B_idx with the same size of A.
Then, accumulate (add) all elements in A based on the indices B_idx using index_add_.
A = torch.arange(1, 8)
B = torch.tensor([2, 3, 2])
B_idx = [idx.repeat(times) for idx, times in zip(torch.arange(len(B)), B)]
B_idx = torch.cat(B_idx) # tensor([0, 0, 1, 1, 1, 2, 2])
A_sum = torch.zeros_like(B)
A_sum.index_add_(dim=0, index=B_idx, source=A)
print(A_sum) # tensor([ 3, 12, 13])

How to randomly throw numbers in a 2D dimensional board

I have a 50x50 2D dimensional board with empty cells now. I want to fill 20% cells with 0, 30% cells with 1, 30% cells with 2 and 20% cells with 3. How to randomly throw these 4 numbers onto the board with the percentages?
import numpy as np
from numpy import random
dim = 50
map = [[" "for i in range(dim)] for j in range(dim)]
print(map)
One way to get this kind of randomness would be to start with a random permutation of the numbers from 0 to the total number of cells you have minus one.
perm = np.random.permutation(2500)
now you split the permutation according the proportions you want to get and treat the entries of the permutation as the indices of the array.
array = np.empty(2500)
p1 = int(0.2*2500)
p2 = int(0.3*2500)
p3 = int(0.3*2500)
array[perm[range(0, p1)]] = 0
array[perm[range(p1, p1 + p2)]] = 1
array[perm[range(p1 + p2, p3)]] = 2
array[perm[range(p1 + p2 + p3, 2500)]] = 3
array = array.reshape(50, 50)
This way you ensure the proportions for each number.
Since the percentages sum up to 1, you can start with a board of zeros
bsize = 50
board = np.zeros((bsize, bsize))
In this approach the board positions are interpreted as 1D postions, then we need a set of position equivalent to 80% of all positions.
for i, pos in enumerate(np.random.choice(bsize**2, int(0.8*bsize**2), replace=False)):
# the fisrt 30% will be set with 1
if i < int(0.3*bsize**2):
board[pos//bsize][pos%bsize] = 1
# the second 30% (between 30% and 60%) will be set with 2
elif i < int(0.6*bsize**2):
board[pos//bsize][pos%bsize] = 2
# the rest 20% (between 60% and 80%) will be set with 3
else:
board[pos//bsize][pos%bsize] = 3
At the end the last 20% of positions will remain as zeros
As suggested by #alexis in commentaries, this approach could became more simple by using shuffle method from random module:
from random import shuffle
bsize = 50
board = np.zeros((bsize, bsize))
l = list(range(bsize**2))
shuffle(l)
for i, pos in enumerate(l):
# the fisrt 30% will be set with 1
if i < int(0.3*bsize**2):
board[pos//bsize][pos%bsize] = 1
# the second 30% (between 30% and 60%) will be set with 2
elif i < int(0.6*bsize**2):
board[pos//bsize][pos%bsize] = 2
# the rest 20% (between 60% and 80%) will be set with 3
elif i < int(0.8*bsize**2):
board[pos//bsize][pos%bsize] = 3
The last 20% of positions will remain as zeros again.
A different approach (admittedly it's probabilistic so you won't get perfect proportions as the solution proposed by Brad Solomon)
import numpy as np
res = np.random.random((50, 50))
zeros = np.where(res <= 0.2, 0, 0)
ones = np.where(np.logical_and(res <= 0.5, res > 0.2), 1, 0)
twos = np.where(np.logical_and(res <= 0.8, res > 0.5), 2, 0)
threes = np.where(res > 0.8, 3, 0)
final_result = zeros + ones + twos + threes
Running
np.unique(final_result, return_counts=True)
yielded
(array([0, 1, 2, 3]), array([499, 756, 754, 491]))
Here's an approach with np.random.choice to shuffle indices, then filling those indices with repeats of the inserted ints. It will fill the array in the exact proportions that you specify:
import numpy as np
np.random.seed(444)
board = np.zeros(50 * 50, dtype=np.uint8).flatten()
# The "20% cells with 0" can be ignored since that is the default.
#
# This will work as long as the proportions are "clean" ints
# (I.e. mod to 0; 2500 * 0.2 is a clean 500. Otherwise, need to do some rounding.)
rpt = (board.shape[0] * np.array([0.3, 0.3, 0.2])).astype(int)
repl = np.repeat([1, 2, 3], rpt)
idx = np.random.choice(board.shape[0], size=repl.size, replace=False)
board[idx] = repl
board = board.reshape((50, 50))
Resulting frequencies:
>>> np.unique(board, return_counts=True)
(array([0, 1, 2, 3], dtype=uint8), array([500, 750, 750, 500]))
>>> board
array([[1, 3, 2, ..., 3, 2, 2],
[0, 0, 2, ..., 0, 2, 0],
[1, 1, 1, ..., 2, 1, 0],
...,
[1, 1, 2, ..., 2, 2, 2],
[1, 2, 2, ..., 2, 1, 2],
[2, 2, 2, ..., 1, 0, 1]], dtype=uint8)
Approach
Flatten the board. Easier to work with indices when the board is (temporarily) one-dimensional.
rpt is a 1d vector of the number of repeats per int. It gets "zipped" together with [1, 2, 3] to create repl, which is length 2000. (80% of the size of the board; you don't need to worry about the 0s in this example.)
The indices of the flattened array are effectively shuffled (idx), and the length of this shuffled array is constrained to the size of the replacement candidates. Lastly, those indices in the 1d board are filled with the replacements, after which it can be made 2d again.

Convert numpy array with values into array with frequency for each observation in each row

I have a numpy array as follows:
array = np.random.randint(6, size=(50, 400))
This array has the cluster that each value belongs to, with each row representing a sample and each column representing a feature, but I would like to create a 5 dimensional array with the frequency of each cluster (in each sample, represented as a row in this matrix).
However, in the frequency calculation, I want to ignore 0, meaning that the frequency of all values except 0 (1-5) should add to 1.
Essentially what I want is a array with each row being a cluster (1-5) in this case, and each row still contains a single sample.
How can this be done?
Edit:
small input:
input = np.random.randint(6, size=(2, 5))
array([[0, 4, 2, 3, 0],
[5, 5, 2, 5, 3]])
output:
1 2 3 4 5
0 .33 .33 .33 0
0 .2 .2 0 .6
Where 1-5 are the row names, and the bottom two rows are the desired output in a numpy array.
This is a simple application of bincount. Does this do what you want?
def freqs(x):
counts = np.bincount(x, minlength=6)[1:]
return counts/counts.sum()
frequencies = np.apply_along_axis(freqs, axis=1, arr=array)
If you were wondering about the speed implications of apply_along_axis, this method using tricky indexing is marginally slower in my tests:
counts = (array[:, :, None] == values[None, None, :]).sum(axis=1)
frequencies2 = counts/counts.sum(axis=1)[:, None]

Finding the row with the highest average in a numpy array

Given the following array:
complete_matrix = numpy.array([
[0, 1, 2, 4],
[1, 0, 3, 5],
[2, 3, 0, 6],
[4, 5, 6, 0]])
I would like to identify the row with the highest average, excluding the diagonal zeros.
So, in this case, I would be able to identify complete_matrix[:,3] as being the row with the highest average.
Note that the presence of the zeros doesn't affect which row has the highest mean because all rows have the same number of elements. Therefore, we just take the mean of each row, and then ask for the index of the largest element.
#Take the mean along the 1st index, ie collapse into a Nx1 array of means
means = np.mean(complete_matrix, 1)
#Now just get the index of the largest mean
idx = np.argmax(means)
idx is now the index of the row with the highest mean!
You don't need to worry about the 0s, they shouldn't effect how the averages compare since there will presumably be one in each row. Hence, you can do something like this to get the index of the row with the highest average:
>>> import numpy as np
>>> complete_matrix = np.array([
... [0, 1, 2, 4],
... [1, 0, 3, 5],
... [2, 3, 0, 6],
... [4, 5, 6, 0]])
>>> np.argmax(np.mean(complete_matrix, axis=1))
3
Reference:
numpy.mean
numpy.argmax
As pointed out by a lot of people, presence of zeros isn't an issue as long as you have the same number of zeros in each column. Just in case your intention was to ignore all the zeros, preventing them from participating in the average computation, you could use weights to suppress the contribution of the zeros. The following solution assigns 0 weight to zero entries, 1 otherwise:
numpy.argmax(numpy.average(complete_matrix,axis=0, weights=complete_matrix!=0))
You can always create a weight matrix where the weight is 0 for diagonal entries, and 1 otherwise.
You will see that this answer actually would fit better to your other question that was marked as duplicated to this one (and don't know why because it is not the same question...)
The presence of zeros can indeed affect the columns' or rows' average, for instance:
a = np.array([[ 0, 1, 0.9, 1],
[0.9, 0, 1, 1],
[ 1, 1, 0, 0.5]])
Without eliminating the diagonals, it would tell that the column 3 has the highest average, but eliminating the diagonals the highest average belongs to column 1 and now column 3 has the least average of all columns!
You can correct the calculated mean using the lcm (least common multiple) of the number of lines with and without the diagonals, by guaranteeing that where a diagonal element does not exist the correction is not applied:
correction = column_sum/lcm(len(column), len(column)-1)
new_mean = mean + correction
I copied the algorithm for lcm from this answer and proposed a solution for your case:
import numpy as np
def gcd(a, b):
"""Return greatest common divisor using Euclid's Algorithm."""
while b:
a, b = b, a % b
return a
def lcm(a, b):
"""Return lowest common multiple."""
return a * b // gcd(a, b)
def mymean(a):
if len(a.diagonal()) < a.shape[1]:
tmp = np.hstack((a.diagonal()*0+1,0))
else:
tmp = a.diagonal()*0+1
return np.mean(a, axis=0) + np.sum(a,axis=0)*tmp/lcm(a.shape[0],a.shape[0]-1)
Testing with the a given above:
mymean(a)
#array([ 0.95 , 1. , 0.95 , 0.83333333])
With another example:
b = np.array([[ 0, 1, 0.9, 0],
[0.9, 0, 1, 1],
[ 1, 1, 0, 0.5],
[0.9, 0.2, 1, 0],
[ 1, 1, 0.7, 0.5]])
mymean(b)
#array([ 0.95, 0.8 , 0.9 , 0.5 ])
With the corrected average you just use np.argmax() to get the column index with the highest average. Similarly, np.argmin() to get the index of the column with the least average:
np.argmin(mymean(a))

Categories

Resources