python: creating random values by excluding specific values from list - python

Assuming I have the following data:
data = [1,1,3,2,4]
max_value = 4 # it is known from before
number_of_random_values = 2
And what I want is to create random values with range between 1 and 4 for each point of data but excluding the case of the point for each case. To make it more clear here is an example:
data point random_values
1 -> [2,4]
1 -> [3,2]
3 -> [1,4]
2 -> [3,1]
4 -> [1,3]
So what we have above is: for each data point two random values which these random numbers can not be the same as the data point. What I have done until now is the following:
desired_values = np.zeros((len(data), number_of_random_values))
range_of_data = range(1, max_value + 1)
i = 0
for data_point in data:
copy_of_range = copy.copy(range_of_data)
copy_of_range.remove(data_point)
random_values_for_data_point = random.sample(copy_of_range, number_of_random_values)
desired_values[i] = random_values_for_data_point
i = i + 1
The above code does what I want (desired results in numpy array) but it is clear that it is not performance-wise optimized.
Is there a vectorized method to implement this?or something more efficient?
Edit
replacing data with
data = np.random.random_integers(max_value, size=(1000, 1)).tolist()
and running my solution among with the solutions from the answers below with:
import time
start_time = time.time()
for _ in range(10000):
# each solution
.
.
.
end_time = time.time()
print(end_time - start_time)
we have the following results:
my solution: 40.3 sec
Anton vBR solution: 31.7 sec
Desire: 261 sec

If we don't use np for the random numbers,
we can make something simple like this:
import random
import numpy as np
data = [1,1,3,2,4]
max_value = 4 # it is known from before
number_of_random_values = 2
output = [random.sample([i for i in range(1,max_value+1) if i != item],2)\
for item in data]
np.array(output)
Returns
array([[4, 2],
[3, 4],
[1, 4],
[1, 3],
[3, 2]])

Avoiding a given integer in the range [1, max_value] can be achieved with modular arithmetics, which is vectorized in NumPy:
Generate a random number in range(0, max_value-1) (so not including max_value or max_value-1).
Add it to the given, excluded number.
Take the remainder modulo max_value, and add 1.
The result is equally likely to be any number between 1 and max_value inclusive, except the excluded one. (Indeed, the only way to get excluded value would be to add max_value-1 at step 1, which is not allowed).
So the problem boils down to generating many samples from the same array (no exclusion), without replacement. Unfortunately it does not seem like NumPy has a tool for this at present. The method numpy.random.choice only produces one sample, so one has to call it in a loop.
data = np.array([1,1,3,2,4])
max_value = 4
number_of_random_values = 2
desired_values = np.zeros((len(data), number_of_random_values), dtype=np.int)
for i in range(len(data)):
desired_values[i, :] = np.random.choice(max_value-1, number_of_random_values, replace=False)
desired_values = np.mod(desired_values + data.reshape(-1, 1), max_value) + 1
Notice this version declares dtype of array desired_data, which would be float64 by default. The type could be np.int8 if you expect only small integers.

Related

Is there an efficient way in python of multiplying each column in a matrix with all columns in the same matrix?

I have a large matrix (like 100.000 x 100.000). The good thing it contains only zeros and ones and mostly zeros (it is already saved as boolean matrix to save some RAM). Now I need to multiply each column of the matrix with all of the other columns. The reason is that I need need to check whether there is as least one row where both columns have a non-zero element (therefore multiplying and summing the resulting vector to check whether it is zero or not). As an example assume we have a matrix
1.column
2.column
3.column
1
0
0
1
1
0
0
0
1
Then I need to compare all columns and check whether there is as least one row where both columns are one. So comparing the first and the second column would return a True since they are both one in the second row. However comparing the first and third column and the second the third column would result in a Falsesince there are no rows with a row where both are one.
Obviously this can be done using a for loop and iterating over all columns. However not in a very satisfying speed. I already tried numba like this:
#njit(parallel=True)
def create_dist_arr(arr: np.array):
n = arr.shape[1]
dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
for i in prange(arr.shape[1]):
for j in prange(i, arr.shape[1]):
dist_greater_zero = calc_dist_graeter_than_zeros(arr[:, i], arr[:, j])
dist_arr[i][j] = dist_greater_zero
dist_arr[i][j] = dist_greater_zero
return skill_dist_arr
#njit
def calc_dist_graeter_than_zeros(ith_col, jth_col):
return np.sum(np.multiply(ith_col, jth_col)) != 0
zero_arr = np.zeros(shape=(2000, 6000), dtype=bool)
bool_dist_matrix = create_dist_arr(zero_arr)
But despite having 120gb Ram and 32 cores, that gets very slow around 10.000 x 10.000 matrices. Even worse is it when trying scipy.spatial.distance.pdist like this:
from scipy.spatial.distance import pdist
zero_arr = np.zeros(shape=(500, 500), dtype=bool)
bool_dist_matrix = pdist(zero_arr, lambda u, v: np.sum(np.multiply(u, v)) != 0)
Is there maybe some nice and fast workaround using sparse matrices or something else not taking like forever?
Thank you in advance :)
Idk how memory efficient this solution is nor if it will be faster than the other solutions but it is vectorized.
Your idea was to multiply the columns and add the result. which reminded me of matrix multiplication, except its against its own columns. so..
Lets say your matrix is M. If you take the transverse M.T and matrix multiply against itself, M.T # M it will be the same as multiplying each column against the other columns and taking the sum.
import numpy as np
# A random matrix
M = np.array([[1,0,0,0,0,1],
[1,1,0,0,0,1],
[0,1,1,1,0,0]])
bool_dist_matrix = (M.T # M).astype('bool')
np.fill_diagonal(bool_dist_matrix, 0)
"""
[[0 1 0 0 0 1]
[1 0 1 1 0 1]
[0 1 0 1 0 0]
[0 1 1 0 0 0]
[0 0 0 0 0 0]
[1 1 0 0 0 0]]
"""
If your matrix is mainly composed by zeros, then it will be more efficient to work with index:
import numpy as np
# A random matrix
M = np.array([[1,0,0,0,0,1],
[1,1,0,0,0,1],
[0,1,1,1,0,0]])
# Get the index where M == 1
ind = np.where(M)
# Get the unique value and return the count.
uni = np.unique(ind[0], return_counts=True)
# Keep only the column that have at least two 1 in the same row.
col = uni[0][uni[1]>1]
# Create a boolean index.
bol = np.isin(ind[0],col)
# And here we get the "compressed" result:
res = ind[1][bol] #Col number
grp = ind[0][bol] #Group
# res = [0, 5, 0, 1, 5, 1, 2, 3]
# grp = [1, 1, 2, 2, 2, 3, 3, 3]
# So each pair of each group will output a True statement:
# group 1: 0-5
# group 2: 0-1, 0-5, 1-5
# group 3: 1-2, 1-3, 2-3
# For an explicit result you could use itertools. But all the information is
# contained in the res variable.
Noticed that this method will produce some duplicate if, for multiple rows, two column have a common 1 value. But it will be easy to get rid of those duplicate. But since you're working with a 100000x100000 matrix and that not all columns have at least one common 1 value, it is very likely that the percentage of 1 in your matrix is very low, so this method should provide some good result.
I'm not sure what you meant you need to do, but if i understood correctly my code should help. Here's what i tried. It seems to run a little faster. Although that won't be the case if the matrix isn't sparse as you mentioned and i get a symmetric matrix and not an upper triangular one:
import numpy as np
from numba import njit, prange
#njit(parallel=True)
def create_dist_arr(arr: np.array):
n = arr.shape[1]
dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
for i in prange(arr.shape[1]):
for j in prange(i, arr.shape[1]):
dist_greater_zero = calc_dist_graeter_than_zeros(arr[:, i], arr[:, j])
dist_arr[i][j] = dist_greater_zero
dist_arr[i][j] = dist_greater_zero
return dist_arr
#njit
def calc_dist_graeter_than_zeros(ith_col, jth_col):
return np.sum(np.multiply(ith_col, jth_col)) != 0
def create_dist_arr_sparse(arr: np.array):
n = arr.shape[1]
dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
rows, cols = np.array(np.where(arr))
same_rows = rows.reshape(1, -1) == rows.reshape(-1, 1)
idx0, idx1 = np.meshgrid(cols, cols)
idx0 = idx0.flatten()[same_rows.flatten()]
idx1 = idx1.flatten()[same_rows.flatten()]
dist_arr[idx0, idx1] = 1
return dist_arr
np.random.seed(1)
k = 1000
zero_arr = np.zeros(shape=(k, k), dtype=bool)
rows, cols = np.random.randint(0, k, (2, k))
zero_arr[rows, cols] = 1
%timeit bool_dist_matrix = create_dist_arr(zero_arr)
%timeit bool_dist_matrix = create_dist_arr_sparse(zero_arr)
output:
1 loop, best of 5: 1.59 s per loop
100 loops, best of 5: 9.8 ms per loop

Looking for a better way to handle periodic boundary condition on numpy array or list in python

I have a set of a large dataset (2-dimensional matrix) of about 5 to 100 rows and 5000 to 25000 columns. I was told to extract a strip out of each row, the strip length is given. For each row, the strip is begin filled from a random position on the row and all the way up, if the position is beyond the length of the row, it will pick the entries from the beginning like the periodic boundary. For example, assume a row has 10 elements,
row = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
the position is picked to 8 and the strip length is 4. The strip will then be [9, 10, 1, 2]
I am trying to use NumPy to do the computation at first
A = np.ones((5, 8000), order='F')
import time
L = (4,3,3,3,4) # length for each of the 5 strips
starttime = time.process_time()
for i in range(80000):
B = []
for c, row in enumerate(A):
start = random.randint(0,len(row)-1)
end = start+L[c]
if end>len(row)-1:
sce = np.zeros(L[c])
for k in range(start, end):
sce[k-start] = k%len(row)
else:
sce = row[start:end]
B = sce
print(time.process_time() - starttime)
I don't have a good way to handle the boundary condition so I just break it into two cases: one when the whole strip is within the row and one when parts of the strip are beyond the row. This code works and takes about 1.5 seconds to run. I then try to use the list instead
A = [[1]*8000]*5
starttime = time.process_time()
for i in range(80000):
B = []
for c, row in enumerate(A):
start = random.randint(0,len(row)-1)
end = start+L[c]
if end>len(row)-1:
sce = np.zeros(L[c])
for k in range(start, end):
sce[k-start] = k%len(row)
else:
sce = row[start:end]
B = sce
print(time.process_time() - starttime)
This one is about 0.5 seconds faster, it is quite surprised I expect NumPy should be faster!!! Both codes are good for the small size of the matrix and a small number of iteration. But in the real project, I am going to deal with a very large matrix and a lot more iterations, I wonder if there is any suggestion to improve the efficiency. Also, is there is any suggestion on how to handle the periodic boundary condition (neater and higher efficiency)?
Considering that you create the array A before timing it, both solutions will be equally fast because you are just iterating over the array. But i am actually not sure on why the pure python solution is quicker, maybe it has to do with collection-based iterators (enumerate) are better suited for primitive python types?
Looking at the example with one row, you want to take a range of elements from the row and wrap around the out-of-bounds indices. For this I would suggest doing:
row = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
start = 8
L = 4
np.take(row, np.arange(start, start+L), mode='wrap')
output:
array([ 9, 10, 1, 2])
This behavior can then be extended to 2 dimensions by specifying the axis keyword. But working with uneven lengths in L does make it a bit trickier, because working with non-homogeneous arrays you will loose most of the benefits from using numpy. The work-around is to partition L in a way that equally sized lengths are grouped together.
If I understand the whole task correctly, you are given some start value and you want to extract each corresponding strip length along the second axis of A.
A = np.arange(5*8000).reshape(5,8000) # using arange makes it easier to verify output
L = (4,3,3,3,4) # length for each of the 5 strips
parts = ((0,4), (1,2,3)) # partition L (to lazy to implement this myself atm)
start = 7998 # arbitrary start position
for part in parts:
ranges = np.arange(start, start+L[part[0]])
out = np.take(A[part,:], ranges, axis=-1, mode='wrap')
print(f'Output for rows {part} with length {L[part[0]]}:\n\n{out}\n')
Output:
Output for rows (0, 4) with length 4:
[[ 7998 7999 0 1]
[39998 39999 32000 32001]]
Output for rows (1, 2, 3) with length 3:
[[15998 15999 8000]
[23998 23999 16000]
[31998 31999 24000]]
Although, it looks like you want a random starting position for each row?

How to calculate numbers of "uninterrupted" repeats in an array in python?

I have a 0,1 numpy array like this:
[0,0,0,1,1,1,0,0,1,1,0,0,0,1,1,1,1,0,0,0]
I want to have a function that tells me number 1 is repeated 3,2,4 times in this array, respectively. Is there a simple numpy function for this?
This is one way to do it to find first the clusters and then get their frequency using Counter. The first part is inspired from this answer for 2d arrays. I added the second Counter part to get the desired answer.
If you find the linked original answer helpful, please visit it and upvote it.
from scipy.ndimage import measurements
from collections import Counter
arr = np.array([0,0,0,1,1,1,0,0,1,1,0,0,0,1,1,1,1,0,0,0])
cluster, freq = measurements.label(arr)
print (list(Counter(cluster).values())[1:])
# [3, 2, 4]
Assume you only have 0s and 1s:
import numpy as np
a = np.array([0,0,0,1,1,1,0,0,1,1,0,0,0,1,1,1,1,0,0,0])
# pad a with 0 at both sides for edge cases when a starts or ends with 1
d = np.diff(np.pad(a, pad_width=1, mode='constant'))
# subtract indices when value changes from 0 to 1 from indices where value changes from 1 to 0
np.flatnonzero(d == -1) - np.flatnonzero(d == 1)
# array([3, 2, 4])
A custom implementation?
def count_consecutives(predicate, iterable):
tmp = []
for e in iterable:
if predicate(e): tmp.append(e)
else:
if len(tmp) > 0: yield(len(tmp)) # > 1 if you want at least two consecutive
tmp = []
if len(tmp) > 0: yield(len(tmp)) # > 1 if you want at least two consecutive
So you can:
array = [0,0,0,1,1,1,0,0,1,1,0,0,0,1,1,1,1,0,0,0]
(count_consecutives(lambda x: x == 0, array)
#=> [3, 2, 4]
And also:
array = [0,0,0,1,2,3,0,0,3,2,1,0,0,1,11,10,10,0,0,100]
count_consecutives(lambda x: x > 1, array)
# => [2, 2, 3, 1]

Python: check if columns of array are within boundaries, if not pick a random number within the boundary

I have an array of the form
a = np.array([[1,2],[3,4],[5,6]])
and I have a "domain" or boundary, which is again an array of the form
b = np.array([[0, 4], [3,7]])
Basically I want to check that a[:,0] is within the first row of b and that a[:,1] is within the second row of b. For instance in this example a[:,0]=[1,3,5] and we can see that they all work, except for 5 that is bigger than 4. Similarly a[:,1] = [2,4,6] and so we see that 2 fails because 2<3.
So basically I want 0 <= a[:,0] <= 4 and 3 <= a[:,1]<=7. When a number goes out of this boundaries, I want to basically replace it with a random number within the boundary.
My try
a[:,0][~np.logical_and(b[0][0] <= a[:,0], a[:,0] <= b[0][1])] = np.random.uniform(b[0][0], b[0][1])
a[:,1][~np.logical_and(b[1][0] <= a[:,1], a[:,1] <= b[1][1])] = np.random.uniform(b[1][0], b[1][1])
Is there a faster/better way?
Approach #1 : Here's one approach -
# Invalid mask where new values are to be put
mask = (a < b[:,0]) | (a > b[:,1])
# Number of invalid ones per column of a
count = mask.sum(0)
# Get lengths for range limits set by b
lens = b[:,1] - b[:,0]
# Scale for uniform random number generation
scale = np.repeat(lens, count)
# Generate random numbers in [0,1)
rand_num = np.random.rand(count.sum())
# Get offset for each set of random numbers. Scale and add offsets to get
#equivalent of all the original code uniform rand number generation
offset = np.repeat(b[:,0], count)
put_num = rand_num*scale + offset
# Finally make a copy as a float array and assign using invalid mask
out = a.copy().astype(float)
out.T[mask.T] = put_num
Sample run -
In [1004]: a
Out[1004]:
array([[1, 2],
[7, 4],
[5, 6]])
In [1005]: b
Out[1005]:
array([[ 2, 6],
[ 5, 12]])
In [1006]: out
Out[1006]:
array([[ 2.9488404 , 8.97938277],
[ 4.51508777, 5.69467752],
[ 5. , 6. ]])
# limits: [2, 6] [5, 12]
Approach #2 : Another approach would be to generate scaled and offsetted random numbers of the same shape as a and simply use np.where alongwith the invalid mask to choose between the generated random numbers and a. The implementation would be a simpler one, like so -
rand_nums = np.random.rand(*a.shape)*(b[:,1] - b[:,0]) + b[:,0]
mask = (a < b[:,0]) | (a > b[:,1])
out = np.where(mask, rand_nums, a)
import numpy as np
a = np.array([[1,2],[3,4],[5,6]])
b = np.array([[0,4], [3,7]])
for iter in range(np.size(a,1)):
index = np.where(np.logical_or(a[:,iter]<b[0,iter], a[:,iter]>b[1,iter]))
if len(index)!=0:
a[index,iter] = np.random.random_integers(b[0,iter], b[1,iter], size=[len(index),1])
This should give you what you need :)

Selecting rows from array under many conditions

I am trying to extract rows from a large numpy array. The columns of the array are obs number, group id (j), time id (t), and some data x_jt.
Here is an example:
import numpy as np
N = 100
T = 100
X = np.vstack((np.array(range(1,N*T+1)),np.repeat(np.array(range(1,N+1)),T), np.tile(np.array(range(1,T+1)),N), np.random.randint(100,size=N*T))).T
If I want to extract all rows from X where group id = 2, I would do
X[np.where(X[:,1] == 2)]
And if I wanted all rows where j = 2 or 3, I could extend that code. However, in my case, I have many group ids (j's) to extract. Specifically, I want to extract all rows where j comes from
samples = np.random.randint(N, size=N) + 1
For example, suppose size = 5 instead of N, and samples = (2,4,5,4,7). What I am after is code that goes through X and selects all rows where j = 2, then j = 4, then j = 5, j = 4, and finally j = 7, and creates a new array with the results. Basically this:
result = []
for j in samples:
result.extend(X[np.where(X[:,1] == j)])
However, this code is slow when N is large. Do you have any suggestions to speed it up? Thanks!
Without replacement
This could be done with vectorized functions:
def contains(X, samples):
return numpy.vectorize(lambda x: x in samples)(X)
result = X[contains(X[:, 1], set(samples)), :]
With replacement
If you want to do this with replacement just check off one count per sample until there are no more samples (assuming the order does not matter). This way you at least reduce the amount of times you need to iterate over the matrix.
result = []
sample_counts = collections.Counter(samples)
while sum(sample_counts.itervalues()):
# pick up one of each of the remaining samples and chain their rows
# together in result
s = set(key for key, value in sample_counts.iteritems() if value)
result = itertools.chain(result, X[contains(X[:, 1], s), :])
sample_counts -= collections.Counter(dict.fromkeys(s, 1))
# create a matrix of the final result
result = numpy.array(list(result))
In that case the only way I can think of that might speed up what you're already doing is preallocating a matrix. So you would do:
It doesn't do exactly what you are describing, but this type of problems are a good candidate for np.in1d. Something like this should work:
result = X[np.in1d(X[:, 1], samples)]

Categories

Resources