Generate an array of bit vectors with no repeated columns - python

I have an array of dimensions [batch_size, input_dim] which needs to be filled only with 0 or 1. I need element in each column to be distinct from the rest of the columns. I have taken a approach like below:
train_data = np.zeros(shape=[batch, input_dim])
num_of_ones = random.sample(range(input_dim + 1), batch)
for k in range(batch):
num_of_one = num_of_ones[k]
for _ in range(num_of_one):
train_data[k][np.random.randint(0, input_dim)] = 1
Though this guarantees that no element is repeated (owing to the fact that each column has different number of 1's), there are still many combinations that are left out. For instance when num_of_one = 1, there are input_dim number of possibilities and so on. another downside of the method I have follwed is that both batch_size and input_dim have to be the same (else random.sample throws an error). I do not want to list down all possibilities as that would take forever to complete.
Is there any simple way to achieve the above stated problem?

Observe the binary representation of the numbers from 0 to 7:
000
001
010
011
100
101
110
111
Each row is different! So, all we have to do is convert each row to column. e.g.
arr = [
[0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0, 1, 1],
[0, 1, 0, 1, 0, 1, 0, 1],
]
Also, observe that we have used all the unique possibilities. Now, with 3 rows, we can not add 2**3 + 1 th column.
In general, if cols > 2**rows, then we can not find unique representation.
You can do something like this:
rows = 3
cols = 8
if 2**rows < cols:
print('Not possible')
arr = [[None] * cols for _ in range(rows)]
for col_idx in range(cols):
binary = bin(col_idx)[2:]
binary = binary.zfill(rows)
for row_idx in range(rows):
arr[row_idx][col_idx] = int(binary[row_idx])
for row in arr:
print(row)
Time Complexity: O(rows * cols)
Space Complexity: O(rows * cols)

Why yours doesn't work
Yours has an issue with this line:
for _ in range(num_of_one):
train_data[k][np.random.randint(0, input_dim)] = 1
Because you select random rows to be set to 1, you could have these repeating, and it's not guaranteed that you'll have the right number of ones in each column, hence you can have duplicates. This is essentially no better, than randomizing the entire array, and hoping there are no duplicates.
Solution
You can achieve this via the magic of binary counting. Each of these columns are a different numbers binary representation. There are some limitations to this, as you would with any solution, where it's impossible to have all unique columns.
d = np.arange(input_dim)
random.shuffle(d)
train_data = (((d[:,None] & (1 << np.arange(batch)))) > 0).astype(float).T
print( train_data )

You could select a set of distinct numbers (look in itertools) between 0 and 2^input_dim, and use their binary representations to get the sequence of 0's and 1's for each value. Since the numbers selected would be distinct, their binary representations would be distinct as well.

Your best bet is something like np.unpackbits combined with python's random.sample. random.sample will sample without replacement without creating a list of the input. This means that you can use a range object over arbitrarily large integers without any risk of problems, as long as the sample size fits in memory. np.unpackbits then converts the integers into unique bit sequences. This idea is a concrete implementation of #ScottHunter's answer.
batch_size = number_of_bits
input_size = number_of_samples
First, decide how many bytes you'll need to generate, and the max integer that you'll need to cover the range. Remember, Python supports arbitrary precision integers, so go crazy:
bytes_size = np.ceil(batch_size / 8)
max_int = 1<<batch_size
Now get your unique samples:
samples = random.sample(range(max_int), input_size)
Python integers are full blown objects with a to_bytes method that will prep your samples for np.unpackbits:
data = np.array([list(x.to_bytes(bytes_size, 'little')) for x in samples], dtype=np.uint8).T
The byte order matters if batch_size is not a multiple of 8: were going to trim the final array to size.
Now unpack and you're good to go:
result = np.unpackbits(data, axis=0)[:batch, :]
Putting it all together into a single package:
def random_bit_columns(batch_size, input_size):
samples = random.sample(range(1 << batch_size), input_size)
data = np.array([list(x.to_bytes(np.ceil(batch_size / 8), 'little')) for x in samples], dtype=np.uint8).T
result = np.unpackbits(data, axis=0)[:batch, :]
return result
I'm afraid I can't see a way out of using a list comprehension over the number of columns if you want to have the benefit of python's arbitrary precision integers.

Related

Looking for a better way to handle periodic boundary condition on numpy array or list in python

I have a set of a large dataset (2-dimensional matrix) of about 5 to 100 rows and 5000 to 25000 columns. I was told to extract a strip out of each row, the strip length is given. For each row, the strip is begin filled from a random position on the row and all the way up, if the position is beyond the length of the row, it will pick the entries from the beginning like the periodic boundary. For example, assume a row has 10 elements,
row = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
the position is picked to 8 and the strip length is 4. The strip will then be [9, 10, 1, 2]
I am trying to use NumPy to do the computation at first
A = np.ones((5, 8000), order='F')
import time
L = (4,3,3,3,4) # length for each of the 5 strips
starttime = time.process_time()
for i in range(80000):
B = []
for c, row in enumerate(A):
start = random.randint(0,len(row)-1)
end = start+L[c]
if end>len(row)-1:
sce = np.zeros(L[c])
for k in range(start, end):
sce[k-start] = k%len(row)
else:
sce = row[start:end]
B = sce
print(time.process_time() - starttime)
I don't have a good way to handle the boundary condition so I just break it into two cases: one when the whole strip is within the row and one when parts of the strip are beyond the row. This code works and takes about 1.5 seconds to run. I then try to use the list instead
A = [[1]*8000]*5
starttime = time.process_time()
for i in range(80000):
B = []
for c, row in enumerate(A):
start = random.randint(0,len(row)-1)
end = start+L[c]
if end>len(row)-1:
sce = np.zeros(L[c])
for k in range(start, end):
sce[k-start] = k%len(row)
else:
sce = row[start:end]
B = sce
print(time.process_time() - starttime)
This one is about 0.5 seconds faster, it is quite surprised I expect NumPy should be faster!!! Both codes are good for the small size of the matrix and a small number of iteration. But in the real project, I am going to deal with a very large matrix and a lot more iterations, I wonder if there is any suggestion to improve the efficiency. Also, is there is any suggestion on how to handle the periodic boundary condition (neater and higher efficiency)?
Considering that you create the array A before timing it, both solutions will be equally fast because you are just iterating over the array. But i am actually not sure on why the pure python solution is quicker, maybe it has to do with collection-based iterators (enumerate) are better suited for primitive python types?
Looking at the example with one row, you want to take a range of elements from the row and wrap around the out-of-bounds indices. For this I would suggest doing:
row = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
start = 8
L = 4
np.take(row, np.arange(start, start+L), mode='wrap')
output:
array([ 9, 10, 1, 2])
This behavior can then be extended to 2 dimensions by specifying the axis keyword. But working with uneven lengths in L does make it a bit trickier, because working with non-homogeneous arrays you will loose most of the benefits from using numpy. The work-around is to partition L in a way that equally sized lengths are grouped together.
If I understand the whole task correctly, you are given some start value and you want to extract each corresponding strip length along the second axis of A.
A = np.arange(5*8000).reshape(5,8000) # using arange makes it easier to verify output
L = (4,3,3,3,4) # length for each of the 5 strips
parts = ((0,4), (1,2,3)) # partition L (to lazy to implement this myself atm)
start = 7998 # arbitrary start position
for part in parts:
ranges = np.arange(start, start+L[part[0]])
out = np.take(A[part,:], ranges, axis=-1, mode='wrap')
print(f'Output for rows {part} with length {L[part[0]]}:\n\n{out}\n')
Output:
Output for rows (0, 4) with length 4:
[[ 7998 7999 0 1]
[39998 39999 32000 32001]]
Output for rows (1, 2, 3) with length 3:
[[15998 15999 8000]
[23998 23999 16000]
[31998 31999 24000]]
Although, it looks like you want a random starting position for each row?

Split a list into n randomly sized chunks

I am trying to split a list into n sublists where the size of each sublist is random (with at least one entry; assume P>I). I used numpy.split function which works fine but does not satisfy my randomness condition. You may ask which distribution the randomness should follow. I think, it should not matter. I checked several posts which were not equivalent to my post as they were trying to split with almost equally sized chunks. If duplicate, let me know. Here is my approach:
import numpy as np
P = 10
I = 5
mylist = range(1, P + 1)
[list(x) for x in np.split(np.array(mylist), I)]
This approach collapses when P is not divisible by I. Further, it creates equal sized chunks, not probabilistically sized chunks. Another constraint: I do not want to use the package random but I am fine with numpy. Don't ask me why; I wish I had a logical response for it.
Based on the answer provided by the mad scientist, this is the code I tried:
P = 10
I = 5
data = np.arange(P) + 1
indices = np.arange(1, P)
np.random.shuffle(indices)
indices = indices[:I - 1]
result = np.split(data, indices)
result
Output:
[array([1, 2]),
array([3, 4, 5, 6]),
array([], dtype=int32),
array([4, 5, 6, 7, 8, 9]),
array([10])]
The problem can be refactored as choosing I-1 random split points from {1,2,...,P-1}, which can be viewed using stars and bars.
Therefore, it can be implemented as follows:
import numpy as np
split_points = np.random.choice(P - 2, I - 1, replace=False) + 1
split_points.sort()
result = np.split(data, split_points)
np.split is still the way to go. If you pass in a sequence of integers, split will treat them as cut points. Generating random cut points is easy. You can do something like
P = 10
I = 5
data = np.arange(P) + 1
indices = np.random.randint(P, size=I - 1)
You want I - 1 cut points to get I chunks. The indices need to be sorted, and duplicates need to be removed. np.unique does both for you. You may end up with fewer than I chunks this way:
result = np.split(data, indices)
If you absolutely need to have I numbers, choose without resampling. That can be implemented for example via np.shuffle:
indices = np.arange(1, P)
np.random.shuffle(indices)
indices = indices[:I - 1]
indices.sort()

Is there a faster way to do this pseudo code efficiently in python numpy?

I have three arrays called RowIndex, ColIndex and Entry in numpy. Essentially, this is a subset of entries from a matrix with the row indexes, column indexes, and value of that entry in these three variables respectively. I have two numpy 2D arrays (matrices) U and M. Let alpha and beta be two given constants. I need to iterate through the subset of entries of the matrix which is possible if I iterate through RowIndex, ColIndex and Value. Say,
i=RowIndex[0], j=ColIndex[0], value = Entry[0]
then I need to update i'th row and j'th column of U and M respectively according to some equation. Then, I make
i=RowIndex[1], j=ColIndex[1], value = Entry[1]
and so on. The detail is below.
for iter in np.arange(length(RowIndex)):
i = RowIndex[iter]
j = ColIndex[iter]
value = Entry[iter]
e = value - np.dot(U[i,:],M[:,j])
OldUi = U[i,:]
OldMj = M[:,j]
U[i,:] = OldUi + beta * (e*OldMj - alpha*OldUi)
M[:,j] = OldMj + beta * (e*OldUi - alpha*OldMj)
The problem is that the code is extremely slow. Is there any portion of code where I can speed this up?
PS: For the curious ones, this is a variant of the prize-winning solution to the famous NetFlix million prize problem. RowIndex corresponds to users and ColIndex correspond to movies and values corresponding to their ratings. Most of the ratings are missing. Known ratings are stacked up in RowIndex, ColIndex and Entry. Now you try to find matrices U and M, such that, the rating of i'th user for j'th movie is given by np.dot(U[i,:],M[:,j]). Now based on the available ratings, you try to find the matrices U and M (or their rows and columns) using a update equation as shown in the above code.
I think if I didn't understand wrong, that your code can be vectorized as follows:
import numpy as np
U, M = # two 2D matrices
rows_idx = # list of indexes
cols_idx = # list of indexes
values = # np.array() of values
e = values - np.dot(U[rows_idx, :], M[:, cols_idx]).diagonal()
Uo = U.copy()
Mo = M.copy()
U[rows_idx, :] += beta * ((e * Mo[:, cols_idx]).T - alpha * Uo[rows_idx, :])
M[:, cols_idx] += beta * ((e * Uo[rows_idx, :].T) - alpha * Mo[:, cols_idx])
Here,
e = values - np.dot(U[rows_idx, :], M[:, cols_idx]).diagonal()
computes your
e = value - np.dot(U[i,:],M[:,j])
Note that the result you want resides in the diagonal of the dot product between matrices.
This wont handle sequential updates (as for that there is no available vectorization), but it will allow you to perform a batch of independent updates in a vectorized and faster way.
As stated above, the code I proposed to you can't handle sequential updates, because by definition, a sequential updating scheme can't be vectorized. Anything of the form
A(t) = A(t-1) +/* something
where t defines time, can't be updated in parallel.
So, what I proposed, is a vectorized update for independent updates.
Imagine you have M and U with 10x10 rows each, and you have the following row and columns indexes:
rows_idx = [1, 1, 3, 4, 5, 0]
cols_idx = [7, 1, 7, 5, 6, 5]
You can identify from there two independent sets (considering that indexes are ordered):
rows_idx = [1, 4, 5], [1, 3, 0]
cols_idx = [7, 5, 6], [1, 7, 5]
Note that independent sets are made by indexes in both rows and columns that are unique. With that definition, you can reduce the number of loops you need from 6 (in this case) to 2:
for i in len(rows_idx):
ridx = rows_idx[i]
cidx = cols_idx[i]
# Use the vectorized scheme proposed above the edit
e = values - np.dot(U[ridx, :], M[:, cidx]).diagonal()
Uo = U.copy()
Mo = M.copy()
U[ridx, :] += beta * ((e * Mo[:, cidx]).T - alpha * Uo[ridx, :])
M[:, cidx] += beta * ((e * Uo[ridx, :].T) - alpha * Mo[:, cidx])
So, either if you have a way of manually (or easily) extracting the independent updates, or you calculate the list by a using search algorithm, the above code would vectorize the independent updates.
For clarification just in case, in the above example:
rows_idx = [1, 1, 3, 4, 5, 0]
cols_idx = [7, 1, 7, 5, 6, 5]
2nd row can't be parallelized because 1 has appeared before, and 3rd and last columns can't be parallelized because of the same reason (with 7 and 5). So as both rows and columns need to be unique, we end up with 2 sets of tuples:
rows_idx = [1, 4, 5], [1, 3, 0]
cols_idx = [7, 5, 6], [1, 7, 5]
From here, the way to go would depend on your data. The problem of finding independent sets could be very expensive, specially if most of them are dependent on some previous updates.
If you have a way from your data (say that you have your data recorded on time) to extract independent sets, then the batch update will help you. In the other hand, if you have your data all together (which is common), it will depend on one factor:
If you can assure that the length of the independent sets N is very much larger than the number of independent sets M (which more or less means, that if you will end up with a few M = {2,3,4} independent sets for your N = 100000, with N >> M row/col indexes), then it might worth looking for independent sets.
In other words, if you are going to update 30 authors and 30 movies in 10000 different combinations, then your data will be likely to be dependent in previous updates, however, if you are going to update 100000 authors and 100000 movies in 30 combinations, then your data is likely to be independent.
Some pseudocode to find independent set, if you don't have a way of extracting them without information, would be something like this:
independent_sets = [] # list with sets
for row, col in zip(rows_idx, cols_idx):
for iset in independent_sets:
if row and col DONT exist in iset:
insert row and col
break
if nothing inserted:
add new set to independent set
add current (row, col) to the new set
as you can see, in order to find independent sets you already need to iterate over the whole list of row/column indexes. The pseudocode above is not the most efficent one, and I'm pretty sure there will be specific algorithms for this. But, the cost of finding independent set might be higher than doing all your sequential updates if your updates are likely to be dependent in previous ones.
To finish: after the whole post, it entirely depends on your data.
If you can beforehand from the way you get the rows/columns you want to update extract independent sets, then you can easily update them vectorized.
If you can ensure that most of your updates will be independent (say, 990 out of 10000 will be), it might be worth trying to find the 990 set. One way to approximate the set is by using np.unique:
# Just get the index of the unique rows and columns
_, idx_rows = np.unique(rows_idx, return_index=True)
_, idx_cols = np.unique(cols_idx, return_index=True)
# Get the index where both rows and columns are unique
idx = np.intersection1d(idx_rows, idx_cols)
Now idx contains the positions of rows_idx and cols_idx that are unique, hopefully this can reduce your computational cost a lot. You can use my batch update to update fast those rows and columns corresponding to those indexes. You can then use your initial approach to update the hopefully few entries that are repeated iterating over the non-unique indexes.
If you have multiple updates for same actors or movies, then... keep your sequential update scheme, as finding independent sets will be harder than iterative update.

Selecting rows from array under many conditions

I am trying to extract rows from a large numpy array. The columns of the array are obs number, group id (j), time id (t), and some data x_jt.
Here is an example:
import numpy as np
N = 100
T = 100
X = np.vstack((np.array(range(1,N*T+1)),np.repeat(np.array(range(1,N+1)),T), np.tile(np.array(range(1,T+1)),N), np.random.randint(100,size=N*T))).T
If I want to extract all rows from X where group id = 2, I would do
X[np.where(X[:,1] == 2)]
And if I wanted all rows where j = 2 or 3, I could extend that code. However, in my case, I have many group ids (j's) to extract. Specifically, I want to extract all rows where j comes from
samples = np.random.randint(N, size=N) + 1
For example, suppose size = 5 instead of N, and samples = (2,4,5,4,7). What I am after is code that goes through X and selects all rows where j = 2, then j = 4, then j = 5, j = 4, and finally j = 7, and creates a new array with the results. Basically this:
result = []
for j in samples:
result.extend(X[np.where(X[:,1] == j)])
However, this code is slow when N is large. Do you have any suggestions to speed it up? Thanks!
Without replacement
This could be done with vectorized functions:
def contains(X, samples):
return numpy.vectorize(lambda x: x in samples)(X)
result = X[contains(X[:, 1], set(samples)), :]
With replacement
If you want to do this with replacement just check off one count per sample until there are no more samples (assuming the order does not matter). This way you at least reduce the amount of times you need to iterate over the matrix.
result = []
sample_counts = collections.Counter(samples)
while sum(sample_counts.itervalues()):
# pick up one of each of the remaining samples and chain their rows
# together in result
s = set(key for key, value in sample_counts.iteritems() if value)
result = itertools.chain(result, X[contains(X[:, 1], s), :])
sample_counts -= collections.Counter(dict.fromkeys(s, 1))
# create a matrix of the final result
result = numpy.array(list(result))
In that case the only way I can think of that might speed up what you're already doing is preallocating a matrix. So you would do:
It doesn't do exactly what you are describing, but this type of problems are a good candidate for np.in1d. Something like this should work:
result = X[np.in1d(X[:, 1], samples)]

Min-Max difference in continuous part of certain length within a np.array

I have a numpy array of values like this:
a = np.array((1, 3, 4, 5, 10))
In this case the array has length 5. Now I want to know the difference between the lowest and highest value in the array, but only within a certain continuous part of the array, for example with length 3.
So in this case it would be the difference between 4 and 10, so 6. It would also be nice to have the index of the starting point of the continuous part (in the above example that would be 2). So something like this:
def f(a, lenght_of_part):
...
return (max_difference, starting index)
I know I could iterate over sliced parts of the array, but for my actual purpose I have ~150k arrays of length 1500, so that would take too long.
What would be an easy and quick way of doing this?
Thanks in advance!
This is a bit tricky to get done in a vectorised way in Numpy. One option is to use numpy.lib.stride_tricks.as_strided, which requires care, because it allows to access arbitrary memory. Here's an example for a window size of k = 3:
>>> k = 3
>>> shape = (len(a) - k + 1, k)
>>> b = numpy.lib.stride_tricks.as_strided(
a, shape=shape, strides=(a.itemsize, a.itemsize))
>>> moving_ptp = b.ptp(axis=1)
>>> start_index = moving_ptp.argmax()
>>> moving_ptp[start_index]
6

Categories

Resources