Split a list into n randomly sized chunks - python

I am trying to split a list into n sublists where the size of each sublist is random (with at least one entry; assume P>I). I used numpy.split function which works fine but does not satisfy my randomness condition. You may ask which distribution the randomness should follow. I think, it should not matter. I checked several posts which were not equivalent to my post as they were trying to split with almost equally sized chunks. If duplicate, let me know. Here is my approach:
import numpy as np
P = 10
I = 5
mylist = range(1, P + 1)
[list(x) for x in np.split(np.array(mylist), I)]
This approach collapses when P is not divisible by I. Further, it creates equal sized chunks, not probabilistically sized chunks. Another constraint: I do not want to use the package random but I am fine with numpy. Don't ask me why; I wish I had a logical response for it.
Based on the answer provided by the mad scientist, this is the code I tried:
P = 10
I = 5
data = np.arange(P) + 1
indices = np.arange(1, P)
np.random.shuffle(indices)
indices = indices[:I - 1]
result = np.split(data, indices)
result
Output:
[array([1, 2]),
array([3, 4, 5, 6]),
array([], dtype=int32),
array([4, 5, 6, 7, 8, 9]),
array([10])]

The problem can be refactored as choosing I-1 random split points from {1,2,...,P-1}, which can be viewed using stars and bars.
Therefore, it can be implemented as follows:
import numpy as np
split_points = np.random.choice(P - 2, I - 1, replace=False) + 1
split_points.sort()
result = np.split(data, split_points)

np.split is still the way to go. If you pass in a sequence of integers, split will treat them as cut points. Generating random cut points is easy. You can do something like
P = 10
I = 5
data = np.arange(P) + 1
indices = np.random.randint(P, size=I - 1)
You want I - 1 cut points to get I chunks. The indices need to be sorted, and duplicates need to be removed. np.unique does both for you. You may end up with fewer than I chunks this way:
result = np.split(data, indices)
If you absolutely need to have I numbers, choose without resampling. That can be implemented for example via np.shuffle:
indices = np.arange(1, P)
np.random.shuffle(indices)
indices = indices[:I - 1]
indices.sort()

Related

Looking for a better way to handle periodic boundary condition on numpy array or list in python

I have a set of a large dataset (2-dimensional matrix) of about 5 to 100 rows and 5000 to 25000 columns. I was told to extract a strip out of each row, the strip length is given. For each row, the strip is begin filled from a random position on the row and all the way up, if the position is beyond the length of the row, it will pick the entries from the beginning like the periodic boundary. For example, assume a row has 10 elements,
row = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
the position is picked to 8 and the strip length is 4. The strip will then be [9, 10, 1, 2]
I am trying to use NumPy to do the computation at first
A = np.ones((5, 8000), order='F')
import time
L = (4,3,3,3,4) # length for each of the 5 strips
starttime = time.process_time()
for i in range(80000):
B = []
for c, row in enumerate(A):
start = random.randint(0,len(row)-1)
end = start+L[c]
if end>len(row)-1:
sce = np.zeros(L[c])
for k in range(start, end):
sce[k-start] = k%len(row)
else:
sce = row[start:end]
B = sce
print(time.process_time() - starttime)
I don't have a good way to handle the boundary condition so I just break it into two cases: one when the whole strip is within the row and one when parts of the strip are beyond the row. This code works and takes about 1.5 seconds to run. I then try to use the list instead
A = [[1]*8000]*5
starttime = time.process_time()
for i in range(80000):
B = []
for c, row in enumerate(A):
start = random.randint(0,len(row)-1)
end = start+L[c]
if end>len(row)-1:
sce = np.zeros(L[c])
for k in range(start, end):
sce[k-start] = k%len(row)
else:
sce = row[start:end]
B = sce
print(time.process_time() - starttime)
This one is about 0.5 seconds faster, it is quite surprised I expect NumPy should be faster!!! Both codes are good for the small size of the matrix and a small number of iteration. But in the real project, I am going to deal with a very large matrix and a lot more iterations, I wonder if there is any suggestion to improve the efficiency. Also, is there is any suggestion on how to handle the periodic boundary condition (neater and higher efficiency)?
Considering that you create the array A before timing it, both solutions will be equally fast because you are just iterating over the array. But i am actually not sure on why the pure python solution is quicker, maybe it has to do with collection-based iterators (enumerate) are better suited for primitive python types?
Looking at the example with one row, you want to take a range of elements from the row and wrap around the out-of-bounds indices. For this I would suggest doing:
row = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
start = 8
L = 4
np.take(row, np.arange(start, start+L), mode='wrap')
output:
array([ 9, 10, 1, 2])
This behavior can then be extended to 2 dimensions by specifying the axis keyword. But working with uneven lengths in L does make it a bit trickier, because working with non-homogeneous arrays you will loose most of the benefits from using numpy. The work-around is to partition L in a way that equally sized lengths are grouped together.
If I understand the whole task correctly, you are given some start value and you want to extract each corresponding strip length along the second axis of A.
A = np.arange(5*8000).reshape(5,8000) # using arange makes it easier to verify output
L = (4,3,3,3,4) # length for each of the 5 strips
parts = ((0,4), (1,2,3)) # partition L (to lazy to implement this myself atm)
start = 7998 # arbitrary start position
for part in parts:
ranges = np.arange(start, start+L[part[0]])
out = np.take(A[part,:], ranges, axis=-1, mode='wrap')
print(f'Output for rows {part} with length {L[part[0]]}:\n\n{out}\n')
Output:
Output for rows (0, 4) with length 4:
[[ 7998 7999 0 1]
[39998 39999 32000 32001]]
Output for rows (1, 2, 3) with length 3:
[[15998 15999 8000]
[23998 23999 16000]
[31998 31999 24000]]
Although, it looks like you want a random starting position for each row?

Generate an array of bit vectors with no repeated columns

I have an array of dimensions [batch_size, input_dim] which needs to be filled only with 0 or 1. I need element in each column to be distinct from the rest of the columns. I have taken a approach like below:
train_data = np.zeros(shape=[batch, input_dim])
num_of_ones = random.sample(range(input_dim + 1), batch)
for k in range(batch):
num_of_one = num_of_ones[k]
for _ in range(num_of_one):
train_data[k][np.random.randint(0, input_dim)] = 1
Though this guarantees that no element is repeated (owing to the fact that each column has different number of 1's), there are still many combinations that are left out. For instance when num_of_one = 1, there are input_dim number of possibilities and so on. another downside of the method I have follwed is that both batch_size and input_dim have to be the same (else random.sample throws an error). I do not want to list down all possibilities as that would take forever to complete.
Is there any simple way to achieve the above stated problem?
Observe the binary representation of the numbers from 0 to 7:
000
001
010
011
100
101
110
111
Each row is different! So, all we have to do is convert each row to column. e.g.
arr = [
[0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0, 1, 1],
[0, 1, 0, 1, 0, 1, 0, 1],
]
Also, observe that we have used all the unique possibilities. Now, with 3 rows, we can not add 2**3 + 1 th column.
In general, if cols > 2**rows, then we can not find unique representation.
You can do something like this:
rows = 3
cols = 8
if 2**rows < cols:
print('Not possible')
arr = [[None] * cols for _ in range(rows)]
for col_idx in range(cols):
binary = bin(col_idx)[2:]
binary = binary.zfill(rows)
for row_idx in range(rows):
arr[row_idx][col_idx] = int(binary[row_idx])
for row in arr:
print(row)
Time Complexity: O(rows * cols)
Space Complexity: O(rows * cols)
Why yours doesn't work
Yours has an issue with this line:
for _ in range(num_of_one):
train_data[k][np.random.randint(0, input_dim)] = 1
Because you select random rows to be set to 1, you could have these repeating, and it's not guaranteed that you'll have the right number of ones in each column, hence you can have duplicates. This is essentially no better, than randomizing the entire array, and hoping there are no duplicates.
Solution
You can achieve this via the magic of binary counting. Each of these columns are a different numbers binary representation. There are some limitations to this, as you would with any solution, where it's impossible to have all unique columns.
d = np.arange(input_dim)
random.shuffle(d)
train_data = (((d[:,None] & (1 << np.arange(batch)))) > 0).astype(float).T
print( train_data )
You could select a set of distinct numbers (look in itertools) between 0 and 2^input_dim, and use their binary representations to get the sequence of 0's and 1's for each value. Since the numbers selected would be distinct, their binary representations would be distinct as well.
Your best bet is something like np.unpackbits combined with python's random.sample. random.sample will sample without replacement without creating a list of the input. This means that you can use a range object over arbitrarily large integers without any risk of problems, as long as the sample size fits in memory. np.unpackbits then converts the integers into unique bit sequences. This idea is a concrete implementation of #ScottHunter's answer.
batch_size = number_of_bits
input_size = number_of_samples
First, decide how many bytes you'll need to generate, and the max integer that you'll need to cover the range. Remember, Python supports arbitrary precision integers, so go crazy:
bytes_size = np.ceil(batch_size / 8)
max_int = 1<<batch_size
Now get your unique samples:
samples = random.sample(range(max_int), input_size)
Python integers are full blown objects with a to_bytes method that will prep your samples for np.unpackbits:
data = np.array([list(x.to_bytes(bytes_size, 'little')) for x in samples], dtype=np.uint8).T
The byte order matters if batch_size is not a multiple of 8: were going to trim the final array to size.
Now unpack and you're good to go:
result = np.unpackbits(data, axis=0)[:batch, :]
Putting it all together into a single package:
def random_bit_columns(batch_size, input_size):
samples = random.sample(range(1 << batch_size), input_size)
data = np.array([list(x.to_bytes(np.ceil(batch_size / 8), 'little')) for x in samples], dtype=np.uint8).T
result = np.unpackbits(data, axis=0)[:batch, :]
return result
I'm afraid I can't see a way out of using a list comprehension over the number of columns if you want to have the benefit of python's arbitrary precision integers.

Matlab to Python Matrix Code

I am trying to translate some code from MATLAB to Python. I have been stumped on this part of the MATLAB code:
[L,N] = size(Y);
if (L<p)
error('Insufficient number of columns in y');
end
I understand that [L,N] = size(Y) returns the number of rows and columns when Y is a matrix. However I have limited experience with Python and thus cannot understand how to do the same with Python. This also is part of the reason I do not understand how the MATLAB logic with in the loop can be also fulfilled in Python.
Thank you in advance!
Also, in case the rest of the code is also needed. Here it is.
function [M,Up,my,sing_values] = mvsa(Y,p,varargin)
if (nargin-length(varargin)) ~= 2
error('Wrong number of required parameters');
end
% data set size
[L,N] = size(Y)
if (L<p)
error('Insufficient number of columns in y');
end
I am still unclear as to what p is from your post, however the excerpt below effectively performs the same task as your MATLAB code in Python. Using numpy, you can represent a matrix as an array of arrays and then call .shape to return the number of rows and columns, respectively.
import numpy as np
p = 2
Y = np.matrix([[1, 1, 1, 1],[2, 2, 2, 2],[3, 3, 3, 3]])
L, N = Y.shape
if L < p:
print('Insufficient number of columns in y')
Non-numpy
data = ([[1, 2], [3, 4], [5, 6]])
L, N = len(data), len(data[0])
p = 2
if L < p:
raise ValueError("Insufficient number of columns in y")
number_of_rows = Y.__len__()
number_of_cols = Y[0].__len__()

Selecting rows from array under many conditions

I am trying to extract rows from a large numpy array. The columns of the array are obs number, group id (j), time id (t), and some data x_jt.
Here is an example:
import numpy as np
N = 100
T = 100
X = np.vstack((np.array(range(1,N*T+1)),np.repeat(np.array(range(1,N+1)),T), np.tile(np.array(range(1,T+1)),N), np.random.randint(100,size=N*T))).T
If I want to extract all rows from X where group id = 2, I would do
X[np.where(X[:,1] == 2)]
And if I wanted all rows where j = 2 or 3, I could extend that code. However, in my case, I have many group ids (j's) to extract. Specifically, I want to extract all rows where j comes from
samples = np.random.randint(N, size=N) + 1
For example, suppose size = 5 instead of N, and samples = (2,4,5,4,7). What I am after is code that goes through X and selects all rows where j = 2, then j = 4, then j = 5, j = 4, and finally j = 7, and creates a new array with the results. Basically this:
result = []
for j in samples:
result.extend(X[np.where(X[:,1] == j)])
However, this code is slow when N is large. Do you have any suggestions to speed it up? Thanks!
Without replacement
This could be done with vectorized functions:
def contains(X, samples):
return numpy.vectorize(lambda x: x in samples)(X)
result = X[contains(X[:, 1], set(samples)), :]
With replacement
If you want to do this with replacement just check off one count per sample until there are no more samples (assuming the order does not matter). This way you at least reduce the amount of times you need to iterate over the matrix.
result = []
sample_counts = collections.Counter(samples)
while sum(sample_counts.itervalues()):
# pick up one of each of the remaining samples and chain their rows
# together in result
s = set(key for key, value in sample_counts.iteritems() if value)
result = itertools.chain(result, X[contains(X[:, 1], s), :])
sample_counts -= collections.Counter(dict.fromkeys(s, 1))
# create a matrix of the final result
result = numpy.array(list(result))
In that case the only way I can think of that might speed up what you're already doing is preallocating a matrix. So you would do:
It doesn't do exactly what you are describing, but this type of problems are a good candidate for np.in1d. Something like this should work:
result = X[np.in1d(X[:, 1], samples)]

Min-Max difference in continuous part of certain length within a np.array

I have a numpy array of values like this:
a = np.array((1, 3, 4, 5, 10))
In this case the array has length 5. Now I want to know the difference between the lowest and highest value in the array, but only within a certain continuous part of the array, for example with length 3.
So in this case it would be the difference between 4 and 10, so 6. It would also be nice to have the index of the starting point of the continuous part (in the above example that would be 2). So something like this:
def f(a, lenght_of_part):
...
return (max_difference, starting index)
I know I could iterate over sliced parts of the array, but for my actual purpose I have ~150k arrays of length 1500, so that would take too long.
What would be an easy and quick way of doing this?
Thanks in advance!
This is a bit tricky to get done in a vectorised way in Numpy. One option is to use numpy.lib.stride_tricks.as_strided, which requires care, because it allows to access arbitrary memory. Here's an example for a window size of k = 3:
>>> k = 3
>>> shape = (len(a) - k + 1, k)
>>> b = numpy.lib.stride_tricks.as_strided(
a, shape=shape, strides=(a.itemsize, a.itemsize))
>>> moving_ptp = b.ptp(axis=1)
>>> start_index = moving_ptp.argmax()
>>> moving_ptp[start_index]
6

Categories

Resources