I am trying to extract rows from a large numpy array. The columns of the array are obs number, group id (j), time id (t), and some data x_jt.
Here is an example:
import numpy as np
N = 100
T = 100
X = np.vstack((np.array(range(1,N*T+1)),np.repeat(np.array(range(1,N+1)),T), np.tile(np.array(range(1,T+1)),N), np.random.randint(100,size=N*T))).T
If I want to extract all rows from X where group id = 2, I would do
X[np.where(X[:,1] == 2)]
And if I wanted all rows where j = 2 or 3, I could extend that code. However, in my case, I have many group ids (j's) to extract. Specifically, I want to extract all rows where j comes from
samples = np.random.randint(N, size=N) + 1
For example, suppose size = 5 instead of N, and samples = (2,4,5,4,7). What I am after is code that goes through X and selects all rows where j = 2, then j = 4, then j = 5, j = 4, and finally j = 7, and creates a new array with the results. Basically this:
result = []
for j in samples:
result.extend(X[np.where(X[:,1] == j)])
However, this code is slow when N is large. Do you have any suggestions to speed it up? Thanks!
Without replacement
This could be done with vectorized functions:
def contains(X, samples):
return numpy.vectorize(lambda x: x in samples)(X)
result = X[contains(X[:, 1], set(samples)), :]
With replacement
If you want to do this with replacement just check off one count per sample until there are no more samples (assuming the order does not matter). This way you at least reduce the amount of times you need to iterate over the matrix.
result = []
sample_counts = collections.Counter(samples)
while sum(sample_counts.itervalues()):
# pick up one of each of the remaining samples and chain their rows
# together in result
s = set(key for key, value in sample_counts.iteritems() if value)
result = itertools.chain(result, X[contains(X[:, 1], s), :])
sample_counts -= collections.Counter(dict.fromkeys(s, 1))
# create a matrix of the final result
result = numpy.array(list(result))
In that case the only way I can think of that might speed up what you're already doing is preallocating a matrix. So you would do:
It doesn't do exactly what you are describing, but this type of problems are a good candidate for np.in1d. Something like this should work:
result = X[np.in1d(X[:, 1], samples)]
Related
I need to write in python3 a function that fills the bigger square with smaller squares basing on the input.
The input is a list of positive integers. Each integer is the number of squares per row. So the list [3, 8, 5, 2] means I got 4 rows of squares where first one has 3 squares, second one 8 and so on. All squares in all rows are of the same size.
The output should be description of rows distribution in the form of list of lists.
The thing is that on the output there can not be empty rows. So effectively the number of columns can not be greater than the number of rows. The rows can be split though into two or more rows. So for example for the list [3, 8, 5, 2] the function should return [[3], [5, 3], [5], [2]]:
AAA
BBBBB
BBB
CCCCC
DD
and for input [14,13,2,12] it should return [[7,7], [7,6], [2], [7,5]]:
AAAAAAA
AAAAAAA
BBBBBBB
BBBBBB
CC
DDDDDDD
DDDDD
As we can see, the number of rows and columns is in both examples equal. Of course it's not always possible but the lesser difference between the number of columns and rows, the more efficient the algorythm is - the better the square is filled. In general we aim to get as many columns as possible and as little rows as possible.
And here is the issue - the above examples used 4 input rows - the input list can have a lot of more elements (for example 200 input rows). And the problem is to find optimial way to split the rows (for example if i should split 18 as 9+9 or 6+6+6 or 7+7+4 or maybe 5+5+5+3). Because every time i split the rows basing on available columns (that depend on the number of used rows), I get more output rows and therefore I am able to use more additional available columns - and I fall into some weird loop or recursion.
I'm sorry if there is some easy solution that I don't see and thank you in advance for help <3
EDIT: Here I include example function that simply ignores the fact that the number of rows increases and just treats the number of input rows as maximum amount of columns possible:
def getSquare(x):
output = list()
ln = len(x)
for i in x:
if i <= ln:
output.append([i])
else:
split = list()
nrows = i // ln
for j in range(nrows):
split.append(ln)
if i % ln:
split.append(i % ln)
output.append(split)
return output
print(getSquare([14, 13, 2, 12]))
# returns [[4, 4, 4, 2], [4, 4, 4, 1], [2], [4, 4, 4]]
# so 4 columns and 12 rows
# columns - maximum number in the matrix
# rows - number of all elements in the matrix (length of flattened output)
# while it should return: [[7,7], [7,6], [2], [7,5]]
# so 7 columns and 7 rows
#(nr of columns should be not larger but as close to the number of rows as possible)
EDIT2: It doesn't have to return perfect square - just something as close to square as possible - for example for [4,3,3] it should return [[3,1],[3]
,[3]] while for extreme cases like [1,1,1] it should just return [[1],[1],[1]]
list = [3, 8, 5, 2] #As close to a Square as possible
def getSquare(list):
for i in range(1, max(list)+1):
output = []
num = 0
for j in list:
output.append([i]*(j//i))
num += j//i
if j%i != 0:
output[-1].append(j%i)
num += 1
if num == i:
return output
i = int( sum(list) ** (1/2) // 1)
output = []
for j in list:
output.append([i]*(j//i))
num += j//i
if j%i != 0:
output[-1].append(j%i)
num += 1
return output
for i in getSquare(list):
for j in i:
print("*"*j)
Here I just repeat the splitting process of the list until i and num become same.
Essentially when you decrease width, you increase number of rows (vice versa is also true: when you increase width, you decrease number of rows), that's why you want to as far as it's possible equalise width and height, because that would get you minimum area of covering square. From this point of view your problem looks like ternary search: minimize maximum of resulting width and height.
So we perform ternary search over width (fix it) and calculate resulting maximum side of square. Calculating function is pretty obvious:
def get_max_side(l, fixed_width): # O(len(l))
height = 0
for i in l:
height += i // fixed_width
if i % fixed_width:
height += 1
return max(fixed_width, height)
Then use it in a ternary search algorithm to find minimum of get_max_side and restore the answer using found value. Time complexity is O(len(l) * log(max_meaningful_square_width)).
I have improved an answer, adding some stop signs to generally speed up the algorithm. It also gives much more efficient answer when no ideal solution was found.
arr = [14, 13, 2, 12] # as close to a square as possible
def getSquare(lst):
output = []
mx = max(lst)
for cols in range(min(len(lst), mx), mx + 1):
out = []
rows = 0
for j in lst:
ln = j // cols
r = j % cols
out.append([cols] * ln)
rows += ln
if r:
out[-1].append(r)
rows += 1
if rows >= cols:
output = out
else:
break
return output
for i in getSquare(arr):
for j in i:
print("*" * j)
I have a large matrix (like 100.000 x 100.000). The good thing it contains only zeros and ones and mostly zeros (it is already saved as boolean matrix to save some RAM). Now I need to multiply each column of the matrix with all of the other columns. The reason is that I need need to check whether there is as least one row where both columns have a non-zero element (therefore multiplying and summing the resulting vector to check whether it is zero or not). As an example assume we have a matrix
1.column
2.column
3.column
1
0
0
1
1
0
0
0
1
Then I need to compare all columns and check whether there is as least one row where both columns are one. So comparing the first and the second column would return a True since they are both one in the second row. However comparing the first and third column and the second the third column would result in a Falsesince there are no rows with a row where both are one.
Obviously this can be done using a for loop and iterating over all columns. However not in a very satisfying speed. I already tried numba like this:
#njit(parallel=True)
def create_dist_arr(arr: np.array):
n = arr.shape[1]
dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
for i in prange(arr.shape[1]):
for j in prange(i, arr.shape[1]):
dist_greater_zero = calc_dist_graeter_than_zeros(arr[:, i], arr[:, j])
dist_arr[i][j] = dist_greater_zero
dist_arr[i][j] = dist_greater_zero
return skill_dist_arr
#njit
def calc_dist_graeter_than_zeros(ith_col, jth_col):
return np.sum(np.multiply(ith_col, jth_col)) != 0
zero_arr = np.zeros(shape=(2000, 6000), dtype=bool)
bool_dist_matrix = create_dist_arr(zero_arr)
But despite having 120gb Ram and 32 cores, that gets very slow around 10.000 x 10.000 matrices. Even worse is it when trying scipy.spatial.distance.pdist like this:
from scipy.spatial.distance import pdist
zero_arr = np.zeros(shape=(500, 500), dtype=bool)
bool_dist_matrix = pdist(zero_arr, lambda u, v: np.sum(np.multiply(u, v)) != 0)
Is there maybe some nice and fast workaround using sparse matrices or something else not taking like forever?
Thank you in advance :)
Idk how memory efficient this solution is nor if it will be faster than the other solutions but it is vectorized.
Your idea was to multiply the columns and add the result. which reminded me of matrix multiplication, except its against its own columns. so..
Lets say your matrix is M. If you take the transverse M.T and matrix multiply against itself, M.T # M it will be the same as multiplying each column against the other columns and taking the sum.
import numpy as np
# A random matrix
M = np.array([[1,0,0,0,0,1],
[1,1,0,0,0,1],
[0,1,1,1,0,0]])
bool_dist_matrix = (M.T # M).astype('bool')
np.fill_diagonal(bool_dist_matrix, 0)
"""
[[0 1 0 0 0 1]
[1 0 1 1 0 1]
[0 1 0 1 0 0]
[0 1 1 0 0 0]
[0 0 0 0 0 0]
[1 1 0 0 0 0]]
"""
If your matrix is mainly composed by zeros, then it will be more efficient to work with index:
import numpy as np
# A random matrix
M = np.array([[1,0,0,0,0,1],
[1,1,0,0,0,1],
[0,1,1,1,0,0]])
# Get the index where M == 1
ind = np.where(M)
# Get the unique value and return the count.
uni = np.unique(ind[0], return_counts=True)
# Keep only the column that have at least two 1 in the same row.
col = uni[0][uni[1]>1]
# Create a boolean index.
bol = np.isin(ind[0],col)
# And here we get the "compressed" result:
res = ind[1][bol] #Col number
grp = ind[0][bol] #Group
# res = [0, 5, 0, 1, 5, 1, 2, 3]
# grp = [1, 1, 2, 2, 2, 3, 3, 3]
# So each pair of each group will output a True statement:
# group 1: 0-5
# group 2: 0-1, 0-5, 1-5
# group 3: 1-2, 1-3, 2-3
# For an explicit result you could use itertools. But all the information is
# contained in the res variable.
Noticed that this method will produce some duplicate if, for multiple rows, two column have a common 1 value. But it will be easy to get rid of those duplicate. But since you're working with a 100000x100000 matrix and that not all columns have at least one common 1 value, it is very likely that the percentage of 1 in your matrix is very low, so this method should provide some good result.
I'm not sure what you meant you need to do, but if i understood correctly my code should help. Here's what i tried. It seems to run a little faster. Although that won't be the case if the matrix isn't sparse as you mentioned and i get a symmetric matrix and not an upper triangular one:
import numpy as np
from numba import njit, prange
#njit(parallel=True)
def create_dist_arr(arr: np.array):
n = arr.shape[1]
dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
for i in prange(arr.shape[1]):
for j in prange(i, arr.shape[1]):
dist_greater_zero = calc_dist_graeter_than_zeros(arr[:, i], arr[:, j])
dist_arr[i][j] = dist_greater_zero
dist_arr[i][j] = dist_greater_zero
return dist_arr
#njit
def calc_dist_graeter_than_zeros(ith_col, jth_col):
return np.sum(np.multiply(ith_col, jth_col)) != 0
def create_dist_arr_sparse(arr: np.array):
n = arr.shape[1]
dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
rows, cols = np.array(np.where(arr))
same_rows = rows.reshape(1, -1) == rows.reshape(-1, 1)
idx0, idx1 = np.meshgrid(cols, cols)
idx0 = idx0.flatten()[same_rows.flatten()]
idx1 = idx1.flatten()[same_rows.flatten()]
dist_arr[idx0, idx1] = 1
return dist_arr
np.random.seed(1)
k = 1000
zero_arr = np.zeros(shape=(k, k), dtype=bool)
rows, cols = np.random.randint(0, k, (2, k))
zero_arr[rows, cols] = 1
%timeit bool_dist_matrix = create_dist_arr(zero_arr)
%timeit bool_dist_matrix = create_dist_arr_sparse(zero_arr)
output:
1 loop, best of 5: 1.59 s per loop
100 loops, best of 5: 9.8 ms per loop
I have a matrix, X, and wish to delete columns based upon values in two different lists named "starts" and "lengths". Values in the first list are in increasing order, with each denoting the index of the starting column in X to delete. The corresponding value in "lengths" indicates how many columns to delete from that point forward, including the starting value itself. A simple example:
import numpy as np
X=np.random.randint(5, size=(3, 20))
starts=[2,9,16]
lengths=[3,4,2]
So, I want to delete columns 2-5, 9-13, and 16-18 of X. In other words, I want my result to be the same as
X[:,[0,1,6,7,8,14,15,19]]
What is the most efficient means of accomplishing this?
This should work. The time complexity is O(number of rows * number of columns). (The inner for loop that iterates over starts will run only upto number of columns in that row.). I don't think you can improve time complexity beyond this.
def delete_columns(matrix, starts, lengths):
# New matrix with columns removed
new_matrix = []
# Iterate over all rows.
for row in matrix:
new_row = []
col_index = 0
# Number of columns in current row
column_count = len(row)
# Iterate over given starts
for start_index in range(len(starts)):
start_col = starts[start_index]
# Add columns which are not present in starts to new matrix
while col_index < min(column_count, start_col):
new_row.append(row[col_index])
col_index += 1
# Reset column index to column pointed by starts
col_index = start_col + lengths[start_index] + 1
if col_index >= column_count:
break
# Handles empty starts and last few columns to be added
while col_index < column_count:
new_row.append(row[col_index])
col_index += 1
# Add row to new matrix
new_matrix.append(new_row)
return new_matrix
matrix = [list(range(0, 20))]
starts=[2,9,16]
lengths=[3,4,2]
print(delete_columns(matrix, starts, lengths))
Output:
[[0, 1, 6, 7, 8, 14, 15, 19]]
Another approach which just came to mind.
import numpy as np
num_times=20
X=np.random.randint(5, size=(3, num_times))
starts=[2,9,16]
lengths=[3,4,2]
T=[set(np.arange(starts[i],starts[i]+lengths[i]+1,1)) for i in
range(len(starts))]
to_remove=set()
for s in T:
to_remove=to_remove.union(s)
U=set(np.arange(0,num_times))
to_keep=list(U.difference(to_remove))
Y=X[:,to_keep] #The desired matrix
A colleague provided me another succinct way of doing it:
import numpy as np
num_times=20
X=np.random.randint(5, size=(3, num_times))
starts=[2,9,16]
lengths=[3,4,2]
cols = list(range(X.shape[1]))
remove = []
for i, s in enumerate(starts):
remove += range(s, s+lengths[i])
saved_cols = list(set(cols).difference(set(remove)))
Y=X[:,saved_cols]
I am trying to split a list into n sublists where the size of each sublist is random (with at least one entry; assume P>I). I used numpy.split function which works fine but does not satisfy my randomness condition. You may ask which distribution the randomness should follow. I think, it should not matter. I checked several posts which were not equivalent to my post as they were trying to split with almost equally sized chunks. If duplicate, let me know. Here is my approach:
import numpy as np
P = 10
I = 5
mylist = range(1, P + 1)
[list(x) for x in np.split(np.array(mylist), I)]
This approach collapses when P is not divisible by I. Further, it creates equal sized chunks, not probabilistically sized chunks. Another constraint: I do not want to use the package random but I am fine with numpy. Don't ask me why; I wish I had a logical response for it.
Based on the answer provided by the mad scientist, this is the code I tried:
P = 10
I = 5
data = np.arange(P) + 1
indices = np.arange(1, P)
np.random.shuffle(indices)
indices = indices[:I - 1]
result = np.split(data, indices)
result
Output:
[array([1, 2]),
array([3, 4, 5, 6]),
array([], dtype=int32),
array([4, 5, 6, 7, 8, 9]),
array([10])]
The problem can be refactored as choosing I-1 random split points from {1,2,...,P-1}, which can be viewed using stars and bars.
Therefore, it can be implemented as follows:
import numpy as np
split_points = np.random.choice(P - 2, I - 1, replace=False) + 1
split_points.sort()
result = np.split(data, split_points)
np.split is still the way to go. If you pass in a sequence of integers, split will treat them as cut points. Generating random cut points is easy. You can do something like
P = 10
I = 5
data = np.arange(P) + 1
indices = np.random.randint(P, size=I - 1)
You want I - 1 cut points to get I chunks. The indices need to be sorted, and duplicates need to be removed. np.unique does both for you. You may end up with fewer than I chunks this way:
result = np.split(data, indices)
If you absolutely need to have I numbers, choose without resampling. That can be implemented for example via np.shuffle:
indices = np.arange(1, P)
np.random.shuffle(indices)
indices = indices[:I - 1]
indices.sort()
Assuming I have the following data:
data = [1,1,3,2,4]
max_value = 4 # it is known from before
number_of_random_values = 2
And what I want is to create random values with range between 1 and 4 for each point of data but excluding the case of the point for each case. To make it more clear here is an example:
data point random_values
1 -> [2,4]
1 -> [3,2]
3 -> [1,4]
2 -> [3,1]
4 -> [1,3]
So what we have above is: for each data point two random values which these random numbers can not be the same as the data point. What I have done until now is the following:
desired_values = np.zeros((len(data), number_of_random_values))
range_of_data = range(1, max_value + 1)
i = 0
for data_point in data:
copy_of_range = copy.copy(range_of_data)
copy_of_range.remove(data_point)
random_values_for_data_point = random.sample(copy_of_range, number_of_random_values)
desired_values[i] = random_values_for_data_point
i = i + 1
The above code does what I want (desired results in numpy array) but it is clear that it is not performance-wise optimized.
Is there a vectorized method to implement this?or something more efficient?
Edit
replacing data with
data = np.random.random_integers(max_value, size=(1000, 1)).tolist()
and running my solution among with the solutions from the answers below with:
import time
start_time = time.time()
for _ in range(10000):
# each solution
.
.
.
end_time = time.time()
print(end_time - start_time)
we have the following results:
my solution: 40.3 sec
Anton vBR solution: 31.7 sec
Desire: 261 sec
If we don't use np for the random numbers,
we can make something simple like this:
import random
import numpy as np
data = [1,1,3,2,4]
max_value = 4 # it is known from before
number_of_random_values = 2
output = [random.sample([i for i in range(1,max_value+1) if i != item],2)\
for item in data]
np.array(output)
Returns
array([[4, 2],
[3, 4],
[1, 4],
[1, 3],
[3, 2]])
Avoiding a given integer in the range [1, max_value] can be achieved with modular arithmetics, which is vectorized in NumPy:
Generate a random number in range(0, max_value-1) (so not including max_value or max_value-1).
Add it to the given, excluded number.
Take the remainder modulo max_value, and add 1.
The result is equally likely to be any number between 1 and max_value inclusive, except the excluded one. (Indeed, the only way to get excluded value would be to add max_value-1 at step 1, which is not allowed).
So the problem boils down to generating many samples from the same array (no exclusion), without replacement. Unfortunately it does not seem like NumPy has a tool for this at present. The method numpy.random.choice only produces one sample, so one has to call it in a loop.
data = np.array([1,1,3,2,4])
max_value = 4
number_of_random_values = 2
desired_values = np.zeros((len(data), number_of_random_values), dtype=np.int)
for i in range(len(data)):
desired_values[i, :] = np.random.choice(max_value-1, number_of_random_values, replace=False)
desired_values = np.mod(desired_values + data.reshape(-1, 1), max_value) + 1
Notice this version declares dtype of array desired_data, which would be float64 by default. The type could be np.int8 if you expect only small integers.