Pairwise count of common elements in 2D numpy array - python

I have a numpy array of shape 5000, 9 and dtype int. I am trying to create an array of shape 5000, 5000 of dtype int that contains a count of shared elements in each pair of arrays.
I can accomplish this using itertools.combinations and a loop, but that approach is pretty slow (3-4 minutes on my machine), so I'm searching for a more efficient alternative. Any suggestions would be greatly appreciated!
from itertools import combinations
import numpy as np
# create random array where row don't have duplicates
data = np.random.rand(5000, 9).argsort(axis=0)
counts = np.zeros((5000, 9), dtype=int)
for i, j in combinations(range(len(data)), 2):
counts[i, j] = len(np.intersect1d(data[i], data[j]))

Let's try:
# sample data with 200 unique values
np.random.seed(1)
data = np.array([np.random.choice(np.arange(200), size=9, replace=False)
for _ in range(5000)]
)
# identify the unique values:
uniques = np.unique(data)
# dummy for each row
a = (data[...,None] == uniques).sum(1)
# output
out = np.einsum('ij,kj->ik',a,a)
Takes about 4.5s on my system.

Related

Conditional averaged of numpy array based on a different

I have a sequence of N images with some shape (N, x,y). I also have corresponding times for each image, which is just a 1D array of length N.
Some of these times are duplicates, so I want to average the images at the same time steps so that I have a single (x,y) image for each time. I am curious what the best pythonic way for this would be?
Essentially just groupby("time").agg("mean"), but for 2D arrays.
If stick to groupby() paradigm, I would suggest the following:
import numpy as np
import pandas as pd
# the number of 512x512 gray-scale images
N = 200
# the iterator function that generates random images with associated random time points
# the images are numpy 2D arrays
def get_image():
yield [ np.random.randint(24*60*60), np.random.randint(256, size=(512, 512))]
# We generate the list of random images with associated random time points
my_images = [next(get_image()) for _ in range(N)]
df = pd.DataFrame(my_images)
df.columns = ['time_point', 'image']
df = df.sort_values('time_point').reset_index(drop=True)
# making at least some images have same time point as others
df.iloc[range(0,N,5),0] = df.iloc[range(1,N-1,5),0].values
#finally our groupby
result = df.groupby('time_point').mean()
# convert pixels back from floats to integers
result.loc['image'] = result['image'].apply(np.int64)
print(result)

Find Indexes that Maps a Numpy Array to Another

If we have an numpy array a that needs to be sampled with replacement to create a second numpy array b,
import numpy as np
a = np.arange(10, 200*1000)
b = np.random.choice(a, len(a), replace=True)
What is the most efficient way to find an array of indexes named mapping that will transform a to b? It is OK to change np.random.choice to a more suitable function.
The following code is too slow and takes 7-8 seconds on a Macbook Pro to creating the mapping array. With an array size of 1 million, it will take much longer.
mapping = np.array([], dtype=np.int)
for n in b:
m = np.searchsorted(a, n)
mapping = np.append(mapping, m)
Perhaps, run the choice on index of a and slice a using this random index mapping:
mapping = np.random.choice(np.arange(len(a)), len(a), replace=True)
b = a[mapping]

Python: Initialize numpy arrays within an array of zeroes

In Python, I am trying to initialize 2-element arrays of zeros within a size N by N array. The code I'm using works but I'm looking for something more efficient and elegant:
array1 = np.empty((N,N), dtype=object)
for i in range(N):
for j in range(N):
array1[i,j] = np.zeros(2, dtype=np.int)
Thank ahead for the help
As I understand it, you should probably use a 3D array:
import numpy as np
array1 = np.empty((N,N,2), dtype=object)
which returns an array of N rows, N columns and 2 depth. If you want to pass a (NxN) array to let's say the first depth, just use:
tmp = np.ones(N,N) #for instance
array1(:,:,0) = tmp

Append numpy one dimensional arrays does not lead to a matrix

I am trying to get a 2d array, by randomly generating its rows and appending
import numpy as np
my_nums = np.array([])
for i in range(100):
x = np.random.rand(2, 1)
my_nums = np.append(my_nums, np.array(x))
But I do not get what I want but instead get a 1d array.
What is wrong?
Transposing x did not help either.
You could do this by using np.append(axis=0) or np.vstack. This however requires the rows appended to have the same length as the rows already in the array.
You cannot use the same code to append a row with two values to an empty array, and to append a row to an already existing 2D array: numpy will throw a
ValueError: all the input arrays must have same number of dimensions.
You could initialize my_nums to work around this:
my_nums = np.random.rand(1, 2)
for i in range(99):
x = np.random.rand(1, 2)
my_nums = np.append(my_nums, x, axis=0)
Note the decrease in the range by one due to the initialization row. Also note that I changed the dimensions to (1, 2) to get actual row vectors.
Much easier than appending row-wise will of course be to create the array in the wanted final shape:
my_nums = np.random.rand(100, 2)

numpy padding matrix of different row size

I have a numpy array of different row size
a = np.array([[1,2,3,4,5],[1,2,3],[1]])
and I would like to become this one into a dense (fixed n x m size, no variable rows) matrix. Until now I tried with something like this
size = (len(a),5)
result = np.zeros(size)
result[[0],[len(a[0])]]=a[0]
But I receive an error telling me
shape mismatch: value array of shape (5,) could not be broadcast to
indexing result of shape (1,)
I also tried to do padding wit np.pad, but according to the documentation of numpy.pad it seems I need to specify in the pad_width, the previous size of the rows (which is variable and produced me errors trying with -1,0, and biggest row size).
I know I can do it padding padding lists per row as it's shown here, but I need to do that with a much bigger array of data.
If someone can help me with the answer to this question, I would be glad to know of it.
There's really no way to pad a jagged array such that it would loose its jaggedness, without having to iterate over the rows of the array. You'll have to iterate over the array twice even: once to find out the maximum length you need to pad to, another to actually do the padding.
The code proposal you've linked to will get the job done, but it's not very efficient, because it adds zeroes in a python for-loop that iterates over the elements of the rows, whereas that appending could have been precalculated, thereby pushing more of that code to C.
The code below precomputes an array of the required minimal dimensions, filled with zeroes and then simply adds the row from the jagged array M in place, which is far more efficient.
import random
import numpy as np
M = [[random.random() for n in range(random.randint(0,m))] for m in range(10000)] # play-data
def pad_to_dense(M):
"""Appends the minimal required amount of zeroes at the end of each
array in the jagged array `M`, such that `M` looses its jagedness."""
maxlen = max(len(r) for r in M)
Z = np.zeros((len(M), maxlen))
for enu, row in enumerate(M):
Z[enu, :len(row)] += row
return Z
To give you some idea for speed:
from timeit import timeit
n = [10, 100, 1000, 10000]
s = [timeit(stmt='Z = pad_to_dense(M)', setup='from __main__ import pad_to_dense; import numpy as np; from random import random, randint; M = [[random() for n in range(randint(0,m))] for m in range({})]'.format(ni), number=1) for ni in n]
print('\n'.join(map(str,s)))
# 7.838103920221329e-05
# 0.0005027339793741703
# 0.01208890089765191
# 0.8269036808051169
If you want to prepend zeroes to the arrays, rather than append, that's a simple enough change to the code, which I'll leave to you.
You can do something like this with numpy.pad
import numpy as np
a = np.array([[1,2,3,4,5],[1,2,3],[1]])
l = np.array([len(a[i]) for i in range(len(a))])
width = l.max()
b=[]
for i in range(len(a)):
if len(a[i]) != width:
x = np.pad(a[i], (0,width-len(a[i])), 'constant',constant_values = 0)
else:
x = a[i]
b.append(x)
b = np.array(b)
print(b)
Above piece of code outputs something like this.
b = [[1, 2, 3, 4, 5],
[1, 2, 3, 0, 0],
[1, 0, 0, 0, 0]]
You can read back your input version of data by doing something as follows
a = []
for i in range(len(b)):
a.append(b[i][0:l[i]])
a = np.array(a)
print(a)
where you get the following output
a = array([array([1, 2, 3, 4, 5]), array([1, 2, 3]), array([1])], dtype=object)
Hopefully this helps someone who struggled like me to solve the issue.
Thank you.

Categories

Resources