Compare rows in a matrix and count the number of equal elements - python

I am wondering if there is an efficient way to compare rows in a matrix and count the number of equal elements in the rows. Say I have a matrix:
[['food', 'food', 'food'],
['food', 'food', 'drink'],
['food', 'food', 'drink']]
I would like to compare the first row with the second row, the first row with the third row, and the second row with the third row. There is no need to compare two rows two times and I don't want to compare a row with itself. I'd like to return a list or array that is as long as the number of comparisons (or similar) and that contains the number of equal elements for each comparison. In this case, I'd get: [2, 2, 3].
I've tried looping through the matrix as follows:
comparisons = [sum(matrix[i]==matrix[j]) for i in range(len(matrix)) for j in range(len(matrix)) if i < j]
I'm worried this solution will be too slow if the size of the matrix grows. Is there a more efficient solution by using e.g. NumPy?

By using itertools.chain.from_iterable:
>>> list(chain.from_iterable(
(matrix[i+1:] == row).sum(1) for i, row in enumerate(matrix[:-1])
))
[2, 2, 3]
Timing:
# Method 1 [from the question]
>>> %timeit [sum(matrix[i]==matrix[j]) for i in range(len(matrix)) for j in range(len(matrix)) if i < j]
25.6 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Method 2
>>> %timeit list(chain.from_iterable((matrix[i+1:] == row).sum(1) for i, row in enumerate(matrix[:-1])))
11.8 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

Generate random binary matrix with all theirs rows different using numpy

I need to generate a random binary matrix with dimensions m x n where all their rows are different among themselves. Using numpy I tried
import numpy as np
import random
n = 512
m = 1000
a = random.sample(range(0, 2**n), m)
a = np.array(a)
a = np.reshape(a, (m, 1))
np.unpackbits(a.view(np.uint8), axis=1)
But it is not suitable for my case because n > 128 and m > 1000. So, the code above generates only rows with at most 62 elements. Could you help me, please?
You could generate a random array of 0's and 1's with numpy.random.choice and then make sure that the rows are different through numpy.unique:
import numpy as np
m = 1000
n = 512
while True:
a = np.random.choice([0, 1], size=(m, n))
if len(np.unique(a, axis=0)) == m:
break
I would try creating one row at a time and check if that row exists already via a set which has a membership testing runtime of O(1). If the row exists simply generate another 1, if not add it to the array and move to the next row until you are done. This principle can be made faster by:
Setting the unique counter to 0
generating m - counter rows, adding the unique rows to the solution
increasing counter the by unique rows added
if counter == m you are done, else return to 2
The implementation is as follows:
import numpy as np
n = 128
m = 1000
a = np.zeros((m,n))
rows = set()
counter = 0
while counter < m:
temp = np.random.randint(0, 2, (m-counter, n))
for row in temp:
if tuple(row) not in rows:
rows.add(tuple(row))
a[counter] = row
counter += 1
Runtime comparison
By generating all the matrix at once and checking if all the rows are unique you are saving a lot of time, only if n >> log2(m).
Example 1
with the following:
n = 128
m = 1000
I ran my suggestion and the solution mentions in the other answer, resulting in:
# my suggestion
17.7 ms ± 328 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# generating all matrix at once and chacking if all rows are unique
4.62 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is because the probabily of generating m different rows is very high in this situation.
Example 2
When changing to:
n = 10
m = 1024
I ran my suggestion and the solution mentions in the other answer, resulting in:
# my suggestion
26.3 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The suggestion of generating all matrix at once and checking if all rows are unique did not finish running. This is because when math.log2(m) == n there are exactly m valid rows. The probability of generating a valid matrix randomly approaches 0 as the shape of the matrix increases.
You could create a matrix with unique rows and shuffle the rows:
n = 512
m = 1000
d = np.arange(m) # m unique numbers
d = ((d[:, None] & (1 << d[:n])) > 0).astype(np.uint8) # convert to binary array
i = np.random.randn(m).argsort() # indices used for shuffling rows
a = d[i] # output
all rows are unique:
assert len(np.unique(a, axis=0)) == m
Timings
n=128, m=1000:
271 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
n=2**10, m=2**14:
50.9 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This works best for n <= m, otherwise you need to swap d[:n] with np.arange(n), resulting in longer runtime.

Pick n random elements from set1 not present in set2

I would like to pick n random elements from set1 that are not present in set2, if there are n such elements. If there are not, only those that are different should be returned or an empty set in the worst case.
Example 1:
input: n=2, set1={0,1,2,3,4,5,6,7}, set2={0,2,4,6}
example possible output: {1,5} (other possible outputs: {1,3}, {1,7}, {3,5}, {3,7}, {5,7})
Example 2:
input: n=5, set1={0,1,2,3,4,5,6,7}, set2={0,2,4,6}
single possible output: {1,3,5,7} since there are only 4 choices and number of elements to choose is 5
Create a set of unique values and return either n random elements from it, or if n is larger than the population return all elements:
def random_unique(x, y, n):
''' returns n random elements from set x not found in set y '''
unique = x - y
return set(random.sample(unique, min(n, len(unique))))
In action:
x = {0, 1, 2, 3, 4, 5, 6, 7}
y = {0, 2, 4, 6}
random_unique(x, y, 2)
{3, 5}
random_unique(x, y, 10)
{1, 3, 5, 7}
In two lines (we could make it one, but it's a little ugly), using sets and random.sample:
diff = set(list1).difference(list2)
random.sample(diff, min(len(diff), n))
Original solution (preserving duplicates in list1, as well as order, which doesn't really matter for random samples):
diff = [x for x in list1 if x not in set(list2)]
random.sample(diff, min(len(diff), n))
If you don't care about preserving duplicates, then set difference is indeed the way to go. Checking the timing between the two implementations with the following:
list1 = np.arange(10000)
list2 = np.random.randint(0, 10000, 1000)
we get:
set difference: 1.15 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
list comprehension: 1.13 s ± 44.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
list comprehension with set pre-defined: 1.47 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
That's a factor of 1000 faster for sets! Not much of a difference between set.difference and the list comprehension though if we have already built the set, but what difference is there appears to be significant (those standard deviations are tiny!).

Efficiently extract values from array using a list of index

Given a 2D NumPy array a and a list of indices stored in index, there must be a way of extracting the values of the list very efficiently. Using a for loop as follows take about 5 ms which seems extremely slow for 2000 elements to extract:
import numpy as np
import time
# generate dummy array
a = np.arange(4000).reshape(1000, 4)
# generate dummy list of indices
r1 = np.random.randint(1000, size=2000)
r2 = np.random.randint(3, size=2000)
index = np.concatenate([[r1], [r2]]).T
start = time.time()
result = [a[i, j] for [i, j] in index]
print time.time() - start
How can I increase the extraction speed? np.take does not seem appropriate here because it would return a 2D array instead of a 1D array.
You can use advanced indexing which basically means extract the row and column indices from the index array and then use it to extract values from a, i.e. a[index[:,0], index[:,1]] -
%timeit a[index[:,0], index[:,1]]
# 12.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit [a[i, j] for [i, j] in index]
# 2.22 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another option would be numpy.ravel_multi_index, which lets you avoid the manual indexing.
np.ravel_multi_index(index.T, a.shape)

Python: Taking a the outer product of each row of matrix by itself, taking the sum then returning a vector of sums

Say I have a matrix A of dimension N by M.
I wish to return an N dimensional vector V where the nth element is the double sum of all pairwise product of the entries in the nth row of A.
In loops, I guess I could do:
V = np.zeros(A.shape[0])
for n in range(A.shape[0]):
for i in range(A.shape[1]):
for j in range(A.shape[1]):
V[n] += A[n,i] * A[n,j]
I want to vectorise this and I guess I could do:
V_temp = np.einsum('ij,ik->ijk', A, A)
V = np.einsum('ijk->i', A)
But I don't think this is very memory efficient way as the intermediate step V_temp is unnecessarily storing the whole outer products when all I need are sums. Is there a better way to do this?
Thanks
You can use
V=np.einsum("ni,nj->n",A,A)
You are actually calculating
A.sum(-1)**2
In other words, the sum over an outer product is just the product of the sums of the factors.
Demo:
A = np.random.random((1000,1000))
np.allclose(np.einsum('ij,ik->i', A, A), A.sum(-1)**2)
# True
t = timeit.timeit('np.einsum("ij,ik->i",A,A)', globals=dict(A=A,np=np), number=10)*100; f"{t:8.4f} ms"
# '948.4210 ms'
t = timeit.timeit('A.sum(-1)**2', globals=dict(A=A,np=np), number=10)*100; f"{t:8.4f} ms"
# ' 0.7396 ms'
Perhaps you can use
np.einsum('ij,ik->i', A, A)
or the equivalent
np.einsum(A, [0,1], A, [0,2], [0])
On a 2015 Macbook, I get
In [35]: A = np.random.rand(100,100)
In [37]: %timeit for_loops(A)
640 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [38]: %timeit np.einsum('ij,ik->i', A, A)
658 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.einsum(A, [0,1], A, [0,2], [0])
672 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

How to sum a 2d array in Python?

I want to sum a 2 dimensional array in python:
Here is what I have:
def sum1(input):
sum = 0
for row in range (len(input)-1):
for col in range(len(input[0])-1):
sum = sum + input[row][col]
return sum
print sum1([[1, 2],[3, 4],[5, 6]])
It displays 4 instead of 21 (1+2+3+4+5+6 = 21). Where is my mistake?
I think this is better:
>>> x=[[1, 2],[3, 4],[5, 6]]
>>> sum(sum(x,[]))
21
You could rewrite that function as,
def sum1(input):
return sum(map(sum, input))
Basically, map(sum, input) will return a list with the sums across all your rows, then, the outer most sum will add up that list.
Example:
>>> a=[[1,2],[3,4]]
>>> sum(map(sum, a))
10
This is yet another alternate Solution
In [1]: a=[[1, 2],[3, 4],[5, 6]]
In [2]: sum([sum(i) for i in a])
Out[2]: 21
And numpy solution is just:
import numpy as np
x = np.array([[1, 2],[3, 4],[5, 6]])
Result:
>>> b=np.sum(x)
print(b)
21
Better still, forget the index counters and just iterate over the items themselves:
def sum1(input):
my_sum = 0
for row in input:
my_sum += sum(row)
return my_sum
print sum1([[1, 2],[3, 4],[5, 6]])
One of the nice (and idiomatic) features of Python is letting it do the counting for you. sum() is a built-in and you should not use names of built-ins for your own identifiers.
This is the issue
for row in range (len(input)-1):
for col in range(len(input[0])-1):
try
for row in range (len(input)):
for col in range(len(input[0])):
Python's range(x) goes from 0..x-1 already
range(...)
range([start,] stop[, step]) -> list of integers
Return a list containing an arithmetic progression of integers.
range(i, j) returns [i, i+1, i+2, ..., j-1]; start (!) defaults to 0.
When step is given, it specifies the increment (or decrement).
For example, range(4) returns [0, 1, 2, 3]. The end point is omitted!
These are exactly the valid indices for a list of 4 elements.
range() in python excludes the last element. In other words, range(1, 5) is [1, 5) or [1, 4]. So you should just use len(input) to iterate over the rows/columns.
def sum1(input):
sum = 0
for row in range (len(input)):
for col in range(len(input[0])):
sum = sum + input[row][col]
return sum
Don't put -1 in range(len(input)-1) instead use:
range(len(input))
range automatically returns a list one less than the argument value so no need of explicitly giving -1
def sum1(input):
return sum([sum(x) for x in input])
Quick answer, use...
total = sum(map(sum,[array]))
where [array] is your array title.
In Python 3.7
import numpy as np
x = np.array([ [1,2], [3,4] ])
sum(sum(x))
outputs
10
It seems like a general consensus is that numpy is a complicated solution. In comparison to simpler algorithms. But for the sake of the answer being present:
import numpy as np
def addarrays(arr):
b = np.sum(arr)
return sum(b)
array_1 = [
[1, 2],
[3, 4],
[5, 6]
]
print(addarrays(array_1))
This appears to be the preferred solution:
x=[[1, 2],[3, 4],[5, 6]]
sum(sum(x,[]))
def sum1(input):
sum = 0
for row in input:
for col in row:
sum += col
return sum
print(sum1([[1, 2],[3, 4],[5, 6]]))
Speed comparison
import random
import timeit
import numpy
x = [[random.random() for i in range(100)] for j in range(100)]
xnp = np.array(x)
Methods
print("Sum python array:")
%timeit sum(map(sum,x))
%timeit sum([sum(i) for i in x])
%timeit sum(sum(x,[]))
%timeit sum([x[i][j] for i in range(100) for j in range(100)])
print("Convert to numpy, then sum:")
%timeit np.sum(np.array(x))
%timeit sum(sum(np.array(x)))
print("Sum numpy array:")
%timeit np.sum(xnp)
%timeit sum(sum(xnp))
Results
Sum python array:
130 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
149 µs ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3.05 ms ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.58 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Convert to numpy, then sum:
1.36 ms ± 90.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sum numpy array:
24.6 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
301 µs ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
def sum1(input):
sum = 0
for row in range (len(input)-1):
for col in range(len(input[0])-1):
sum = sum + input[row][col]
return sum
print (sum1([[1, 2],[3, 4],[5, 6]]))
You had a problem with parenthesis at the print command....
This solution will be good now
The correct solution in Visual Studio Code

Categories

Resources