Find the number of clusters in a list of integers

Find the number of clusters in a list of integers - python

Let's consider the distance d(a, b) = number of digits which are pairwise different in a and b, e.g.:
d(1003000000, 1000090000) = 2 # the 4th and 6th digits don't match
(we only work with 10-digit numbers) and this list:
L = [2678888873,
2678878873, # distance 1 from L[0]
1000000000,
1000040000, # distance 1 from L[2]
1000300000, # distance 1 from L[2], distance 2 from L[3]
1000300009, # distance 1 from L[4], distance 2 from L[2]
]
I would like to find the minimal number of points P such that each integer in the list is at a distance <= 1 of a point in P.
Here I think this number is 3: every number in the list is at distance <= 1 of 2678888873, 1000000000, or 1000300009.
I imagine an O(n^2) algorithm is possible by first computing a distance matrix i.e. M[i, j] = d(L[i], L[j]).
Is there a better way to do this, especially using Numpy? (maybe there's a built-in algorithm in Numpy/Scipy?)
PS: If we see these 10-digit integers as strings, we're close to finding a minimal number of clusters in a list of many words with a Levenshtein distance.
PS2: I know realize this distance has a name on strings: Hamming distance.

Let's see what we know from a the distance metric. Given a number P (not necessarily in L), if two members of L are within distance 1 of P, they each share 9 digits with P, but not necessarily the same ones, so they are only guaranteed to share 8 digits with each other. So any two numbers that have distance 2 are guaranteed to two unique Ps that are distance 1 from each of them (and distance 2 from each other as well). You can use this information to reduce the amount of brute-force effort required to optimize the selection of P.
Let's say you have a distance matrix. You can immediately discard rows (or columns) that don't have entries less than 3: they are their own cluster automatically. For the remaining entries that are equal to 2, construct a list of possible P values. Find the number of elements of L that are within 1 of each element of P (another distance matrix). Sort P by the number of neighbors, and select. You will need to update the matrix at each iteration as you remove members with maximal neighbors to avoid inefficient grouping due to overlap (members of L that are near multiple members of P).
You can compute a distance matrix for L in numpy by first converting it to a 2D array of digits:
L = np.array([2678888873, 2678878873, 1000000000, 1000040000, 1000300000, 1000300009])
z = 10 # Number of digits
n = len(L) # Number of numbers
dec = 10**np.arange(z).reshape(-1, 1).astype(np.int64)
digits = (L // dec) % 10
digits is now a 10xN array:
array([[3, 3, 0, 0, 0, 9],
[7, 7, 0, 0, 0, 0],
[8, 8, 0, 0, 0, 0],
[8, 8, 0, 0, 0, 0],
[8, 7, 0, 4, 0, 0],
[8, 8, 0, 0, 3, 3],
[8, 8, 0, 0, 0, 0],
[7, 7, 0, 0, 0, 0],
[6, 6, 0, 0, 0, 0],
[2, 2, 1, 1, 1, 1]], dtype=int64)
You can compute the distance between digits and itself, or digits and any other 10xM array using != and sum along the right axis:
distance = (digits[:, None, :] != digits[..., None]).sum(axis=0)
The result:
array([[ 0, 1, 10, 10, 10, 10],
[ 1, 0, 10, 10, 10, 10],
[10, 10, 0, 1, 1, 2],
[10, 10, 1, 0, 2, 3],
[10, 10, 1, 2, 0, 1],
[10, 10, 2, 3, 1, 0]])
We are only concerned with the upper (or lower) triangle of that matrix, so we can immediately mask out the other triangle:
distance[np.tril_indices(n)] = z + 1
Find all candidate values of P: all elements of L, but also all pairs between elements that have distance 2:
# Find indices of pairs that differ by 2
indices = np.nonzero(distance == 2)
# Extract those numbers as 10xKx2 array
d = digits[:, np.stack(indices, axis=1)]
# Compute where the difference is nonzero (Kx2)
locs = np.diff(d, axis=2).astype(bool).squeeze()
# Find the index of the first digit to replace (K)
s = np.argmax(locs, axis=0)
The extra values of P are constructed from each half of d, with the digits represented by k replaced from the other half:
P0 = digits[:, indices[0]]
P1 = digits[:, indices[1]]
k = np.arange(s.size)
tmp = P0[s, k]
P0[s, k] = P1[s, k]
P1[s, k] = tmp
Pextra = np.unique(np.concatenate((P0, P1), axis=1)
So now you can compute the total set of possibilities for P:
P = np.concatenate((digits, Pextra), axis=1)
distance2 = (P[:, None, :] != digits[..., None]).sum(axis=0)
You can discard any elements of Pextra that match with elements of digits based on the distance:
mask = np.concatenate((np.ones(n, bool), distance2[:, n:].all(axis=0)))
P = P[:, mask]
distance2 = distance2[:, mask]
Now you can iteratively distance P with L, and select the best values of P, removing any values that have been selected from the distance matrix. A greedy selection from P will not necessarily be optimal, since an alternative combination may require fewer elements due to overlaps, but that is a matter for a simple (but somewhat expensive) graph traversal algorithm. The following snippet just shows a simple greedy selection, which will work fine for your toy example:
distMask = distance2 <= 1
quality = distMask.sum(axis=0)
clusters = []
accounted = 0
while accounted < n:
# Get the cluster location
best = np.argmax(quality)
# Get the cluster number
clusters.append(P[:, best].dot(dec).item())
# Remove numbers in cluser from consideration
accounted += quality[best]
quality -= distMask[distMask[:, best], :].sum(axis=0)
The last couple of steps can be optimized using sets and graphs, but this shows a starting point for a valid approach. This is going to be slow for large data, but probably not prohibitively so. Do some benchmarks to decide how much time you want to spend optimizing vs just running the algorithm.

Related

calculate sum of Nth column of numpy array entry grouped by the indices in first two columns?

I would like to loop over following check_matrix in such a way that code recognize whether the first and second element is 1 and 1 or 1 and 2 etc? Then for each separate class of pair i.e. 1,1 or 1,2 or 2,2, the code should store in the new matrices, the sum of last element (which in this case has index 8) times exp(-i*q(check_matrix[k][2:5]-check_matrix[k][5:8])), where i is iota (complex number), k is the running index on check_matrix and q is a vector defined as given below. So there are 20 q vectors.
import numpy as np
q= []
for i in np.linspace(0, 10, 20):
q.append(np.array((0, 0, i)))
q = np.array(q)
check_matrix = np.array([[1, 1, 0, 0, 0, 0, 0, -0.7977, -0.243293],
[1, 1, 0, 0, 0, 0, 0, 1.5954, 0.004567],
[1, 2, 0, 0, 0, -1, 0, 0, 1.126557],
[2, 1, 0, 0, 0, 0.5, 0.86603, 1.5954, 0.038934],
[2, 1, 0, 0, 0, 2, 0, -0.7977, -0.015192],
[2, 2, 0, 0, 0, -0.5, 0.86603, 1.5954, 0.21394]])
This means in principles I will have to have 20 matrices of shape 2x2, corresponding to each q vector.
For the moment my code is giving only one matrix, which appears to be the last one, even though I am appending in the Matrices. My code looks like below,
for i in range(2):
i = i+1
for j in range(2):
j= j +1
j_list = []
Matrices = []
for k in range(len(check_matrix)):
if check_matrix[k][0] == i and check_matrix[k][1] == j:
j_list.append(check_matrix[k][8]*np.exp(-1J*np.dot(q,(np.subtract(check_matrix[k][2:5],check_matrix[k][5:8])))))
j_11 = np.sum(j_list)
I_matrix[i-1][j-1] = j_11
Matrices.append(I_matrix)
I_matrix is defined as below:
I_matrix= np.zeros((2,2),dtype=np.complex_)
At the moment I get following output.
Matrices = [array([[-0.66071446-0.77603624j, -0.29038112+2.34855023j], [-0.31387562-0.08116629j, 4.2788 +0.j ]])]
But, I desire to get a matrix corresponding to each q value meaning that in total there should be 20 matrices in this case, where each 2x2 matrix element would be containing sums such that elements belong to 1,1 and 1,2 and 2,2 pairs in following manner
array([[11., 12.],
[21., 22.]])
I shall highly appreciate your suggestion to correct it. Thanks in advance!

I am pretty sure you can solve this problem in an easier way and I am not 100% sure that I understood you correctly, but here is some code that does what I think you want. If you have a possibility to check if the results are valid, I would suggest you do so.
import numpy as np
n = 20
q = np.zeros((20, 3))
q[:, -1] = np.linspace(0, 10, n)
check_matrix = np.array([[1, 1, 0, 0, 0, 0, 0, -0.7977, -0.243293],
[1, 1, 0, 0, 0, 0, 0, 1.5954, 0.004567],
[1, 2, 0, 0, 0, -1, 0, 0, 1.126557],
[2, 1, 0, 0, 0, 0.5, 0.86603, 1.5954, 0.038934],
[2, 1, 0, 0, 0, 2, 0, -0.7977, -0.015192],
[2, 2, 0, 0, 0, -0.5, 0.86603, 1.5954, 0.21394]])
check_matrix[:, :2] -= 1 # python indexing is zero based
matrices = np.zeros((n, 2, 2), dtype=np.complex_)
for i in range(2):
for j in range(2):
k_list = []
for k in range(len(check_matrix)):
if check_matrix[k][0] == i and check_matrix[k][1] == j:
k_list.append(check_matrix[k][8] *
np.exp(-1J * np.dot(q, check_matrix[k][2:5]
- check_matrix[k][5:8])))
matrices[:, i, j] = np.sum(k_list, axis=0)
NOTE: I changed your indices to have consistent
zero-based indexing.
Here is another approach where I replaced the k-loop with a vectored version:
for i in range(2):
for j in range(2):
k = np.logical_and(check_matrix[:, 0] == i, check_matrix[:, 1] == j)
temp = np.dot(check_matrix[k, 2:5] - check_matrix[k, 5:8], q[:, :, np.newaxis])[..., 0]
temp = check_matrix[k, 8:] * np.exp(-1J * temp)
matrices[:, i, j] = np.sum(temp, axis=0)

3 line solution
You asked for efficient solution in your original title so how about this solution that avoids nested loops and if statements in a 3 liner, which is thus hopefully faster?
fac=2*(check_matrix[:,0]-1)+(check_matrix[:,1]-1)
grp=np.split(check_matrix[:,8], np.cumsum(np.unique(fac,return_counts=True)[1])[:-1])
[np.sum(x) for x in grp]
output:
[-0.23872600000000002, 1.126557, 0.023742000000000003, 0.21394]
How does it work?
I combine the first two columns into a single index, treating each as "bits" (i.e. base 2)
fac=2*(check_matrix[:,0]-1)+(check_matrix[:,1]-1)
( If you have indexes that exceed 2, you can still use this technique but you will need to use a different base to combine the columns. i.e. if your indices go from 1 to 18, you would need to multiply column 0 by a number equal to or larger than 18 instead of 2. )
So the result of the first line is
array([0., 0., 1., 2., 2., 3.])
Note as well it assumes the data is ordered, that one column changes fastest, if this is not the case you will need an extra step to sort the index and the original check matrix. In your example the data is ordered.
The next step groups the data according to the index, and uses the solution posted here.
np.split(check_matrix[:,8], np.cumsum(np.unique(fac,return_counts=True)[1])[:-1])
[array([-0.243293, 0.004567]), array([1.126557]), array([ 0.038934, -0.015192]), array([0.21394])]
i.e. it outputs the 8th column of check_matrix according to the grouping of fac
then the last line simply sums those... knowing how the first two columns were combined to give the single index allows you to map the result back. Or you could simply add it to check matrix as a 9th column if you wanted.

Tensor reduction based off index vector

As an example, I have 2 tensors: A = [1;2;3;4;5;6;7] and B = [2;3;2]. The idea is that I want to reduce A based off B - such that B's values represent how to sum A's values- such that B = [2;3;2] means the reduced A shall be the sum of the first 2 values, next 3, and last 2: A' = [(1+2);(3+4+5);(6+7)]. It is apparent that the sum of B shall always be equal to the length of A. I'm trying to do this as efficiently as possible - preferably specific functions or matrix operations contained within pytorch/python. Thanks!

Here is the solution.
First, we create an array of indices B_idx with the same size of A.
Then, accumulate (add) all elements in A based on the indices B_idx using index_add_.
A = torch.arange(1, 8)
B = torch.tensor([2, 3, 2])
B_idx = [idx.repeat(times) for idx, times in zip(torch.arange(len(B)), B)]
B_idx = torch.cat(B_idx) # tensor([0, 0, 1, 1, 1, 2, 2])
A_sum = torch.zeros_like(B)
A_sum.index_add_(dim=0, index=B_idx, source=A)
print(A_sum) # tensor([ 3, 12, 13])

How to randomly throw numbers in a 2D dimensional board

I have a 50x50 2D dimensional board with empty cells now. I want to fill 20% cells with 0, 30% cells with 1, 30% cells with 2 and 20% cells with 3. How to randomly throw these 4 numbers onto the board with the percentages?
import numpy as np
from numpy import random
dim = 50
map = [[" "for i in range(dim)] for j in range(dim)]
print(map)

One way to get this kind of randomness would be to start with a random permutation of the numbers from 0 to the total number of cells you have minus one.
perm = np.random.permutation(2500)
now you split the permutation according the proportions you want to get and treat the entries of the permutation as the indices of the array.
array = np.empty(2500)
p1 = int(0.2*2500)
p2 = int(0.3*2500)
p3 = int(0.3*2500)
array[perm[range(0, p1)]] = 0
array[perm[range(p1, p1 + p2)]] = 1
array[perm[range(p1 + p2, p3)]] = 2
array[perm[range(p1 + p2 + p3, 2500)]] = 3
array = array.reshape(50, 50)
This way you ensure the proportions for each number.

Since the percentages sum up to 1, you can start with a board of zeros
bsize = 50
board = np.zeros((bsize, bsize))
In this approach the board positions are interpreted as 1D postions, then we need a set of position equivalent to 80% of all positions.
for i, pos in enumerate(np.random.choice(bsize**2, int(0.8*bsize**2), replace=False)):
# the fisrt 30% will be set with 1
if i < int(0.3*bsize**2):
board[pos//bsize][pos%bsize] = 1
# the second 30% (between 30% and 60%) will be set with 2
elif i < int(0.6*bsize**2):
board[pos//bsize][pos%bsize] = 2
# the rest 20% (between 60% and 80%) will be set with 3
else:
board[pos//bsize][pos%bsize] = 3
At the end the last 20% of positions will remain as zeros
As suggested by #alexis in commentaries, this approach could became more simple by using shuffle method from random module:
from random import shuffle
bsize = 50
board = np.zeros((bsize, bsize))
l = list(range(bsize**2))
shuffle(l)
for i, pos in enumerate(l):
# the fisrt 30% will be set with 1
if i < int(0.3*bsize**2):
board[pos//bsize][pos%bsize] = 1
# the second 30% (between 30% and 60%) will be set with 2
elif i < int(0.6*bsize**2):
board[pos//bsize][pos%bsize] = 2
# the rest 20% (between 60% and 80%) will be set with 3
elif i < int(0.8*bsize**2):
board[pos//bsize][pos%bsize] = 3
The last 20% of positions will remain as zeros again.

A different approach (admittedly it's probabilistic so you won't get perfect proportions as the solution proposed by Brad Solomon)
import numpy as np
res = np.random.random((50, 50))
zeros = np.where(res <= 0.2, 0, 0)
ones = np.where(np.logical_and(res <= 0.5, res > 0.2), 1, 0)
twos = np.where(np.logical_and(res <= 0.8, res > 0.5), 2, 0)
threes = np.where(res > 0.8, 3, 0)
final_result = zeros + ones + twos + threes
Running
np.unique(final_result, return_counts=True)
yielded
(array([0, 1, 2, 3]), array([499, 756, 754, 491]))

Here's an approach with np.random.choice to shuffle indices, then filling those indices with repeats of the inserted ints. It will fill the array in the exact proportions that you specify:
import numpy as np
np.random.seed(444)
board = np.zeros(50 * 50, dtype=np.uint8).flatten()
# The "20% cells with 0" can be ignored since that is the default.
#
# This will work as long as the proportions are "clean" ints
# (I.e. mod to 0; 2500 * 0.2 is a clean 500. Otherwise, need to do some rounding.)
rpt = (board.shape[0] * np.array([0.3, 0.3, 0.2])).astype(int)
repl = np.repeat([1, 2, 3], rpt)
idx = np.random.choice(board.shape[0], size=repl.size, replace=False)
board[idx] = repl
board = board.reshape((50, 50))
Resulting frequencies:
>>> np.unique(board, return_counts=True)
(array([0, 1, 2, 3], dtype=uint8), array([500, 750, 750, 500]))
>>> board
array([[1, 3, 2, ..., 3, 2, 2],
[0, 0, 2, ..., 0, 2, 0],
[1, 1, 1, ..., 2, 1, 0],
...,
[1, 1, 2, ..., 2, 2, 2],
[1, 2, 2, ..., 2, 1, 2],
[2, 2, 2, ..., 1, 0, 1]], dtype=uint8)
Approach
Flatten the board. Easier to work with indices when the board is (temporarily) one-dimensional.
rpt is a 1d vector of the number of repeats per int. It gets "zipped" together with [1, 2, 3] to create repl, which is length 2000. (80% of the size of the board; you don't need to worry about the 0s in this example.)
The indices of the flattened array are effectively shuffled (idx), and the length of this shuffled array is constrained to the size of the replacement candidates. Lastly, those indices in the 1d board are filled with the replacements, after which it can be made 2d again.

Generating random numbers to obtain a fixed sum(python) [duplicate]

This question already has answers here:
Generate random numbers summing to a predefined value
(7 answers)
Closed 4 years ago.
I have the following list:
Sum=[54,1536,36,14,9,360]
I need to generate 4 other lists, where each list will consist of 6 random numbers starting from 0, and the numbers will add upto the values in sum. For eg;
l1=[a,b,c,d,e,f] where a+b+c+d+e+f=54
l2=[g,h,i,j,k,l] where g+h+i+j+k+l=1536
and so on upto l6. And I need to do this in python. Can it be done?

Generating a list of random numbers that sum to a certain integer is a very difficult task. Keeping track of the remaining quantity and generating items sequentially with the remaining available quantity results in a non-uniform distribution, where the first numbers in the series are generally much larger than the others. On top of that, the last one will always be different from zero because the previous items in the list will never sum up to the desired total (random generators usually use open intervals in the maximum). Shuffling the list after generation might help a bit but won't generally give good results either.
A solution could be to generate random numbers and then normalize the result, eventually rounding it if you need them to be integers.
import numpy as np
totals = np.array([54,1536,36,14]) # don't use Sum because sum is a reserved keyword and it's confusing
a = np.random.random((6, 4)) # create random numbers
a = a/np.sum(a, axis=0) * totals # force them to sum to totals
# Ignore the following if you don't need integers
a = np.round(a) # transform them into integers
remainings = totals - np.sum(a, axis=0) # check if there are corrections to be done
for j, r in enumerate(remainings): # implement the correction
step = 1 if r > 0 else -1
while r != 0:
i = np.random.randint(6)
if a[i,j] + step >= 0:
a[i, j] += step
r -= step
Each column of a represents one of the lists you want.
Hope this helps.

This might not be the most efficient way but it will work
totals = [54, 1536, 36, 14]
nums = []
x = np.random.randint(0, i, size=(6,))
for i in totals:
while sum(x) != i: x = np.random.randint(0, i, size=(6,))
nums.append(x)
print(nums)
[array([ 3, 19, 21, 11, 0, 0]), array([111, 155, 224, 511, 457,
78]), array([ 8, 5, 4, 12, 2, 5]), array([3, 1, 3, 2, 1, 4])]
This is a way more efficient way to do this
totals = [54,1536,36,14,9,360, 0]
nums = []
for i in totals:
if i == 0:
nums.append([0 for i in range(6)])
continue
total = i
temp = []
for i in range(5):
val = np.random.randint(0, total)
temp.append(val)
total -= val
temp.append(total)
nums.append(temp)
print(nums)
[[22, 4, 16, 0, 2, 10], [775, 49, 255, 112, 185, 160], [2, 10, 18, 2,
0, 4], [10, 2, 1, 0, 0, 1], [8, 0, 0, 0, 0, 1], [330, 26, 1, 0, 2, 1],
[0, 0, 0, 0, 0, 0]]

Finding the row with the highest average in a numpy array

Given the following array:
complete_matrix = numpy.array([
[0, 1, 2, 4],
[1, 0, 3, 5],
[2, 3, 0, 6],
[4, 5, 6, 0]])
I would like to identify the row with the highest average, excluding the diagonal zeros.
So, in this case, I would be able to identify complete_matrix[:,3] as being the row with the highest average.

Note that the presence of the zeros doesn't affect which row has the highest mean because all rows have the same number of elements. Therefore, we just take the mean of each row, and then ask for the index of the largest element.
#Take the mean along the 1st index, ie collapse into a Nx1 array of means
means = np.mean(complete_matrix, 1)
#Now just get the index of the largest mean
idx = np.argmax(means)
idx is now the index of the row with the highest mean!

You don't need to worry about the 0s, they shouldn't effect how the averages compare since there will presumably be one in each row. Hence, you can do something like this to get the index of the row with the highest average:
>>> import numpy as np
>>> complete_matrix = np.array([
... [0, 1, 2, 4],
... [1, 0, 3, 5],
... [2, 3, 0, 6],
... [4, 5, 6, 0]])
>>> np.argmax(np.mean(complete_matrix, axis=1))
3
Reference:
numpy.mean
numpy.argmax

As pointed out by a lot of people, presence of zeros isn't an issue as long as you have the same number of zeros in each column. Just in case your intention was to ignore all the zeros, preventing them from participating in the average computation, you could use weights to suppress the contribution of the zeros. The following solution assigns 0 weight to zero entries, 1 otherwise:
numpy.argmax(numpy.average(complete_matrix,axis=0, weights=complete_matrix!=0))
You can always create a weight matrix where the weight is 0 for diagonal entries, and 1 otherwise.

You will see that this answer actually would fit better to your other question that was marked as duplicated to this one (and don't know why because it is not the same question...)
The presence of zeros can indeed affect the columns' or rows' average, for instance:
a = np.array([[ 0, 1, 0.9, 1],
[0.9, 0, 1, 1],
[ 1, 1, 0, 0.5]])
Without eliminating the diagonals, it would tell that the column 3 has the highest average, but eliminating the diagonals the highest average belongs to column 1 and now column 3 has the least average of all columns!
You can correct the calculated mean using the lcm (least common multiple) of the number of lines with and without the diagonals, by guaranteeing that where a diagonal element does not exist the correction is not applied:
correction = column_sum/lcm(len(column), len(column)-1)
new_mean = mean + correction
I copied the algorithm for lcm from this answer and proposed a solution for your case:
import numpy as np
def gcd(a, b):
"""Return greatest common divisor using Euclid's Algorithm."""
while b:
a, b = b, a % b
return a
def lcm(a, b):
"""Return lowest common multiple."""
return a * b // gcd(a, b)
def mymean(a):
if len(a.diagonal()) < a.shape[1]:
tmp = np.hstack((a.diagonal()*0+1,0))
else:
tmp = a.diagonal()*0+1
return np.mean(a, axis=0) + np.sum(a,axis=0)*tmp/lcm(a.shape[0],a.shape[0]-1)
Testing with the a given above:
mymean(a)
#array([ 0.95 , 1. , 0.95 , 0.83333333])
With another example:
b = np.array([[ 0, 1, 0.9, 0],
[0.9, 0, 1, 1],
[ 1, 1, 0, 0.5],
[0.9, 0.2, 1, 0],
[ 1, 1, 0.7, 0.5]])
mymean(b)
#array([ 0.95, 0.8 , 0.9 , 0.5 ])
With the corrected average you just use np.argmax() to get the column index with the highest average. Similarly, np.argmin() to get the index of the column with the least average:
np.argmin(mymean(a))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find the number of clusters in a list of integers - python

Related

calculate sum of Nth column of numpy array entry grouped by the indices in first two columns?

Tensor reduction based off index vector

How to randomly throw numbers in a 2D dimensional board

Generating random numbers to obtain a fixed sum(python) [duplicate]

Finding the row with the highest average in a numpy array

Categories

Resources