How to find difference between all integers in an array - python

Can someone point me in the right direction to accomplish the following. I would really appreciate it.
Given the following column.
111
108
106
107
109
130
I would like to take the first number(111) and find and print the difference between the rest of the values in the order they appear.
I would then like to repeat the process starting on the second position(108) until all rows have looped through to the end.
And lastly I would like to display the biggest difference and row# from the results.
Expected output is something along these lines
Start bigest-difference row/positioning
111 19 5
108 22 5
106 24 5
107 23 5
109 24 5
130 24 2

You could use broadcasting:
import numpy as np
data = np.array([111, 108, 106, 107, 109, 130])
data - data[:, None]
# array([[ 0, -3, -5, -4, -2, 19],
# [ 3, 0, -2, -1, 1, 22],
# [ 5, 2, 0, 1, 3, 24],
# [ 4, 1, -1, 0, 2, 23],
# [ 2, -1, -3, -2, 0, 21],
# [-19, -22, -24, -23, -21, 0]])

Related

Scipy hierarchical clustering appropriate linkage method

Apologies because I asked a similar question yesterday but I feel my question lacked content, hopefully now it will be easier to understand.
I have a symmetric matrix with pairwise distances between individuals (see below), and I want to cluster groups of individuals in a way that all members of a cluster will have pairwise distances of zero. I have applied scipy.cluster.hierarchy using different linkage methods and clustering criteria for this but I don't get my expected results. In the example below I would argue that ind5 shouldn't be part of the cluster #1 because it's distance to ind9 is 1 and not 0.
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
import numpy as np
import pandas as pd
df = pd.read_csv(infile1, sep = '\t', index_col = 0)
print(df)
ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9
ind1 0 29 27 1 2 1 2 1 1
ind2 29 0 2 30 31 29 31 30 30
ind3 27 2 0 28 29 27 29 28 28
ind4 1 30 28 0 0 0 1 2 0
ind5 2 31 29 0 0 0 2 2 1
ind6 1 29 27 0 0 0 1 2 0
ind7 2 31 29 1 2 1 0 3 1
ind8 1 30 28 2 2 2 3 0 2
ind9 1 30 28 0 1 0 1 2 0
X = squareform(df.to_numpy())
print(X)
[29 27 1 2 1 2 1 1 2 30 31 29 31 30 30 28 29 27 29 28 28 0 0 1
2 0 0 2 2 1 1 2 0 3 1 2]
Z = linkage(X, 'single')
print(Z)
[[ 3. 4. 0. 2.]
[ 5. 9. 0. 3.]
[ 8. 10. 0. 4.]
[ 0. 11. 1. 5.]
[ 6. 12. 1. 6.]
[ 7. 13. 1. 7.]
[ 1. 2. 2. 2.]
[14. 15. 27. 9.]]
max_d = 0
clusters = fcluster(Z, max_d, criterion='distance')
sample_list = df.index.to_list()
clust_name_list = clusters.tolist()
result = pd.DataFrame({'Inds': sample_list, 'Clusters': clust_name_list})
print(result)
Inds Clusters
0 ind1 2
1 ind2 5
2 ind3 6
3 ind4 1
4 ind5 1
5 ind6 1
6 ind7 3
7 ind8 4
8 ind9 1
I was hoping that anybody more familiar with these methods could advice whether there is any linkage method that would exclude from the cluster any element (in this case ind5) with distance > 0 to at least one of the other elements in the cluster.
Thanks for your help!
Gonzalo
You can reinterpret your problem as the problem finding cliques in a graph. The graph is obtained from your distance matrix by interpreting a distance of 0 as creating an edge between two nodes. Once you have the graph, you can use networkx (or some other graph theory library) to find the cliques in the graph. The cliques in the graph will be the sets of nodes in which all the pairwise distances in the clique are 0.
Here is your distance matrix (but note that your distances do not satisfy the triangle inequality):
In [136]: D
Out[136]:
array([[ 0, 29, 27, 1, 2, 1, 2, 1, 1],
[29, 0, 2, 30, 31, 29, 31, 30, 30],
[27, 2, 0, 28, 29, 27, 29, 28, 28],
[ 1, 30, 28, 0, 0, 0, 1, 2, 0],
[ 2, 31, 29, 0, 0, 0, 2, 2, 1],
[ 1, 29, 27, 0, 0, 0, 1, 2, 0],
[ 2, 31, 29, 1, 2, 1, 0, 3, 1],
[ 1, 30, 28, 2, 2, 2, 3, 0, 2],
[ 1, 30, 28, 0, 1, 0, 1, 2, 0]])
Convert the distance matrix to the adjacency matrix A:
In [137]: A = D == 0
In [138]: A.astype(int) # Display as integers for a more compact output.
Out[138]:
array([[1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 1],
[0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 1]])
Create a networkx graph G, and find the cliques with nx.find_cliques:
In [139]: import networkx as nx
In [140]: G = nx.Graph(A)
In [141]: cliques = nx.find_cliques(G)
In [142]: list(cliques)
Out[142]: [[0], [1], [2], [3, 5, 8], [3, 5, 4], [6], [7]]
(The values in the lists are the indices; e.g. the clique [2] corresponds to the set of labels ['ind3'].)
Note that there are two nontrivial cliques, [3, 5, 8] and [3, 5, 4], and 3 and 5 occur in both. This is a consequence of your distances having this anomalous data: distance(ind5, ind4) = 0, and distance(ind4, ind9) = 0, but distance(ind5, ind9) = 1 (i.e. the triangle inequality is not satisfied). So, by your definition of a "cluster", there are two possible nontrivial clusters: [ind4, ind5, ind9] or [ind4, ind5, ind6].
Finally, note the warning in the networkx documentation: "Finding the largest clique in a graph is NP-complete problem, so most of these algorithms have an exponential running time". If your distance matrix is large, this calculation could take a very long time!
Your solution is correct!
You are getting the following clusters:
cluster 1 with elements ind4, ind5, ind6 and ind9 (at distance 0 from each other).
cluster 2 with element ind1
cluster 3 with element ind7
cluster 4 with element ind8
cluster 5 with element ind2
cluster 6 with element ind3
Only the elements at distance 0 are clustered together in cluster 1, as you require. Clusters 2 to 6 are degenerate clusters, with a single isolated element.
Let's modify the distances so that more proper clusters are created:
X = np.array([ 0, 27, 1, 2, 1, 2, 1, 1,
2, 30, 31, 29, 31, 30, 30,
28, 29, 27, 29, 28, 28,
0, 0, 1, 2, 0,
0, 2, 2, 1,
1, 2, 0,
0, 1,
2])
Z = linkage(X, 'single')
max_d = 0
clusters = fcluster(Z, max_d, criterion='distance')
print("Clusters:", clusters)
for cluster_id in np.unique(clusters):
members = np.where(clusters == cluster_id)[0]
print(f"Cluster {cluster_id} has members {members}")
Getting:
Clusters: [2 2 4 3 3 3 1 1 3]
Cluster 1 has members [6 7]
Cluster 2 has members [0 1]
Cluster 3 has members [3 4 5 8]
Cluster 4 has members [2]

Comparing elements of the same multi dimensional array

So I do have an multi dimensional array in this format:
Cjk = [[81 51 31] [82 47 54] [34 55 64] [96 73 43]];
How can I get the minimum values on each index of the arrays contained.
I want this output:
34 47 31 # these are the minimum values compared to each one values of the same index
I have tried some methods but they were unsucesfully because I had to work with I and J because the array Cjk will get more values in time so it needs to be scalable
You want to find the minimum in each column. You can use zip here.
Cjk = [[81 51 31] [82 47 54] [34 55 64] [96 73 43]]
min_cols=[min(lst) for lst in zip(*Cjk)]
# [34, 47, 31]
You can do this,
In [21]: list(map(lambda x:min(x),zip(*Cjk)))
Out[21]: [34, 47, 31]
You can import numpy and find minimums and maximums inrows and columns of the matrix, using axis parameter.
Like in this example:
import numpy as np
>>> x = -np.matrix(np.arange(12).reshape((3,4))); x
matrix([[ 0, -1, -2, -3],
[ -4, -5, -6, -7],
[ -8, -9, -10, -11]])
>>> x.min()
-11
>>> x.min(0)
matrix([[ -8, -9, -10, -11]])
>>> x.min(1)
matrix([[ -3],
[ -7],
[-11]])
Check this https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.min.html

Converting Matrix Definition to Zero-Indexed Notation - Numpy

I am trying to construct a numpy array (a 2-dimensional numpy array - i.e. a matrix) from a paper that uses a non-standard indexing to construct the matrix. I.e. the top left element is q1,2. instead of q0,0.
Define the n x (n-2) matrix Q by its elements qi,j for i = i,...,n and j = 2, ... , n-1 given by
qj-1,j=h-1j-1, qj,j = h-1j-1 - h-1j and qj+1,j=hjj-1. (I have posted this in Latex form here: http://www.texpaste.com/n/8vwds4fx)
I have tried to implement in python like this:
# n = u_s.size
# n = 299 for this example
n = 299
Q = np.zeros((n,n-2))
for i in range(0,n+1):
for j in range(2,n):
Q[j-1,j] = 1.0/h[j-1]
Q[j,j] = -1.0/h[j-1] - 1.0/h[j]
Q[j+1,j] = 1.0/h[j]
But I always get the error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-54-c07a3b1c81bb> in <module>()
1 for i in range(1,n+1):
2 for j in range(2,n-1):
----> 3 Q[j-1,j] = 1.0/h[j-1]
4 Q[j,j] = -1.0/h[j-1] - 1.0/h[j]
5 Q[j+1,j] = 1.0/h[j]
IndexError: index 297 is out of bounds for axis 1 with size 297
I initially thought I could decrement both i and j in my for loop to keep edge cases safe, as a quick way to move to zero-indexed notation, but this hasn't worked. I also tried incrementing and modifying the range().
Is there a way to convert this definition to one that python can handle? Is this a common issue?
Simplifying the problem to make the assignment pattern obvious:
In [228]: h=np.arange(10,15)
In [229]: Q=np.zeros((5,5),int)
In [230]: for j in range(1,5):
...: Q[j-1:j+2,j] = h[j-1:j+2]
In [231]: Q
Out[231]:
array([[ 0, 10, 0, 0, 0],
[ 0, 11, 11, 0, 0],
[ 0, 12, 12, 12, 0],
[ 0, 0, 13, 13, 13],
[ 0, 0, 0, 14, 14]])
Assignment to the partial first and last columns may need tweaking. Here's the equivalent built from diagonals:
In [232]: np.diag(h,0)+np.diag(h[:-1],1)+np.diag(h[1:],-1)
Out[232]:
array([[10, 10, 0, 0, 0],
[11, 11, 11, 0, 0],
[ 0, 12, 12, 12, 0],
[ 0, 0, 13, 13, 13],
[ 0, 0, 0, 14, 14]])
With the h[j-1], h[j] indexing this diagonal assignment probably needs tweaking, but it should be a useful starting point.
Selecting h values more like what you use (skipping the 1/h for now):
In [238]: Q=np.zeros((5,5),int)
In [239]: for j in range(1,4):
...: Q[j-1:j+2,j] =[h[j-1],h[j-1]+h[j], h[j]]
...:
In [240]: Q
Out[240]:
array([[ 0, 10, 0, 0, 0],
[ 0, 21, 11, 0, 0],
[ 0, 11, 23, 12, 0],
[ 0, 0, 12, 25, 0],
[ 0, 0, 0, 13, 0]])
I'm skipping the two partial end columns for now. The first slicing approach allowed me to be a bit sloppy, since it's ok to slice 'off the end'. The end columns, if set, will require their own expressions.
In [241]: j=0; Q[j:j+2,j] =[h[j], h[j]]
In [242]: j=4; Q[j-1:j+1,j] =[h[j-1],h[j-1]+h[j]]
In [243]: Q
Out[243]:
array([[10, 10, 0, 0, 0],
[10, 21, 11, 0, 0],
[ 0, 11, 23, 12, 0],
[ 0, 0, 12, 25, 13],
[ 0, 0, 0, 13, 27]])
The relevant diagonal pieces are still evident:
In [244]: h[1:]+h[:-1]
Out[244]: array([21, 23, 25, 27])
The equation doesn't contain any value for i. It is referring only to j. The Q should be a matrix of dimension n+2 x n+2. For j = 1, it refers to Q[0,1], Q[1,1] and Q[2,1]. for j =n, it refers to Q[n-1,n], Q[n,n] and Q[n+1,n]. So, Q should have indices from 0 to n+1 which n+2
I don't think, you require the i loop. You can achieve your results only with j loop from 1 to n, but Q should be from 0 to n+1

Python Randomly assign a list from a set number

What i'm trying to do is make a list that gets filled with different combinations of numbers (not even) that all add up to a pre defined number.
Example, if I have the a variable total = 50 as well as a list that holds 7 numbers, each time I generate and print the list in a loop, the results will be completly different with some being huge and others near empty or empty. I dont want any restrictions for the range of the value (could come as 0 or the entire 50, and next time may even be all balanced).
Is this possible?
Thanks
EDIT: I've gotten to here, but it seems to prioritize the ending, how can I make each variable have an equal chance of high or low numbers?
`import random
tot = 50
size = 7
s = 0
run = 7
num = {}
while run > 0:
num[run] = random.randint(s,tot)
tot -= num[run]
run -= 1
print(str(num))
`
Disclaimer: I don't mind what this code is meant to be.
from random import randint, seed
seed(345)
def bizarre(total, slots):
tot = total
acc = []
for _ in range(slots-1):
r = randint(0,tot)
tot -= r
acc.append(r)
acc.append(total-sum(acc))
return acc
# testing code
for i in range(10):
tot = randint(50,80)
n = randint(5,10)
b = bizarre(tot, n)
print "%3d %3d %s -> %d" % (tot, n, b, sum(b))
Output
73 5 [73, 0, 0, 0, 0] -> 73
54 6 [36, 5, 9, 0, 3, 1] -> 54
60 7 [47, 6, 6, 1, 0, 0, 0] -> 60
69 7 [3, 48, 15, 3, 0, 0, 0] -> 69
72 8 [36, 18, 18, 0, 0, 0, 0, 0] -> 72
65 8 [17, 32, 13, 3, 0, 0, 0, 0] -> 65
54 7 [33, 13, 0, 2, 4, 1, 1] -> 54
54 6 [7, 11, 26, 3, 5, 2] -> 54
67 7 [62, 5, 0, 0, 0, 0, 0] -> 67
67 8 [28, 25, 1, 0, 10, 3, 0, 0] -> 67
If you want a list of n random numbers that add up to a variable x, create n-1 random numbers. Then last number is the difference between x and the n-1 random numbers. For example, if you want a list of size three that adds up to 5 create two numbers randomly, 1 and 2. 1+2 = 3, 5-3 = 2, so the list is 1,2,2.

Iterate over a matrix, sum over some rows and add the result to another array

Hi there I have the following matrix
[[ 47 43 51 81 54 81 52 54 31 46]
[ 35 21 30 16 37 11 35 30 39 37]
[ 8 17 11 2 5 4 11 9 17 10]
[ 5 9 4 0 1 1 0 3 9 3]
[ 2 7 2 0 0 0 0 1 2 1]
[215 149 299 199 159 325 179 249 249 199]
[ 27 49 24 4 21 8 35 15 45 25]
[100 100 100 100 100 100 100 100 100 100]]
I need to iterate over the matrix summing all elements in rows 0,1,2,3,4 only
example: I need
row_0_sum = 47+43+51+81....46
Furthermore I need to store each rows sum in an array like this
[row0_sum, row1_sum, row2_sum, row3_sum, row4_sum]
So far I have tried this code but its not doing the job:
mu = np.zeros(shape=(1,6))
#get an average
def standardize_ratings(matrix):
sum = 0
for i, eli in enumerate(matrix):
for j, elj in enumerate(eli):
if(i<5):
sum = sum + matrix[i][j]
if(j==elj.len -1):
mu[i] = sum
sum = 0
print "mu[i]="
print mu[i]
This just gives me an Error: numpy.int32 object has no attribute 'len'
So can someone help me. What's the best way to do this and which type of array in Python should I use to store this. Im new to Python but have done programming....
Thannks
Make your data, matrix, a numpy.ndarray object, instead of a list of lists, and then just do matrix.sum(axis=1).
>>> matrix = np.asarray([[ 47, 43, 51, 81, 54, 81, 52, 54, 31, 46],
[ 35, 21, 30, 16, 37, 11, 35, 30, 39, 37],
[ 8, 17, 11, 2, 5, 4, 11, 9, 17, 10],
[ 5, 9, 4, 0, 1, 1, 0, 3, 9, 3],
[ 2, 7, 2, 0, 0, 0, 0, 1, 2, 1],
[215, 149, 299, 199, 159, 325, 179, 249, 249, 199],
[ 27, 49, 24, 4, 21, 8, 35, 15, 45, 25],
[100, 100, 100, 100, 100, 100, 100, 100, 100, 100]])
>>> print matrix.sum(axis=1)
[ 540 291 94 35 15 2222 253 1000]
To get the first five rows from the result, you can just do:
>>> row_sums = matrix.sum(axis=1)
>>> rows_0_through_4_sums = row_sums[:5]
>>> print rows_0_through_4_sums
[540 291 94 35 15]
Or, you can alternatively sub-select only those rows to begin with and only apply the summation to them:
>>> rows_0_through_4 = matrix[:5,:]
>>> print rows_0_through_4.sum(axis=1)
[540 291 94 35 15]
Some helpful links will be:
NumPy for Matlab Users, if you are familiar with these things in Matlab/Octave
Slicing/Indexing in NumPy

Categories

Resources