NumPy array with largest value on diagonal and other values shuffled - python

I am trying to create a square NumPy (or PyTorch, since PyTorch code can be turned into NumPy with minimal effort) matrix which has the following property: given a set of values, the diagonal elements in each row have the largest value and the other values are randomly shuffled for the other positions.
For example, if I have [1, 2, 3, 4], a possible desired output is:
[[4, 3, 1, 2],
[1, 4, 3, 2],
[2, 1, 4, 3],
[2, 3, 1, 4]]
There can be (several) other possible outputs, as long as the diagonal elements are the largest value (4 in this case) and the off-diagonal elements in each row contain the other values but shuffled.
A hacky/inefficient way of doing this could be first creating a square matrix (4x4) of zeros and putting the largest value (4) in all the diagonal positions, and then traversing the matrix row by row, where for each row i, populate the elements except index i with shuffled remaining values (shuffled versions of [1, 2, 3]). This would be very slow as the matrix size increases. Is there a cleaner/faster/Pythonic way of doing it? Thank you.

First you can generate a randomized array on the first axis with np.random.shuffle(), then I've used a (not so easy to understand) mathematical tricks to shift each rows:
import numpy as np
from numpy.fft import fft, ifft
# First create your randomized array with np.random.shuffle()
x = np.array([[1,2,3,4],
[2,4,3,1],
[4,1,2,3],
[2,3,1,4]])
# We use np.where to determine on which column each 4 are.
_,s = np.where(x==4);
# We compute the left shift that need to be applied to each row in order to get each 4 on the diagonal
s = s-np.r_[0:x.shape[0]]
# And here is the tricks, we can use the fast fourrier transform in order to left shift each row by a given value:
L = np.real(ifft(fft(x,axis=1)*np.exp(2*1j*np.pi/x.shape[1]*s[:,None]*np.r_[0:x.shape[1]][None,:]),axis=1).round())
# Noticed that we could also use a right shift, we simply have to negate our exponential exponant:
# np.exp(-2*1j*np.pi...
And we obtain the following matrix:
[[4. 1. 2. 3.]
[2. 4. 1. 3.]
[2. 3. 4. 1.]
[3. 2. 1. 4.]]
No hidden for loop, only pure linear algaebra stuff.
To give you an idea it take only a few milliseconds for a 1000x1000 matrix on my computer and ~20s for a 10000x10000 matrix.

Related

2D version of numpy random choice with weighting

This relates to this earlier post: Numpy random choice of tuples
I have a 2D numpy array and want to choose from it using a 2D probability array. The only way I could think to do this was to flatten and then use the modulo and remainder to convert the result back to a 2D index
import numpy as np
# dummy data
x=np.arange(100).reshape(10,10)
# dummy probability array
p=np.zeros([10,10])
p[4:7,1:4]=1.0/9
xy=np.random.choice(x.flatten(),1,p=p.flatten())
index=[int(xy/10),(xy%10)[0]] # convert back to index
print(index)
which gives
[5, 2]
but is there a cleaner way that avoids flattening and the modulo? i.e. I could pass a list of coordinate tuples as x, but how can I then handle the weights?
I don't think it's possible to directly specify a 2D shaped array of probabilities. So raveling should be fine. However to get the corresponding 2D shaped indices from the flat index you can use np.unravel_index
index= np.unravel_index(xy.item(), x.shape)
# (4, 2)
For multiple indices, you can just stack the result:
xy=np.random.choice(x.flatten(),3,p=p.flatten())
indices = np.unravel_index(xy, x.shape)
# (array([4, 4, 5], dtype=int64), array([1, 2, 3], dtype=int64))
np.c_[indices]
array([[4, 1],
[4, 2],
[5, 3]], dtype=int64)
where np.c_ stacks along the right hand axis and gives the same result as
np.column_stack(indices)
You could use numpy.random.randint to generate an index, for example:
# assumes p is a square array
ij = np.random.randint(p.shape[0], size=p.ndim) # size p.ndim = 2 generates 2 coords
# need to convert to tuple to index correctly
p[tuple(i for i in ij))]
>>> 0.0
You can also index multiple random values at once:
ij = np.random.randint(p.shape[0], size=(p.ndim, 5)) # get 5 values
p[tuple(i for i in ij))]
>>> array([0. , 0. , 0. , 0.11111111, 0. ])

More efficient way to merge columns in pandas

My code calculates the euclidean distance between all points in a set of samples I have. What I want to know is in general this the most efficient way to perform some operation between all elements in a set and then plot them, for instance to make a correlation matrix.
The index of samples is used to initialize the dataframe and provide labels. Then the 3d coordinates are provided as tuples in three_D_coordinate_tuple_list but this could easily be any measurement and then the variable distance could be any operation. I'm curious about finding a more efficient solution to making each column and then merging them again using pandas or numpy. Am I clogging up any memory with my solution? How can I make this cleaner?
def euclidean_distance_matrix_maker(three_D_coordinate_tuple_list, index_of_samples):
#list of tuples
#well_id or index as series or list
n=len(three_D_coordinate_tuple_list)
distance_matrix_df=pd.DataFrame(index_of_samples)
for i in range(0, n):
column=[]
#iterates through all elemetns calculates distance vs this element
for j in range(0, n):
distance=euclidean_dist_threeD_for_tuples( three_D_coordinate_tuple_list[i],
three_D_coordinate_tuple_list[j])
column.append(distance)
#adds euclidean distance to a list which overwrites old data frame then
#is appeneded with concat column wise to output matrix
new_column=pd.DataFrame(column)
distance_matrix_df=pd.concat([distance_matrix_df, new_column], axis=1)
distance_matrix_df=distance_matrix_df.set_index(distance_matrix_df.iloc[:,0])
distance_matrix_df=distance_matrix_df.iloc[:,1:]
distance_matrix_df.columns=distance_matrix_df.index
Setup
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
scipy.spatial.distance_matrix
from scipy.spatial import distance_matrix
distance_matrix(x, x)
array([[ 0. , 5.19615242, 10.39230485],
[ 5.19615242, 0. , 5.19615242],
[10.39230485, 5.19615242, 0. ]])
Numpy
from scipy.spatial.distance import squareform
i, j = np.triu_indices(len(x), 1)
((x[i] - x[j]) ** 2).sum(-1) ** .5
array([ 5.19615242, 10.39230485, 5.19615242])
Which we can make into a square form with squareform
squareform(((x[i] - x[j]) ** 2).sum(-1) ** .5)
array([[ 0. , 5.19615242, 10.39230485],
[ 5.19615242, 0. , 5.19615242],
[10.39230485, 5.19615242, 0. ]])

Math behind scipy.ndimage.convolve

While I have already found the documentation on scipy.ndimage.convolve function and I "practically know what it does", when I try to calculate the resulting arrays I can't follow the mathematical formula. Let's take for example:
a = np.array([[1, 2, 0, 0],`
[5, 3, 0, 4],
[0, 0, 0, 7],
[9, 3, 0, 0]])
k = np.array([[1,1,1],[1,1,0],[1,0,0]])
from scipy import ndimage
ndimage.convolve(a, k, mode='constant', cval=0.0)
# Why is the result like this ?
array([[11, 10, 7, 4],
[10, 3, 11, 11],
[15, 12, 14, 7],
[12, 3, 7, 0]])
I would appreciate a step by step calculation.
Details on NDImage.convolve
I stumbled on this NDImage convolution eventhough I know the basic np.convolve, and the document is not much self explanatory, so I took the effort to crunch through and supplement the earlier explanatory post:
A. Basics:
Reference: refers to following if your concept on convolution is not so well grounded
https://en.wikipedia.org/wiki/Kernel_(image_processing),
https://en.wikipedia.org/wiki/Convolution
Essentially NDimage.convolve has 4 modes, this post focused on the Constant mode, for which you use the value as specified by cval=0 or whatever and add padded rows and columns as needed (will explain in a little bit)
The convolution essentially slides the kernel from left and right and then step down again and from left to right again until the needed (same number) number of convolved elements are achieved
The function will calculate the padded rows/columns needed. In this case the filter K is 3 x 3 matrix, and the source image is matrix a is 4 x 4, so you need two padded rows at top and bottom and two padded rows at left and right (4 + 2 = 6, and the number of rows or columns needed is 3 + 1 + 1 + 1 = 6, each slide will need the extra one row or column)
B. Operations:
Add a row and column of zeros to the top and left of Array a (to convolve a 3 x 3 to 4 x 4 evenly,
you need extra padded row/column at the 1st and 4th sliding window) and also one row/column of padded zeros to the bottom and right
Flip the kernel K as Kflip: [[0,0,1], [0,1,1], [1,1,1]]
you can use numpy np.flip (why it need to be flipped basically relates to the concept of convolution vs correlation which are like twins in opposite direction)
Slide the flipped K matrix onto this size 6 x 6 expanded matrix [[0,0,0,0,0], [0,1,2,0,0], [0,5,3,0,4,0], [0,0,0,0,7], [0,9,3,0,0,0], [0,0,0,0,0]]
For the first step of sliding window (note the first row of column of the kernel will convolved with the padded zeros), you get:
Flipped K dot sum [[0,0,0], [0,1,2], [0,5,3]] = 11 (1*1+1*2+1*5+1*3, others are zeros)
(dot sum refers to sum of the inner dot element-wise multiplication, basically just multiply the corresponding elements in the same positions for the two given matrices)
Slide K one step to the right, you will have 10 (first row all zeros due to padded zeros, second row: 1*2+, third row 1*3 + 1*4, fourth row all zeros due to [0,0,0,0,7])
likewise you slide to the right for another two steps to get all four elements for the convolved matrix (note for the 4th of this row, again we partially convolved on expanded padded row/columns)(
Then you slide the K filter one row down and reset to the far left of the "expanded /padded matrix"
You will have again the same 10 (first row: 1*2+, second row 1*3 + 1*4), so on and so forth
Just to warm up consider
k = np.array([[1,0,0],[0,1,0],[0,0,0]])
instead of your k, then if you
ndimage.convolve(a, k, mode='constant', cval=0.0)
you get
array([[4, 2, 4, 0],
[5, 3, 7, 4],
[3, 0, 0, 7],
[9, 3, 0, 0]])
and note that any element is the sum of it's own position (due to the 2nd 1 in k) and the one below and to the right (due to the 1st 1 in k), ie the 4 in the top corner is from the original 1 in the top corner plus the 3 diagonally down from it.
The (possibly) confusing part is that the effect of the k is opposite of what you might expect, ie for the k above you might expect the first 1 to add the value above and to the left, instead of down and to the right.
Now back to yours: the 12 (3 down and 2 across) is the sum of 9+3+0+0+0+0.
Note that anything outside the matrix is assumed to be 0.

Calculating cosine similarity of columns of a python matrix

I have a numpy matrix say A as below
array([[1, 2, 3],
[1, 2, 2]])
I want to find the cosine similarity matrix of this a matrix where cosine similarity is between the columns.
Now cosine similarity of two vectors is just a dot product of two normalized by the L2 norm product of each
But I don't want to iterate for each column in a loop and do it.
So I first tried this:
from scipy.spatial import distance
cos=distance.cdist(a.T,a.T,'cosine')
Here I am taking transpose as else it would do cosine of rows(observations). I want for columns.
However I am not sure this is the right answer. The doc of this function says it gives 1- cosine_similarity. So should I then do?
cos-1-distance.cdist(a.T,a.T,'cosine')
Please advise.
II)
Also what If I try doing something like this:
cos=(np.dot(a.T,a))/(np.linalg.norm(a, axis=0, keepdims=True))*(np.linalg.norm(a, axis=0, keepdims=True))
It won't work as some problem in getting the right L2 norm of the right column. Any idea how we can implement this without function?
Try this:
a = np.array([[1, 2, 3], [1, 2, 2]])
n = np.linalg.norm(a, axis=0).reshape(1, a.shape[1])
a.T.dot(a) / n.T.dot(n)
array([[ 1. , 1. , 0.98058068],
[ 1. , 1. , 0.98058068],
[ 0.98058068, 0.98058068, 1. ]])
This assignment for n would have also worked.
np.linalg.norm(a, axis=0)[None, :]

How to choose axis value in numpy array

I am a new user to numpy and I was using numpy delete, where it mention that to delete horizontal row we should use axis=0 but in other documentation of numpy glossary, it says horizontal axis is 1. It would be great if someone can let me know what is wrong in my understanding.
An array is a systematic way of structuring numbers in grids of any dimensionality. The grid directions have labels, and these labels come from a convention of how new dimensions are added to a grid.
Here's the convention:
The simplest such grid is a 0-dimensional (0D) array, which has no axes and can only hold a scalar. This is a 0D array:
42
If we start putting scalars into a list we get a 1D array. This new grid only has one axis, and if we want to label that axis with a number, we better start with something simple - like axis=0! A 1D array could be:
# ----0--->
[42, π, √2]
Now we want to create an array of 1D arrays, which will give us a 2D array. The horizontal axis will still be 0, but the new vertical axis will get the next lowest number we know, axis=1. Here's what it could look like:
# ----0---->
[[42, π, √2], # |
[1, 2, 3], # 1
[10, 20, 30]] # V
The true beauty is that this generalizes to infinity. If we need a box of numbers we'd create a 3D array by stacking 2D arrays, and the direction that traces the depth of the box would naturally have to be axis=2. If we wanted a 4D array, we would just make a list of boxes (3D arrays), and call every box using an index along axis=3. This can go on forever.
In NumPy:
Any function/method that takes an axis-argument uses this convention. For a 2D array this means that doing something like np.delete(X, [1, 2, 3], axis=0) will iterate over arrays extruded along the 0'th axis, to return X without rows 1, 2 and 3. The same logic applies for getting values from an array.
X[rows_along_0th_axis, columns_along_1st_axis, ..., vectors_along_nth_axis]
Taking from the links that you provided, here the excerpts from numpy delete and glossary that probably caused you some confusions and the clarification in the following.
Excerpt
>>> arr = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
>>> arr
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
>>> np.delete(arr, 1, 0)
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
Excerpt
the first running vertically downwards across rows (axis 0), and the
second running horizontally across columns (axis 1)
I think the confusion derives from the words vertically and horizontally in the second excerpt.
What the second excerpt means is that by setting axis it is possible to decide over which dimension to move. For example, in a 2d matrix, axis=0 corresponds to iterating over the rows (thus moving vertically over the array), while axis=1 corresponds
to iterating over columns (so moving horizontally over the array). It does not say that axis=1 corresponds to the horizontal axis as the OP understood.
The delete function follows the above description, as indeed, by using np.delete(arr, 1, axis=0), the function iterates over the rows, and deletes the row with index 1. If, instead, columns should be deleted, then axis=1. For example, on the same array arr
>>> np.delete(arr, [0,1,4], axis=1)
array([[ 3, 4],
[ 7, 8],
[11, 12]])
in which delete iterates over the columns, and the columns with indices 0, 1 are deleted, and nothing else is deleted as column with index 4 does not exist.

Categories

Resources