I want to generate a rank 5 100x600 matrix in numpy with all the entries sampled from np.random.uniform(0, 20), so that all the entries will be uniformly distributed between [0, 20). What will be the best way to do so in python?
I see there is an SVD-inspired way to do so here (https://math.stackexchange.com/questions/3567510/how-to-generate-a-rank-r-matrix-with-entries-uniform), but I am not sure how to code it up. I am looking for a working example of this SVD-inspired way to get uniformly distributed entries.
I have actually managed to code up a rank 5 100x100 matrix by vertically stacking five 20x100 rank 1 matrices, then shuffling the vertical indices. However, the resulting 100x100 matrix does not have uniformly distributed entries [0, 20).
Here is my code (my best attempt):
import numpy as np
def randomMatrix(m, n, p, q):
# creates an m x n matrix with lower bound p and upper bound q, randomly.
count = np.random.uniform(p, q, size=(m, n))
return count
Qs = []
my_rank = 5
for i in range(my_rank):
L = randomMatrix(20, 1, 0, np.sqrt(20))
# L is tall
R = randomMatrix(1, 100, 0, np.sqrt(20))
# R is long
Q = np.outer(L, R)
Qs.append(Q)
Q = np.vstack(Qs)
#shuffle (preserves rank 5 [confirmed])
np.random.shuffle(Q)
Not a perfect solution, I must admit. But it's simple and comes pretty close.
I create 5 vectors that are gonna span the space of the matrix and create random linear combinations to fill the rest of the matrix.
My initial thought was that a trivial solution will be to copy those vectors 20 times.
To improve that, I created linear combinations of them with weights drawn from a uniform distribution, but then the distribution of the entries in the matrix becomes normal because the weighted mean basically causes the central limit theorm to take effect.
A middle point between the trivial approach and the second approach that doesn't work is to use sets of weights that favor one of the vectors over the others. And you can generate these sorts of weight vectors by passing any vector through the softmax function with an appropriately high temperature parameter.
The distribution is almost uniform, but the vectors are still very close to the base vectors. You can play with the temperature parameter to find a sweet spot that suits your purpose.
from scipy.stats import ortho_group
from scipy.special import softmax
import numpy as np
from matplotlib import pyplot as plt
N = 100
R = 5
low = 0
high = 20
sm_temperature = 100
p = np.random.uniform(low, high, (1, R, N))
weights = np.random.uniform(0, 1, (N-R, R, 1))
weights = softmax(weights*sm_temperature, axis = 1)
p_lc = (weights*p).sum(1)
rand_mat = np.concatenate([p[0], p_lc])
plt.hist(rand_mat.flatten())
I just couldn't take the fact the my previous solution (the "selection" method) did not really produce strictly uniformly distributed entries, but only close enough to fool a statistical test sometimes. The asymptotical case however, will almost surely not be distributed uniformly. But I did dream up another crazy idea that's just as bad, but in another manner - it's not really random.
In this solution, I do smth similar to OP's method of forming R matrices with rank 1 and then concatenating them but a little differently. I create each matrix by stacking a base vector on top of itself multiplied by 0.5 and then I stack those on the same base vector shifted by half the dynamic range of the uniform distribution. This process continues with multiplication by a third, two thirds and 1 and then shifting and so on until i have the number of required vectors in that part of the matrix.
I know it sounds incomprehensible. But, unfortunately, I couldn't find a way to explain it better. Hopefully, reading the code would shed some more light.
I hope this "staircase" method will be more reliable and useful.
import numpy as np
from matplotlib import pyplot as plt
'''
params:
N - base dimention
M - matrix length
R - matrix rank
high - max value of matrix
low - min value of the matrix
'''
N = 100
M = 600
R = 5
high = 20
low = 0
# base vectors of the matrix
base = low+np.random.rand(R-1, N)*(high-low)
def build_staircase(base, num_stairs, low, high):
'''
create a uniformly distributed matrix with rank 2 'num_stairs' different
vectors whose elements are all uniformly distributed like the values of
'base'.
'''
l = levels(num_stairs)
vectors = []
for l_i in l:
for i in range(l_i):
vector_dynamic = (base-low)/l_i
vector_bias = low+np.ones_like(base)*i*((high-low)/l_i)
vectors.append(vector_dynamic+vector_bias)
return np.array(vectors)
def levels(total):
'''
create a sequence of stritcly increasing numbers summing up to the total.
'''
l = []
sum_l = 0
i = 1
while sum_l < total:
l.append(i)
i +=1
sum_l = sum(l)
i = 0
while sum_l > total:
l[i] -= 1
if l[i] == 0:
l.pop(i)
else:
i += 1
if i == len(l):
i = 0
sum_l = sum(l)
return l
n_rm = R-1 # number of matrix subsections
m_rm = M//n_rm
len_rms = [ M//n_rm for i in range(n_rm)]
len_rms[-1] += M%n_rm
rm_list = []
for len_rm in len_rms:
# create a matrix with uniform entries with rank 2
# out of the vector 'base[i]' and a ones vector.
rm_list.append(build_staircase(
base = base[i],
num_stairs = len_rms[i],
low = low,
high = high,
))
rm = np.concatenate(rm_list)
plt.hist(rm.flatten(), bins = 100)
A few examples:
and now with N = 1000, M = 6000 to empirically demonstrate the nearly asymptotic behavior:
Related
I'm trying to generate a square matrix that only contains the values -1 or 0 or 1 but no other values. The matrix is used as a relationship matrix for a genetic algorithm project that I am working on. The diagonal has to be all zeros.
So far I have tried this:
import numpy as np
n = 5
M = []
for i in range(n):
disc = random.random()
if disc <= 0.33:
M.append(-1)
elif disc > 0.33 and disc <= 0.66:
M.append(0)
else:
M.append(1)
RelMat = np.array(M).reshape(int(sqrt(n)),-1)
np.fill_diagonal(RelMat, 0)
This will yield me a matrix with all three values but it won't allow me to make it symmetrical. I have tried to multiply it with its transpose but then the values are not correct anymore.
I have also tried to get it to work with:
import numpy as np
N = 5
b = np.random.random_integers(-1,2,size=(N,N))
b_symm = (b + b.T)/2
but this will give me 0.5 as values in the matrix which pose a problem.
My main issues is the symmetry of the matrix and the condition that the matrix has to contain all three numbers. Any help is appreciated.
numpy.triu returns the upper triangular portion of the matrix (it sets elements below the k-th diagonal to 0). You could also zero the main diagonal too in that same call (to avoid calling fill_diagonal).
After that b + b.T should give you a symmetric matrix with the desired values.
Here's a much more compact way to build the matrix, in this case 5x5:
b = np.triu(np.random.randint(-1, 2, size=[5,5]), k=1)
b + b.T
I got about 9000 trajectoris, for a project I have to calculate the distance between each two of them. A trajectory consists of 11 points, and each points contains x and y coordinates. Simply generate sample dataset by:
import numpy as np
trajs = np.random.rand(9000,11,2)
I took the Frechet distance function from https://pypi.org/project/similaritymeasures/, which takes two trajectoris as input, and the output is a float value.
At the begining I wrote a nested for-loop:
from similaritymeasures import frechet_dist
distance_matrix = []
for i in trajs:
for j in trajs:
distance_matrix.append(frechet_dist(i,j))
It takes too long to get results.
Since the distance calculation is symmetrical (i.e. frechet_dist(t1,t2) = frechet_dist(t2,t1) ), I decrease the calculation times to half of above in this way:
from scipy.spatial.distance import squareform
n = len(trajs)
distance_matrix = []
flag = 0
for i in range(n):
for j in range(flag,n):
if i != j:
distance_matrix.append(frechet_dist(trajs[i],trajs[j]))
flag += 1
dist_mat = squareform(np.asarray(distance_matrix))
Now for 9000 trajectories, it takes 19 hours. I got the result but it still too long. Are there any methods to speed up the calculation?
I have two lists containing x and y number of n-dimensional points respectively. I had to calculate the sum of minimum distances of each point in list one (containing x points) from each point in second list (containing y points). The distance I am calculating is Euclidean distance. The optimized solution is needed.
I have already implemented its naive solution in Python. But its time complexity is too much to be used anywhere. There will be optimization possible. Can this problems time complexity be reduced than what I have implemented?
I was reading thispaper which I was trying to implement. In this they were having the similar problem to which they stated that it's special condition of Earth Mover Distance. As there was no code given, hence unable to know how it got implemented. Thus my naive implementation, the above code was too slow to work on data set of 11k documents. I used Google Colab for executing my code.
# Calculating Euclidean distance between two points
def euclidean_dist(x,y):
dd = 0.0
#len(x) is number of dimensions. Basically x and y is a
#list which contains coordinates of a point
for i in range(len(x)):
dd = dd+(x[i]-y[i])**2
return dd**(1/2)
# Calculating the desired solution to our problem
def dist(l1,l2):
min_dd = 0.0
dd = euclidean_dist(l1[0],l2[0])
for j in range(len(l1)):
for k in range(len(l2)):
temp = euclidean_dist(l1[j],l2[k])
if dd > temp:
dd = temp
min_dd = min_dd+dd
dd = euclidean_dist(l1[j],l2[0])
return min_dd
To reduce runtime, I would suggest finding manhattan distances (delta x + delta y), sorting the resulting array for each point and then creating a buffer of +20% of lowest manhattan distance, if values in the sorted list are in that range of +20%, you can compute euclidean distances and find the correct/minimum euclidean answer.
This will reduce some time, but the 20% figure might not reduce time if the points are all close together as most of them will fit in the buffer region, try fine-tuning the 20% parameter to see what works best for your dataset. Keep in mind reducing it too much might lead to inaccurate answers due to the nature of euclidean vs. manhattan distances.
It is similar to a k-nearest-neighbor problem so finding each closest point to a given point cost O(N) and for your problem should be O(N^2).
Sometimes Using kd-tree MAY improve some performance if your data is low-dimensional.
To calculate the distance between two points, you can use the distance formula:
which you can implement like that in python:
import math
def dist(x1, y1, x2, y2):
return math.sqrt(pow(x1 - x2, 2) + pow(y1 - y2, 2))
Then all you need to do is to loop over X or Y list, check the distance of two points and store it if it's under the current stored minimal distance. You should end up with a O(n²) complexity algorithm which is what you seems to want. Here a working example:
min_dd = None
for i in range(len(l1)):
for j in range(i + 1, len(l1)):
dd = dist(l1[i], l2[i], l1[j], l2[j])
if min_dd is None or dd < min_dd:
min_dd = dd
With this you can get pretty good performances even with large list of points.
Small arrays
For two numpy arrays x and y of shape (n,) and (m,) respectively, you can vectorize the distance calculations and then get the minimum distance:
import numpy as np
n = 10
m = 20
x = np.random.random(n)
y = np.random.random(m)
# Using squared distance matrix and taking the
# square root at the minimum value
distance_matrix = (x[:,None]-y[None,:])**2
minimum_distance_sum = np.sum(np.sqrt(np.min(distance_matrix, axis=1)))
For arrays of shape (n,l) and (m,l), you just need to calculate the distance_matrix as:
distance_matrix = np.sum((x[:,None]-y[None,:])**2, axis=2)
Alternatively, you could use np.linalg.norm, scipy.spatial.distance.cdist, np.einsum etc., but in many cases they are not faster.
Large arrays
If l, n and m above are too large for you to keep the distance_matrix in memory, you can use the mathematical lower and upper bound of the euclidean distance to increase the speed (see this paper. Since this relies on for loops, it will be very slow, but one can wrap the functions with numba to counter this:
import numpy as np
import numba
#numba.jit(nopython=True, fastmath=True)
def get_squared_distance(a,b):
return np.sum((a-b)**2)
def get_minimum_distance_sum(x,y):
n = x.shape[0]
m = y.shape[0]
l = x.shape[1]
# Calculate mean and standard deviation of both arrays
mx = np.mean(x, axis=1)
my = np.mean(y, axis=1)
sx = np.std(x, axis=1)
sy = np.std(y, axis=1)
return _get_minimum_distance_sum(x,y,n,m,l,mx,my,sx,sy)
#numba.jit(nopython=True, fastmath=True)
def _get_minimum_distance_sum(x,y,n,m,l,mx,my,sx,sy):
min_distance_sum = 0
for i in range(n):
min_distance = get_squared_distance(x[i], y[0])
for j in range(1,m):
if i == 0 and j == 0:
continue
lower_bound = l * ((mx[i] - my[j])**2 + (sx[i] - sy[j])**2)
if lower_bound >= min_distance:
continue
distance = get_squared_distance(x[i], y[j])
if distance < min_distance:
min_distance = distance
min_distance_sum += np.sqrt(min_distance)
return min_distance_sum
def test_minimum_distance_sum():
# Will likely be much larger for this to be faster than the other method
n = 10
m = 20
l = 100
x = np.random.random((n,l))
y = np.random.random((m,l))
return get_minimum_distance_sum(x,y)
This approach should be faster than the former approach with increased array size. The algorithm can be improved slightly as described in the paper, but any speedup would depend heavily on the shape of the arrays.
Timings
On my laptop, on two arrays of shape (1000,100), your approach takes ~1 min, the "small arrays" approach takes 690 ms and the "large arrays" approach takes 288 ms. For two arrays of shape (100, 3), your approach takes 28 ms, the "small arrays" approach takes 429 μs and the "large arrays" approach takes 578 μs.
I have collected outputs from several clustering algorithms on the same data set, based on which I would like to generate an adjacency matrix indicating in how many different runs any two samples were clustered together. I.e. for each I = 10 clusterings, I have a one-hot representation N x C_i indicating whether or not the sample n belongs to cluster c (for the i'th run), with the possibility of different (amount of) clusters for each run. The goal is then to build an adjacency matrix N x N on which I can threshold and select only consistent clusters for further analysis.
It is quite easy to build an algorithm that does this:
n_samples = runs[0].shape[0]
i = []
j = []
for iter_no, ca in enumerate(runs):
print("Processing adjacency", iter_no)
for col in range(ca.shape[1]):
comb = itertools.combinations(np.where(ca[:, col])[0], 2)
for c in comb:
i.append(c[0])
j.append(c[1])
i = np.array(i)
j = np.array(j)
adj_mat = scipy.sparse.coo_matrix((np.ones(len(i)), (i, j)), shape=[n_samples, n_samples])
This scales very poorly with cluster size, and I typically have N = 15000 with cluster sizes occasionally reaching 12k. Hence, I'm looking for the networkx library to possibly speed this up? Is there any obvious way to do this?
UPDATE: Solution found (see answer).
Linear algebra to the rescue:
assert len(runs) > 0
N = runs[0].shape[0]
R = len(runs)
# Iteratively populate the output matrix (dense)
S = np.zeros((N, N), dtype=np.int8)
for i, scan in enumerate(runs):
print("Adjacency", i)
S += np.matmul(scan, scan.T).astype(np.int8)
# Convert to sparse and return
return scipy.sparse.csr_matrix(S)
I have a simple question regarding normalization when doing a 2D FFT in python.
My understanding is that normalization factors can be determined from making arrays filled with ones.
For example in 1d, FFT of [1,1,1,1] would give me [4+0j,0+0j,0+0j,0+0j] so the normalization factor should be 1/N=1/4.
In 2D, FFT of [[1,1],[1,1]] would give me [[4+0j,0+0j],[0+0j,0+0j]] so the normalization should be 1/MN=1/(2*2)=1/4.
Now suppose we have a 3000 by 3000 matrix, each element with a Gaussian distributed value with mean 0. When we FFT and normalize this (normalization factor = 1/(3000*3000)), we get a mean power of order 10^-7.
Now we repeat this using a 1000 by 1000 element sub-region (normalization factor = 1/(1000*1000)). The mean power we get from this is of order 10^-6. I'm wondering why there is a factor of ~10 difference. Shouldn't the mean power be the same? Am I missing an extra normalization factor?
If we say that the factor difference is infact 9, then I could guess that this comes from the number of elements (3000 x 3000 has 9 times more elements than 1000 x 1000), but what is the intuitive reason for this extra factor? Also, how do we determine the absolute normalization factor to obtain the "true" underlying mean power?
Any insight will be greatly appreciated. Thanks in advance!
sample code:
import numpy as np
a = np.random.randn(3000,3000)
af = np.fft.fft2(a)/3000.0/3000.0
aP = np.mean(np.abs(af)**2)
b = a[1000:2000,1000:2000]
bf = np.fft.fft2(b)/1000.0/1000.0
bP = np.mean(np.abs(bf)**2)
print aP,bP
>1.11094908545e-07 1.00226264535e-06
First, it's important to note that this issue is not related to the difference between 1D and 2D FFTs, but rather to how total power and mean power scale with the number of elements in an array.
You are exactly right when you say that the factor of 9 comes from the 9x more elements in a than b. What is confusing, perhaps, is that you noticed that you've already normalized by dividing by np.fft.fft2(a)/3000./3000. and np.fft.fft2(b)/1000./1000. In fact, those normalizations are necessary to get the total (not mean) power to be equal in the space and frequency domains. To get the mean you have to divide again by the array sizes.
Your question is really about Parseval's theorem, which states that the total power in the two domains (space/time and frequency) are equal. Its statement, for DFT is this. Notice, that in spite of the 1/N on the right, this is not mean power, but total power. The reason for the 1/N is the normalization convention for the DFT.
Put in Python, this means that for a time/space signal sig, Parseval equivalence may be stated as:
np.sum(np.abs(sig)**2) == np.sum(np.abs(np.fft.fft(sig))**2)/sig.size
Below is a complete example, starting with some toy cases (one and two dimensional arrays filled one 1s) and ending with your own case. Note that I used the .size property of numpy.ndarray, which returns the total number of elements in the array. It's equivalent to your /1000./1000. etc. Hope this helps!
import numpy as np
print 'simple examples:'
# 1-d, 4 elements:
ones_1d = np.array([1.,1.,1.,1.])
ones_1d_f = np.fft.fft(ones_1d)
# compute total power in space and frequency domains:
space_power_1d = np.sum(np.abs(ones_1d)**2)
freq_power_1d = np.sum(np.abs(ones_1d_f)**2)/ones_1d.size
print 'space_power_1d = %f'%space_power_1d
print 'freq_power_1d = %f'%freq_power_1d
# 2-d, 4 elements:
ones_2d = np.array([[1.,1.],[1.,1.]])
ones_2d_f = np.fft.fft2(ones_2d)
# compute and print total power in space and frequency domains:
space_power_2d = np.sum(np.abs(ones_2d)**2)
freq_power_2d = np.sum(np.abs(ones_2d_f)**2)/ones_2d.size
print 'space_power_2d = %f'%space_power_2d
print 'freq_power_2d = %f'%freq_power_2d
# 2-d, 9 elements:
ones_2d_big = np.array([[1.,1.,1.],[1.,1.,1.],[1.,1.,1.]])
ones_2d_big_f = np.fft.fft2(ones_2d_big)
# compute and print total power in space and frequency domains:
space_power_2d_big = np.sum(np.abs(ones_2d_big)**2)
freq_power_2d_big = np.sum(np.abs(ones_2d_big_f)**2)/ones_2d_big.size
print 'space_power_2d_big = %f'%space_power_2d_big
print 'freq_power_2d_big = %f'%freq_power_2d_big
print
# asker's example array a and fft af:
print 'askers examples:'
a = np.random.randn(3000,3000)
af = np.fft.fft2(a)
# compute the space and frequency total powers:
space_power_a = np.sum(np.abs(a)**2)
freq_power_a = np.sum(np.abs(af)**2)/af.size
# mean power is the total power divided by the array size:
freq_power_a_mean = freq_power_a/af.size
print 'space_power_a = %f'%space_power_a
print 'freq_power_a = %f'%freq_power_a
print 'freq_power_a_mean = %f'%freq_power_a_mean
print
# the central 1000x1000 section of the 3000x3000 original array:
b = a[1000:2000,1000:2000]
bf = np.fft.fft2(b)
# we expect the total power in the space and frequency domains
# to be about 1/9 of the total power in the space frequency domains
# for matrix a:
space_power_b = np.sum(np.abs(b)**2)
freq_power_b = np.sum(np.abs(bf)**2)/bf.size
# we expect the mean power to be the same as the mean power from
# matrix a:
freq_power_b_mean = freq_power_b/bf.size
print 'space_power_b = %f'%space_power_b
print 'freq_power_b = %f'%freq_power_b
print 'freq_power_b_mean = %f'%freq_power_b_mean