I have a large data matrix and I want calculate the similarity matrix of that large matrix but due to memory limitation I want to split the calculation.
Lets assume I have following: For the example I have taken a smaller matrix
data1 = data/np.linalg.norm(data,axis=1)[:,None]
(Pdb) data1
array([[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0.04777415, 0.00091094, 0.01326067, ..., 0. ,
0. , 0. ],
...,
[ 0. , 0.01503281, 0.00655707, ..., 0. ,
0. , 0. ],
[ 0.00418038, 0.00308079, 0.01893477, ..., 0. ,
0. , 0. ],
[ 0.06883803, 0. , 0.0209448 , ..., 0. ,
0. , 0. ]])
They I try to do following:
similarity_matrix[n1:n2,m1:m2] = np.einsum('ik,jk->ij', data1[n1:n2,:], data1[m1:m2,:])
n1,n2,m1,m2 been calculated as follows: (df is a data frame)
data = df.values
m, k = data.shape
n1=0; n2=m/2; m1=n2+1; m2=m;
But the error is:
(Pdb) similarity_matrix[n1:n2,m1:m2] = np.einsum('ik,jk->ij', data1[n1:n2,:], data1[m1:m2,:])
*** NameError: name 'similarity_matrix' is not defined
Didn't you do something like
similarity_matrix = np.empty((N,M),dtype=float)
at the start of your calculations?
You can't index an array, on right or left side of an equation, before you create it.
If that full (N,M) matrix is too big for memory, then just assign your einsum value to another variable, and work with that.
partial_matrix = np.einsum...
How you relate that partial_matrix to the virtual similarity_matrix is a different issue.
Related
I've been starting at my textbook for 3 days as well as supplementary Matlab code, and I'm simply not grasping how the Gamma matrix looks when the transition matrix has an exit state.
Usually, digamma would have t-1 elements, and each element would be a matrix of the transition matrix dimension (right? Digamma_t_ij)... so, say my toy example is
A = [[0.9, 0.1, 0.0],
[0.0, 0.9, 0.1]]
x = [-0.2, 2.6, 1.3]
The emission probabilities for state 1 and 2 are modelled by gaussian distributions. I'm disregarding r for the moment, i.e. several sequences.
In my mind digamma should have the shape np.zeros((sequence_length, self.nStates, len(self.nStates + 1 ))) where the +1 is for the exit state...
Although, here's an exert from my textbook:
"We therefore supplement the ξr matrix with one additional column", how... what... I feel so stupid :(
This is what I'm getting atm
Digamma
array([[[0.06495786, 0.93504214, 0. ],
[0. , 0. , 0. ]],
[[0. , 0.06495786, 0. ],
[0. , 0.93504214, 0. ]],
[[0. , 0. , 0. ],
[0. , 0. , 0. ]]])
I'm guessing the first two elements are right as the values in each matrix add up to 1, so changing their end column just seems wrong... but filling in the last matrix doesn't seem right either?
I'll add all the variables needed to calculate digamma, incase someone can explain this to me and would prefer not to come up with a toy example:
# a_pass from the Forward Algorithm
a_pass = array([[1. , 0.38470424, 0.41887466],
[0. , 0.61529576, 0.58112534]])
# b_pass from the Backward Algorithm
b_pass = array([[1. , 1.03893571, 0. ],
[8.41537925, 9.35042138, 2.08182773]])
# scale variables derived from Forward
c_scale = array([1. , 0.16252347, 0.82658096, 0.05811253])
# Transition matrix
A = array([[0.9, 0.1, 0. ],
[0. , 0.9, 0.1]])
# Emission matrix, calculated as getting the pdf for each x given a gaussian dist
B = array([[1. , 0.06947052, 1. ],
[0.14182701, 1. , 0.81107303]])
# Observation sequence
x = [-0.2, 2.6, 1.3]
# The normalised gamma
gamma = array([[1. , 0.06495786, 0. ],
[0. , 0.93504214, 1. ]])
I have this python code to calculate coordinates distances among different points.
IDs,X,Y,Z
0-20,193.722,175.733,0.0998975
0-21,192.895,176.727,0.0998975
7-22,187.065,178.285,0.0998975
0-23,192.296,178.648,0.0998975
7-24,189.421,179.012,0.0998975
8-25,179.755,179.347,0.0998975
8-26,180.436,179.288,0.0998975
7-27,186.453,179.2,0.0998975
8-28,178.899,180.92,0.0998975
The code works perfectly, but as the amount of coordinates I now have is very big (~50000) I need to optimise this code, otherwise is impossible to run. Could someone suggest me a way of doing this that is more memory efficient? Thanks for any suggestion.
#!/usr/bin/env python
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()
# CREATES columns d0, d1, d2, d3
dist_cols = df_1['IDs']
df_1[dist_cols] = df_1['dist'].apply(pd.Series)
df_1.to_csv("results_Spots.csv")
There are a couple of ways to save space. The first is to only store the upper triangle of your matrix and make sure that your indices always reflect that. The second is only to store the values that meet your threshold. This can be done collectively by using sparse matrices, which support most of the operations you will likely need, and will only store the elements you need.
To store half the data, preprocess your indices when you access your matrix. So for your matrix, access index [i, j] like this:
getitem(A, i, j):
if i > j:
i, j = j, i
return dist[i, j]
scipy.sparse supports a number of sparse matrix formats: BSR, Coordinate, CSR, CSC, Diagonal, DOK, LIL. According to the usage reference, the easiest way to construct a matrix is using DOK or LIL format. I will show the latter for simplicity, although the former may be more efficient. I will leave it up to the reader to benchmark different options once a basic functioning approach has been shown. Remember to convert to CSR or CSC format when doing matrix math.
We will sacrifice speed for spatial efficiency by constructing one row at a time:
N = coords.shape[0]
threshold = 2
threshold2 = threshold**2 # minor optimization to save on some square roots
distances = scipy.sparse.lil_matrix((N, N))
for i in range(N):
# Compute square distances
d2 = np.sum(np.square((coords[i + 1:, :] - coords[i])), axis=1)
# Threshold
mask = np.flatnonzero(d2 <= threshold2)
# Apply, only compute square root if necessary
distances[i, mask + i + 1] = np.sqrt(d2[mask])
For your toy example, we find that there are only four elements that actually pass threshold, making the storage very efficient:
>>> distances.nnz
4
>>> distances.toarray()
array([[0. , 1.29304486, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 1.1008038 , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.68355102, 0. , 1.79082802],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
Using the result from scipy.spatial.distance_matrix confirms that these numbers are in fact accurate.
If you want to fill the matrix (effectively doubling the storage, which should not be prohibitive), you should probably move away from LIL format before doing so. Simply add the transpose to the original matrix to fill it out.
The approach shown here addresses your storage concerns, but you can improve the efficiency of the entire computation using spatial sorting and other geospatial techniques. For example, you could use scipy.spatial.KDTree or the similar scipy.spatial.cKDTree to arrange your dataset and query neighbors within a specific threshold directly and efficiently.
For example, the following would replace the matrix construction shown here with what is likely a more efficient method:
tree = scipy.spatial.KDTree(coords)
distances = tree.sparse_distance_matrix(tree, threshold)
You are asking in your code for point to point distances in a ~50000 x ~50000 matrix. The result will be very big, if you really like to store it. The matrix is dense as each point has a positive distance to each other point.
I recommend to revisit your business requirements. Do you really need to calculate all these points upfront and store them in a file on a disk ? Sometimes it is better to do the required calculations on the fly; scipy.spacial is fast, perhaps even not much slower then reading a precalculated value.
EDIT (based on comment):
You can filter calculated distances by a threshold (here for illustration: 5.0) and then look up the IDs in the DataFrame
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
adj_5 = np.argwhere(distances[:] < 5.0)
pd.DataFrame(zip(df_1['IDs'][adj_5[:,0]].values,
df_1['IDs'][adj_5[:,1]].values),
columns=['from', 'to'])
In the FFT (second) plot, I am expecting a bigger peak at frequency = 1.0, compared to other frequencies, since it is a 1 Hz Square Wave signal sampled at 5Hz.
I am a beginner at this, possibly missing something silly here
Here's what I have done:
import numpy as np
from matplotlib import pyplot as plt
from scipy import signal
t500 = np.linspace(0,5,500,endpoint=False)
s1t500 = signal.square(2*np.pi*1.0*t500)
First plot shows 1 Hz Square Wave sampled at 5Hz for 5 seconds:
t5 = np.linspace(0,5,25,endpoint=False)
t5 = t5 + 1e-14
s1t5 = signal.square(2.0*np.pi*1.0*t5)
plt.ylim(-2,2); plt.plot(t500,s1t500,'k',t5,s1t5,'b',t5,s1t5,'bo'); plt.show()
Here in the Second plot, I am expecting the magnitude at f=1 Hz to be more than at f=2. Am I missing something ?
y1t5 = np.fft.fft(s1t5)
ff1t5 = np.fft.fftfreq(25,d=0.2)
plt.plot(ff1t5,y1t5); plt.show()
It seems you missed the fact that Fourier transform produces functions (or sequences of numbers in case of DFT/FFT) in complex space:
>>> np.fft.fft(s1t5)
[ 5. +0.j 0. +0.j 0. +0.j 0. +0.j 0. +0.j
5.-15.38841769j 0. +0.j 0. +0.j 0. +0.j 0. +0.j
5. +3.63271264j 0. +0.j 0. +0.j 0. +0.j 0. +0.j
# and so on
In order to see the amplitude spectrum on your plot, apply np.absolute or abs:
>>> np.absolute(np.fft.fft(s1t5))
[ 5. 0. 0. 0. 0. 16.18033989
0. 0. 0. 0. 6.18033989 0. 0.
0. 0. 6.18033989 0. 0. 0. 0.
16.18033989 0. 0. 0. 0. ]
Otherwise only real part will be shown.
I have to do the SVD of a matrix, but it has some errors, in the following example U[1][1], U[2][1] and U[2][0] should be 0.
The thing is that the above example was only a test, I have to work with large matrices which won't be so well conditioned, what can I do to trust the results I'll get?
By most standards 1e-17 is considered to be 0.
For example, it passes the np.allclose test
In [582]: A=np.array([1,-1,1,1,1,1]).reshape(3,2)
In [583]: U,d,V=np.linalg.svd(A)
In [584]: U
Out[584]:
array([[ -8.56248666e-17, 1.00000000e+00, -6.40884929e-17],
[ -7.07106781e-01, 2.53974359e-17, -7.07106781e-01],
[ -7.07106781e-01, 2.53974359e-17, 7.07106781e-01]])
In [585]: y=np.array([[0,np.sqrt(2),0],[-1,0,-1],[-1,0,1]])/np.sqrt(2)
In [586]: y
Out[586]:
array([[ 0. , 1. , 0. ],
[-0.70710678, 0. , -0.70710678],
[-0.70710678, 0. , 0.70710678]])
In [587]: np.allclose(U,y)
Out[587]: True
I need to sum the rows of a matrix, negate them, and put them on the diagonal of either the original matrix or a matrix where the off-diagonal terms are zero. What works is
Mat2 = numpy.diag(numpy.negative(numpy.squeeze(numpy.asarray(numpy.sum(Mat1,axis=1))))
Is there a cleaner/faster way to do this? I'm trying to optimize some code.
I think np.diag(-Mat1.A.sum(1)) would produce the same results:
>>> Mat1 = np.matrix(np.random.rand(3,3))
>>> Mat1
matrix([[ 0.35702661, 0.0191392 , 0.34793743],
[ 0.9052968 , 0.16182118, 0.2239716 ],
[ 0.57865916, 0.77934846, 0.60984091]])
>>> Mat2 = np.diag(np.negative(np.squeeze(np.asarray(np.sum(Mat1,axis=1)))))
>>> Mat2
array([[-0.72410324, 0. , 0. ],
[ 0. , -1.29108958, 0. ],
[ 0. , 0. , -1.96784852]])
>>> np.diag(-Mat1.A.sum(1))
array([[-0.72410324, 0. , 0. ],
[ 0. , -1.29108958, 0. ],
[ 0. , 0. , -1.96784852]])
Note that matrices are a bit of a headache in numpy -- arrays are generally much more convenient -- and the only syntactic advantage they had, namely easier multiplication, doesn't really count any more now that we have # for matrix multiplication in modern Python.
If Mat1 were an array instead of a matrix, you wouldn't need the .A there.