Related
I have to scale between [0,1] a matrix. So, for each element from matrix i have to do this formula:
(Element - min_cols) / (max_cols - min_cols)
min_cols -> array with every minimum of each column from the matrix. max_cols -> same but with max
My problem is, i want to calculate result with this:
result = (Element- min_cols) / (max_cols - min_cols)
Or, from each element from the matrix i have to do difference between that element and the minimum from element's column, and do the difference between (maximum element's column and the minimum).*
but when i have for example the value from min_cols negative and the value from max_cols also negative, it results the sum between both.
I want to specify that the matrix is: _mat = np.random.randn(1000, 1000) * 50
Use numpy
Example
import numpy as np
x = 50*np.random.rand(6,4)
array([[26.7041017 , 46.88118463, 41.24541748, 31.17881807],
[47.57036124, 16.49040094, 6.62454156, 37.15976348],
[46.7157895 , 8.53357717, 39.01399714, 5.14287858],
[24.36012016, 5.67603151, 40.7697121 , 13.09877845],
[21.69045322, 12.61989002, 8.74692768, 46.23368735],
[ 3.9058066 , 35.50845507, 4.66785679, 2.34177134]])
Apply your formula
np.divide(np.subtract(x, x.min(axis=0)), x.max(axis=0)-x.min(axis=0))
array([[0.52212361, 1. , 1. , 0.65700132],
[1. , 0.26245187, 0.05349413, 0.79326663],
[0.98042871, 0.06934923, 0.93899483, 0.06381829],
[0.46844205, 0. , 0.98699461, 0.24507946],
[0.40730168, 0.16851918, 0.1115184 , 1. ],
[0. , 0.7239974 , 0. , 0. ]])
The max value of each column is mapped to 1, the min value of each column is mapped to 0 an the intermediate values have are linearly mapped between 0 and 1
I've been starting at my textbook for 3 days as well as supplementary Matlab code, and I'm simply not grasping how the Gamma matrix looks when the transition matrix has an exit state.
Usually, digamma would have t-1 elements, and each element would be a matrix of the transition matrix dimension (right? Digamma_t_ij)... so, say my toy example is
A = [[0.9, 0.1, 0.0],
[0.0, 0.9, 0.1]]
x = [-0.2, 2.6, 1.3]
The emission probabilities for state 1 and 2 are modelled by gaussian distributions. I'm disregarding r for the moment, i.e. several sequences.
In my mind digamma should have the shape np.zeros((sequence_length, self.nStates, len(self.nStates + 1 ))) where the +1 is for the exit state...
Although, here's an exert from my textbook:
"We therefore supplement the ξr matrix with one additional column", how... what... I feel so stupid :(
This is what I'm getting atm
Digamma
array([[[0.06495786, 0.93504214, 0. ],
[0. , 0. , 0. ]],
[[0. , 0.06495786, 0. ],
[0. , 0.93504214, 0. ]],
[[0. , 0. , 0. ],
[0. , 0. , 0. ]]])
I'm guessing the first two elements are right as the values in each matrix add up to 1, so changing their end column just seems wrong... but filling in the last matrix doesn't seem right either?
I'll add all the variables needed to calculate digamma, incase someone can explain this to me and would prefer not to come up with a toy example:
# a_pass from the Forward Algorithm
a_pass = array([[1. , 0.38470424, 0.41887466],
[0. , 0.61529576, 0.58112534]])
# b_pass from the Backward Algorithm
b_pass = array([[1. , 1.03893571, 0. ],
[8.41537925, 9.35042138, 2.08182773]])
# scale variables derived from Forward
c_scale = array([1. , 0.16252347, 0.82658096, 0.05811253])
# Transition matrix
A = array([[0.9, 0.1, 0. ],
[0. , 0.9, 0.1]])
# Emission matrix, calculated as getting the pdf for each x given a gaussian dist
B = array([[1. , 0.06947052, 1. ],
[0.14182701, 1. , 0.81107303]])
# Observation sequence
x = [-0.2, 2.6, 1.3]
# The normalised gamma
gamma = array([[1. , 0.06495786, 0. ],
[0. , 0.93504214, 1. ]])
I have this python code to calculate coordinates distances among different points.
IDs,X,Y,Z
0-20,193.722,175.733,0.0998975
0-21,192.895,176.727,0.0998975
7-22,187.065,178.285,0.0998975
0-23,192.296,178.648,0.0998975
7-24,189.421,179.012,0.0998975
8-25,179.755,179.347,0.0998975
8-26,180.436,179.288,0.0998975
7-27,186.453,179.2,0.0998975
8-28,178.899,180.92,0.0998975
The code works perfectly, but as the amount of coordinates I now have is very big (~50000) I need to optimise this code, otherwise is impossible to run. Could someone suggest me a way of doing this that is more memory efficient? Thanks for any suggestion.
#!/usr/bin/env python
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()
# CREATES columns d0, d1, d2, d3
dist_cols = df_1['IDs']
df_1[dist_cols] = df_1['dist'].apply(pd.Series)
df_1.to_csv("results_Spots.csv")
There are a couple of ways to save space. The first is to only store the upper triangle of your matrix and make sure that your indices always reflect that. The second is only to store the values that meet your threshold. This can be done collectively by using sparse matrices, which support most of the operations you will likely need, and will only store the elements you need.
To store half the data, preprocess your indices when you access your matrix. So for your matrix, access index [i, j] like this:
getitem(A, i, j):
if i > j:
i, j = j, i
return dist[i, j]
scipy.sparse supports a number of sparse matrix formats: BSR, Coordinate, CSR, CSC, Diagonal, DOK, LIL. According to the usage reference, the easiest way to construct a matrix is using DOK or LIL format. I will show the latter for simplicity, although the former may be more efficient. I will leave it up to the reader to benchmark different options once a basic functioning approach has been shown. Remember to convert to CSR or CSC format when doing matrix math.
We will sacrifice speed for spatial efficiency by constructing one row at a time:
N = coords.shape[0]
threshold = 2
threshold2 = threshold**2 # minor optimization to save on some square roots
distances = scipy.sparse.lil_matrix((N, N))
for i in range(N):
# Compute square distances
d2 = np.sum(np.square((coords[i + 1:, :] - coords[i])), axis=1)
# Threshold
mask = np.flatnonzero(d2 <= threshold2)
# Apply, only compute square root if necessary
distances[i, mask + i + 1] = np.sqrt(d2[mask])
For your toy example, we find that there are only four elements that actually pass threshold, making the storage very efficient:
>>> distances.nnz
4
>>> distances.toarray()
array([[0. , 1.29304486, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 1.1008038 , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.68355102, 0. , 1.79082802],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
Using the result from scipy.spatial.distance_matrix confirms that these numbers are in fact accurate.
If you want to fill the matrix (effectively doubling the storage, which should not be prohibitive), you should probably move away from LIL format before doing so. Simply add the transpose to the original matrix to fill it out.
The approach shown here addresses your storage concerns, but you can improve the efficiency of the entire computation using spatial sorting and other geospatial techniques. For example, you could use scipy.spatial.KDTree or the similar scipy.spatial.cKDTree to arrange your dataset and query neighbors within a specific threshold directly and efficiently.
For example, the following would replace the matrix construction shown here with what is likely a more efficient method:
tree = scipy.spatial.KDTree(coords)
distances = tree.sparse_distance_matrix(tree, threshold)
You are asking in your code for point to point distances in a ~50000 x ~50000 matrix. The result will be very big, if you really like to store it. The matrix is dense as each point has a positive distance to each other point.
I recommend to revisit your business requirements. Do you really need to calculate all these points upfront and store them in a file on a disk ? Sometimes it is better to do the required calculations on the fly; scipy.spacial is fast, perhaps even not much slower then reading a precalculated value.
EDIT (based on comment):
You can filter calculated distances by a threshold (here for illustration: 5.0) and then look up the IDs in the DataFrame
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
adj_5 = np.argwhere(distances[:] < 5.0)
pd.DataFrame(zip(df_1['IDs'][adj_5[:,0]].values,
df_1['IDs'][adj_5[:,1]].values),
columns=['from', 'to'])
I am trying to create a basic item based recommender system with knn. But with the following code it always returns same distances with different k's with the model. Why it returns same results?
df_ratings = pd.read_csv('ml-1m/ratings.dat', names=["user_id", "movie_id", "rating", "timestamp"],
header=None, sep='::', engine='python')
matrix_df = df_ratings.pivot(index='movie_id', columns='user_id', values='rating').fillna(0).astype(bool).astype(int)
um_matrix = scipy.sparse.csr_matrix(matrix_df.values)
# knn model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=17, n_jobs=-1)
model_knn.fit(um_matrix)
distances, indices = model_knn.kneighbors(um_matrix[int(movie)], n_neighbors=100)
Your model returns the same distances for any K because your K does not change the distances between your datapoints.
K-Nearest-Neigbours does simply find the nearest neigbours of a point in your feature space, the K does spceify how many of them you are looking for, not how far they are away from each other.
A simple example would be
X = [[0,0],[0,5],[5,0],[5,5][4,4]]
As a scatter plot it looks like
so your distance matrix defines the distances between all points:
[0,0]: [0. , 5. , 5. , 5.65685425, 7.07106781],
[0,5]: [0. , 4.12310563, 5. , 5. , 7.07106781],
[5,0]: [0. , 4.12310563, 5. , 5. , 7.07106781],
[5,5]: [0. , 1.41421356, 5. , 5. , 7.07106781],
[4,4]: [0. , 1.41421356, 4.12310563, 4.12310563, 5.65685425]]
The first row shows the distances from point [0,0] to any other point
to itself its 0
to [0,5] the distance is 5
to [5,0] the distance is 5
to [4,4] its (in my case euklidian distance) the squareroot of
4*4+4*4 so 5.65..
to [5,5] the euklidian distance is 7.07106781
So no matter how many points you are looking for (K) the distances are allways the same.
I need to find the steady state of Markov models using the left eigenvectors of their transition matrices using some python code.
It has already been established in this question that scipy.linalg.eig fails to provide actual left eigenvectors as described, but a fix is demonstrated there. The official documentation is mostly useless and incomprehensible as usual.
A bigger problem than than the incorrect format is that the eigenvalues produced are not in any particular order (not sorted and different each time). So if you want to find the left eigenvectors that correspond to the 1 eigenvalues you have to hunt for them, and this poses it's own problem (see below). The math is clear, but how to get python to compute this and return the correct eigenvectors is not clear. Other answers to this question, like this one, don't seem to be using the left eigenvectors, so those can't be correct solutions.
This question provides a partial solution, but it doesn't account for the unordered eigenvalues of larger transition matrices. So, just using
leftEigenvector = scipy.linalg.eig(A,left=True,right=False)[1][:,0]
leftEigenvector = leftEigenvector / sum(leftEigenvector)
is close, but doesn't work in general because the entry in the [:,0] position may not be the eigenvector for the correct eigenvalue (and in my case it is usually not).
Okay, but the output of scipy.linalg.eig(A,left=True,right=False) is an array in which the [0] element is a list of each eigenvalue (not in any order) and that is followed in position [1] by an array of eigenvectors in a corresponding order to those eigenvalues.
I don't know a good way to sort or search that whole thing by the eigenvalues to pull out the correct eigenvectors (all eigenvectors with eigenvalue 1 normalized by the sum of the vector entries.) My thinking is to get the indices of the eigenvalues that equal 1, and then pull those columns from the array of eigenvectors. My version of this is slow and cumbersome. First I have a function (that doesn't quite work) to find positions in a last that matches a value:
# Find the positions of the element a in theList
def findPositions(theList, a):
return [i for i, x in enumerate(theList) if x == a]
Then I use it like this to get the eigenvectors matching the eigenvalues = 1.
M = transitionMatrix( G )
leftEigenvectors = scipy.linalg.eig(M,left=True,right=False)
unitEigenvaluePositions = findPositions(leftEigenvectors[0], 1.000)
steadyStateVectors = []
for i in unitEigenvaluePositions:
thisEigenvector = leftEigenvectors[1][:,i]
thisEigenvector / sum(thisEigenvector)
steadyStateVectors.append(thisEigenvector)
print steadyStateVectors
But actually this doesn't work. There is one eigenvalue = 1.00000000e+00 +0.00000000e+00j that is not found even though two others are.
My expectation is that I am not the first person to use python to find stationary distributions of Markov models. Somebody who is more proficient/experienced probably has a working general solution (whether using numpy or scipy or not). Considering how popular Markov models are I expected there to be a library just for them to perform this task, and maybe it does exist but I couldn't find one.
You linked to How do I find out eigenvectors corresponding to a particular eigenvalue of a matrix? and said it doesn't compute the left eigenvector, but you can fix that by working with the transpose.
For example,
In [901]: import numpy as np
In [902]: import scipy.sparse.linalg as sla
In [903]: M = np.array([[0.5, 0.25, 0.25, 0], [0, 0.1, 0.9, 0], [0.2, 0.7, 0, 0.1], [0.2, 0.3, 0, 0.5]])
In [904]: M
Out[904]:
array([[ 0.5 , 0.25, 0.25, 0. ],
[ 0. , 0.1 , 0.9 , 0. ],
[ 0.2 , 0.7 , 0. , 0.1 ],
[ 0.2 , 0.3 , 0. , 0.5 ]])
In [905]: eval, evec = sla.eigs(M.T, k=1, which='LM')
In [906]: eval
Out[906]: array([ 1.+0.j])
In [907]: evec
Out[907]:
array([[-0.32168797+0.j],
[-0.65529032+0.j],
[-0.67018328+0.j],
[-0.13403666+0.j]])
In [908]: np.dot(evec.T, M).T
Out[908]:
array([[-0.32168797+0.j],
[-0.65529032+0.j],
[-0.67018328+0.j],
[-0.13403666+0.j]])
To normalize the eigenvector (which you know should be real):
In [913]: u = (evec/evec.sum()).real
In [914]: u
Out[914]:
array([[ 0.18060201],
[ 0.36789298],
[ 0.37625418],
[ 0.07525084]])
In [915]: np.dot(u.T, M).T
Out[915]:
array([[ 0.18060201],
[ 0.36789298],
[ 0.37625418],
[ 0.07525084]])
If you don't know the multiplicity of eigenvalue 1 in advance, see #pv.'s comment showing code using scipy.linalg.eig. Here's an example:
In [984]: M
Out[984]:
array([[ 0.9 , 0.1 , 0. , 0. , 0. , 0. ],
[ 0.3 , 0.7 , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.25, 0.75, 0. , 0. ],
[ 0. , 0. , 0.5 , 0.5 , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 1. ],
[ 0. , 0. , 0. , 0. , 1. , 0. ]])
In [985]: import scipy.linalg as la
In [986]: evals, lvecs = la.eig(M, right=False, left=True)
In [987]: tol = 1e-15
In [988]: mask = abs(evals - 1) < tol
In [989]: evals = evals[mask]
In [990]: evals
Out[990]: array([ 1.+0.j, 1.+0.j, 1.+0.j])
In [991]: lvecs = lvecs[:, mask]
In [992]: lvecs
Out[992]:
array([[ 0.9486833 , 0. , 0. ],
[ 0.31622777, 0. , 0. ],
[ 0. , -0.5547002 , 0. ],
[ 0. , -0.83205029, 0. ],
[ 0. , 0. , 0.70710678],
[ 0. , 0. , 0.70710678]])
In [993]: u = lvecs/lvecs.sum(axis=0, keepdims=True)
In [994]: u
Out[994]:
array([[ 0.75, -0. , 0. ],
[ 0.25, -0. , 0. ],
[ 0. , 0.4 , 0. ],
[ 0. , 0.6 , 0. ],
[ 0. , -0. , 0.5 ],
[ 0. , -0. , 0.5 ]])
In [995]: np.dot(u.T, M).T
Out[995]:
array([[ 0.75, 0. , 0. ],
[ 0.25, 0. , 0. ],
[ 0. , 0.4 , 0. ],
[ 0. , 0.6 , 0. ],
[ 0. , 0. , 0.5 ],
[ 0. , 0. , 0.5 ]])
All right, I had to make some changes while implementing Warren's solution and I've included those below. It's basically the same so he gets all the credit, but the realities of numerical approximations with numpy and scipy required more massaging which I thought would be helpful to see for others trying to do this in the future. I also changed the variable names to be super noob-friendly.
Please let me know if I got anything wrong or there are further recommended improvements (e.g. for speed).
# in this case my Markov model is a weighted directed graph, so convert that nx.graph (G) into it's transition matrix
M = transitionMatrix( G )
#create a list of the left eigenvalues and a separate array of the left eigenvectors
theEigenvalues, leftEigenvectors = scipy.linalg.eig(M, right=False, left=True)
# for stationary distribution the eigenvalues and vectors are always real, and this speeds it up a bit
theEigenvalues = theEigenvalues.real
leftEigenvectors = leftEigenvectors.real
# set how close to zero is acceptable as being zero...1e-15 was too low to find one of the actual eigenvalues
tolerance = 1e-10
# create a filter to collect the eigenvalues that are near enough to zero
mask = abs(theEigenvalues - 1) < tolerance
# apply that filter
theEigenvalues = theEigenvalues[mask]
# filter out the eigenvectors with non-zero eigenvalues
leftEigenvectors = leftEigenvectors[:, mask]
# convert all the tiny and negative values to zero to isolate the actual stationary distributions
leftEigenvectors[leftEigenvectors < tolerance] = 0
# normalize each distribution by the sum of the eigenvector columns
attractorDistributions = leftEigenvectors / leftEigenvectors.sum(axis=0, keepdims=True)
# this checks that the vectors are actually the left eigenvectors, but I guess it's not needed to usage
#attractorDistributions = np.dot(attractorDistributions.T, M).T
# convert the column vectors into row vectors (lists) for each attractor (the standard output for this kind of analysis)
attractorDistributions = attractorDistributions.T
# a list of the states in any attractor with the approximate stationary distribution within THAT attractor (e.g. for graph coloring)
theSteadyStates = np.sum(attractorDistributions, axis=1)
Putting that all together in an easy copy-and-paste format:
M = transitionMatrix( G )
theEigenvalues, leftEigenvectors = scipy.linalg.eig(M, right=False, left=True)
theEigenvalues = theEigenvalues.real
leftEigenvectors = leftEigenvectors.real
tolerance = 1e-10
mask = abs(theEigenvalues - 1) < tolerance
theEigenvalues = theEigenvalues[mask]
leftEigenvectors = leftEigenvectors[:, mask]
leftEigenvectors[leftEigenvectors < tolerance] = 0
attractorDistributions = leftEigenvectors / leftEigenvectors.sum(axis=0, keepdims=True)
attractorDistributions = attractorDistributions.T
theSteadyStates = np.sum(attractorDistributions, axis=0)
Using this analysis on a generated Markov model produced one attractor (of three) with a steady state distribution of 0.19835218 and 0.80164782 compared to the mathematically accurate values of 0.2 and 0.8. So that's more than 0.1% off, kind of a big error for science. That's not a REAL problem because if accuracy is important then, now that the individual attractors have been identified, a more accurate analyses of behavior within each attractor using a matrix subset can be done.