I have a cooccurrence matrix, a symmetric matrix (Numpy Array) in which each cell indicates the frequency of two co-occurring words.
In this matrix, I want to calculate the association strength. Which is defined as the number of times word i and j co-occur, divided by the product of i- and j's total frequency:
def calculate_association_strength(self, cooc, i, j, word_occurrences):
return cooc/(word_occurrences[i]*word_occurrences[j])
Here:
cooc = the cooccurrence of word i and j, with size vocabulary_size x vocabulary_size.
word_occurences = a list of length vocabulary_size, showing at each index the frequency of word i.
i and j = integers, indicating the word indices.
I am looping through the cooccurrence matrix to calculate the association strength per cell. However, this approach is very slow. I am familiar with the apply_along_axis method. However, it is unclear how to use it for this method. Is this possible? And if so, how can I do this?
IIUC you want to row- and columnwise divide each element of coor by word_occurrences. This can be done by simple elementwise division and broadcasting:
import numpy as np
cooc = np.array([[0, 1, 1, 0], [1, 0, 2, 1], [1, 2, 0, 1], [0, 1, 1, 0]])
word_occurrences = [1, 2, 3, 4]
cooc / word_occurrences / np.array(word_occurrences)[:, np.newaxis]
Result:
array([[0. , 0.5 , 0.33333333, 0. ],
[0.5 , 0. , 0.33333333, 0.125 ],
[0.33333333, 0.33333333, 0. , 0.08333333],
[0. , 0.125 , 0.08333333, 0. ]])
Is this what you are looking for?
Related
I have an array P as shown below:
P
array([[ 0.49530662, 0.32619367, 0.54593724, -0.0224462 ],
[-0.10503237, 0.48607405, 0.28572714, 0.15175049],
[ 0.0286128 , -0.32407902, -0.56598029, -0.26743756],
[ 0.14353725, -0.35624814, 0.25655861, -0.09241335]])
and a vector y:
y
array([0, 0, 1, 0], dtype=int16)
I want to modify another matrix Z which has the same dimension as P, such that Z_ij = y_j when Z_ij < 0.
In the above example, my Z matrix should be
Z = array([[-, -, -, 0],
[0, -, -, -],
[-, 0, 1, 0],
[-, 0, -, 0]])
Where '-' indicates the original Z values. What I thought about is very straightforward implementation which basically iterates through each row of Z and comparing the column values against corresponding Y and P. Do you know any better pythonic/numpy approach?
What you need is np.where. This is how to use it:-
import numpy as np
z = np.array([[ 0.49530662, 0.32619367, 0.54593724, -0.0224462 ],
[-0.10503237, 0.48607405, 0.28572714, 0.15175049],
[ 0.0286128 , -0.32407902, -0.56598029, -0.26743756],
[ 0.14353725, -0.35624814, 0.25655861, -0.09241335]])
y=([0, 0, 1, 0])
result = np.where(z<0,y,z)
#Where z<0, replace it by y
Result
>>> print(result)
[[0.49530662 0.32619367 0.54593724 0. ]
[0. 0.48607405 0.28572714 0.15175049]
[0.0286128 0. 1. 0. ]
[0.14353725 0. 0.25655861 0. ]]
I am trying to interpolate a 2D numpy matrix with the dimensions (5, 3) to a matrix with the dimensions (7, 3) along the axis 1 (columns). Obviously, the wrong approach would be to randomly insert rows anywhere between the original matrix, see the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Target (terrible interpolation -> not wanted!):
[[0, 1, 1]
[0, 1.5, 0.5]
[0, 2, 0]
[0, 3, 1]
[0, 3.5, 0.5]
[0, 4, 0]
[0, 5, 1]]
The correct approach would be to take every row into account and interpolate between all of them to expand the source matrix to a (7, 3) matrix. I am aware of the scipy.interpolate.interp1d or scipy.interpolate.interp2d methods, but could not get it to work with other Stack Overflow posts or websites. I hope to receive any type of tips or tricks.
Update #1: The expected values should be equally spaced.
Update #2:
What I want to do is basically use the separate columns of the original matrix, expand the length of the column to 7 and interpolate between the values of the original column. See the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Split into 3 separate Columns:
[0 [1 [1
0 2 0
0 3 1
0 4 0
0] 5] 1]
Expand length to 7 and interpolate between them, example for second column:
[1
1.66
2.33
3
3.66
4.33
5]
It seems like each column can be treated completely independently, but for each column you need to define essentially an "x" coordinate so that you can fit some function "f(x)" from which you generate your output matrix.
Unless the rows in your matrix are associated with some other datastructure (e.g. a vector of timestamps), an obvious set of x values is just the row-number:
x = numpy.arange(0, Source.shape[0])
You can then construct an interpolating function:
fit = scipy.interpolate.interp1d(x, Source, axis=0)
and use that to construct your output matrix:
Target = fit(numpy.linspace(0, Source.shape[0]-1, 7)
which produces:
array([[ 0. , 1. , 1. ],
[ 0. , 1.66666667, 0.33333333],
[ 0. , 2.33333333, 0.33333333],
[ 0. , 3. , 1. ],
[ 0. , 3.66666667, 0.33333333],
[ 0. , 4.33333333, 0.33333333],
[ 0. , 5. , 1. ]])
By default, scipy.interpolate.interp1d uses piecewise-linear interpolation. There are many more exotic options within scipy.interpolate, based on higher order polynomials, etc. Interpolation is a big topic in itself, and unless the rows of your matrix have some particular properties (e.g. being regular samples of a signal with a known frequency range), there may be no "truly correct" way of interpolating. So, to some extent, the choice of interpolation scheme will be somewhat arbitrary.
You can do this as follows:
from scipy.interpolate import interp1d
import numpy as np
a = np.array([[0, 1, 1],
[0, 2, 0],
[0, 3, 1],
[0, 4, 0],
[0, 5, 1]])
x = np.array(range(a.shape[0]))
# define new x range, we need 7 equally spaced values
xnew = np.linspace(x.min(), x.max(), 7)
# apply the interpolation to each column
f = interp1d(x, a, axis=0)
# get final result
print(f(xnew))
This will print
[[ 0. 1. 1. ]
[ 0. 1.66666667 0.33333333]
[ 0. 2.33333333 0.33333333]
[ 0. 3. 1. ]
[ 0. 3.66666667 0.33333333]
[ 0. 4.33333333 0.33333333]
[ 0. 5. 1. ]]
I have several sparse vectors represented as lists of tuples eg.
[[(22357, 0.6265631775164965),
(31265, 0.3900572375543419),
(44744, 0.4075397480094991),
(47751, 0.5377595092643747)],
[(22354, 0.6265631775164965),
(31261, 0.3900572375543419),
(42344, 0.4075397480094991),
(47751, 0.5377595092643747)],
...
]
And my goal is to compose scipy.sparse.csr_matrix from several millions of vectors like this.
I would like to ask if there exists some simple elegant solution for this kind of conversion without trying to stuck everything to memory.
EDIT:
Just a clarification: My goal is to build the 2d matrix, where each of my sparse vectors represent one row in matrix.
Collecting indices,data into a structured array avoids the integer-double conversion issue. It is also a bit faster than the vstack approach (in limited testing) (With list data like this np.array is faster than np.vstack.)
indptr = np.cumsum([0]+[len(i) for i in vectors])
aa = np.array(vectors,dtype='i,f').flatten()
A = sparse.csr_matrix((aa['f1'], aa['f0'], indptr))
I substituted the list comprehension for map since I'm using Python3.
Indicies in the coo format (data, (i,j)) might be more intuitive
ii = [[i]*len(v) for i,v in enumerate(vectors)])
ii = np.array(ii).flatten()
aa = np.array(vectors,dtype='i,f').flatten()
A2 = sparse.coo_matrix((aa['f1'],(np.array(ii), aa['f0'])))
# A2.tocsr()
Here, ii from the 1st step is the row numbers for each sublist.
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
...]]
This construction method is slower than the csr direct indptr.
For a case where there are differing numbers of entries per row, this approach works (using intertools.chain to flatten lists):
A sample list (no empty rows for now):
In [779]: vectors=[[(1, .12),(3, .234),(6,1.23)],
[(2,.222)],
[(2,.23),(1,.34)]]
row indexes:
In [780]: ii=[[i]*len(v) for i,v in enumerate(vectors)]
In [781]: ii=list(chain(*ii))
column and data values pulled from tuples and flattened
In [782]: jj=[j for j,_ in chain(*vectors)]
In [783]: data=[d for _,d in chain(*vectors)]
In [784]: ii
Out[784]: [0, 0, 0, 1, 2, 2]
In [785]: jj
Out[785]: [1, 3, 6, 2, 2, 1]
In [786]: data
Out[786]: [0.12, 0.234, 1.23, 0.222, 0.23, 0.34]
In [787]: A=sparse.csr_matrix((data,(ii,jj))) # coo style input
In [788]: A.A
Out[788]:
array([[ 0. , 0.12 , 0. , 0.234, 0. , 0. , 1.23 ],
[ 0. , 0. , 0.222, 0. , 0. , 0. , 0. ],
[ 0. , 0.34 , 0.23 , 0. , 0. , 0. , 0. ]])
Consider the following:
import numpy as np
from scipy.sparse import csr_matrix
vectors = [[(22357, 0.6265631775164965),
(31265, 0.3900572375543419),
(44744, 0.4075397480094991),
(47751, 0.5377595092643747)],
[(22354, 0.6265631775164965),
(31261, 0.3900572375543419),
(42344, 0.4075397480094991),
(47751, 0.5377595092643747)]]
indptr = np.cumsum([0] + map(len, vectors))
indices, data = np.vstack(vectors).T
A = csr_matrix((data, indices.astype(int), indptr))
Unfortunately, this way the column indices are converted from integers to doubles and back. This works correctly for up to very large matrices, but is not ideal.
I have a 3D numpy array consisting of 1's and zeros defining open versus filled space in a porous solid (it's currently a numpy Int64 array). I want to determine the euclidian distance from each of the "1" points (voxels) to its nearest zero point. Is there a simple way to do this?
What you are asking for is the distance transform, which you can compute using scipy's ndimage package and its distance_transform_edt function:
>>> import numpy as np
>>> import scipy.ndimage as ndi
>>> img = np.random.randint(2, size=(5, 5))
>>> img
array([[0, 0, 1, 1, 1],
[1, 0, 1, 0, 1],
[0, 1, 1, 1, 1],
[0, 0, 0, 1, 1],
[0, 1, 1, 1, 1]])
>>> ndi.distance_transform_edt(img)
array([[ 0. , 0. , 1. , 1. , 1.41421356],
[ 1. , 0. , 1. , 0. , 1. ],
[ 0. , 1. , 1. , 1. , 1.41421356],
[ 0. , 0. , 0. , 1. , 2. ],
[ 0. , 1. , 1. , 1.41421356, 2.23606798]])
If val contains the value (0 or 1) and pos contains the positions of each of these voxels, then you could use scipy.spatial.distance.cdist to compute all pairwise distances:
import numpy as np
from scipy.spatial.distance import cdist
# Find the points corresponding to zeros and ones
zero_indices = (val == 0)
one_indices = (val == 1)
# Compute all pairwise distances between zero-points and one-points
pairwise_distances = distance.cdist(pos[zero_indices, :], pos[one_indices, :])
# Choose the minimum distance
min_dist = np.min(pairwise_distances, axis=0)
I have data in a file in following form:
user_id, item_id, rating
1, abc,5
1, abcd,3
2, abc, 3
2, fgh, 5
So, the matrix I want to form for above data is following:
# itemd_ids
# abc abcd fgh
[[5, 3, 0] # user_id 1
[3, 0, 5]] # user_id 2
where missing data is replaced by 0.
But from this I want to create both user to user similarity matrix and item to item similarity matrix?
How do I do that?
Technically, this is not a programming problem but a math problem. But I think you better off using variance-covariance matrix. Or correlation matrix, if the scale of the values are very different, say, instead of having:
>>> x
array([[5, 3, 0],
[3, 0, 5],
[5, 5, 0],
[1, 1, 7]])
You have:
>>> x
array([[5, 300, 0],
[3, 0, 5],
[5, 500, 0],
[1, 100, 7]])
To get a variance-cov matrix:
>>> np.cov(x)
array([[ 6.33333333, -3.16666667, 6.66666667, -8. ],
[ -3.16666667, 6.33333333, -5.83333333, 7. ],
[ 6.66666667, -5.83333333, 8.33333333, -10. ],
[ -8. , 7. , -10. , 12. ]])
Or the correlation matrix:
>>> np.corrcoef(x)
array([[ 1. , -0.5 , 0.91766294, -0.91766294],
[-0.5 , 1. , -0.80295507, 0.80295507],
[ 0.91766294, -0.80295507, 1. , -1. ],
[-0.91766294, 0.80295507, -1. , 1. ]])
This is the way to look at it, the diagonal cell, i.e., (0,0) cell, is the correlation of your 1st vector in X to it self, so it is 1. The other cells, i.e, (0,1) cell, is the correlation between the 1st and 2nd vector in X. They are negatively correlated. Or similarly, the 1st and 3rd cell are positively correlated.
covariance matrix or correlation matrix avoid the zero problem pointed out by #Akavall.
See this question: What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
Having:
A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])
dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out
Result in:
array([[ 1. , 0.40824829, 0.40824829],
[ 0.40824829, 1. , 0.33333333],
[ 0.40824829, 0.33333333, 1. ]])
But that works for dense matrix. For sparse you have to develop your solution.