How to convert co-occurrence matrix to sparse matrix - python

I am starting dealing with sparse matrices so I'm not really proficient on this topic. My problem is, I have a simple coo-occurrences matrix from a word list, just a 2-dimensional co-occurrence matrix word by word counting how many times a word occurs in same context. The matrix is quite sparse since the corpus is not that big. I want to convert it to a sparse matrix to be able to deal better with it, eventually do some matrix multiplication afterwards. Here what I have done until now (only the first part, the rest is just output format and cleaning data):
def matrix(from_corpus):
d = defaultdict(lambda : defaultdict(int))
heads = set()
trans = set()
for text in corpus:
d[text[0]][text[1]] += 1
heads.add(text[0])
trans.add(text[1])
return d,heads,trans
My idea would be to make a new function:
def matrix_to_sparse(d):
A = sparse.lil_matrix(d)
Does this make any sense? This is however not working and somehow I don't the way how get a sparse matrix. Should I better work with numpy arrays? What would be the best way to do this. I want to compare many ways to deal with matrices.
It would be nice if some could put me in the direction.

Here's how you construct a document-term matrix A from a set of documents in SciPy's COO format, which is a good tradeoff between ease of use and efficiency(*):
vocabulary = {} # map terms to column indices
data = [] # values (maybe weights)
row = [] # row (document) indices
col = [] # column (term) indices
for i, doc in enumerate(documents):
for term in doc:
# get column index, adding the term to the vocabulary if needed
j = vocabulary.setdefault(term, len(vocabulary))
data.append(1) # uniform weights
row.append(i)
col.append(j)
A = scipy.sparse.coo_matrix((data, (row, col)))
Now, to get a cooccurrence matrix:
A.T * A
(ignore the diagonal, which holds cooccurrences of term with themselves, i.e. squared frequency).
Alternatively, use some package that does this kind of thing for you, such as Gensim or scikit-learn. (I'm a contributor to both projects, so this might not be unbiased advice.)

Related

Any alternate approaches to calculate co-occurrence matrix in Python?

I'm trying to calculate the co-occurrence matrix for a large corpus but it takes a very long time(+6hours). Are there any faster ways?
My approach:
consider this array as the corpus and each element of the corpus as context:
corpus = [
'where python is used',
'what is python used in',
'why python is best',
'what companies use python'
]
Algorithm:
words = list(set(' '.join(corpus).split(' ')))
c_matrix = np.zeros((len(words), len(words)), dtype='int')
for context in corpus:
context = context.split(' ')
for i in range(len(context)):
for j in range(i + 1, len(context)):
row = words.index(context[i])
column = words.index(context[j])
c_matrix[row][column] += 1
The provided algorithm is not efficient because it recompute words.index(...) a lot of time. You can pre-compute the indices first and then build the matrix. Here is a significantly better solution:
words = list(set(' '.join(corpus).split(' ')))
c_matrix = np.zeros((len(words), len(words)), dtype='int')
for context in corpus:
context = context.split(' ')
index = [words.index(item) for item in context]
for i in range(len(context)):
for j in range(i + 1, len(context)):
c_matrix[index[i]][index[j]] += 1
Moreover, you can transform index to a Numpy array and use Numba (or Cython) to build the c_matrix very quickly from index.
Finally, you can transform words to a dictionary (with the string in the current list as the dictionary keys and the index in the current list as the dictionary values) so that indexing will be much faster (constant-time fetch).
The resulting algorithm should be several order of magnitude faster. If this is not enough, then you probably need to replace the matrix c_matrix with a more advanced (but also much more complex) sparse data-structure regarding your needs.

How to collect similar vectors efficiently with Python?

I have a large set of 3-dimensional vectors, and each vector is associated with a weight. That is, the set is in the form
{[[0.707,0.5,0.5],0.3],[[0.6,0.8,0],0.2]....}
I want to collect vectors which are closed to each other together, and sum their weights up. Now my idea is that: if a vector A is close to another one B, then I will treat them as the same vector and add the weight of A to the weight of B. And the Python code is
def gathervecs(vecs):
gathers = []
for vec in vecs: #in each vec, there are two elements, the first one is the normalized vector, and the second is the norm**2.
index = 0
for i,avec in enumerate(gathers):
if sum(abs(vec[0] - avec[0])) < 10**(-10):
(gathers[i])[1] = avec[1]+vec[1]
index = 1
break
if index==0:
gathers.append(vec)
return gathers
But the time of this code is polynomial of the size of the original set. So, the question is how to design a more efficient algorithm?
PS: please generate the original set for testing the efficiency of the algorithm randomly.

Is it possible to translate this Python code to Cython?

I'm actually looking to speed up #2 of this code by as much as possible, so I thought that it might be useful to try Cython. However, I'm not sure how to implement sparse matrix in Cython. Can somebody show how to / if it's possible to wrap it in Cython or perhaps Julia to make it faster?
#1) This part computes u_dict dictionary filled with unique strings and then enumerates them.
import scipy.sparse as sp
import numpy as np
from scipy.sparse import csr_matrix
full_dict = set(train1.values.ravel().tolist() + test1.values.ravel().tolist() + train2.values.ravel().tolist() + test2.values.ravel().tolist())
print len(full_dict)
u_dict= dict()
for i, q in enumerate(full_dict):
u_dict[q] = i
shape = (len(full_dict), len(full_dict))
H = sp.lil_matrix(shape, dtype=np.int8)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
#2) I need to speed up this part
# train_full is pandas dataframe with two columns w1 and w2 filled with strings
H = load_sparse_csr('matrix.npz')
correlation_train = []
for idx, row in train_full.iterrows():
if idx%1000 == 0: print idx
id_1 = u_dict[row['w1']]
id_2 = u_dict[row['w2']]
a_vec = H[id_1].toarray() # these vectors are of length of < 3 mil.
b_vec = H[id_2].toarray()
correlation_train.append(np.corrcoef(a_vec, b_vec)[0][1])
While I contributed to How to properly pass a scipy.sparse CSR matrix to a cython function? quite some time ago, I doubt if cython is the way to go. Especially if you don't already have experience with numpy and cython. cython gives the biggest speedup when you replace iterative calculations with code that it can translate to C without calling numpy or other python code. Throw pandas into the mix and you have an even bigger learning curve.
And important parts of sparse code are already written with cython.
Without touching the cython issue I see a couple of problems.
H is defined twice:
H = sp.lil_matrix(shape, dtype=np.int8)
H = load_sparse_csr('matrix.npz')
That's either an oversight, or a failure to understand how Python variables are created and assigned. The 2nd assignment replaces the first; thus the first does nothing. In addition the first just makes an empty lil matrix. Such a matrix could be filled iteratively; while not fast it is the intended use of the lil format.
The 2nd expression creates a new matrix from data saved in an npz file. That involves the numpy npz file loaded as well as the basic csr matrix creation code. And since the attributes are already in csr format, there's nothing for cython touch.
You do have an iteration here - but over a Pandas dataframe:
for idx, row in train_full.iterrows():
id_1 = u_dict[row['w1']]
a_vec = H[id_1].toarray()
Looks like you are picking a particular row of H based on a dictionary/array look up. Sparse matrix indexing is slow compared to dense matrix indexing. That is, if Ha = H.toarray() fits your memory then,
a_vec = Ha[id_1,:]
will be a lot faster.
Faster selection of rows (or columns) from a sparse matrix has been asked before. If you could work directly with the sparse data of a row I could recommend something more direct. But you want a dense array that you can pass to np.corrcoef, so we'd have to implement the toarray step as well.
How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Better way to map a function (cosine similarity) to each row in scipy.csr_matrix

Suppose I have a sparse matrix of document collection, where each row is a vector representing a document (generated by scikit-learn's tfidf_transformer for example).
tfidf_matrix = tfidf_transformer.fit_transform(posting)
Now I have a query coming in,
query = transformer.transform(vectorizer.transform(['I am a sample query']))
So I want to compare this query, to each of the document (each row) of the matrix using scipy.spatial.distance.cosine (cosine similarity). So I do a map as follows
result = map(lambda document: cosine(document.toarray(), query[0].toarray()), tfidf_matrix)
it could be done with a loop as well
result = []
for row in tfidf_matrix:
result = result + [cosine(row.toarray(), query[0].toarray())]
However, it is slow (I threw in a gevent.threadpool.map to it out of frustration with same result). I am pretty sure this is not the right way of doing this (mapping a function to each row of a sparse matrix), but I can't seem to find the proper way of doing this.
So the question is, what is the proper way to map a function to each row in the sparse matrix (scipy.csr_matrix)?
First thing I noticed was that you're running query[0].toarray() every time you go through the for loop (or on every iteration of the map() call). Is that value ever going to change in between rows? Because if it isn't, you can save some time by calculating it just one, outside the for loop:
result = []
query_array = query[0].toarray()
for row in tfidf_matrix:
result = result + [cosine(row.toarray(), query_array)]
Also, don't do result = result + [another_list_element]; that's much slower than result.append(another_list_element). In this case, you should be doing:
result = []
query_array = query[0].toarray()
for row in tfidf_matrix:
result.append(cosine(row.toarray(), query_array))
Or with map, that would be:
query_array = query[0].toarray()
result = map(lambda document: cosine(document.toarray(), query_array), tfidf_matrix)
There may be other speedups possible as well, but try this one and see if it helps.
EDIT: Also, have you seen Function application over numpy's matrix row/column? It looks like the vectorize function may be what you want. I can't give you more details since I'm not really familiar with numpy and scipy myself, but that looks like a good starting point for your reading.

Pandas: matrix correllation for words

I am new in pandas and python. I want to find common words for my data set. e.g i have list of companies ["Microsoft.com", "Microsoft", "Microsoft com", "apple" ...] etc. I have around 1M list of such companies and i want to calculate correlation between them to find the relevance for the words e.g Microsoft.com, Microsoft, Microsoft com there are common words.
This is what i did but its very slow:
import hashlib
companies = pd.read_csv('/tmp/companies.csv', error_bad_lines=False)
unique_companies = companies.groupby(['company'])['company'].unique()
df = DataFrame()
for company in unique_companies:
df[hashlib.md5(company).hexdigest()] = [{'name': company[0], 'code': [ord(c) for c in company[0]]}]
rows = df.unstack()
for company in rows:
series1 = Series(company['code'])
for word in rows:
series2 = Series(word['code'])
if series1.corr(series2) > 0.8:
company['match'] = [word['name']]
can anyone please guide me how to find matrix correlation for the words ?
I don't think there's a corr function that will work for strings - only numerics.
If you can somehow compress your words into meaningful numeric values that preserves the "closeness" of one against another, you might then be able to "corr" them, but other options are available.
Hamming Distance is one (basic) method, but slightly better is calculating the Levenshtein difference: http://en.wikipedia.org/wiki/Levenshtein_distance
It's tricky but one way of trying this would be to build a matrix of m x n cells. Where m is the number of unique words in your first wordlist, and n is the number of unique words in the secornd wordlist - then calculate the Hamming, or Levenshtein distances between row/column identifiers.
There are python modules that package up the distance-algorithms for you -
e.g. https://pypi.python.org/pypi/python-Levenshtein/
Or you could write your own, I think the packaged ones are likely to be faster as they're C'ified.
So, assuming the Levenshtein module (I don't know, as have not used it) provides a function say getLev (word1, word2) that generates a numeric score, you should be able to feed in the contents from two wordlists from sources 1 and 2. If you make sure your inputs are already filtered for uniqueness, and maybe sorted alphabetically, that would help too.
Feed them into a matrix generation function.
Here, I've imported numpy as np and am using that module for speed
def genLevenshteinMatrix(wordlist1, wordlist2):
x = len(wordlist1)
y = len(wordlist1)
l_matrix = np.zeros(( x, y))
for i in range( 0 , x ):
x_word = wordlist1[i]
for j in range ( 0 , y ):
y_word = wordlist[j]
l_matrix[i][j] = getLev ( x_word, y_word )
Something like that should allow you to generate a matrix that stores a measure of which words are most like which other words.
Once that's created, you can interrogate it using a function like this:
def interrogate_Levenshtein_matrix (ndarray_x, wordlist1, wordlist2, float_threshold):
l = []
x = len(ndarray_x)
y = len(ndarray_x[0])
for i in range(0 , x ):
for j in range(0 , y ):
if ndarray_x[i][j] >= float_threshold:
l.append ([(wordlist1[i],wordlist2[j]),ndarray_x[i][j]])
return l
And that will output a list of words that are "close" (i.e. have a lower distance) as measured by the Levenshtein function used earlier, as a list, containing a list of the two similar words.
You might need to trim it down somehow, as I think you'll get all like-combinations twice, i.e. ['word','work'] as one return value, and ['work','word'] as another.
As you develop the code, you could swap in different correlation functions and try different threshold values.

Categories

Resources