I'm trying to calculate the co-occurrence matrix for a large corpus but it takes a very long time(+6hours). Are there any faster ways?
My approach:
consider this array as the corpus and each element of the corpus as context:
corpus = [
'where python is used',
'what is python used in',
'why python is best',
'what companies use python'
]
Algorithm:
words = list(set(' '.join(corpus).split(' ')))
c_matrix = np.zeros((len(words), len(words)), dtype='int')
for context in corpus:
context = context.split(' ')
for i in range(len(context)):
for j in range(i + 1, len(context)):
row = words.index(context[i])
column = words.index(context[j])
c_matrix[row][column] += 1
The provided algorithm is not efficient because it recompute words.index(...) a lot of time. You can pre-compute the indices first and then build the matrix. Here is a significantly better solution:
words = list(set(' '.join(corpus).split(' ')))
c_matrix = np.zeros((len(words), len(words)), dtype='int')
for context in corpus:
context = context.split(' ')
index = [words.index(item) for item in context]
for i in range(len(context)):
for j in range(i + 1, len(context)):
c_matrix[index[i]][index[j]] += 1
Moreover, you can transform index to a Numpy array and use Numba (or Cython) to build the c_matrix very quickly from index.
Finally, you can transform words to a dictionary (with the string in the current list as the dictionary keys and the index in the current list as the dictionary values) so that indexing will be much faster (constant-time fetch).
The resulting algorithm should be several order of magnitude faster. If this is not enough, then you probably need to replace the matrix c_matrix with a more advanced (but also much more complex) sparse data-structure regarding your needs.
Related
I have a corpus with m documents and n unique words.
Based on this corpus, I want to calculate a co-occurrence matrix for words and calculate their similarity.
To do so, I have created a NumPy array occurrences (m x n), which indicates which words are present in each document.
Based on the occurrences, I have created cooccurrences as follows:
cooccurrences = np.transpose(occurrences) # occurrences
Furthermore, word_occurrences gives the sum per word in the corpus:
word_occurrences = occurrences.sum(axis=0)
Now, I want to calculate the similarity scores of words in cooccurrences based on the association strength.
I want to divide each cell i, j in cooccurrences, by word_occurrences[i] * word_occurrences[j].
Currently, I loop through cooccurrences to do this.
def calculate_association_strength(cooc, i, j, word_occurrences):
return cooc/(word_occurrences[i]*word_occurrences[j])
for i in range(len(cooccurrences)):
for j in range(len(cooccurrences)):
if i != j:
if cooccurrences[i,j] > 0 :
cooccurrences[i,j] = 1 - self.calculate_association_strength(cooccurrences[i,j], i,j,word_occurrences)
else:
cooccurrences[i,j] = 0
But with m > 30 000, this is very time-consuming. Is there a faster way to do this?
Here, they discuss mapping a function on a np.array. However, they don't use multiple variables derived from the array.
If I understand the problem here correctly, You could vectorise the whole operation, so the result would be:
cooccurrences / word_occurrences.reshape(-1,1) * word_occurrences
This is pretty much guaranteed to work faster than looping through the array
For performance reasons I'd like to use the Python list insert() method. I will demonstrate why:
My final list is a 31k * 31k matrix:
w=31*10**3
h=31*10**3
distance_matrix = [[0 for x in range(w)] for y in range(h)]
I intent to update the matrix one iteration at the time:
for i in range(len(index)):
for j in range(len(index)):
distance_matrix[index[i]][index[j]] = k[0][i][j]
Obviously this doesn't perform well.
I'd rather like to start with an empty list and fill it up gradually, making the computation intense at the end of the process (and easy at the beginning):
distance_matrix = []
for i in range(len(index)):
for j in range(len(index)):
distance_matrix.insert([index[i]][index[j]], k[0][i][j])
But this multi-index or list-in-list insert doesn't seem to be possible.
How would you advise to proceed? I've also looked into numpy arrays, but without luck so far.
To be precise: updating the (ordered) large array of zeros index by index is the issue here. In a DataFrame I can use custom columns/indices, but that is not scalable in performance.
Additional information:
I split up the entire original data matrix in parts to compute distance matrices in parallel. The issue in this process is to aggregate the distance matrix again with the computed values. The distance matrix/array is very large, therefore a simple list insert or edit takes very long.
I think this approach achieves what I had in mind:
distance_matrix = []
def dynamic_append(x,i,j,val):
if((len(x)-1)<i):
dif_x = i-len(x)+1
for k in range(dif_x):
x.append([])
dif_y = j-len(x[i])+1
for l in range(dif_y):
x[i].append([])
elif((len(x[i])-1)<j):
dif_y = j-len(x[i])+1
for l in range(dif_y):
x[i].append([])
x[i][j]=val
return(x)
for i in range(len(index)):
for j in range(len(index)):
distance_matrix=dynamic_append(distance_matrix,index[i],index[j],k[0][i][j])
I am new in pandas and python. I want to find common words for my data set. e.g i have list of companies ["Microsoft.com", "Microsoft", "Microsoft com", "apple" ...] etc. I have around 1M list of such companies and i want to calculate correlation between them to find the relevance for the words e.g Microsoft.com, Microsoft, Microsoft com there are common words.
This is what i did but its very slow:
import hashlib
companies = pd.read_csv('/tmp/companies.csv', error_bad_lines=False)
unique_companies = companies.groupby(['company'])['company'].unique()
df = DataFrame()
for company in unique_companies:
df[hashlib.md5(company).hexdigest()] = [{'name': company[0], 'code': [ord(c) for c in company[0]]}]
rows = df.unstack()
for company in rows:
series1 = Series(company['code'])
for word in rows:
series2 = Series(word['code'])
if series1.corr(series2) > 0.8:
company['match'] = [word['name']]
can anyone please guide me how to find matrix correlation for the words ?
I don't think there's a corr function that will work for strings - only numerics.
If you can somehow compress your words into meaningful numeric values that preserves the "closeness" of one against another, you might then be able to "corr" them, but other options are available.
Hamming Distance is one (basic) method, but slightly better is calculating the Levenshtein difference: http://en.wikipedia.org/wiki/Levenshtein_distance
It's tricky but one way of trying this would be to build a matrix of m x n cells. Where m is the number of unique words in your first wordlist, and n is the number of unique words in the secornd wordlist - then calculate the Hamming, or Levenshtein distances between row/column identifiers.
There are python modules that package up the distance-algorithms for you -
e.g. https://pypi.python.org/pypi/python-Levenshtein/
Or you could write your own, I think the packaged ones are likely to be faster as they're C'ified.
So, assuming the Levenshtein module (I don't know, as have not used it) provides a function say getLev (word1, word2) that generates a numeric score, you should be able to feed in the contents from two wordlists from sources 1 and 2. If you make sure your inputs are already filtered for uniqueness, and maybe sorted alphabetically, that would help too.
Feed them into a matrix generation function.
Here, I've imported numpy as np and am using that module for speed
def genLevenshteinMatrix(wordlist1, wordlist2):
x = len(wordlist1)
y = len(wordlist1)
l_matrix = np.zeros(( x, y))
for i in range( 0 , x ):
x_word = wordlist1[i]
for j in range ( 0 , y ):
y_word = wordlist[j]
l_matrix[i][j] = getLev ( x_word, y_word )
Something like that should allow you to generate a matrix that stores a measure of which words are most like which other words.
Once that's created, you can interrogate it using a function like this:
def interrogate_Levenshtein_matrix (ndarray_x, wordlist1, wordlist2, float_threshold):
l = []
x = len(ndarray_x)
y = len(ndarray_x[0])
for i in range(0 , x ):
for j in range(0 , y ):
if ndarray_x[i][j] >= float_threshold:
l.append ([(wordlist1[i],wordlist2[j]),ndarray_x[i][j]])
return l
And that will output a list of words that are "close" (i.e. have a lower distance) as measured by the Levenshtein function used earlier, as a list, containing a list of the two similar words.
You might need to trim it down somehow, as I think you'll get all like-combinations twice, i.e. ['word','work'] as one return value, and ['work','word'] as another.
As you develop the code, you could swap in different correlation functions and try different threshold values.
I am starting dealing with sparse matrices so I'm not really proficient on this topic. My problem is, I have a simple coo-occurrences matrix from a word list, just a 2-dimensional co-occurrence matrix word by word counting how many times a word occurs in same context. The matrix is quite sparse since the corpus is not that big. I want to convert it to a sparse matrix to be able to deal better with it, eventually do some matrix multiplication afterwards. Here what I have done until now (only the first part, the rest is just output format and cleaning data):
def matrix(from_corpus):
d = defaultdict(lambda : defaultdict(int))
heads = set()
trans = set()
for text in corpus:
d[text[0]][text[1]] += 1
heads.add(text[0])
trans.add(text[1])
return d,heads,trans
My idea would be to make a new function:
def matrix_to_sparse(d):
A = sparse.lil_matrix(d)
Does this make any sense? This is however not working and somehow I don't the way how get a sparse matrix. Should I better work with numpy arrays? What would be the best way to do this. I want to compare many ways to deal with matrices.
It would be nice if some could put me in the direction.
Here's how you construct a document-term matrix A from a set of documents in SciPy's COO format, which is a good tradeoff between ease of use and efficiency(*):
vocabulary = {} # map terms to column indices
data = [] # values (maybe weights)
row = [] # row (document) indices
col = [] # column (term) indices
for i, doc in enumerate(documents):
for term in doc:
# get column index, adding the term to the vocabulary if needed
j = vocabulary.setdefault(term, len(vocabulary))
data.append(1) # uniform weights
row.append(i)
col.append(j)
A = scipy.sparse.coo_matrix((data, (row, col)))
Now, to get a cooccurrence matrix:
A.T * A
(ignore the diagonal, which holds cooccurrences of term with themselves, i.e. squared frequency).
Alternatively, use some package that does this kind of thing for you, such as Gensim or scikit-learn. (I'm a contributor to both projects, so this might not be unbiased advice.)
I'm implementing feature vectors as bit maps for documents in a corpus. I already have the vocabulary for the entire corpus (as a list/set) and a list of the terms in each document.
For example, if the corpus vocabulary is ['a', 'b', 'c', 'd'] and the terms in document d1 is ['a', 'b', 'd', 'd'], the feature vector for d1 should be [1, 1, 0, 2].
To generate the feature vector, I'd iterate over the corpus vocabulary and check if each term is in the list of document terms, then set the bit in the correct position in the document's feature vector.
What would be the most efficient way to implement this? Here are some things I've considered:
Using a set would make checking vocab membership very efficient but sets have no ordering, and the feature vector bits need to be in the order of the sorted corpus vocabulary.
Using a dict for the corpus vocab (mapping each vocab term to an arbitrary value, like 1) would allow iteration over sorted(dict.keys()) so I could keep track of the index. However, I'd have the space overhead of dict.values().
Using a sorted(list) would be inefficient to check membership.
What would StackOverflow suggest?
I think the most efficient way is to loop over each document's terms, get the position of the term in the (sorted) corpus and set the bit accordingly.
The sorted list of corpus terms can be stored as dictionary with term -> index mapping (basically an inverted index).
You can create it like so:
corpus = dict(((term, index) for index, term in enumerate(sorted(all_words))))
For each document you'd have to generate a list of 0's as feature vector:
num_words = len(corpus)
fvs = [[0]*num_words for _ in docs]
Then building the feature vectors would be:
for i, doc_terms in enumerate(docs):
fv = fvs[i]
for term in doc_terms:
fv[corpus[term]] += 1
There is no overhead in testing membership, you just have to loop over all terms of all documents.
That all said, depending on the size of the corpus, you should have a look at numpy and scipy. It is likely that you will run into memory problems and scipy provides special datatypes for sparse matrices (instead of using a list of lists) which can save a lot of memory.
You can use the same approach as shown above, but instead of adding numbers to list elements, you add it to matrix elements (e.g. the rows will be the documents and the columns the terms of the corpus).
You can also make use of some matrix operations provided by numpy if you want to apply local or global weighting schemes.
I hope this gets you started :)