Scaling problem with parallel words finding in text (Python)

Scaling problem with parallel words finding in text (Python) - python

I'm working in python and I have to resolve a simple task (at least with a simple definition):
I have a set of names, each name is a sequence of token: names_to_find = ['York', 'New York', 'Dustin']
I have a corpus, which consists of a list of sentences: corpus = [' I love New York but also York ', ' Dustin is my cat ', ' I live in New York with my friend Dustin ']
My desired output is a dictionary with names_to_find as keys and, for each occurrence in the corpus, a couple (#sentence_index, #word_index)
The desired output of the example is:
output = { 'York' : [(0, 3), (0, 6), (2, 4)], 'New York' : [(0, 2), (2, 2)], 'Dustin' : [(1, 0), (2, 8)]}
As you can see if the name_to_find appears two times in the same sentence I want both, for composed names (e.g., 'New York') I want the index of the first word.
The problem is that I have 1 million of names_to_find and 4.8 million of sentences in corpus
I made a code which does not scale in order to see if time was acceptable (it was not); to find all names in 100000 (100k) sentences my code needs 12 hours :'(
My question is twofold: I'm here to ask you to help me with my code or to paste a largely different code, it doesn't matter, the only thing that matters is that the code scales
I report my (parallel) code, here I find only single words and the composite words (i.e. 'New York') are found in another function which checks if word indexes are contiguous:
def parallel_find(self, n_proc):
"""
takes entities in self.entity_token_in_corpus and for each call the function self.entities_token_in_corpus
this method (and the involved one) are thought to work in parallel, so after the calling a reduce is applied
:param
n_proc: the number of process used to make the computation
"""
p = Pool(n_proc)
print('start indexing')
t = time.time()
index_list = p.map(self.create_word_occurrence_index, self.entities_token_in_corpus)
t = time.time() - t
index_list_dict = {k:v for elem in index_list for k, v in elem.items() if v}
p.close()
return index_list_dict, n_proc, len(self.corpus), t
def create_word_occurrence_index(self, word):
"""
loop on all the corpus, call self.find_in_sentence to find occurrences of word in each sentence, returns a dict
:param
word: the word to find
:return: a dict with the structure: {entity_name: list of tuples (row: [occurrences in row])}
"""
key = word
returning_list = []
for row_index, sent in enumerate(self.joined_corpus):
if sent.find(' ' + word + ' ') != -1:
indices = self.find_in_sentence(word = word, sentence = sent)
if indices:
returning_list.append((row_index, indices))
return {key: returning_list}
def find_in_sentence(self, word, sentence):
"""
returns the indexes in which the word appear in a sentence
:params
word: the word to find
sentence: the sentence in which find the word
:return: a list of indices
"""
indices = [i for i, x in enumerate(sentence.split()) if x == word]
return indices
Thanks in advance

Here's an attempt using generators but I'm not sure how much better it will perform on a large target. The problematic part is the multi-word matches, but I tried to build in some multiple short-circuit and early-termination code (I think there is more to do on that, but the complexity starts building up there too):
def matcher(words, targets):
for word in words:
result = {word: []} #empty dict to hold each word
if len(word.split()) == 1: #check to see if word is single
for t, target in enumerate(targets):
foo = target.split()
bar = [(t,i) for i,x in enumerate(foo) if x == word] #collect the indices
if bar:
result[word].extend(bar) #update the dict
yield result #returns a generator
else:
consecutive = word.split()
end = len(consecutive)
starter = consecutive[0] #only look for first word match
for t, target in enumerate(targets):
foo = target.split()
limit = len(foo)
if foo.count(starter): #skip entire target if 1st word missing
indices = [i for i, x in enumerate(foo) if (x==starter and
limit - end > i)] #don't try to match if index too high
bar = []
for i in indices:
if foo[i:i+end] == consecutive: #do match (expensive)
bar.append((t,i))
result[word].extend(bar)
else:
continue
yield result
If you want to collect everything at one go, for this modified example
targets = [ ' I love New York but also York ',
' Dustin is my cat ',
' I live in New York with my friend Dustin ',
' New York State contains New York City aka New York']
values = [ 'York', 'New York', 'Dustin', 'New York State' ]
zed = matcher(values, targets)
print(list(zed))
Produces:
{'York': [(0, 3), (0, 6), (2, 4), (3, 1), (3, 5), (3, 9)]}
{'New York': [(0, 2), (2, 3), (3, 0), (3, 4)]}
{'Dustin': [(1, 0), (2, 8)]}
{'New York State': [(3, 0)]}
There might be ways to exploit concurrency here, I'm really not sure, not being too familiar with that as of yet. See https://realpython.com/async-io-python/ for example. Also, I didn't go over that code carefully for off-by-one errors... think its okay. Probably want some unittest here.

Related

Walking throughout syntax tree recursively

I have a sentence which is syntactically parsed. For example, "My mom wants to cook". Parsing is [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]. The numbers mean the indexes of items the words depend on: 'mom' depends on 'wants' and 'wants' is the second element of the array (starting from zero as usual). 'Wants' has '-1' because that is the core of sentence, it doesn't depend on anything else. I need to GET the SUBJECT which is 'my mom' here. How can I do this?
To this moment, I have only tried writing loops which work not in every case. The deal is that the subject may consist of more than 2 words, and that number is undefined. Something like this...
'Values' is [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]
for indx, value in enumerate(values):
m = morph.parse(value[0])
if isinstance(m, list):
m = m[0]
if 'NOUN' in m.tag:
if value[1] == str(index[0]): #it checks if the word (part of the subject) depends on the verb
terms.append([value[0], indx])
if len(terms) > 0:
term = terms[0][0]
t = []
for indx, value in enumerate(values):
if value[1] == str(terms[0][1]): #it checks if the word depend on the found part of the subject
m = morph.parse(value[0])
if isinstance(m, list):
m = m[0]
if 'NOUN' in m.tag:
t.append([value[0], terms[0][0]])
The algorithm should work like this: walks the whole array and stops when it finds all the dependencies of the given word and all the dependencies of these dependencies. (In the example, all the dependencies of 'mom'). Please, help!

Sorry that I took so long to get back to you.
I finally figured it out. Code is at the bottom, but I'll explain how it works first:
when parse(values) is called it iterates through the sentence in values and calls the recursive getDepth for each word. ```getDepth`` computes how many word relations are between a given word and the verb.
E.g. for "The", depth is 1, because it directly calls the verb.
for "King", it is 2, because "King" calls "The" and "The" calls the verb (Where call = have a pointer that points to a certain word)
Once all depths are computed, parse finds the word with the highest depth ("France") and uses recursive traceFrom() to string together the subject.
All you really care a about is parse() which takes the preparsed string like [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)] and spits out the complete subject. Works for both examples, but you should test some more.
values = [('The', 4), ('King', 0), ('of', 1), ('France', 2), ('died', -1)]
#values = [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]
def getDepth(i, s, n = 0):
print('D is %d' % n)
if s[i][1] == -1:
return n
else:
return getDepth(s[i][1], s, n+1)
def traceFrom(m, s, dt=[]):
print('Got n:%d, s:' % m, s)
if s[m][1] == -1:
d = []
n = 0
for i in range(len(s)):
if i in dt:
d.append(s[i][0])
return " ".join(d)
else:
dt.append(m)
return traceFrom(s[m][1], s, dt)
def parse(sentence):
d = []
for i in range(len(sentence)):
d.append(getDepth(i, sentence))
m = d.index(max(d))
print('Largest is ' , d[m], ' of ', d)
return traceFrom(m, sentence)
print('Subject :', parse(values))

Given your preparsed array, this is quite easy:
values = [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]
def parseFrom(n, data):
if values[n][1] != -1:
#print('Stepping (%s,%d)' % (values[n][0], values[n][1]))
data.append(values[n][0])
return parseFrom(values[n][1], data)
else:
#print('At verb')
return data
subject = ' '.join(parseFrom(0, []))
print('Sentence has subject:', subject)
The function is recursive if the current word is not the verb, otherwise returns the subject as an array. Sorry if it doesn't work on all sentences

Reducing compute time for Anagram word search

The code below is a brute force method of searching a list of words and creating sub-lists of any that are Anagrams.
Searching the entire English dictionary is prohibitively time consuming so I'm curious of anyone has tips for reducing the compute complexity of the code?
def anogramtastic(anagrms):
d = []
e = []
for j in range(len(anagrms)):
if anagrms[j] in e:
pass
else:
templist = []
tester = anagrms[j]
tester = list(tester)
tester.sort()
tester = ''.join(tester)
for k in range(len(anagrms)):
if k == j:
pass
else:
testers = anagrms[k]
testers = list(testers)
testers.sort()
testers = ''.join(testers)
if testers == tester:
templist.append(anagrms[k])
e.append(anagrms[k])
if len(templist) > 0:
templist.append(anagrms[j])
d.append(templist)
d.sort(key=len,reverse=True)
return d
print(anogramtastic(wordlist))

How about using a dictionary of frozensets? Frozensets are immutable, meaning you can hash them for constant lookup. And when it comes to anagrams, what makes two words anagrams of each other is that they have the same letters with the same count. So you can construct a frozenset of {(letter, count), ...} pairs, and hash these for efficient lookup.
Here's a quick little function to convert a word to a multiset using collections.Counter:
from collections import Counter, defaultdict
def word2multiset(word):
return frozenset(Counter(word).items())
Now, given a list of words, populate your anagram dictionary like this:
list_of_words = [... ]
anagram_dict = defaultdict(set)
for word in list_of_words:
anagram_dict[word2multiset(word)].add(word)
For example, when list_of_words = ['hello', 'olleh', 'test', 'apple'], this is the output of anagram_dict after a run of the loop above:
print(anagram_dict)
defaultdict(set,
{frozenset({('e', 1), ('h', 1), ('l', 2), ('o', 1)}): {'hello',
'olleh'},
frozenset({('e', 1), ('s', 1), ('t', 2)}): {'test'},
frozenset({('a', 1), ('e', 1), ('l', 1), ('p', 2)}): {'apple'}})

Unless I'm misunderstanding the problem, simply grouping the words by sorting their characters should be an efficient solution -- as you've already realized. The trick is to avoid comparing every word to all the other ones. A dict with the char-sorted string as key will make finding the right group for each word fast; a lookup/insertion will be O(log n).
#!/usr/bin/env python3
#coding=utf8
from sys import stdin
groups = {}
for line in stdin:
w = line.strip()
g = ''.join(sorted(w))
if g not in groups:
groups[g] = []
groups[g].append(w)
for g, words in groups.items():
if len(words) > 1:
print('%2d %-20s' % (len(words), g), ' '.join(words))
Testing on my words file (99171 words), it seems to work well:
anagram$ wc /usr/share/dict/words
99171 99171 938848 /usr/share/dict/words
anagram$ time ./anagram.py < /usr/share/dict/words | tail
2 eeeprsw sweeper weepers
2 brsu burs rubs
2 aeegnrv avenger engrave
2 ddenoru redound rounded
3 aesy ayes easy yeas
2 gimnpu impugn umping
2 deeiinsst densities destinies
2 abinost bastion obtains
2 degilr girdle glider
2 orsttu trouts tutors
real 0m0.366s
user 0m0.357s
sys 0m0.012s

You can speed things up considerably by using a dictionary for checking membership instead of doing linear searches. The only "trick" is to devise a way to create keys for it such that it will be the same for anagrammatical words (and not for others).
In the code below this is being done by creating a sorted tuple from the letters in each word.
def anagramtastic(words):
dct = {}
for word in words:
key = tuple(sorted(word)) # Identifier based on letters.
dct.setdefault(key, []).append(word)
# Return a list of all that had an anagram.
return [words for words in dct.values() if len(words) > 1]
wordlist = ['act', 'cat', 'binary', 'brainy', 'case', 'aces',
'aide', 'idea', 'earth', 'heart', 'tea', 'tee']
print('result:', anagramtastic(wordlist))
Output produced:
result: [['act', 'cat'], ['binary', 'brainy'], ['case', 'aces'], ['aide', 'idea'], ['earth', 'heart']]

Python: Finding unknown repeated word(s) in a list of strings

I have a list of strings, which are subjects from different email conversations. I would like to see if there are words or word combinations which are being used frequently.
An example list would be:
subjects = [
'Proposal to cooperate - Company Name',
'Company Name Introduction',
'Into Other Firm / Company Name',
'Request for Proposal'
]
The function would have to detect that "Company Name" as combination is used more than once, and that "Proposal" is being used more than once. These words won't be known in advance though, so I guess it would have to start trying all possible combinations.
The actual list is of course a lot longer than this example, so manually trying all combinations doesn't seem like the best way to go. What would be the best way to go about this?
UPDATE
I've used Tim Pietzcker's answer to start developing a function for this, but I get stuck on applying the Counter correctly. It keeps returning the length of the list as count for all phrases.
The phrases function, including punctuation filter and a check if this phrase has already been checked, and a max length per phrase of 3 words:
def phrases(string, phrase_list):
words = string.split()
result = []
punctuation = '\'\"-_,.:;!? '
for number in range(len(words)):
for start in range(len(words)-number):
if number+1 <= 3:
phrase = " ".join(words[start:start+number+1])
if phrase in phrase_list:
pass
else:
phrase_list.append(phrase)
phrase = phrase.strip(punctuation).lower()
if phrase:
result.append(phrase)
return result, phrase_list
And then the loop through the list of subjects:
phrase_list = []
ranking = {}
for s in subjects:
result, phrase_list = phrases(s, phrase_list)
all_phrases = collections.Counter(phrase.lower() for s in subjects for phrase in result)
"all_phrases" returns a list with tuples where each count value is 167, which is the length of the subject list I'm using. Not sure what I'm missing here...

You also want to find phrases that are composed of more than single words. No problem. This should even scale quite well.
import collections
subjects = [
'Proposal to cooperate - Company Name',
'Company Name Introduction',
'Into Other Firm / Company Name',
'Request for Proposal',
'Some more Firm / Company Names'
]
def phrases(string):
words = string.split()
result = []
for number in range(len(words)):
for start in range(len(words)-number):
result.append(" ".join(words[start:start+number+1]))
return result
The function phrases() splits the input string on whitespace and returns all possible substrings of any length:
In [2]: phrases("A Day in the Life")
Out[2]:
['A',
'Day',
'in',
'the',
'Life',
'A Day',
'Day in',
'in the',
'the Life',
'A Day in',
'Day in the',
'in the Life',
'A Day in the',
'Day in the Life',
'A Day in the Life']
Now you can count how many times each of these phrases are found in all your subjects:
all_phrases = collections.Counter(phrase for subject in subjects for phrase in phrases(subject))
Result:
In [3]: print([(phrase, count) for phrase, count in all_phrases.items() if count > 1])
Out [3]:
[('Company', 4), ('Proposal', 2), ('Firm', 2), ('Name', 3), ('Company Name', 3),
('Firm /', 2), ('/', 2), ('/ Company', 2), ('Firm / Company', 2)]
Note that you might want to use other criteria than simply splitting on whitespace, maybe ignore punctuation and case etc.

I would suggest that you use space as a separator, otherwise there are too many possibilities if you don't specify how an allowed 'phrase' should look like.
To count word occurrences you can use Counter from the collections module:
import operator
from collections import Counter
d = Counter(' '.join(subjects).split())
# create a list of tuples, ordered by occurrence frequency
sorted_d = sorted(d.items(), key=operator.itemgetter(1), reverse=True)
# print all entries that occur more than once
for x in sorted_d:
if x[1] > 1:
print(x[1], x[0])
Output:
3 Name
3 Company
2 Proposal

Similar to pp_'s answer. Using Split.
import operator
subjects = [
'Proposal to cooperate - Company Name',
'Company Name Introduction',
'Into Other Firm / Company Name',
'Request for Proposal'
]
flat_list = [item for i in subjects for item in i.split() ]
count_dict = {i:flat_list.count(i) for i in flat_list}
sorted_dict = sorted(count_dict.items(), reverse=True, key=operator.itemgetter(1))
Output:
[('Name', 3),
('Company', 3),
('Proposal', 2),
('Other', 1),
('/', 1),
('for', 1),
('cooperate', 1),
('Request', 1),
('Introduction', 1),
('Into', 1),
('-', 1),
('to', 1),
('Firm', 1)]

How to return the count of the same elements in two lists?

I have two very large lists(that's why I used ... ), a list of lists:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],...,['how to match and return the frequency?']]
and a list of strings:
y = ['hi', 'nice', 'ok',..., 'frequency']
I would like to return in a new list the times (count) that any word in y occurred in all the lists of x. For example, for the above lists, this should be the correct output:
[(1,2),(2,0),(3,1),...,(n,count)]
As follows, [(1,count),...,(n,count)]. Where n is the number of the list and count the number of times that any word from y appeared in x. Any idea of how to approach this?.

First, you should preprocess x into a list of sets of lowercased words -- that will speed up the following lookups enormously. E.g:
ppx = []
for subx in x:
ppx.append(set(w.lower() for w in re.finditer(r'\w+', subx))
(yes, you could collapse this into a list comprehension, but I'm aiming for some legibility).
Next, you loop over y, checking how many of the sets in ppx contain each item of y -- that would be
[sum(1 for s in ppx if w in s) for w in y]
That doesn't give you those redundant first items you crave, but enumerate to the rescue...:
list(enumerate((sum(1 for s in ppx if w in s) for w in y), 1))
should give exactly what you require.

Here is a more readable solution. Check my comments in the code.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
assert len(x)==len(y), "you have to make sure length of x equals y's"
num = []
for i in xrange(len(y)):
# lower all the strings in x for comparison
# find all matched patterns in x and count it, and store result in variable num
num.append(len(re.findall(y[i], x[i][0].lower())))
res = []
# use enumerate to give output in format you want
for k, v in enumerate(num):
res.append((k,v))
# here is what you want
print res
OUTPUT:
[(0, 1), (1, 0), (2, 1), (3, 1)]

INPUT:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],
['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
CODE:
import re
s1 = set(y)
index = 0
result = []
for itr in x:
itr = re.sub('[!.?]', '',itr[0].lower()).split(' ')
# remove special chars and convert to lower case
s2 = set(itr)
intersection = s1 & s2
#find intersection of common strings
num = len(intersection)
result.append((index,num))
index = index+1
OUTPUT:
result = [(0, 2), (1, 0), (2, 1), (3, 1)]

You could do like this also.
>>> x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
>>> y = ['hi', 'nice', 'ok', 'frequency']
>>> l = []
>>> for i,j in enumerate(x):
c = 0
for x in y:
if re.search(r'(?i)\b'+x+r'\b', j[0]):
c += 1
l.append((i+1,c))
>>> l
[(1, 2), (2, 0), (3, 1), (4, 1)]
(?i) will do a case-insensitive match. \b called word boundaries which matches between a word character and a non-word character.

Maybe you could concatenate the strings in x to make the computation easy:
w = ' '.join(i[0] for i in x)
Now w is a long string like this:
>>> w
"I like stackoverflow. Hi ok! this is a great community Ok, I didn't like this!. how to match and return the frequency?"
With this conversion, you can simply do this:
>>> l = []
>>> for i in range(len(y)):
l.append((i+1, w.count(str(y[i]))))
which gives you:
>>> l
[(1, 2), (2, 0), (3, 1), (4, 0), (5, 1)]

You can make a dictionary where key is each item in the "Y" List. Loop through the values of the keys and look up for them in the dictionary. Keep updating the value as soon as you encounter the word into your X nested list.

Separating nltk.FreqDist words into two lists?

I have a series of texts that are instances of a custom WebText class. Each text is an object that has a rating (-10 to +10) and a word count (nltk.FreqDist) associated with it:
>>trainingTexts = [WebText('train1.txt'), WebText('train2.txt'), WebText('train3.txt'), WebText('train4.txt')]
>>trainingTexts[1].rating
10
>>trainingTexts[1].freq_dist
<FreqDist: 'the': 60, ',': 49, 'to': 38, 'is': 34,...>
How can you now get two lists (or dictionaries) containing every word used exclusively in the positively rated texts (trainingText[].rating>0), and another list containing every word used exclusively in the negative texts (trainingText[].rating<0). And have each list contain the total word counts for all the positive or negative texts, so that you get something like this:
>>only_positive_words
[('sky', 10), ('good', 9), ('great', 2)...]
>>only_negative_words
[('earth', 10), ('ski', 9), ('food', 2)...]
I considered using sets, as sets contain unique instances, but i can't see how this can be done with nltk.FreqDist, and on top of that, a set wouldn't be ordered by word frequency. Any ideas?

Ok, let's say you start with this for the purposes of testing:
class Rated(object):
def __init__(self, rating, freq_dist):
self.rating = rating
self.freq_dist = freq_dist
a = Rated(5, nltk.FreqDist('the boy sees the dog'.split()))
b = Rated(8, nltk.FreqDist('the cat sees the mouse'.split()))
c = Rated(-3, nltk.FreqDist('some boy likes nothing'.split()))
trainingTexts = [a,b,c]
Then your code would look like:
from collections import defaultdict
from operator import itemgetter
# dictionaries for keeping track of the counts
pos_dict = defaultdict(int)
neg_dict = defaultdict(int)
for r in trainingTexts:
rating = r.rating
freq = r.freq_dist
# choose the appropriate counts dict
if rating > 0:
partition = pos_dict
elif rating < 0:
partition = neg_dict
else:
continue
# add the information to the correct counts dict
for word,count in freq.iteritems():
partition[word] += count
# Turn the counts dictionaries into lists of descending-frequency words
def only_list(counts, filtered):
return sorted(filter(lambda (w,c): w not in filtered, counts.items()), \
key=itemgetter(1), \
reverse=True)
only_positive_words = only_list(pos_dict, neg_dict)
only_negative_words = only_list(neg_dict, pos_dict)
And the result is:
>>> only_positive_words
[('the', 4), ('sees', 2), ('dog', 1), ('cat', 1), ('mouse', 1)]
>>> only_negative_words
[('nothing', 1), ('some', 1), ('likes', 1)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scaling problem with parallel words finding in text (Python) - python

Related

Walking throughout syntax tree recursively

Reducing compute time for Anagram word search

Python: Finding unknown repeated word(s) in a list of strings

How to return the count of the same elements in two lists?

Separating nltk.FreqDist words into two lists?

Categories

Resources