I have a sentence which is syntactically parsed. For example, "My mom wants to cook". Parsing is [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]. The numbers mean the indexes of items the words depend on: 'mom' depends on 'wants' and 'wants' is the second element of the array (starting from zero as usual). 'Wants' has '-1' because that is the core of sentence, it doesn't depend on anything else. I need to GET the SUBJECT which is 'my mom' here. How can I do this?
To this moment, I have only tried writing loops which work not in every case. The deal is that the subject may consist of more than 2 words, and that number is undefined. Something like this...
'Values' is [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]
for indx, value in enumerate(values):
m = morph.parse(value[0])
if isinstance(m, list):
m = m[0]
if 'NOUN' in m.tag:
if value[1] == str(index[0]): #it checks if the word (part of the subject) depends on the verb
terms.append([value[0], indx])
if len(terms) > 0:
term = terms[0][0]
t = []
for indx, value in enumerate(values):
if value[1] == str(terms[0][1]): #it checks if the word depend on the found part of the subject
m = morph.parse(value[0])
if isinstance(m, list):
m = m[0]
if 'NOUN' in m.tag:
t.append([value[0], terms[0][0]])
The algorithm should work like this: walks the whole array and stops when it finds all the dependencies of the given word and all the dependencies of these dependencies. (In the example, all the dependencies of 'mom'). Please, help!
Sorry that I took so long to get back to you.
I finally figured it out. Code is at the bottom, but I'll explain how it works first:
when parse(values) is called it iterates through the sentence in values and calls the recursive getDepth for each word. ```getDepth`` computes how many word relations are between a given word and the verb.
E.g. for "The", depth is 1, because it directly calls the verb.
for "King", it is 2, because "King" calls "The" and "The" calls the verb (Where call = have a pointer that points to a certain word)
Once all depths are computed, parse finds the word with the highest depth ("France") and uses recursive traceFrom() to string together the subject.
All you really care a about is parse() which takes the preparsed string like [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)] and spits out the complete subject. Works for both examples, but you should test some more.
values = [('The', 4), ('King', 0), ('of', 1), ('France', 2), ('died', -1)]
#values = [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]
def getDepth(i, s, n = 0):
print('D is %d' % n)
if s[i][1] == -1:
return n
else:
return getDepth(s[i][1], s, n+1)
def traceFrom(m, s, dt=[]):
print('Got n:%d, s:' % m, s)
if s[m][1] == -1:
d = []
n = 0
for i in range(len(s)):
if i in dt:
d.append(s[i][0])
return " ".join(d)
else:
dt.append(m)
return traceFrom(s[m][1], s, dt)
def parse(sentence):
d = []
for i in range(len(sentence)):
d.append(getDepth(i, sentence))
m = d.index(max(d))
print('Largest is ' , d[m], ' of ', d)
return traceFrom(m, sentence)
print('Subject :', parse(values))
Given your preparsed array, this is quite easy:
values = [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]
def parseFrom(n, data):
if values[n][1] != -1:
#print('Stepping (%s,%d)' % (values[n][0], values[n][1]))
data.append(values[n][0])
return parseFrom(values[n][1], data)
else:
#print('At verb')
return data
subject = ' '.join(parseFrom(0, []))
print('Sentence has subject:', subject)
The function is recursive if the current word is not the verb, otherwise returns the subject as an array. Sorry if it doesn't work on all sentences
Related
I'm working in python and I have to resolve a simple task (at least with a simple definition):
I have a set of names, each name is a sequence of token: names_to_find = ['York', 'New York', 'Dustin']
I have a corpus, which consists of a list of sentences: corpus = [' I love New York but also York ', ' Dustin is my cat ', ' I live in New York with my friend Dustin ']
My desired output is a dictionary with names_to_find as keys and, for each occurrence in the corpus, a couple (#sentence_index, #word_index)
The desired output of the example is:
output = { 'York' : [(0, 3), (0, 6), (2, 4)], 'New York' : [(0, 2), (2, 2)], 'Dustin' : [(1, 0), (2, 8)]}
As you can see if the name_to_find appears two times in the same sentence I want both, for composed names (e.g., 'New York') I want the index of the first word.
The problem is that I have 1 million of names_to_find and 4.8 million of sentences in corpus
I made a code which does not scale in order to see if time was acceptable (it was not); to find all names in 100000 (100k) sentences my code needs 12 hours :'(
My question is twofold: I'm here to ask you to help me with my code or to paste a largely different code, it doesn't matter, the only thing that matters is that the code scales
I report my (parallel) code, here I find only single words and the composite words (i.e. 'New York') are found in another function which checks if word indexes are contiguous:
def parallel_find(self, n_proc):
"""
takes entities in self.entity_token_in_corpus and for each call the function self.entities_token_in_corpus
this method (and the involved one) are thought to work in parallel, so after the calling a reduce is applied
:param
n_proc: the number of process used to make the computation
"""
p = Pool(n_proc)
print('start indexing')
t = time.time()
index_list = p.map(self.create_word_occurrence_index, self.entities_token_in_corpus)
t = time.time() - t
index_list_dict = {k:v for elem in index_list for k, v in elem.items() if v}
p.close()
return index_list_dict, n_proc, len(self.corpus), t
def create_word_occurrence_index(self, word):
"""
loop on all the corpus, call self.find_in_sentence to find occurrences of word in each sentence, returns a dict
:param
word: the word to find
:return: a dict with the structure: {entity_name: list of tuples (row: [occurrences in row])}
"""
key = word
returning_list = []
for row_index, sent in enumerate(self.joined_corpus):
if sent.find(' ' + word + ' ') != -1:
indices = self.find_in_sentence(word = word, sentence = sent)
if indices:
returning_list.append((row_index, indices))
return {key: returning_list}
def find_in_sentence(self, word, sentence):
"""
returns the indexes in which the word appear in a sentence
:params
word: the word to find
sentence: the sentence in which find the word
:return: a list of indices
"""
indices = [i for i, x in enumerate(sentence.split()) if x == word]
return indices
Thanks in advance
Here's an attempt using generators but I'm not sure how much better it will perform on a large target. The problematic part is the multi-word matches, but I tried to build in some multiple short-circuit and early-termination code (I think there is more to do on that, but the complexity starts building up there too):
def matcher(words, targets):
for word in words:
result = {word: []} #empty dict to hold each word
if len(word.split()) == 1: #check to see if word is single
for t, target in enumerate(targets):
foo = target.split()
bar = [(t,i) for i,x in enumerate(foo) if x == word] #collect the indices
if bar:
result[word].extend(bar) #update the dict
yield result #returns a generator
else:
consecutive = word.split()
end = len(consecutive)
starter = consecutive[0] #only look for first word match
for t, target in enumerate(targets):
foo = target.split()
limit = len(foo)
if foo.count(starter): #skip entire target if 1st word missing
indices = [i for i, x in enumerate(foo) if (x==starter and
limit - end > i)] #don't try to match if index too high
bar = []
for i in indices:
if foo[i:i+end] == consecutive: #do match (expensive)
bar.append((t,i))
result[word].extend(bar)
else:
continue
yield result
If you want to collect everything at one go, for this modified example
targets = [ ' I love New York but also York ',
' Dustin is my cat ',
' I live in New York with my friend Dustin ',
' New York State contains New York City aka New York']
values = [ 'York', 'New York', 'Dustin', 'New York State' ]
zed = matcher(values, targets)
print(list(zed))
Produces:
{'York': [(0, 3), (0, 6), (2, 4), (3, 1), (3, 5), (3, 9)]}
{'New York': [(0, 2), (2, 3), (3, 0), (3, 4)]}
{'Dustin': [(1, 0), (2, 8)]}
{'New York State': [(3, 0)]}
There might be ways to exploit concurrency here, I'm really not sure, not being too familiar with that as of yet. See https://realpython.com/async-io-python/ for example. Also, I didn't go over that code carefully for off-by-one errors... think its okay. Probably want some unittest here.
I am trying to take the Spark word count example and aggregate word counts by some other value (for example, words and counts by person where person is "VI" or "MO" in the case below)
I have an rdd which is a list of tuples whose values are lists of tuples:
from operator import add
reduced_tokens = tokenized.reduceByKey(add)
reduced_tokens.take(2)
Which gives me:
[(u'VI', [(u'word1', 1), (u'word2', 1), (u'word3', 1)]),
(u'MO',
[(u'word4', 1),
(u'word4', 1),
(u'word5', 1),
(u'word8', 1),
(u'word10', 1),
(u'word1', 1),
(u'word4', 1),
(u'word6', 1),
(u'word9', 1),
...
)]
I want something like:
[
('VI',
[(u'word1', 1), (u'word2', 1), (u'word3', 1)],
('MO',
[(u'word4', 58), (u'word8', 2), (u'word9', 23) ...)
]
Similar to the word count example here, I would like to be able to filter out words with a count below some threshold for some person. Thanks!
The keys that you're trying to reduce across are (name, word) pairs, not just names. So you need to do a .map step to fix-up your data:
def key_by_name_word(record):
name, (word, count) = record
return (name, word), count
tokenized_by_name_word = tokenized.map(key_by_name_word)
counts_by_name_word = tokenized_by_name_word.reduce(add)
This should give you
[
(('VI', 'word1'), 1),
(('VI', 'word2'), 1),
(('VI', 'word3'), 1),
(('MO', 'word4'), 58),
...
]
To get it into exactly the same format you mentioned, you can then do:
def key_by_name(record):
# this is the inverse of key_by_name_word
(name, word), count = record
return name, (word, count)
output = counts_by_name_word.map(key_by_name).reduceByKey(add)
But it might actually be easier to work with the data in the flat format that counts_by_name_word is in.
For completeness, here is how I solved each part of the question:
Ask 1: Aggregate word counts by some key
import re
def restructure_data(name_and_freetext):
name = name_and_freetext[0]
tokens = re.sub('[&|/|\d{4}|\.|\,|\:|\-|\(|\)|\+|\$|\!]', ' ', name_and_freetext[1]).split()
return [((name, token), 1) for token in tokens]
filtered_data = data.filter((data.flag==1)).select('name', 'item')
tokenized = filtered_data.rdd.flatMap(restructure_data)
Ask 2: Filter out words with a count below some threshold:
from operator import add
# keep words which have counts >= 5
counts_by_state_word = tokenized.reduceByKey(add).filter(lambda x: x[1] >= 5)
# map filtered word counts into a list by key so we can sort them
restruct = counts_by_name_word.map(lambda x: (x[0][0], [(x[0][1], x[1])]))
Bonus: Sort words from most frequent to least frequent
# sort the word counts from most frequent to least frequent words
output = restruct.reduceByKey(add).map(lambda x: (x[0], sorted(x[1], key=lambda y: y[1], reverse=True))).collect()
I am trying to get the sorted output from following program.
"""Count words."""
# TODO: Count the number of occurences of each word in s
# TODO: Sort the occurences in descending order (alphabetically in case of ties)
# TODO: Return the top n words as a list of tuples
from operator import itemgetter
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
t1=[]
t2=[]
temp={}
top_n={}
words=s.split()
for word in words:
if word not in temp:
t1.append(word)
temp[word]=1
else:
temp[word]+=1
top_n=sorted(temp.items(), key=itemgetter(1,0),reverse=True)
print top_n
return
def test_run():
"""Test count_words() with some inputs."""
count_words("cat bat mat cat bat cat", 3)
count_words("betty bought a bit of butter but the butter was bitter", 3)
if __name__ == '__main__':
test_run()
This program output is like:
[('cat', 3), ('bat', 2), ('mat', 1)]
[('butter', 2), ('was', 1), ('the', 1), ('of', 1), ('but', 1), ('bought', 1), ('bitter', 1), ('bit', 1), ('betty', 1), ('a', 1)]
but I need in the form like:
[('cat', 3), ('bat', 2), ('mat', 1)]
[('butter', 2), ('a', 1),('betty', 1),('bit', 1),('bitter', 1) ... rest of them here]
COuld you please let me know the best possible way?
You need to change the key function you're giving to sorted, since the items in your desired output need to be sorted in descending order by count but ascending order alphabetically. I'd use a lambda function:
top_n = sorted(temp.items(), key=lambda item: (-item[1], item[0]))
By negating the count, an ascending sort gets you the desired order.
You can change:
top_n=sorted(temp.items(), key=itemgetter(1,0),reverse=True)
To:
temp2=sorted(temp.items(), key=itemgetter(0),reverse=False)
top_n=sorted(temp2.items(), key=itemgetter(1),reverse=True)
and thanks to the Sort Stability you will be good
Instead of itemgetter, use lambda t:(-t[1],t[0]) and drop the reverse=True:
top_n=sorted(temp.items(), key=lambda t:(-t[1],t[0]))
This returns the same thing as itemgetter(1,0) only with the first value inverted so that higher numbers will be sorted before lower numbers.
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
t1=[]
t2=[]
temp={}
top_n={}
words=s.split()
for word in words:
if word not in temp:
t1.append(word)
temp[word]=1
else:
temp[word]+=1
top_n=sorted(temp.items(), key=lambda t: (t[1], t[0]),reverse=True)
print top_n
return
def test_run():
"""Test count_words() with some inputs."""
count_words("cat bat mat cat bat cat", 3)
count_words("betty bought a bit of butter but the butter was bitter", 3)
if __name__ == '__main__':
test_run()
I used lambda instead of itemgetter and in other apps I've written, lambda seems to work.
I have two very large lists(that's why I used ... ), a list of lists:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],...,['how to match and return the frequency?']]
and a list of strings:
y = ['hi', 'nice', 'ok',..., 'frequency']
I would like to return in a new list the times (count) that any word in y occurred in all the lists of x. For example, for the above lists, this should be the correct output:
[(1,2),(2,0),(3,1),...,(n,count)]
As follows, [(1,count),...,(n,count)]. Where n is the number of the list and count the number of times that any word from y appeared in x. Any idea of how to approach this?.
First, you should preprocess x into a list of sets of lowercased words -- that will speed up the following lookups enormously. E.g:
ppx = []
for subx in x:
ppx.append(set(w.lower() for w in re.finditer(r'\w+', subx))
(yes, you could collapse this into a list comprehension, but I'm aiming for some legibility).
Next, you loop over y, checking how many of the sets in ppx contain each item of y -- that would be
[sum(1 for s in ppx if w in s) for w in y]
That doesn't give you those redundant first items you crave, but enumerate to the rescue...:
list(enumerate((sum(1 for s in ppx if w in s) for w in y), 1))
should give exactly what you require.
Here is a more readable solution. Check my comments in the code.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
assert len(x)==len(y), "you have to make sure length of x equals y's"
num = []
for i in xrange(len(y)):
# lower all the strings in x for comparison
# find all matched patterns in x and count it, and store result in variable num
num.append(len(re.findall(y[i], x[i][0].lower())))
res = []
# use enumerate to give output in format you want
for k, v in enumerate(num):
res.append((k,v))
# here is what you want
print res
OUTPUT:
[(0, 1), (1, 0), (2, 1), (3, 1)]
INPUT:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],
['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
CODE:
import re
s1 = set(y)
index = 0
result = []
for itr in x:
itr = re.sub('[!.?]', '',itr[0].lower()).split(' ')
# remove special chars and convert to lower case
s2 = set(itr)
intersection = s1 & s2
#find intersection of common strings
num = len(intersection)
result.append((index,num))
index = index+1
OUTPUT:
result = [(0, 2), (1, 0), (2, 1), (3, 1)]
You could do like this also.
>>> x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
>>> y = ['hi', 'nice', 'ok', 'frequency']
>>> l = []
>>> for i,j in enumerate(x):
c = 0
for x in y:
if re.search(r'(?i)\b'+x+r'\b', j[0]):
c += 1
l.append((i+1,c))
>>> l
[(1, 2), (2, 0), (3, 1), (4, 1)]
(?i) will do a case-insensitive match. \b called word boundaries which matches between a word character and a non-word character.
Maybe you could concatenate the strings in x to make the computation easy:
w = ' '.join(i[0] for i in x)
Now w is a long string like this:
>>> w
"I like stackoverflow. Hi ok! this is a great community Ok, I didn't like this!. how to match and return the frequency?"
With this conversion, you can simply do this:
>>> l = []
>>> for i in range(len(y)):
l.append((i+1, w.count(str(y[i]))))
which gives you:
>>> l
[(1, 2), (2, 0), (3, 1), (4, 0), (5, 1)]
You can make a dictionary where key is each item in the "Y" List. Loop through the values of the keys and look up for them in the dictionary. Keep updating the value as soon as you encounter the word into your X nested list.
I am using Python 3 and I want to write a function that takes a string of all capital letters, so suppose s = 'VENEER', and gives me the following output '614235'.
The function I have so far is:
def key2(s):
new=''
for ch in s:
acc=0
for temp in s:
if temp<=ch:
acc+=1
new+=str(acc)
return(new)
If s == 'VENEER' then new == '634335'. If s contains no duplicates, the code works perfectly.
I am stuck on how to edit the code to get the output stated in the beginning.
Note that the built-in method for replacing characters within a string, str.replace, takes a third argument; count. You can use this to your advantage, replacing only the first appearance of each letter (obviously once you replace the first 'E', the second one will become the first appearance, and so on):
def process(s):
for i, c in enumerate(sorted(s), 1):
## print s # uncomment to see process
s = s.replace(c, str(i), 1)
return s
I have used the built-in functions sorted and enumerate to get the appropriate numbers to replace the characters:
1 2 3 4 5 6 # 'enumerate' from 1 -> 'i'
E E E N R V # 'sorted' input 's' -> 'c'
Example usage:
>>> process("VENEER")
'614235'
One way would be to use numpy.argsort to find the order, then find the ranks, and join them:
>>> s = 'VENEER'
>>> order = np.argsort(list(s))
>>> rank = np.argsort(order) + 1
>>> ''.join(map(str, rank))
'614235'
You can use a regex:
import re
s="VENEER"
for n, c in enumerate(sorted(s), 1):
s=re.sub('%c' % c, '%i' % n, s, count=1)
print s
# 614235
You can also use several nested generators:
def indexes(seq):
for v, i in sorted((v, i) for (i, v) in enumerate(seq)):
yield i
print ''.join('%i' % (e+1) for e in indexes(indexes(s)))
# 614235
From your title, you may want to do like this?
>>> from collections import OrderedDict
>>> s='VENEER'
>>> d = {k: n for n, k in enumerate(OrderedDict.fromkeys(sorted(s)), 1)}
>>> "".join(map(lambda k: str(d[k]), s))
'412113'
As #jonrsharpe commented I didn't need to use OrderedDict.
def caps_to_nums(in_string):
indexed_replaced_string = [(idx, val) for val, (idx, ch) in enumerate(sorted(enumerate(in_string), key=lambda x: x[1]), 1)]
return ''.join(map(lambda x: str(x[1]), sorted(indexed_replaced_string)))
First we run enumerate to be able to save the natural sort order
enumerate("VENEER") -> [(0, 'V'), (1, 'E'), (2, 'N'), (3, 'E'), (4, 'E'), (5, 'R')]
# this gives us somewhere to RETURN to later.
Then we sort that according to its second element, which is alphabetical, and run enumerate again with a start value of 1 to get the replacement value. We throw away the alpha value, since it's not needed anymore.
[(idx, val) for val, (idx, ch) in enumerate(sorted([(0, 'V'), (1, 'E'), ...], key = lambda x: x[1]), start=1)]
# [(1, 1), (3, 2), (4, 3), (2, 4), (5, 5), (0, 6)]
Then map the second element (our value) sorting by the first element (the original index)
map(lambda x: str(x[1]), sorted(replacement_values)
and str.join it
''.join(that_mapping)
Ta-da!