Most common n words in a text - python

I am currently learning to work with NLP. One of the problems I am facing is finding most common n words in text. Consider the following:
text=['Lion Monkey Elephant Weed','Tiger Elephant Lion Water Grass','Lion Weed Markov Elephant Monkey Fine','Guard Elephant Weed Fortune Wolf']
Suppose n = 2. I am not looking for most common bigrams. I am searching for 2-words that occur together the most in the text. Like, the output for the above should give:
'Lion' & 'Elephant': 3
'Elephant' & 'Weed': 3
'Lion' & 'Monkey': 2
'Elephant' & 'Monkey': 2
and such..
Could anyone suggest a suitable way to tackle this?

I would suggest using Counter and combinations as follows.
from collections import Counter
from itertools import combinations, chain
text = ['Lion Monkey Elephant Weed', 'Tiger Elephant Lion Water Grass', 'Lion Weed Markov Elephant Monkey Fine', 'Guard Elephant Weed Fortune Wolf']
def count_combinations(text, n_words, n_most_common=None):
count = []
for t in text:
words = t.split()
combos = combinations(words, n_words)
count.append([" & ".join(sorted(c)) for c in combos])
return dict(Counter(sorted(list(chain(*count)))).most_common(n_most_common))
count_combinations(text, 2)

it was tricky but I solved for you, I used empty space to detect if elem contains more than 3 words :-) cause if elem has 3 words then it must be 2 empty spaces :-) in that case, only elem with 2 words will be returned
l = ["hello world", "good night world", "good morning sunshine", "wassap babe"]
for elem in l:
if elem.count(" ") == 1:
print(elem)
output
hello world
wassap babe

Related

Suitable data structure for fast search on sets. imput: tags, output: sentence

I have the following problem.
I get 1-10 tags, related to an image, each have a probability of existence in image.
inputs: beach, woman, dog, tree ...
I would like to retrieve from database an already composed sentence which is most related to the tags.
e.g:
beach -> "fun at the beach" / "chilling on the beach" ....
beach, woman -> "woman at the beach"
beach, woman, dog - > none found!
take the closest exist but consider probability
lets say: woman 0.95, beach 0.85, dog 0.7
so if exist take woman+beach(0.95, 0.85) then woman+dog and last beach+dog, the order is that higher are better but we are not summing.
I thought of using python sets but I am not really sure how.
Another option will be defaultdict:
db['beach']['woman']['dog'], but I want to get the same result also from:
db['woman']['beeach']['dog']
I would like to get a nice solution.
Thanks.
EDIT: Working solution
from collections import OrderedDict
list_of_keys = []
sentences = OrderedDict()
sentences[('dogs',)] = ['I like dogs','dogs are man best friends!']
sentences[('dogs', 'beach')] = ['the dog is at the beach']
sentences[('woman', 'cafe')] = ['The woman sat at the cafe.']
sentences[('woman', 'beach')] = ['The woman was at the beach']
sentences[('dress',)] = ['hi nice dress', 'what a nice dress !']
def keys_to_list_of_sets(dict_):
list_of_keys = []
for key in dict_:
list_of_keys.append(set(key))
return list_of_keys
def match_best_sentence(image_tags):
for i, tags in enumerate(list_of_keys):
if (tags & image_tags) == tags:
print(list(sentences.keys())[i])
list_of_keys = keys_to_list_of_sets(sentences)
tags = set(['beach', 'dogs', 'woman'])
match_best_sentence(tags)
results:
('dogs',)
('dogs', 'beach')
('woman', 'beach')
This solution run over all keys of an ordered dictionary,
o(n), I would like to see any performance improvement.
What seems to be the simplest way of doing this without using DBs would be to keep sets for each word and take intersections.
More explicitly:
If a sentence contains the word "woman" then you put it into the "woman" set. Similarly for dog and beach etc. for each sentence. This means your space complexity is O(sentences*average_tags) as each sentence is repeated in the data structure.
You may have:
>>> dogs = set(["I like dogs", "the dog is at the beach"])
>>> woman = set(["The woman sat at the cafe.", "The woman was at the beach"])
>>> beach = set(["the dog is at the beach", "The woman was at the beach", "I do not like the beach"])
>>> dogs.intersection(beach)
{'the dog is at the beach'}
Which you can build into an object which is on top of defaultdict so that you can take a list of tags and you can intersect only those lists and return results.
Rough implementation idea:
from collections import defaultdict
class myObj(object): #python2
def __init__(self):
self.sets = defaultdict(lambda: set())
def add_sentence(self, sentence, tags):
#how you process tags is up to you, they could also be parsed from
#the input string.
for t in tags:
self.sets[tag].add(sentence)
def get_match(self, tags):
result = self.sets(tags[0]) #this is a hack
for t in tags[1:]:
result = result.intersection(self.sets[t])
return result #this function can stand to be improved but the idea is there
Maybe this will make it more clear how the default dict and sets will end up looking in the object.
>>> a = defaultdict(lambda: set())
>>> a['woman']
set([])
>>> a['woman'].add(1)
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1])})"
>>> a['beach'].update([1,2,3,4])
>>> a['woman'].intersection(a['beach'])
set([1])
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1]), 'beach': set([1, 2, 3, 4])})"
It mainly depends on the size of the database and on the number of combinations between keywords. Moreover, it depends also on which operation you do most.
If it's small and you need a fast find operation, a possibility is to use a dictionary with frozensets as keys which contain the tags and with values all the associated sentences.
For instance,
d=defaultdict(list)
# preprocessing
d[frozenset(["bob","car","red"])].append("Bob owns a red car")
# searching
d[frozenset(["bob","car","red"])] #['Bob owns a red car']
d[frozenset(["red","car","bob"])] #['Bob owns a red car']
For combinations of words like "bob","car" you have different possibilities according to the number of the keywords and on what matters more. For example
you could have additional entries with for each combination
you could iterate over the keys and check the ones that contain both car and bob

Replacing text in tags

I have been having problems trying to find a way to replace tags in my strings in Python.
What I have at the moment is the text:
you should buy a {{cat_breed + dog_breed}} or a {{cat_breed + dog_breed}}
Where cat_breed and dog_breed are lists of cat and dog breeds.
What I want to end up with is:
you should buy a Scottish short hair or a golden retriever
I want the tag to be replaced by a random entry in one of the two lists.
I have been looking at re.sub() but I do not know how to fix the problem and not just end up with the same result in both tags.
Use random.sample to get two unique elements from the population.
import random
cats = 'lazy cat', 'cuddly cat', 'angry cat'
dogs = 'dirty dog', 'happy dog', 'shaggy dog'
print("you should buy a {} or a {}".format(*random.sample(dogs + cats, 2)))
There's no reason to use regular expressions here. Just use string.format instead.
I hope the idea below gives you some idea on how to complete your task:
list1 = ['cat_breed1', 'cat_breed2']
list2 = ['dog_breed1', 'dog_breed2']
a = random.choice(list1)
b = random.choice(list2)
sentence = "you should buy a %s or a %s" %(a, b)

Python, find words from array in string

I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks
import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']
Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]
Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2

Count number of times each word has repeated in a string?

For example if I had a string without any punctuation:
"She walked the dog to the park and played ball with the dog When she threw the ball to the dog the dog missed the ball and ran to the other side of the park to fetch it"
I know how to do it by converting the string to uppercase/lowercase and using the function
from collections import Counter
but I can't think of any other way to count without using built-in functions (this includes set.default, get, sorted, etc.)
It should come out in a key:value format. Any ideas?
Forget about libraries and "fast" ways of doing it, use simpler logic:
Start by splitting your string using stringName.split(). This returns to you an array of words. Now create an empty dicitonary. Then iterate through the array and do one of two things, if it exists in the dictionary, increment the count by 1, otherwise, create the key value pair with key as the word and value as 1.
At the end, you'll have a count of words.
The code:
testString = "She walked the dog to the park and played ball with the dog When she threw the ball to the dog the dog missed the ball and ran to the other side of the park to fetch it"
dic = {}
words = testString.split()
for raw_word in words:
word = raw_word.lower()
if word in dic:
dic[word] += 1
else:
dic[word] = 1
print dic

Counting occurrences of multiple strings in another string

In Python 2.7, given this string:
Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.
what would be the best way to find the sum amount of "Spot"s, "brown"s, and "hair"s in the string? In the example, it would return 8.
I'm looking for something like string.count("Spot","brown","hair") but works with with the "strings to be found" in a tuple or list.
Thanks!
This does what you asked for, but notice that it will also count words like "hairy", "browner" etc.
>>> s = "Spot is a brown dog. Spot has brown hair. The hair of Spot is brown."
>>> sum(s.count(x) for x in ("Spot", "brown", "hair"))
8
You can also write it as a map
>>> sum(map(s.count, ("Spot", "brown", "hair")))
8
A more robust solution might use the nltk package
>>> import nltk # Natural Language Toolkit
>>> from collections import Counter
>>> sum(x in {"Spot", "brown", "hair"} for x in nltk.wordpunct_tokenize(s))
8
I might use a Counter:
s = 'Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'
words_we_want = ("Spot","brown","hair")
from collections import Counter
data = Counter(s.split())
print (sum(data[word] for word in words_we_want))
Note that this will under-count by 1 since 'brown.' and 'brown' are separate Counter entries.
A slightly less elegant solution that doesn't trip up on punctuation uses a regex:
>>> len(re.findall('Spot|brown|hair','Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'))
8
You can create the regex from a tuple simply by
'|'.join(re.escape(x) for x in words_we_want)
The nice thing about these solutions is that they have a much better algorithmic complexity compared to the solution by gnibbler. Of course, which actually performs better on real world data still needs to be measured by OP (since OP is the only one with the real world data)

Categories

Resources