Counting occurrences of multiple strings in another string

Counting occurrences of multiple strings in another string - python

In Python 2.7, given this string:
Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.
what would be the best way to find the sum amount of "Spot"s, "brown"s, and "hair"s in the string? In the example, it would return 8.
I'm looking for something like string.count("Spot","brown","hair") but works with with the "strings to be found" in a tuple or list.
Thanks!

This does what you asked for, but notice that it will also count words like "hairy", "browner" etc.
>>> s = "Spot is a brown dog. Spot has brown hair. The hair of Spot is brown."
>>> sum(s.count(x) for x in ("Spot", "brown", "hair"))
8
You can also write it as a map
>>> sum(map(s.count, ("Spot", "brown", "hair")))
8
A more robust solution might use the nltk package
>>> import nltk # Natural Language Toolkit
>>> from collections import Counter
>>> sum(x in {"Spot", "brown", "hair"} for x in nltk.wordpunct_tokenize(s))
8

I might use a Counter:
s = 'Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'
words_we_want = ("Spot","brown","hair")
from collections import Counter
data = Counter(s.split())
print (sum(data[word] for word in words_we_want))
Note that this will under-count by 1 since 'brown.' and 'brown' are separate Counter entries.
A slightly less elegant solution that doesn't trip up on punctuation uses a regex:
>>> len(re.findall('Spot|brown|hair','Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'))
8
You can create the regex from a tuple simply by
'|'.join(re.escape(x) for x in words_we_want)
The nice thing about these solutions is that they have a much better algorithmic complexity compared to the solution by gnibbler. Of course, which actually performs better on real world data still needs to be measured by OP (since OP is the only one with the real world data)

Related

Most common n words in a text

I am currently learning to work with NLP. One of the problems I am facing is finding most common n words in text. Consider the following:
text=['Lion Monkey Elephant Weed','Tiger Elephant Lion Water Grass','Lion Weed Markov Elephant Monkey Fine','Guard Elephant Weed Fortune Wolf']
Suppose n = 2. I am not looking for most common bigrams. I am searching for 2-words that occur together the most in the text. Like, the output for the above should give:
'Lion' & 'Elephant': 3
'Elephant' & 'Weed': 3
'Lion' & 'Monkey': 2
'Elephant' & 'Monkey': 2
and such..
Could anyone suggest a suitable way to tackle this?

I would suggest using Counter and combinations as follows.
from collections import Counter
from itertools import combinations, chain
text = ['Lion Monkey Elephant Weed', 'Tiger Elephant Lion Water Grass', 'Lion Weed Markov Elephant Monkey Fine', 'Guard Elephant Weed Fortune Wolf']
def count_combinations(text, n_words, n_most_common=None):
count = []
for t in text:
words = t.split()
combos = combinations(words, n_words)
count.append([" & ".join(sorted(c)) for c in combos])
return dict(Counter(sorted(list(chain(*count)))).most_common(n_most_common))
count_combinations(text, 2)

it was tricky but I solved for you, I used empty space to detect if elem contains more than 3 words :-) cause if elem has 3 words then it must be 2 empty spaces :-) in that case, only elem with 2 words will be returned
l = ["hello world", "good night world", "good morning sunshine", "wassap babe"]
for elem in l:
if elem.count(" ") == 1:
print(elem)
output
hello world
wassap babe

Fast replacement of 2 matrix in python

Sorry if it's a simple question, I'm new in python. I have an string (array of words) and a 2 dimensions of words which I'm going to replace them one by one as something like follow:
str="Jim is a good person"
# and will convert to:
parts=['Jim','is','a','good','person']
and a 2 dimensions array which each dimension is a array of words that can be replaced with an element with same index in parts. for example something like this:
replacement=[['john','Nock','Kati'],
['were','was','are'],
['a','an'],
['bad','perfect','awesome'],
['cat','human','dog']]
result can be something like this:
1: nike is a good person
2: John are an bad human
3: Kati were a perfect cat
and so on
Actually I'm going to replace each word of a sentence with some possible words and then do some calculation on the new sentence. I need to achieve all possible replacement.
Many thanks.

itertools.product might be the best choice for creating all of the combinations that you're looking for.
Let's use your replacement list as a starting point for what could work. A way to get all the combinations you're looking for could look something like this
from itertools import product
word_options=[['john','Nock','Kati'],
['were','was','are'],
['a','an'],
['bad','perfect','awesome'],
['cat','human','dog']]
for option in product(*word_options):
new_sentence = ' '.join(option)
#do calculation on new_sentence
Each option that is being iterated through is a tuple, where each element is a single choice from each of the individual sub-lists of the original 2D list. Then the ' '.join(option) will combine the individual strings into a single string where the words are separated by a space. If you were to just print new_sentence, the output would look as follows.
john were a bad cat
john were a bad human
john were a bad dog
john were a perfect cat
john were a perfect human
john were a perfect dog
.
.
.
Kati are an perfect cat
Kati are an perfect human
Kati are an perfect dog
Kati are an awesome cat
Kati are an awesome human
Kati are an awesome dog

How do I count the number of characters in a code using accumulation pattern without using len()?

Write code to count the number of characters in original_str using the accumulation pattern and assign the answer to a variable num_chars. Do NOT use the len function to solve the problem (if you use it while you are working on this problem, comment it out afterward!)
original_str = "The quick brown rhino jumped over the extremely lazy fox."
num_chars = len(original_str)
print(len(original_str))
for i in original_str:
print(len(i))
The computer tells me this is correct, but it's doesn't answer the question. I must replace len with another function.

If you cannot use the len() function you could write a function like num_characters below that uses a for loop to iterate over the characters in the passed in string and increments and subsequently returns a variable total based on the total amount of characters. I think that is what you mean by an accumulator right?
def num_characters(string):
total = 0
for character in string:
total += 1
return total
original_string = "The quick brown rhino jumped over the extremely lazy fox."
print(f"The numbers of characters in the original string using `len` is {len(original_string)}.")
print(f"The numbers of characters in the original string using `num_characters` is {num_characters(original_string)}.")
Output:
The numbers of characters in the original string using `len` is 57.
The numbers of characters in the original string using `num_characters` is 57.

With the accumulator pattern, you have a variable, and you add to it when something happens. You can make that "something" mean "counting a particular character".
So, write a loop that steps through each character in the string, and each time you go through that loop, add one to a variable, starting from zero.

original_str = "The quick brown rhino jumped over the extremely lazy fox."
num_chars = original_str.count('') - 1
print (num_chars)

original_str = "The quick brown rhino jumped over the extremely lazy fox."
count = 0
for w in original_str:
count = count + 1
num_chars = count
print(num_chars)

original_str = "The quick brown rhino jumped over the extremely lazy fox."
for num_chars in range(0,58):
print(original_str.count)

You could write a loop that goes through each character in the string, and each time you go through that loop, add one to the accumulator variable. For this, you need to set the accumulator variable to zero before executing the loop.
original_str = "The quick brown rhino jumped over the extremely lazy fox."
num_chars = 0
for achar in original_str:
num_chars = num_chars + 1
print (num_chars)

original_str = "The quick brown rhino jumped over the extremely lazy fox."
num_chars=0
for i in original_str:
if i!=" ":
num_chars+=1
print(num_chars)

Suitable data structure for fast search on sets. imput: tags, output: sentence

I have the following problem.
I get 1-10 tags, related to an image, each have a probability of existence in image.
inputs: beach, woman, dog, tree ...
I would like to retrieve from database an already composed sentence which is most related to the tags.
e.g:
beach -> "fun at the beach" / "chilling on the beach" ....
beach, woman -> "woman at the beach"
beach, woman, dog - > none found!
take the closest exist but consider probability
lets say: woman 0.95, beach 0.85, dog 0.7
so if exist take woman+beach(0.95, 0.85) then woman+dog and last beach+dog, the order is that higher are better but we are not summing.
I thought of using python sets but I am not really sure how.
Another option will be defaultdict:
db['beach']['woman']['dog'], but I want to get the same result also from:
db['woman']['beeach']['dog']
I would like to get a nice solution.
Thanks.
EDIT: Working solution
from collections import OrderedDict
list_of_keys = []
sentences = OrderedDict()
sentences[('dogs',)] = ['I like dogs','dogs are man best friends!']
sentences[('dogs', 'beach')] = ['the dog is at the beach']
sentences[('woman', 'cafe')] = ['The woman sat at the cafe.']
sentences[('woman', 'beach')] = ['The woman was at the beach']
sentences[('dress',)] = ['hi nice dress', 'what a nice dress !']
def keys_to_list_of_sets(dict_):
list_of_keys = []
for key in dict_:
list_of_keys.append(set(key))
return list_of_keys
def match_best_sentence(image_tags):
for i, tags in enumerate(list_of_keys):
if (tags & image_tags) == tags:
print(list(sentences.keys())[i])
list_of_keys = keys_to_list_of_sets(sentences)
tags = set(['beach', 'dogs', 'woman'])
match_best_sentence(tags)
results:
('dogs',)
('dogs', 'beach')
('woman', 'beach')
This solution run over all keys of an ordered dictionary,
o(n), I would like to see any performance improvement.

What seems to be the simplest way of doing this without using DBs would be to keep sets for each word and take intersections.
More explicitly:
If a sentence contains the word "woman" then you put it into the "woman" set. Similarly for dog and beach etc. for each sentence. This means your space complexity is O(sentences*average_tags) as each sentence is repeated in the data structure.
You may have:
>>> dogs = set(["I like dogs", "the dog is at the beach"])
>>> woman = set(["The woman sat at the cafe.", "The woman was at the beach"])
>>> beach = set(["the dog is at the beach", "The woman was at the beach", "I do not like the beach"])
>>> dogs.intersection(beach)
{'the dog is at the beach'}
Which you can build into an object which is on top of defaultdict so that you can take a list of tags and you can intersect only those lists and return results.
Rough implementation idea:
from collections import defaultdict
class myObj(object): #python2
def __init__(self):
self.sets = defaultdict(lambda: set())
def add_sentence(self, sentence, tags):
#how you process tags is up to you, they could also be parsed from
#the input string.
for t in tags:
self.sets[tag].add(sentence)
def get_match(self, tags):
result = self.sets(tags[0]) #this is a hack
for t in tags[1:]:
result = result.intersection(self.sets[t])
return result #this function can stand to be improved but the idea is there
Maybe this will make it more clear how the default dict and sets will end up looking in the object.
>>> a = defaultdict(lambda: set())
>>> a['woman']
set([])
>>> a['woman'].add(1)
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1])})"
>>> a['beach'].update([1,2,3,4])
>>> a['woman'].intersection(a['beach'])
set([1])
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1]), 'beach': set([1, 2, 3, 4])})"

It mainly depends on the size of the database and on the number of combinations between keywords. Moreover, it depends also on which operation you do most.
If it's small and you need a fast find operation, a possibility is to use a dictionary with frozensets as keys which contain the tags and with values all the associated sentences.
For instance,
d=defaultdict(list)
# preprocessing
d[frozenset(["bob","car","red"])].append("Bob owns a red car")
# searching
d[frozenset(["bob","car","red"])] #['Bob owns a red car']
d[frozenset(["red","car","bob"])] #['Bob owns a red car']
For combinations of words like "bob","car" you have different possibilities according to the number of the keywords and on what matters more. For example
you could have additional entries with for each combination
you could iterate over the keys and check the ones that contain both car and bob

What is a good strategy to group similar words?

Say I have a list of movie names with misspellings and small variations like this -
"Pirates of the Caribbean: The Curse of the Black Pearl"
"Pirates of the carribean"
"Pirates of the Caribbean: Dead Man's Chest"
"Pirates of the Caribbean trilogy"
"Pirates of the Caribbean"
"Pirates Of The Carribean"
How do I group or find such sets of words, preferably using python and/or redis?

Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.
I'm especially fond of the difflib module
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

You might notice that similar strings have large common substring, for example:
"Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)
To find common substring you may use dynamic programming algorithm. One of algorithms variations is Levenshtein distance (distance between most similar strings is very small, and between more different strings distance is bigger) - http://en.wikipedia.org/wiki/Levenshtein_distance.
Also for quick performance you may try to adapt Soundex algorithm - http://en.wikipedia.org/wiki/Soundex.
So after calculating distance between all your strings, you have to clusterize them. The most simple way is k-means (but it needs you to define number of clusters). If you actually don't know number of clusters, you have to use hierarchical clustering. Note that number of clusters in your situation is number of different movies titles + 1(for totally bad spelled strings).

I believe there is in fact two distinct problems.
The first is spell correction. You can have one in Python here
http://norvig.com/spell-correct.html
The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.
related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:
https://code.google.com/p/tfidf/

To add another tip to Fredrik's answer, you could also get inspired from search engines like code, such as this one :
def dosearch(terms, searchtype, case, adddir, files = []):
found = []
if files != None:
titlesrch = re.compile('>title<.*>/title<')
for file in files:
title = ""
if not (file.lower().endswith("html") or file.lower().endswith("htm")):
continue
filecontents = open(BASE_DIR + adddir + file, 'r').read()
titletmp = titlesrch.search(filecontents)
if titletmp != None:
title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
filecontents = remove_tags(filecontents)
filecontents = filecontents.lstrip()
filecontents = filecontents.rstrip()
if dofind(filecontents, case, searchtype, terms) > 0:
found.append(title)
found.append(file)
return found
Source and more information: http://www.zackgrossbart.com/hackito/search-engine-python/
Regards,
Max

One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.
Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting occurrences of multiple strings in another string - python

Related

Most common n words in a text

Fast replacement of 2 matrix in python

How do I count the number of characters in a code using accumulation pattern without using len()?

Suitable data structure for fast search on sets. imput: tags, output: sentence

What is a good strategy to group similar words?

Categories

Resources