Looking for Word Sequences Inside of a String

Looking for Word Sequences Inside of a String - python

Haven't seen anyone solving something similar to this without using regex. So I have a text = "I have three apples because yesterday I bought three apples" and the list of key words: words = ['I have', 'three', 'apples', 'yesterday'] and k = 2. I have write to a function that finds and returns word sequences of >= k 'words' (by words I mean word combinations identified as a single element in the list e.g. 'I have' is considered a word).
In this case, it should return ['I have three apples', 'three apples']. Even though 'yesterday' is in the string, it's < k, so it doesn't match.
I'm assuming there needs to be a stack to keep track of the size of the sequences. I started writing the code without 1) it doesn't work in this situation because I try checking 'i have', then 'i have three' etc and it can't identify just 'three apples'; 2) I don't know how to proceed with it. Here's the code:
text = "I have three apples because yesterday I bought three apples"
words = ['I have', 'three', 'apples', 'yesterday']
k = 2
check1 = []
def search(text, words, k):
for i in words:
finding = text.count(i)
if finding != 0:
check1.append(i)
check2 = ' '.join(check1)
occurrences = text.count(check2)
if occurrences > 0:
#i want to check if the previous number of occurrences was the same
#that's why I think I need a stack. if it is, i keep going
#if it's not, i append the previous phrases to the list if they're >= k and keep
#checking
pass
else:
#the next word doesn't belong to the sequence, so we finish the process
#by adding the right number of word sequences >= k to the resulting list
pass
else:
#the word is not in the list and I need to add check2 to the list
#considering all word sequences
pass
Different ways of solving this or any ideas are very much appreciated because I'm stuck on trying to solve it this way and I don't know how to implement it.

I found a solution by walking through the text, noting the words in the correct order. The complexity of this algorithm increases quickly with the length of your word list and the length of text, however. Depending on the application, you might want to do it differently:
def walk(t,w,k):
t+=' '
node = -1
current = []
collection = []
while len(t)>1:
elong = False
for i in range(len(w)):
if i > node and t[:len(w[i])] == w[i]:
node = i
t = t[len(w[i])+1:]
current.append(w[i])
elong=True
if not elong or len(t)<2:
t = t[t.find(' ')+1:]
if len(current)>=k: collection.append(' '.join(current))
current = []
node = -1
return collection
This function would handle the requests you mentioned in the question as follows:
#Input:
print(walk("I have three apples because yesterday I bought three apples",
['I have', 'three', 'apples', 'yesterday'],
2))
#Output:
['I have three apples', 'three apples']
#Input:
print(walk("Is three apples all I have three",
['I have', 'three', 'apples', 'yesterday'],
2))
#Output:
['three apples', 'I have three']
It heavily relies on there being blanks that separate the words and would not cope well with punctuation. You might want to include some preprocessing.

Related

How do I turn a series of IF statements into a Pythonic code line?

I am trying to improve my list comprehension capabilities. I would like to turn the following into a Pythonic code line:
# Sample sentence
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
# Empty list to hold words from sentence
words = []
# Empty lists to hold longest and shortest items
longest = []
shortest = []
# Pythonic method to split sentence into list of words
words = [c for c in sample_sentence.split()]
# Non-Pythonic routine to find the shortest and longest words
for i in words:
if len(i) <= shortest_len:
# shortest = [] # Reset list
shortest.append(i)
if len(i) >= longest_len:
longest.append(i)
print(f"Shortest item: {', '.join(shortest)}")
print(f"Longest item: {', '.join(longest)}")
I have tried creating a Pythonic version of the non-Pythonic routine to find the shortest and longest words:
shortest = [i for i in words if len(i) <= shortest_len: shortest.append(i) ]
print(shortest)
longest = [i for i in words if len(i) >= longest_len: longest.append(i)]
print(longest)
But it gets an invalid syntax error.
How can I construct the above two code lines and also is it possible to use a single Pythonic line to combine the two?

Don't make the mistake of thinking that everything has to be a clever one-liner. It is better to write code that you will be able to look at in the future and know exactly what it will do. In my opinion, this is a very Pythonic way to write your script. Yes there are improvements that could be made, but don't optimize if you don't have to. This script is reading two sentences, if it was reading entire books then we can make improvements.
As said in the comments, everyone has their own definition of "Pythonic" and their own formatting. Personally, I don't think you need the comments when it is very clear what your code is doing. For instance, you don't need to comment "Sample Sentence" when your variable is named sample_sentence.
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
words = sample_sentence.split()
words.sort(key=len)
short_len = len(words[0])
long_len = len(words[-1])
shortest_words = [word for word in words if len(word) == short_len]
longest_words = [word for word in words if len(word) == long_len]
print(f"Shortest item: {', '.join(shortest_words)}")
print(f"Longest item: {', '.join(longest_words)}")

Try this for your list comprehension case to resolve the syntax error -
shortest = [i for i in words if len(i) <= shortest_len]
print(shortest)
longest = [i for i in words if len(i) >= longest_len]
print(longest)
Also instead of words = [c for c in sample_sentence.split()] you can use words = sample_sentence.split()

To resolve the SyntaxError, you must just remove the : shortest.append(i) and : longest.append(i). The lists will be as expected, then.
is it possible to use a single Pythonic line to combine the two ?
No, it is not possible to build up both lists on one line. (Except that you can of ccourse put both list constructions on one line:
shortest = [i for i in words if len(i) <= shortest_len: shortest.append(i)]; longest = [i for i in words if len(i) >= longest_len: longest.append(i)]
). But I don't think that was your goal …

An alternative code to get the shortest and longest words:
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
my_list = list(set([(token,len(token)) for token in sample_sentence.split()]))
longest_word = ', '.join(i[0] for i in my_list if i[1]==max([x[1] for x in my_list]))
shortest_word = ', '.join(i[0] for i in my_list if i[1]==min([x[1] for x in my_list]))
print(f"longest words: {longest_word}")
print(f"shortest words: {shortest_word}")
output
longest words: people, walked
shortest words: a

First of all, you need to define shortest_len and longest_len somewhere. You can use this:
shortest_len = min([len(word) for word in words])
longest_len = max([len(word) for word in words])
shortest = [word for word in words if len(word) == shortest_len]
longest = [word for word in words if len(word) == longest_len]
It is possible to combine the two lines into
shortest, longest = [word for word in words if len(word) == shortest_len], [word for word in words if len(word) == longest_len]
If you want to go really crazy, you can compact it as far as this:
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
print(f"Shortest item: {', '.join([word for word in [c for c in sample_sentence.split()] if len(word) == min([len(word) for word in words])])}")
print(f"Longest item: {', '.join([word for word in [c for c in sample_sentence.split()] if len(word) == max([len(word) for word in words])])}")
or even join these two into one print to get a one-liner:
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
print(f"Shortest item: {', '.join([word for word in [c for c in sample_sentence.split()] if len(word) == min([len(word) for word in words])])}\nLongest item: {', '.join([word for word in [c for c in sample_sentence.split()] if len(word) == max([len(word) for word in words])])}")
But, as already pointed out by #big_bad_bison, the shortest solution is not always the same as the best solution and the readability of your code (for yourself, over months or years; or for others, if you ever need to work in a team or maybe ask for help) is much more important.
Many one-liners are not cleanly Pythonic because they make for a line of code that is too long. PEP 8 - Style Guide for Python Code (definitely suggested reading!) stays clearly:
Limit all lines to a maximum of 79 characters.

This is an one-liner to find the length of every word in your sentence and sort them from the shortest to the longest:
words_len = sorted(list(set([(c, len(c)) for c in sample_sentence.split()])), key=lambda t: t[1])
Or as I was pointed out in the comments, you can achieve the same by doing this:
words = sorted(list(set(sample_sentence.split())), key=len)
print(f"shortest: {words[0]}\nlongest: {words[-1]}")

Find triple words in a word list in Python

The comparative form of the adjective big is bigger, and the superlative form is biggest. I would like to prints out all such triples (early, earlier, earliest) or (hard, harder, hardest),...
I'm using Python to open wordList.txt that consists of about 5000 words. I tested my code on a small file perfectly, but it cannot run on the large file since the loops are too long.
def puzzleA(wordList):
tempList1 = []
tempList2 = []
tempList3 = []
for word in wordList:
if word[-2:]=='er':
tempList1.append(word)
if word[-3:]=='est':
tempList2.append(word)
for word1 in wordList:
for word2 in tempList1:
if word1==word2[:-2]:
tempList3.append(word1)
for word1 in tempList3:
for word2 in tempList2:
if word1==word2[:-3]:
print('{}, {}er, {}'.format(word1,word1,word2))
Can you guys suggest another algorithm to optimize the run time please!

You can build a dict with the root of the words as keys, and all variations in a list as value. Then we only keep the entries with 3 values. This way, we only iterate once on the list, and once on the dict we created, keeping the whole process O(n).
We can use a defaultdict to build the dict more easily. Note that the root function might need some improvement, check your list of English adjectives!
from collections import defaultdict
def root(word):
if len(word) < 4:
return word
if word[-3:] == 'ier':
return word[:-3] + 'y'
elif word[-4:] == 'iest':
return word[:-4] + 'y'
elif word[-2:] == 'er':
return word[:-2]
elif word[-3:] == 'est':
return word[:-3]
else:
return word
def find_triples(words):
out_dict = defaultdict(list)
for word in words:
out_dict[root(word)].append(word)
# keep only the lists with 3 distinct values, sorted by length
out = [sorted(set(values), key=len) for values in out_dict.values()
if len(set(values))==3]
return out
data = ['early', 'earlier', 'earliest', 'or', 'hard', 'harder', 'hardest', 'ignored']
print(find_triples(data))
# [['early', 'earlier', 'earliest'], ['hard', 'harder', 'hardest']]

Thank you so much for posting Thierry Lathuille. It's been 4 hours now since I looked at your answer. I am not good enough to understand your codes. I did adjust the root(word) function like this:
def root(word):
if len(word) < 4:
return word
if word[-3:] == 'ier':
return word[:-3] + 'y'
elif word[-4:] == 'iest':
return word[:-4] + 'y'
elif word[-2:] == 'er':
if word[-4:-3]==word[-3:-2]:
return word[:-3]
else:
return word[:-2]
elif word[-3:] == 'est':
if word[-4:-3]==word[-5:-4]:
return word[:-4]
return word[:-3]
else:
return word
But it has 2 problems right now:
First, the wordlist has duplicated word, so it produces something like [terry, terry, terrier].
Second, it's really difficult to find such triple [big, bigger, biggest]
Mine produces [whin, whiner, whinner], [willy, willier, willyer], [slat, slater, slatter],...
Assume that I am not allowed to remove the duplicate words first. So is there any way to access each value in each key. I'd like to compare those pair values to eliminate unwanted results.
And Thierry, if you have time, can you explain this code please?
out = [sorted(values, key=len) for values in out_dict.values() if len(values)==3]
I am really bad at reading a list comprehension.

It seems like you're now struggling with how to stem words in a consistent manner.
For your scenario, you can make do with a list of pattern-replacement rules, i.e. "if the word ends with this pattern, replace it with that".
Using regular expressions, you can easily specify the patterns and replacements, including things like "if it ends with a repeating letter, replace it with a single instance of that letter".
For example:
def root(word):
pattern_replacements = [
("e$", ""), # fine => fin (to match finer, finest)
("y$", "i"), # tiny => tini (to match tinier, tiniest)
("er$", ""),
("est$", ""),
(r"([a-z])\1$", r"\1") # bigger => big
]
for pattern, replacement in pattern_replacements:
word = re.sub(pattern, replacement, word)
return word
words = "big bigger biggest tiny tinier tiniest fine finer finest same samer samest good gooder goodest".split(" ")
map(root, words)
# ['big', 'big', 'big', 'tini', 'tini', 'tini', 'fin', 'fin', 'fin', 'sam', 'sam', 'sam', 'good', 'good', 'good']

How to slice a string into sub strings of words and append to list recursively?

I am trying to slice a string based on the limit provide and append the slices to a list if they form an entire word.
If they do not form a complete word, i want to go forward/back till the point I find the previous words end and then slice the string at that point and continue this for the rest of the string.
I have written the code but having problems doing it in a recursive way.
strr = "stackoverflow is the best knowledge platform"
limit = 11
def get_list(strr, limit, fin_list = []):
i = 0
if limit <= len(strr):
if strr[limit].isspace():
fin_list.append(strr[i:limit])
i=limit+1
return get_list(strr[i:],limit, fin_list)
print(fin_list)
else:
return get_list(strr,limit-1)
print(get_list(strr, limit))
My limit is 11 so my expected output is
['stackoverflow','is the best','knowledge','platform']
stackoverflow ==> complete word hence move forward
Another method I tried was using dictionaries but it did not solve the problem.
can this be achieve in a typical python style one line of code using comprehension?

You should use a while loop to find the closest white space to the limit, going backwards, and if no white space is found, then go forward. Recursively find the next substring until the remaining string fits the length limit:
def get_list(strr, limit):
if len(strr) <= limit:
return [strr] if strr else []
i = limit
while 0 < i < len(strr) and not strr[i].isspace():
i -= 1
if not i:
i = limit
while i < len(strr) and not strr[i].isspace():
i += 1
return [strr[:i]] + get_list(strr[i + 1:], limit)
so that:
get_list('stackoverflow is the best knowledge platform', 11)
returns:
['stackoverflow', 'is the best', 'knowledge', 'platform']
Alternatively, you can split the input by white space, and iterate over the list of words to decide whether to add a word to the current string of words or to output the current string of words and start a new one based on whether the new word is going to make the string of words exceed the limit:
def get_list(strr, limit):
words = ''
for word in strr.split():
if len(words) + len(word) + 1 > limit:
if words:
yield words
words = word
else:
yield word
words = ''
else:
if words:
words += ' '
words += word
if words:
yield words
so that:
print(list(get_list(strr, 3)))
print(list(get_list(strr, 11)))
print(list(get_list(strr, 100)))
outputs:
['stackoverflow', 'is', 'the', 'best', 'knowledge', 'platform']
['stackoverflow', 'is the best', 'knowledge', 'platform']
['stackoverflow is the best knowledge platform']

Scanning for two word phrase in Python dictionary

I am trying to use a Python dictionary object to help translate an input string to other words or phrases. I am having success with translating single words from the input, but I can't seem to figure out how to translate multi-word phrases.
Example:
sentence = input("Please enter a sentence: ")
myDict = {"hello": "hi","mean adult":"grumpy elder", ...ect}
How can I return hi grumpy elder if the user enters hello mean adult for the input?

"fast car" is a key to the dictionary, so you can extract the value if you use the key coming back from it.
If you're taking the input straight from the user and using it to reference the dictionary, get is safer, as it allows you to provide a default value in case the key doesn't exist.
print(myDict.get(sentence, "Phrase not found"))
Since you've clarified your requirements a bit more, the hard part now is the splitting; the get doesn't change. If you can guarantee the order and structure of the sentences (that is, it's always going to be structured such that we have a phrase with 1 word followed by a phrase with 2 words), then split only on the first occurrence of a space character.
split_input = input.split(' ', 1)
print("{} {}".format(myDict.get(split_input[0]), myDict.get(split_input[1])))
More complex split requirements I leave as an exercise for the reader. A hint would be to use the keys of myDict to determine what valid tokens are present in the sentence.

The same way as you normally would.
translation = myDict['fast car']
A solution to your particular problem would be something like the following, where maxlen is the maximum number of words in a single phrase in the dictionary.
translation = []
words = sentence.split(' ')
maxlen = 3
index = 0
while index < len(words):
for i in range(maxlen, 0, -1):
phrase = ' '.join(words[index:index+i])
if phrase in myDict:
translation.append(myDict[phrase])
index += i
break
else:
translation.append(words[index])
index += 1
print ' '.join(translation)
Given the sentence hello this is a nice fast car, it outputs hi this is a sweet quick ride

This will check for each word and also a two word phrase using the word before and after the current word to make the phrase:
myDict = {"hello": "hi",
"fast car": "quick ride"}
sentence = input("Please enter a sentence: ")
words = sentence.split()
for i, word in enumerate(words):
if word in myDict:
print myDict.get(word)
continue
if i:
phrase = ' '.join([words[i-1], word])
if phrase1 in myDict:
print myDict.get(phrase)
continue
if i < len(words)-1:
phrase = ' '.join([word, words[i+1])
if phrase in myDict:
print myDict.get(phrase)
continue

Python - For Loop of Itertools Permutations, How to not Create Endless Elif Statments?

I am a python/programming newbie, been at it for a few months. Hopefully this chunk of code is not too big or over-the-top for SO, but I couldn't figure how else to ask the question without complete context. So here goes:
import re
import itertools
nouns = ['bacon', 'cheese', 'eggs', 'milk', 'fish', 'houses', 'dog']
CC = ['and', 'or']
def replacer_factory():
def create_permutations(match):
group1_string = (match.group(1)[:-1]) # strips trailing whitespace
# creates list of matched.group() with word 'and' or 'or' removed
nouns2 = filter(None, re.split(r',\s*', group1_string)) + [match.group(3)]
perm_nouns2 = list(itertools.permutations(nouns2))
CC_match = match.group(2) # this either matches word 'and' or 'or'
# create list that holds the permutations created in for loop below
perm_list = []
for comb in itertools.permutations(nouns2):
comb_len = len(comb)
if comb_len == 2:
perm_list.append(' '.join((comb[0], CC_match, comb[-1])))
elif comb_len == 3:
perm_list.append(', '.join((comb[0], comb[1], CC_match, comb[-1])))
elif comb_len == 4:
perm_list.append(', '.join((comb[0], comb[1], comb[2], CC_match, comb[-1])))
# does the match.group contain word 'and' or 'or'
if (match.group(2)) == "and":
joined = '*'.join(perm_list)
strip_comma = joined.replace("and,", "and")
completed = '|'+strip_comma+'|'
return completed
elif (match.group(2)) == "or":
joined = '*'.join(perm_list)
strip_comma = joined.replace("or,", "or")
completed = '|'+strip_comma+'|'
return completed
return create_permutations
def search_and_replace(text):
# use'nouns' and 'CC' lists to find a noun list phrase
# e.g 'bacon, eggs, and milk' is 1 example of a match
noun_patt = r'\b(?:' + '|'.join(nouns) + r')\b'
CC_patt = r'\b(' + '|'.join(CC) + r')\b'
patt = r'((?:{0},? )+){1} ({0})'.format(noun_patt, CC_patt)
replacer = replacer_factory()
return re.sub(patt, replacer, text)
def main():
with open('test_sentence.txt') as input_f:
read_f = input_f.read()
with open('output.txt', 'w') as output_f:
output_f.write(search_and_replace(read_f))
if __name__ == '__main__':
main()
Contents of the 'test_sentence.txt':
I am 2 list with 'or': eggs or cheese.
I am 2 list with 'and': milk and eggs.
I am 3 list with 'or': cheese, bacon, and eggs.
I am 3 list with 'and': bacon, milk and cheese.
I am 4 list: milk, bacon, eggs, and cheese.
I am 5 list, I don't match.
I am 3 list with non match noun: cheese, bacon and pie.
So, the code all works great, but I have hit a limitation that I can't figure out how to solve. This limitation is contained within the for loop. As it stands I have only created 'if' and 'elif' statements that go only as far as elif comb == 4:. I actually want this to become unlimited, moving on to elif comb == 5:, elif comb == 6:, elif comb == 7:. (Well, in practical reality, I would not really need to go beyond elif comb == 20, but the point is the same, and I want to allow for that possibility). However, it is not practical to create so many 'elif' statements.
Any ideas about how to tackle this?
Note that the 'test_sentence.txt' and the list of the variable 'noun' here are just samples. My actually 'noun' list is in the 1000's, and I will be working with bigger documents that the text contained within 'test_sentence.txt.
Cheers
Darren
P.S - I struggled to come up with a suitable title for this one!

If you notice, each of the lines in the if-elif statements follows roughly the same structure: you first take every element in the comb list except for the last one, add on CC_match, then add the last item.
If we write this out as code, we get something like this:
head = list(comb[0:-1])
head.append(CC_match)
head.append(comb[-1])
perm_list.append(', '.join(head))
Then, inside your for loop, you can replace the if-elif statements:
for comb in itertools.permutations(nouns2):
head = list(comb[0:-1])
head.append(CC_match)
head.append(comb[-1])
perm_list.append(', '.join(head))
You should also consider adding some error checking so that the program doesn't react oddly if the length of the comb list is equal to zero or one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Looking for Word Sequences Inside of a String - python

Related

How do I turn a series of IF statements into a Pythonic code line?

Find triple words in a word list in Python

How to slice a string into sub strings of words and append to list recursively?

Scanning for two word phrase in Python dictionary

Python - For Loop of Itertools Permutations, How to not Create Endless Elif Statments?

Categories

Resources