Python find n-sized window around phrase within string - python

I have a string, for example 'i cant sleep what should i do'as well as a phrase that is contained in the string 'cant sleep'. What I am trying to accomplish is to get an n sized window around the phrase even if there isn't n words on either side. So in this case if I had a window size of 2 (2 words on either size of the phrase) I would want 'i cant sleep what should'.
This is my current solution attempting to find a window size of 2, however it fails when the number of words to the left or right of the phrase is less than 2, I would also like to be able to use different window sizes.
import re
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print sentence_words[left-2:right+3]
left = sentence_words.index(span_words[0])
right = sentence_words.index(span_words[-1])
print sentence_words[left-2:right+3]

You can use the partition method for a non-regex solution:
>>> s='i cant sleep what should i do'
>>> p='cant sleep'
>>> lh, _, rh = s.partition(p)
Then use a slice to get up to two words:
>>> n=2
>>> ' '.join(lh.split()[:n]), p, ' '.join(rh.split()[:n])
('i', 'cant sleep', 'what should')
Your exact output:
>>> ' '.join(lh.split()[:n]+[p]+rh.split()[:n])
'i cant sleep what should'
You would want to check whether p is in s or if the partition succeeds of course.
As pointed out in comments, lh should be a negative to take the last n words (thanks Mathias Ettinger):
>>> s='w1 w2 w3 w4 w5 w6 w7 w8 w9'
>>> p='w4 w5'
>>> n=2
>>> ' '.join(lh.split()[-n:]+[p]+rh.split()[:n])
'w2 w3 w4 w5 w6 w7'

If you define words being entities separated by spaces you can split your sentences and use regular python slicing:
def get_window(sentence, phrase, window_size):
sentence = sentence.split()
phrase = phrase.split()
words = len(phrase)
for i,word in enumerate(sentence):
if word == phrase[0] and sentence[i:i+words] == phrase:
start = max(0, i-window_size)
return ' '.join(sentence[start:i+words+window_size])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
print(get_window(sentence, phrase, 2))
You can also change it to a generator by changing return to yield and be able to generate all windows if several match of phrase are in sentence:
>>> list(gen_window('I dont need it, I need to get rid of it', 'need', 2))
['I dont need it, I', 'it, I need to get']

import re
def contains_sublist(lst, sublst):
n = len(sublst)
for i in xrange(len(lst)-n+1):
if (sublst == lst[i:i+n]):
a = max(i, i-2)
b = min(i+n+2, len(lst))
return ' '.join(lst[a:b])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
sentence_words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print contains_sublist(sentence_words, phrase_words)

you can split words using inbuilt string methods, so re shouldn't be nessesary. If you want to define varrring values, then wrap it in a function call like so:
def get_word_window(sentence, phrase, w_left=0, w_right=0):
w_lst = sentence.split()
p_lst = phrase.split()
for i,word in enumerate(w_lst):
if word == p_lst[0] and \
w_lst[i:i+len(p_lst)] == p_lst:
left = max(0, i-w_left)
right = min(len(w_lst), i+w_right+len(p_list)
return w_lst[left:right]
Then you can get the new phrase like so:
>>> sentence='i cant sleep what should i do'
>>> phrase='cant sleep'
>>> ' '.join(get_word_window(sentence,phrase,2,2))
'i cant sleep what should'

Related

Check for words in a sentence

I write a program in Python. The user enters a text message. It is necessary to check whether there is a sequence of words in this message. Sample. Message: "Hello world, my friend.". Check the sequence of these two words: "Hello", "world". The Result Is "True". But when checking the sequence of these words in the message: "Hello, beautiful world "the result is"false". When you need to check the presence of only two words it is possible as I did it in the code, but when combinations of 5 or more words is difficult. Is there any small solution to this problem?
s=message.text
s=s.lower()
lst = s.split()
elif "hello" in lst and "world" in lst :
if "hello" in lst:
c=lst.index("hello")
if lst[c+1]=="world" or lst[c-1]=="world":
E=True
else:
E=False
The straightforward way is to use a loop. Split your message into individual words, and then check for each of those in the sentence in general.
word_list = message.split() # this gives you a list of words to find
word_found = True
for word in word_list:
if word not in message2:
word_found = False
print(word_found)
The flag word_found is True iff all words were found in the sentence. There are many ways to make this shorter and faster, especially using the all operator, and providing the word list as an in-line expression.
word_found = all(word in message2 for word in message.split())
Now, if you need to restrict your "found" property to matching exact words, you'll need more preprocessing. The above code is too forgiving of substrings, such as finding "Are you OK ?" in the sentence "your joke is only barely funny". For the more restrictive case, you should break message2 into words, strip those words of punctuation, drop them to lower-case (to make matching easier), and then look for each word (from message) in the list of words from message2.
Can you take it from there?
I will clarify your requirement first:
ignore case
consecutive sequence
match in any order, like permutation or anagram
support duplicated words
if the number is not too large, you can try this easy-understanding but not the fastest way.
split all words in text message
join them with ' '
list all the permutation of words and join them with ' ' too, For
example, if you want to check sequence of ['Hello', 'beautiful', 'world']. The permutation will be 'Hello beautiful world',
'Hello world beautiful', 'beautiful Hello world'... and so on.
and you can just find whether there is one permutation such as
'hello beautiful world' is in it.
The sample code is here:
import itertools
import re
# permutations brute-force, O(nk!)
def checkWords(text, word_list):
# split all words without space and punctuation
text_words= re.findall(r"[\w']+", text.lower())
# list all the permutations of word_list, and match
for words in itertools.permutations(word_list):
if ' '.join(words).lower() in ' '.join(text_words):
return True
return False
# or use any, just one line
# return any(' '.join(words).lower() in ' '.join(text_words) for words in list(itertools.permutations(word_list)))
def test():
# True
print(checkWords('Hello world, my friend.', ['Hello', 'world', 'my']))
# False
print(checkWords('Hello, beautiful world', ['Hello', 'world']))
# True
print(checkWords('Hello, beautiful world Hello World', ['Hello', 'world', 'beautiful']))
# True
print(checkWords('Hello, beautiful world Hello World', ['Hello', 'world', 'world']))
But it costs a lot when words number is large, k words will generate k! permutation, the time complexity is O(nk!).
I think a more efficient solution is sliding window. The time complexity will decrease to O(n):
import itertools
import re
import collections
# sliding window, O(n)
def checkWords(text, word_list):
# split all words without space and punctuation
text_words = re.findall(r"[\w']+", text.lower())
counter = collections.Counter(map(str.lower, word_list))
start, end, count, all_indexes = 0, 0, len(word_list), []
while end < len(text_words):
counter[text_words[end]] -= 1
if counter[text_words[end]] >= 0:
count -= 1
end += 1
# if you want all the index of match, you can change here
if count == 0:
# all_indexes.append(start)
return True
if end - start == len(word_list):
counter[text_words[start]] += 1
if counter[text_words[start]] > 0:
count += 1
start += 1
# return all_indexes
return False
I don't know if that what you really need but this worked you can tested
message= 'hello world'
message2= ' hello beautiful world'
if 'hello' in message and 'world' in message :
print('yes')
else :
print('no')
if 'hello' in message2 and 'world' in message2 :
print('yes')
out put :
yes
yes

Create all combinations from regex

I have sentences that define a template for random combinations:
I like dogs/cats
I want to eat today/(the next day)
I tried using a regex:
m = re.search(r'(?P<list>[A-Za-z]+/([A-Za-z]+)+)', sentence)
words = m.group('list').split('/')
combs = [comb for comb in [sentence.replace(m.group('list'), w) for w in words]]
For the first sentence I get ['i like dogs', 'i like cats'] which is what I want. For the second sentence, re.search returns None. What I would like to get is ['I want to eat today', 'I want to eat the next day'].
How do I need to change the regex?
(I want to eat today)*|(the next day)
Is the regex that will select the text you want...
r'(?P<list>[A-Za-z]+/([a-zA-Z]+|\(.+?\)))''
([a-zA-Z]+|\(.+?\)) matches strings like "word" or "(some word)". And it also matches "()", we need to remove heading "(" and trailing ")" using strip.
m = re.search(r'(?P<list>[A-Za-z]+/([a-zA-Z]+|\(.+?\)))', sentence)
words = m.group('list').split('/')
combs = [comb for comb in [sentence.replace(m.group('list'), w.strip('()')) for w in words]]
With below code you will get something like
> sentence = 'I want to eat today/(the next day)' m =
> re.search(r'(?P<list>[A-Za-z]+/([A-Za-z]+|(\(.*?\))))', sentence)
> print m.group('list') words = m.group('list').split('/') combs = [comb
> for comb in [sentence.replace(m.group('list'), w) for w in words]]
> print combs
['I want to eat today', 'I want to eat (the next day)'
you could dome extra processing to get rid of the extra parenthesis which should be easy

Title case with a list of exception words

Im trying to come up with something that will "title" a string of words. It should capitalize all words in the string unless given words not to capitalize as an argument. But will still capitalize the first word no matter what. I know how to capitalize every word, but I dont know how to not capitalize the exceptions. Kind of lost on where to start, couldnt find much on google.
def titlemaker(title, exceptions):
return ' '.join(x[0].upper() + x[1:] for x in title.split(' '))
or
return title.title()
but I found that will capitalize a letter after an apostrophe so I dont think I should use it.
Any help on how I should take into account the exceptions would be nice
example: titlemaker('a man and his dog', 'a and') should return 'A Man and His Dog'
def titlemaker(title,exceptions):
exceptions = exceptions.split(' ')
return ' '.join(x.title() if nm==0 or not x in exceptions else x for nm,x in enumerate(title.split(' ')))
titlemaker('a man and his dog','a and') # returns "A Man and His Dog"
The above assumes that the input string and the list of exceptions are in the same case (as they are in your example), but would fail on something like `titlemaker('a man And his dog','a and'). If they could be in mixed case do,
def titlemaker(title,exceptions):
exceptionsl = [x.lower() for x in exceptions.split(' ')]
return ' '.join(x.title() if nm==0 or not x.lower() in exceptions else x.lower() for nm,x in enumerate(title.split(' ')))
titlemaker('a man and his dog','a and') # returns "A Man and His Dog"
titlemaker('a man AND his dog','a and') # returns "A Man and His Dog"
titlemaker('A Man And His DOG','a and') # returns "A Man and His Dog"
Try with this:
def titleize(text, exceptions):
exceptions = exceptions.split()
text = text.split()
# Capitalize every word that is not on "exceptions" list
for i, word in enumerate(text):
text[i] = word.title() if word not in exceptions or i == 0 else word
# Capitalize first word no matter what
return ' '.join(text)
print titleize('a man and his dog', 'a and')
Output:
A Man and His Dog
def titleize(text, exceptions):
return ' '.join([word if word in exceptions else word.title()
for word in text.split()]).capitalize()
import re
import nltk
from nltk.corpus import stopwords
from itertools import chain
def setTitleCase(title):
exceptions = []
exceptions.append([word for word in stopwords.words('english')])
exceptions.append([word for word in stopwords.words('portuguese')])
exceptions.append([word for word in stopwords.words('spanish')])
exceptions.append([word for word in stopwords.words('french')])
exceptions.append([word for word in stopwords.words('german')])
exceptions = list(chain.from_iterable(exceptions))
list_of_words = re.split(' ', title)
final = [list_of_words[0].capitalize()]
for word in list_of_words[1:]:
word = word.lower()
if word in exceptions:
final.append(word)
else:
final.append(word.capitalize())
return " ".join(final)
print(setTitleCase("a Fancy Title WITH many stop Words and other stuff"))
Wich gives you as answer: "A Fancy Title with Many Stop Words and other Stuff"

i need to calculate average sentence length in python

avg_sentence_length is function which calculate average length of a sentence
def avg_sentence_length(text):
""" (list of str) -> float
Precondition: text contains at least one sentence.
A sentence is defined as a non-empty string of non-terminating
punctuation surrounded by terminating punctuation or beginning or
end of file. Terminating punctuation is defined as !?.
Return the average number of words per sentence in text.
>>> text = ['The time has come, the Walrus said\n',
'To talk of many things: of shoes - and ships - and sealing wax,\n',
'Of cabbages; and kings.\n',
'And why the sea is boiling hot;\n',
'and whether pigs have wings.\n']
>>> avg_sentence_length(text)
17.5
"""
I think you are looking for something like this?
def averageSentence(sentence):
words = sentence.split()
average = sum(len(word) for word in words)/len(words)
print(average)
def main():
sentence = input("Enter Sentence: ")
averageSentence(sentence)
main()
output:
Enter Sentence: my name is something
4.25
I am using idle 3 and above. If you are working with python 2.7 or so, the code will be a little bit different.
We can use reduce and lambda.
from functools import reduce
def Average(l):
avg = reduce(lambda x, y: x + y, l) / len(l)
return(avg)
def AVG_SENT_LNTH(File):
SENTS = [i.split() for i in open(File).read().splitlines()]
Lengths = [len(i) for i in SENTS]
return(Average(Lengths))
print("Train\t", AVG_SENT_LNTH("Train.dat"))
Though the splitting process is totally conditional.
import doctest
import re
def avg_sentence_length(text):
r"""(list of str) -> float
Precondition: text contains at least one sentence.
A sentence is defined as a non-empty string of non-terminating
punctuation surrounded by terminating punctuation or beginning or
end of file. Terminating punctuation is defined as !?.
Return the average number of words per sentence in text.
>>> text = ['The time has come, the Walrus said\n',
... 'To talk of many things: of shoes - and ships - and sealing wax,\n',
... 'Of cabbages; and kings.\n',
... 'And why the sea is boiling hot;\n',
... 'and whether pigs have wings.\n']
>>> avg_sentence_length(text)
17.5
"""
terminating_punct = "[!?.]"
punct = r"\W" # non-word characters
sentences = [
s.strip() # without trailing whitespace
for s in re.split(
terminating_punct,
"".join(text).replace("\n", " "), # text as 1 string
)
if s.strip() # non-empty
]
def wordcount(s):
"""Split sentence s on punctuation
and return number of non-empty words
"""
return len([w for w in re.split(punct, s) if w])
return sum(map(wordcount, sentences)) / len(sentences)
# test the spec. I just made the docstring raw with 'r'
# and added ... where needed
doctest.run_docstring_examples(avg_sentence_length, globals())
use word_tokenize to find the words and then use a list comprehension to find alpha and digit words.
from nltk import word_tokenize
from functools import reduce
text = ['The time has come, the Walrus said\n',
'To talk of many things: of shoes - and ships - and sealing wax,\n',
'Of cabbages; and kings.\n',
'And why the sea is boiling hot;\n',
'and whether pigs have wings.\n']
sentences = [ [word for word in word_tokenize ( sent ) if word.isalpha() or
word.isdigit()] for sent in text]
counts=[]
for sentence in sentences:
counts.append(len(sentence))
#https://www.geeksforgeeks.org/reduce-in-python/
#takes the first two sequences of the list and adds then gets the next sequence and accumulates it until end of list.
def Average(list_counts):
avg = reduce(lambda x, y: x + y, list_counts) / len(list_counts)
return(avg)
print("Average words in the sentence is", Average(counts))
output:
[['The', 'time', 'has', 'come', 'the', 'Walrus', 'said'], ['To', 'talk', 'of', 'many', 'things', 'of', 'shoes', 'and', 'ships', 'and', 'sealing', 'wax'], ['Of', 'cabbages', 'and', 'kings'], ['And', 'why', 'the', 'sea', 'is', 'boiling', 'hot'], ['and', 'whether', 'pigs', 'have', 'wings']]
[7, 12, 4, 7, 5]
Average words in the sentence is 7.0
I have edited the code a bit in this answer but it should work (I uninstalled Python so I can't test sorry. (It was to make space on this rubbish laptop that only started with 28GB!) ) This is the code:
def findAverageSentenceLength(long1, medium2, short3):
S1LENGTH = long1.length
S2LENGTH = medium2.length
S3LENGTH = short3.length
ADDED_LENGTHS = S1LENGTH + S2LENGTH + S3lENGTH
AVERAGE = ADDED_LENGTHS / 3
print("The average sentence length is", AVERAGE, "!")
long1input = input("Enter a 17-30 word sentence.")
medium2input = input("Enter a 10-16 word sentence.")
short3input = input("Enter a 5-9 word sentence.")
findAverageSentenceLength(long1input, medium2input, short3input)
Hope this helps.
PS: This WILL only work in Python 3

How do I print words with only 1 vowel?

my code so far, but since i'm so lost it doesn't do anything close to what I want it to do:
vowels = 'a','e','i','o','u','y'
#Consider 'y' as a vowel
input = input("Enter a sentence: ")
words = input.split()
if vowels == words[0]:
print(words)
so for an input like this:
"this is a really weird test"
I want it to only print:
this, is, a, test
because they only contains 1 vowel.
Try this:
vowels = set(('a','e','i','o','u','y'))
def count_vowels(word):
return sum(letter in vowels for letter in word)
my_string = "this is a really weird test"
def get_words(my_string):
for word in my_string.split():
if count_vowels(word) == 1:
print word
Result:
>>> get_words(my_string)
this
is
a
test
Here's another option:
import re
words = 'This sentence contains a bunch of cool words'
for word in words.split():
if len(re.findall('[aeiouy]', word)) == 1:
print word
Output:
This
a
bunch
of
words
You can translate all the vowels to a single vowel and count that vowel:
import string
trans = string.maketrans('aeiouy','aaaaaa')
strs = 'this is a really weird test'
print [word for word in strs.split() if word.translate(trans).count('a') == 1]
>>> s = "this is a really weird test"
>>> [w for w in s.split() if len(w) - len(w.translate(None, "aeiouy")) == 1]
['this', 'is', 'a', 'test']
Not sure if words with no vowels are required. If so, just replace == 1 with < 2
You may use one for-loop to save the sub-strings into the string array if you have checked he next character is a space.
Them for each substring, check if there is only one a,e,i,o,u (vowels) , if yes, add into the another array
aFTER THAT, FROM another array, concat all the strings with spaces and comma
Try this:
vowels = ('a','e','i','o','u','y')
words = [i for i in input('Enter a sentence ').split() if i != '']
interesting = [word for word in words if sum(1 for char in word if char in vowel) == 1]
i found so much nice code here ,and i want to show my ugly one:
v = 'aoeuiy'
o = 'oooooo'
sentence = 'i found so much nice code here'
words = sentence.split()
trans = str.maketrans(v,o)
for word in words:
if not word.translate(trans).count('o') >1:
print(word)
I find your lack of regex disturbing.
Here's a plain regex only solution (ideone):
import re
str = "this is a really weird test"
words = re.findall(r"\b[^aeiouy\W]*[aeiouy][^aeiouy\W]*\b", str)
print(words)

Categories

Resources