How to check if two words are ordered in python - python

what is the bes way tho check if two words are ordered in sentence and how many times it occurs in python.
For example: I like to eat maki sushi and the best sushi is in Japan.
words are: [maki, sushi]
Thanks.
The code
import re
x="I like to eat maki sushi and the best sushi is in Japan"
x1 = re.split('\W+',x)
l1 = [i for i,m in enumerate(x1) if m == "maki"]
l2 = [i for i,m in enumerate(x1) if m == "sushi"]
ordered = []
for i in l1:
for j in l2:
if j == i+1:
ordered.append((i,j))
print ordered

According to the added code, you mean that words are adjacent?
Why not just put them together:
print len(re.findall(r'\bmaki sushi\b', sent))

def ordered(string, words):
pos = [string.index(word) for word in words]
return pos == sorted(pos)
s = "I like to eat maki sushi and the best sushi is in Japan"
w = ["maki", "sushi"]
ordered(s, w) #Returns True.
Not exactly the most efficient way of doing it but simpler to understand.

s = 'I like to eat maki sushi and the best sushi is in Japan'
check order
indices = [s.split().index(w) for w in ['maki', 'sushi']]
sorted(indices) == indices
how to count
s.split().count('maki')
Note (based on discussion below):
suppose the sentence is 'I like makim more than sushi or maki'. Realizing that makim is another word than maki, the word maki is placed after sushi and occurs only once in the sentence. To detect this and count correctly, the sentence must be split over the spaces into the actual words.

if res > 0:
words are sorted in the sentence
words = ["sushi", "maki", "xxx"]
sorted_words = sorted(words)
sen = " I like to eat maki sushi and the best sushi is in Japan xxx";
ind = map(lambda x : sen.index(x), sorted_words)
res = reduce(lambda a, b: b-a, ind)

A regex solution :)
import re
sent = 'I like to eat maki sushi and the best sushi is in Japan'
words = sorted(['maki', 'sushi'])
assert re.search(r'\b%s\b' % r'\b.*\b'.join(words), sent)

Just and idea, it might need some more work
(sentence.index('maki') <= sentence.index('sushi')) == ('maki' <= 'sushi')

Related

Looking for Word Sequences Inside of a String

Haven't seen anyone solving something similar to this without using regex. So I have a text = "I have three apples because yesterday I bought three apples" and the list of key words: words = ['I have', 'three', 'apples', 'yesterday'] and k = 2. I have write to a function that finds and returns word sequences of >= k 'words' (by words I mean word combinations identified as a single element in the list e.g. 'I have' is considered a word).
In this case, it should return ['I have three apples', 'three apples']. Even though 'yesterday' is in the string, it's < k, so it doesn't match.
I'm assuming there needs to be a stack to keep track of the size of the sequences. I started writing the code without 1) it doesn't work in this situation because I try checking 'i have', then 'i have three' etc and it can't identify just 'three apples'; 2) I don't know how to proceed with it. Here's the code:
text = "I have three apples because yesterday I bought three apples"
words = ['I have', 'three', 'apples', 'yesterday']
k = 2
check1 = []
def search(text, words, k):
for i in words:
finding = text.count(i)
if finding != 0:
check1.append(i)
check2 = ' '.join(check1)
occurrences = text.count(check2)
if occurrences > 0:
#i want to check if the previous number of occurrences was the same
#that's why I think I need a stack. if it is, i keep going
#if it's not, i append the previous phrases to the list if they're >= k and keep
#checking
pass
else:
#the next word doesn't belong to the sequence, so we finish the process
#by adding the right number of word sequences >= k to the resulting list
pass
else:
#the word is not in the list and I need to add check2 to the list
#considering all word sequences
pass
Different ways of solving this or any ideas are very much appreciated because I'm stuck on trying to solve it this way and I don't know how to implement it.
I found a solution by walking through the text, noting the words in the correct order. The complexity of this algorithm increases quickly with the length of your word list and the length of text, however. Depending on the application, you might want to do it differently:
def walk(t,w,k):
t+=' '
node = -1
current = []
collection = []
while len(t)>1:
elong = False
for i in range(len(w)):
if i > node and t[:len(w[i])] == w[i]:
node = i
t = t[len(w[i])+1:]
current.append(w[i])
elong=True
if not elong or len(t)<2:
t = t[t.find(' ')+1:]
if len(current)>=k: collection.append(' '.join(current))
current = []
node = -1
return collection
This function would handle the requests you mentioned in the question as follows:
#Input:
print(walk("I have three apples because yesterday I bought three apples",
['I have', 'three', 'apples', 'yesterday'],
2))
#Output:
['I have three apples', 'three apples']
#Input:
print(walk("Is three apples all I have three",
['I have', 'three', 'apples', 'yesterday'],
2))
#Output:
['three apples', 'I have three']
It heavily relies on there being blanks that separate the words and would not cope well with punctuation. You might want to include some preprocessing.

How do I turn a series of IF statements into a Pythonic code line?

I am trying to improve my list comprehension capabilities. I would like to turn the following into a Pythonic code line:
# Sample sentence
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
# Empty list to hold words from sentence
words = []
# Empty lists to hold longest and shortest items
longest = []
shortest = []
# Pythonic method to split sentence into list of words
words = [c for c in sample_sentence.split()]
# Non-Pythonic routine to find the shortest and longest words
for i in words:
if len(i) <= shortest_len:
# shortest = [] # Reset list
shortest.append(i)
if len(i) >= longest_len:
longest.append(i)
print(f"Shortest item: {', '.join(shortest)}")
print(f"Longest item: {', '.join(longest)}")
I have tried creating a Pythonic version of the non-Pythonic routine to find the shortest and longest words:
shortest = [i for i in words if len(i) <= shortest_len: shortest.append(i) ]
print(shortest)
longest = [i for i in words if len(i) >= longest_len: longest.append(i)]
print(longest)
But it gets an invalid syntax error.
How can I construct the above two code lines and also is it possible to use a single Pythonic line to combine the two?
Don't make the mistake of thinking that everything has to be a clever one-liner. It is better to write code that you will be able to look at in the future and know exactly what it will do. In my opinion, this is a very Pythonic way to write your script. Yes there are improvements that could be made, but don't optimize if you don't have to. This script is reading two sentences, if it was reading entire books then we can make improvements.
As said in the comments, everyone has their own definition of "Pythonic" and their own formatting. Personally, I don't think you need the comments when it is very clear what your code is doing. For instance, you don't need to comment "Sample Sentence" when your variable is named sample_sentence.
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
words = sample_sentence.split()
words.sort(key=len)
short_len = len(words[0])
long_len = len(words[-1])
shortest_words = [word for word in words if len(word) == short_len]
longest_words = [word for word in words if len(word) == long_len]
print(f"Shortest item: {', '.join(shortest_words)}")
print(f"Longest item: {', '.join(longest_words)}")
Try this for your list comprehension case to resolve the syntax error -
shortest = [i for i in words if len(i) <= shortest_len]
print(shortest)
longest = [i for i in words if len(i) >= longest_len]
print(longest)
Also instead of words = [c for c in sample_sentence.split()] you can use words = sample_sentence.split()
To resolve the SyntaxError, you must just remove the : shortest.append(i) and : longest.append(i). The lists will be as expected, then.
is it possible to use a single Pythonic line to combine the two ?
No, it is not possible to build up both lists on one line. (Except that you can of ccourse put both list constructions on one line:
shortest = [i for i in words if len(i) <= shortest_len: shortest.append(i)]; longest = [i for i in words if len(i) >= longest_len: longest.append(i)]
). But I don't think that was your goal …
An alternative code to get the shortest and longest words:
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
my_list = list(set([(token,len(token)) for token in sample_sentence.split()]))
longest_word = ', '.join(i[0] for i in my_list if i[1]==max([x[1] for x in my_list]))
shortest_word = ', '.join(i[0] for i in my_list if i[1]==min([x[1] for x in my_list]))
print(f"longest words: {longest_word}")
print(f"shortest words: {shortest_word}")
output
longest words: people, walked
shortest words: a
First of all, you need to define shortest_len and longest_len somewhere. You can use this:
shortest_len = min([len(word) for word in words])
longest_len = max([len(word) for word in words])
shortest = [word for word in words if len(word) == shortest_len]
longest = [word for word in words if len(word) == longest_len]
It is possible to combine the two lines into
shortest, longest = [word for word in words if len(word) == shortest_len], [word for word in words if len(word) == longest_len]
If you want to go really crazy, you can compact it as far as this:
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
print(f"Shortest item: {', '.join([word for word in [c for c in sample_sentence.split()] if len(word) == min([len(word) for word in words])])}")
print(f"Longest item: {', '.join([word for word in [c for c in sample_sentence.split()] if len(word) == max([len(word) for word in words])])}")
or even join these two into one print to get a one-liner:
sample_sentence = "Once upon a time, three people walked into a bar, Once upon a time, three people walked into a bar"
print(f"Shortest item: {', '.join([word for word in [c for c in sample_sentence.split()] if len(word) == min([len(word) for word in words])])}\nLongest item: {', '.join([word for word in [c for c in sample_sentence.split()] if len(word) == max([len(word) for word in words])])}")
But, as already pointed out by #big_bad_bison, the shortest solution is not always the same as the best solution and the readability of your code (for yourself, over months or years; or for others, if you ever need to work in a team or maybe ask for help) is much more important.
Many one-liners are not cleanly Pythonic because they make for a line of code that is too long. PEP 8 - Style Guide for Python Code (definitely suggested reading!) stays clearly:
Limit all lines to a maximum of 79 characters.
This is an one-liner to find the length of every word in your sentence and sort them from the shortest to the longest:
words_len = sorted(list(set([(c, len(c)) for c in sample_sentence.split()])), key=lambda t: t[1])
Or as I was pointed out in the comments, you can achieve the same by doing this:
words = sorted(list(set(sample_sentence.split())), key=len)
print(f"shortest: {words[0]}\nlongest: {words[-1]}")

Get middle term from list of sentences using start and end phrases

I have a list of sentences, I need to find the start phrase and end phrase in that sentence if present and get the middle element. If the middle element has more than one word it should skip and move to the next occurrence.
list of sentences
para_list = [["hello sir good afternoon calling to you from news18 curious"],["a pleasant welcome from enws18 team what can i"], ["hi a good afternoon sir from news18"]]
start phrase
to_find_s =['good','afternoon']
end phrase
to_find_l = ['from','news18']
Code
for i, w in enumerate(para_list):
l = [sentence for sentence in w if all(word in w for word in to_find_s)]
if l:
m = [sentence for sentence in w if all(word in w for word in to_find_l)
Output
I am getting the sentences in which the phrases are present but not able to get the middle term
Expected Output
list = ['sir'] #from the last list. There would not be any item from first list as it has two words in between-'to you'
The following function will do the work.
def find_middle_phrase(para_list, to_find_s, to_find_l):
output_list = []
start_phrase, end_phrase = " ".join(to_find_s), " ".join(to_find_l)
for para in para_list:
para_string = para[0]
if para_string.find(start_phrase)!=-1 and para_string.find(end_phrase)!=-1:
required_phrase_starting_index = para_string.index(start_phrase) + len(start_phrase)
required_phrase_ending_index = para_string.index(end_phrase)
required_output_string = para_string[required_phrase_starting_index: required_phrase_ending_index].strip()
if required_output_string.find(" ") == -1:
output_list.append(required_output_string)
return output_list
EXAMPLE:
para_list = [["hello sir good afternoon calling to you from news18 curious"],["a pleasant welcome from enws18 team what can i"], ["hi a good afternoon sir from news18"]]
to_find_s =['good','afternoon']
to_find_l = ['from','news18']
expected_output = find_middle_phrase(para_list, to_find_s, to_find_l)
print(expected_output)
Got output:
['sir']
This produces the expected output:
for sentence in para_list:
words = sentence[0].split()
for i in range(len(words) - 3):
if(words[i] == to_find_s[0] and words[i+1] == to_find_s[1]):
if(words[i+3] == to_find_l[0] and words[i+4] == to_find_l[1]):
m = words[i+2]
print(m)

Compare spans in a list and return a label if similar

I want to build a list of labels to match the words in a sentence
sentence="I am a healthy boy who lives in Florida"
subject= "I am a healthy boy"
first I split:
sen = sentence.split(" ")
subj = subject.split(" ")
then I try to iterate.
tokens=[]
tokensStartIndex = 0
for i in range(0,len(sen)):
...
I want to have this as a result:
sentence = "I am a healthy boy who lives in Florida"
labels = [sub sub sub sub sub 0 0 0 0]
and how do I iterate this over a dataframe of columns sentence and subject?
It can be done quite easily using regular expressions.
sentence="I am a healthy boy who lives in Florida"
subject= "I am a healthy boy"
Build the expression
abc = '({})'.format('|'.join(re.escape(y) for y in sorted(subject, key=len, reverse=True)))
Iterate the expression over the sentence
xyz = (re.sub(abc, r"sub" , sent) for sent in sentence)
Build new list and filter
new = [new_sentence for old_sentence, new_sentence in zip(sentence, xyz) if old_sentence != new_sentence]
new
voila!

Create all combinations from regex

I have sentences that define a template for random combinations:
I like dogs/cats
I want to eat today/(the next day)
I tried using a regex:
m = re.search(r'(?P<list>[A-Za-z]+/([A-Za-z]+)+)', sentence)
words = m.group('list').split('/')
combs = [comb for comb in [sentence.replace(m.group('list'), w) for w in words]]
For the first sentence I get ['i like dogs', 'i like cats'] which is what I want. For the second sentence, re.search returns None. What I would like to get is ['I want to eat today', 'I want to eat the next day'].
How do I need to change the regex?
(I want to eat today)*|(the next day)
Is the regex that will select the text you want...
r'(?P<list>[A-Za-z]+/([a-zA-Z]+|\(.+?\)))''
([a-zA-Z]+|\(.+?\)) matches strings like "word" or "(some word)". And it also matches "()", we need to remove heading "(" and trailing ")" using strip.
m = re.search(r'(?P<list>[A-Za-z]+/([a-zA-Z]+|\(.+?\)))', sentence)
words = m.group('list').split('/')
combs = [comb for comb in [sentence.replace(m.group('list'), w.strip('()')) for w in words]]
With below code you will get something like
> sentence = 'I want to eat today/(the next day)' m =
> re.search(r'(?P<list>[A-Za-z]+/([A-Za-z]+|(\(.*?\))))', sentence)
> print m.group('list') words = m.group('list').split('/') combs = [comb
> for comb in [sentence.replace(m.group('list'), w) for w in words]]
> print combs
['I want to eat today', 'I want to eat (the next day)'
you could dome extra processing to get rid of the extra parenthesis which should be easy

Categories

Resources