I have sentences that define a template for random combinations:
I like dogs/cats
I want to eat today/(the next day)
I tried using a regex:
m = re.search(r'(?P<list>[A-Za-z]+/([A-Za-z]+)+)', sentence)
words = m.group('list').split('/')
combs = [comb for comb in [sentence.replace(m.group('list'), w) for w in words]]
For the first sentence I get ['i like dogs', 'i like cats'] which is what I want. For the second sentence, re.search returns None. What I would like to get is ['I want to eat today', 'I want to eat the next day'].
How do I need to change the regex?
(I want to eat today)*|(the next day)
Is the regex that will select the text you want...
r'(?P<list>[A-Za-z]+/([a-zA-Z]+|\(.+?\)))''
([a-zA-Z]+|\(.+?\)) matches strings like "word" or "(some word)". And it also matches "()", we need to remove heading "(" and trailing ")" using strip.
m = re.search(r'(?P<list>[A-Za-z]+/([a-zA-Z]+|\(.+?\)))', sentence)
words = m.group('list').split('/')
combs = [comb for comb in [sentence.replace(m.group('list'), w.strip('()')) for w in words]]
With below code you will get something like
> sentence = 'I want to eat today/(the next day)' m =
> re.search(r'(?P<list>[A-Za-z]+/([A-Za-z]+|(\(.*?\))))', sentence)
> print m.group('list') words = m.group('list').split('/') combs = [comb
> for comb in [sentence.replace(m.group('list'), w) for w in words]]
> print combs
['I want to eat today', 'I want to eat (the next day)'
you could dome extra processing to get rid of the extra parenthesis which should be easy
Related
So ideally, my code involves around the idea making a list of the words that I call in the text. This is my code.
import re
def checkWord(regex):
resList = []
with open('wells.txt', 'r') as wordFile:
for line in wordFile:
if re.match(regex, line[:-1]):
resList.append(line[:-1])
return resList
However, my problem is that the code ends up printing the full line of where that word is and I just need the word itself. Here's the outcome
checkWord('He')
['He was full of speculation that night about the condition of Mars, and',
'He remained standing at the edge of the pit that the Thing had made for',
'Henderson stood up with his spade in his hand.',
]
if you want the word you can use '\w*{}\w*'.format(regex). I think this will do exactly what you want:
import re
def checkWord(regex):
resList = []
with open('wells.txt', 'r') as wordFile:
for line in wordFile:
resList+re.findall(r'\w*{}\w*'.format(regex), txt, flags=re.IGNORECASE)
return resList
I used re.IGNORECASE but i'm not sure if you want to ignore case. You can remove it.
You can turn the below into a function, but you can use list comprehension after using .split():
df = pd.DataFrame({'Sentence' : ['He was full of speculation that night about the condition of Mars, and',
'He remained standing at the edge of the pit that the Thing had made for',
'Henderson stood up with his spade in his hand.']})
word = 'He'
df['words'] = (df['Sentence'].str.split()
.apply(lambda x: sum([True for y in x if y == word])))
Sentence words
0 He was full of speculation that night about th... 1
1 He remained standing at the edge of the pit th... 1
2 Henderson stood up with his spade in his hand. 0
Essentially, you return a list of True boolean values for each match and then you sum them (Boolean True is the equivalent value of integer 1).
If you want the sum of all rows, then simply do:
(df['Sentence'].str.split()
.apply(lambda x: sum([True for y in x if y == word]))).sum()
2
I want to remove all the words that contain a specific substring.
Sentence = 'walking my dog https://github.com/'
substring = 'http'
# Remove all words that start with the substring
#...
result = 'walking my dog'
This respects the original spacing in the string without having to fiddle around too much.
import re
string = "a suspect http://string.com with spaces before and after"
starts = "http"
re.sub(f"\\b{starts}[^ ]*[ ]+", "", string)
'a suspect with spaces before and after'
There is a simple approach that we can use for this.
Split the sentence into words
Find all the works that
Check if that word contains the substring and remove it
Join back the remaining words.
>>> sentence = 'walking my dog https://github.com/'
>>> substring = 'http'
>>> f = lambda v, w: ' '.join(filter(lambda x: w not in x, v.split(' ')))
>>> f(sentence, substring)
'walking my dog'
Explanation:
1. ' '.join(
2. filter(
3. lambda x: w not in x,
4. v.split(' ')
6. )
7. )
1 stars with a join. 2 is for filtering all the elements from 4, which splits the string into words. The condition to filter is substring not in word. The not in does a O(len(substring) * len(word)) complexity comparison.
Note: The only step that can be sped up is line 3. The fact that you are comparing words to a constant string, you can use Rabin-Karp String Matching to find the string in O(len(word)) or Z-Function to find the string in O(len(word) + len(substring))
I want to add text to each item in a list and then it back into the list.
Example
foods = [pie,apple,bread]
phrase = ('I like')
statement = [(phrase) + (foods) for phrase in foods]
I want to have the output (I like pie, I like apple, I like bread)
It prints out something along the lines of (hI like,pI like,pI like)
You should avoid declaring phrase on another line, here is the sample that should make it.
statement = ['I like ' + food for food in foods] # Note the white space after I like
In your code you probably coded it too quickly because you are reasigning the value of phrase with the for phrase in foods. Also when you do phrase + foods foods is a list which cannot be concatenated with string
I prefer string formatting:
foods = ['pie','apple','bread']
phrase = 'I like'
statement = ['{0} {1}'.format(phrase, j) for j in foods]
# Python 3.6+
statement = [f'{phrase} {j}' for j in foods]
You can also create a container and use .format:
phrase = 'I like {}'
statement = [phrase.format(food) for food in ("pie","apple","bread")]
foods = ['pie','apple','bread']
phrase = 'I like'
statement = [phrase + ' ' + j for j in foods]
>>> statement
['I like pie', 'I like apple', 'I like bread']
The problem in your code was that you were using the iterator incorrectly.
I have a string, for example 'i cant sleep what should i do'as well as a phrase that is contained in the string 'cant sleep'. What I am trying to accomplish is to get an n sized window around the phrase even if there isn't n words on either side. So in this case if I had a window size of 2 (2 words on either size of the phrase) I would want 'i cant sleep what should'.
This is my current solution attempting to find a window size of 2, however it fails when the number of words to the left or right of the phrase is less than 2, I would also like to be able to use different window sizes.
import re
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print sentence_words[left-2:right+3]
left = sentence_words.index(span_words[0])
right = sentence_words.index(span_words[-1])
print sentence_words[left-2:right+3]
You can use the partition method for a non-regex solution:
>>> s='i cant sleep what should i do'
>>> p='cant sleep'
>>> lh, _, rh = s.partition(p)
Then use a slice to get up to two words:
>>> n=2
>>> ' '.join(lh.split()[:n]), p, ' '.join(rh.split()[:n])
('i', 'cant sleep', 'what should')
Your exact output:
>>> ' '.join(lh.split()[:n]+[p]+rh.split()[:n])
'i cant sleep what should'
You would want to check whether p is in s or if the partition succeeds of course.
As pointed out in comments, lh should be a negative to take the last n words (thanks Mathias Ettinger):
>>> s='w1 w2 w3 w4 w5 w6 w7 w8 w9'
>>> p='w4 w5'
>>> n=2
>>> ' '.join(lh.split()[-n:]+[p]+rh.split()[:n])
'w2 w3 w4 w5 w6 w7'
If you define words being entities separated by spaces you can split your sentences and use regular python slicing:
def get_window(sentence, phrase, window_size):
sentence = sentence.split()
phrase = phrase.split()
words = len(phrase)
for i,word in enumerate(sentence):
if word == phrase[0] and sentence[i:i+words] == phrase:
start = max(0, i-window_size)
return ' '.join(sentence[start:i+words+window_size])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
print(get_window(sentence, phrase, 2))
You can also change it to a generator by changing return to yield and be able to generate all windows if several match of phrase are in sentence:
>>> list(gen_window('I dont need it, I need to get rid of it', 'need', 2))
['I dont need it, I', 'it, I need to get']
import re
def contains_sublist(lst, sublst):
n = len(sublst)
for i in xrange(len(lst)-n+1):
if (sublst == lst[i:i+n]):
a = max(i, i-2)
b = min(i+n+2, len(lst))
return ' '.join(lst[a:b])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
sentence_words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print contains_sublist(sentence_words, phrase_words)
you can split words using inbuilt string methods, so re shouldn't be nessesary. If you want to define varrring values, then wrap it in a function call like so:
def get_word_window(sentence, phrase, w_left=0, w_right=0):
w_lst = sentence.split()
p_lst = phrase.split()
for i,word in enumerate(w_lst):
if word == p_lst[0] and \
w_lst[i:i+len(p_lst)] == p_lst:
left = max(0, i-w_left)
right = min(len(w_lst), i+w_right+len(p_list)
return w_lst[left:right]
Then you can get the new phrase like so:
>>> sentence='i cant sleep what should i do'
>>> phrase='cant sleep'
>>> ' '.join(get_word_window(sentence,phrase,2,2))
'i cant sleep what should'
what is the bes way tho check if two words are ordered in sentence and how many times it occurs in python.
For example: I like to eat maki sushi and the best sushi is in Japan.
words are: [maki, sushi]
Thanks.
The code
import re
x="I like to eat maki sushi and the best sushi is in Japan"
x1 = re.split('\W+',x)
l1 = [i for i,m in enumerate(x1) if m == "maki"]
l2 = [i for i,m in enumerate(x1) if m == "sushi"]
ordered = []
for i in l1:
for j in l2:
if j == i+1:
ordered.append((i,j))
print ordered
According to the added code, you mean that words are adjacent?
Why not just put them together:
print len(re.findall(r'\bmaki sushi\b', sent))
def ordered(string, words):
pos = [string.index(word) for word in words]
return pos == sorted(pos)
s = "I like to eat maki sushi and the best sushi is in Japan"
w = ["maki", "sushi"]
ordered(s, w) #Returns True.
Not exactly the most efficient way of doing it but simpler to understand.
s = 'I like to eat maki sushi and the best sushi is in Japan'
check order
indices = [s.split().index(w) for w in ['maki', 'sushi']]
sorted(indices) == indices
how to count
s.split().count('maki')
Note (based on discussion below):
suppose the sentence is 'I like makim more than sushi or maki'. Realizing that makim is another word than maki, the word maki is placed after sushi and occurs only once in the sentence. To detect this and count correctly, the sentence must be split over the spaces into the actual words.
if res > 0:
words are sorted in the sentence
words = ["sushi", "maki", "xxx"]
sorted_words = sorted(words)
sen = " I like to eat maki sushi and the best sushi is in Japan xxx";
ind = map(lambda x : sen.index(x), sorted_words)
res = reduce(lambda a, b: b-a, ind)
A regex solution :)
import re
sent = 'I like to eat maki sushi and the best sushi is in Japan'
words = sorted(['maki', 'sushi'])
assert re.search(r'\b%s\b' % r'\b.*\b'.join(words), sent)
Just and idea, it might need some more work
(sentence.index('maki') <= sentence.index('sushi')) == ('maki' <= 'sushi')