Finding all occurrences + substrings of a word

Finding all occurrences + substrings of a word - python

I have the 'main' word, "LAUNCHER", and 2 other words, "LAUNCH" and "LAUNCHER". I want to find out (using regex), which words are in the 'main' word. I'm using findAll, with the regex: "(LAUNCH)|(LAUNCHER)" , but this will only return LAUNCH and not both of them. How do i fix this?
import re
mainword = "launcher"
words = "(launch|launcher)"
matches = re.findall(words,mainword)
for match in matches:
print(match)

you can try something like this:
import re
mainword = "launcher"
words = "(launch|launcher)"
for x in (re.findall(r"[A-Za-z##]+|\S", words)):
if x in mainword:
print (x)
result:
launch
launcher

If you're not required to use regular expressions, this would be done more efficiently with the IN operator and a simple loop or list comprehension:
mainWord = "launcher"
words = ["launch","launcher"]
matches = [ word for word in words if word in mainWord ]
# case insensitive...
matchWord = mainWord.lower()
matches = [ word for word in words if word.lower() in matchWord ]
Even if you do require regex, a loop would be needed because re.findAll() never matches overlapping patterns :
import re
pattern = re.compile("launcher|launch")
mainWord = "launcher"
matches = []
startPos = 0
lastMatch = None
while startPos < len(mainWord):
if lastMatch : match = pattern.match(mainWord,lastMatch.start(),lastMatch.end()-1)
else : match = pattern.match(mainWord,startPos)
if not match:
if not lastMatch : break
startPos = lastMatch.start() + 1
lastMatch = None
continue
matches.append(mainWord[match.start():match.end()])
lastMatch = match
print(matches)
note that, even with this loop, you need to have the longer words appear before shorter ones if you use the | operator in the regular expression. This is because | is never greedy and will match the first word, not the longest one.

Related

Filter a list of strings by a char in same position

I am trying to make a simple function that gets three inputs: a list of words, list of guessed letters and a pattern. The pattern is a word with some letters hidden with an underscore. (for example the word apple and the pattern '_pp_e')
For some context it's a part of the game hangman where you try to guess a word and this function gives a hint.
I want to make this function to return a filtered list of words from the input that does not contain any letters from the list of guessed letters and the filtered words contain the same letters and their position as with the given pattern.
I tried making this work with three loops.
First loop that filters all words by the same length as the pattern.
Second loop that checks for similarity between the pattern and the given word. If the not filtered word does contain the letter but not in the same position I filter it out.
Final loop checks the filtered word that it does not contain any letters from the given guessed list.
I tried making it work with not a lot of success, I would love for help. Also any tips for making the code shorter (without using third party libraries) will be a appreciated very much.
Thanks in advance!
Example: pattern: "d _ _ _ _ a _ _ _ _" guessed word list ['b','c'] and word list contain all the words in english.
output list: ['delegating', 'derogation', 'dishwasher']
this is the code for more context:
def filter_words_list(words, pattern, wrong_guess_lst):
lst_return = []
lst_return_2 = []
lst_return_3 = []
new_word = ''
for i in range(len(words)):
if len(words[i]) == len(pattern):
lst_return.append(words[i])
pattern = list(pattern)
for i in range(len(lst_return)):
count = 0
word_to_check = list(lst_return[i])
for j in range(len(pattern)):
if pattern[j] == word_to_check[j] or (pattern[j] == '_' and
(not (word_to_check[j] in
pattern))):
count += 1
if count == len(pattern):
lst_return_2.append(new_word.join(word_to_check))
for i in range(len(lst_return_2)):
word_to_check = lst_return_2[i]
for j in range(len(wrong_guess_lst)):
if word_to_check.find(wrong_guess_lst[j]) == -1:
lst_return_3.append(word_to_check)
return lst_return_3

The easiest, and likely quite efficient, way to do this would be to translate your pattern into a regular expression, if regular expressions are in your "toolbox". (The re module is in the standard library.)
In a regular expression, . matches any single character. So, we replace all _s with .s and add "^" and "$" to anchor the regular expression to the whole string.
import re
def filter_words(words, pattern, wrong_guesses):
re_pattern = re.compile("^" + re.escape(pattern).replace("_", ".") + "$")
# get words that
# (a) are the correct length
# (b) aren't in the wrong guesses
# (c) match the pattern
return [
word
for word in words
if (
len(word) == len(pattern) and
word not in wrong_guesses and
re_pattern.match(word)
)
]
all_words = [
"cat",
"dog",
"mouse",
"horse",
"cow",
]
print(filter_words(all_words, "c_t", []))
print(filter_words(all_words, "c__", []))
print(filter_words(all_words, "c__", ["cat"]))
prints out
['cat']
['cat', 'cow']
['cow']
If you don't care for using regexps, you can instead translate the pattern to a dict mapping each defined position to the character that should be found there:
def filter_words_without_regex(words, pattern, wrong_guesses):
# get a map of the pattern's defined letters to their positions
letter_map = {i: letter for i, letter in enumerate(pattern) if letter != "_"}
# get words that
# (a) are the correct length
# (b) aren't in the wrong guesses
# (c) have the correct letters in the correct positions
return [
word
for word in words
if (
len(word) == len(pattern) and
word not in wrong_guesses and
all(word[i] == ch for i, ch in letter_map.items())
)
]
The result is the same.

Probably not the most efficient, but this should work:
def filter_words_list(words, pattern, wrong_guess_lst):
fewer_words = [w for w in words if not any([wgl in w for wgl in wrong_guess_lst])]
equal_len_words = [w for w in fewer_words if len(w) == len(pattern)]
pattern_indices = [idl for idl, ltr in enumerate(pattern) if ltr != '_']
word_indices = [[idl for idl, ltr in enumerate(w) if ((ltr in pattern) and (ltr != '_'))] for w in equal_len_words]
out = [w for wid, w in zip(word_indices, equal_len_words) if ((wid == pattern_indices) and (w[pid] == pattern[pid] for pid in pattern_indices))]
return out
The idea is to first remove all words that have letters in your wrong_guess_lst.
Then, remove everything which does not have the same length (you could also merge this condition in the first one..).
Next, for both pattern and your remaining words, you create a pattern mask, which indicates the positions of non '_' letters.
To be a candidate, the masks have to be identical AND the letters in these positions have to be identical as well.
Note, that I replaced a lot of for loops in you code by list comprehension snippets. List comprehension is a very useful construct which helps a lot especially if you don't want to use other libraries.
Edit: I cannot really tell you, where your code went wrong as it was a little too long for me..

The regex rule is explicitely constructed, in particular no check on the word's length is needed. To achieve this the groupby function from the itertools package of the standard library is used:
'_ b _ _ _' -- regex-- > r'^.{1}b.{3}$'
Here how to filter the dictionary by a guess string:
import itertools as it
import re
# sample dictionary
dictionary = "a ability able about above accept according account across act action activity actually add address"
dictionary = dictionary.split()
guess = '_ b _ _ _'
guess = guess.replace(' ', '') # remove white spaces
# construction of the regex rule
regex = r'^'
for _, i in it.groupby(guess, key=lambda x: x == '_'):
if '_' in (l:=list(i)):
regex += ''.join(f'.{{{len(l)}}}') # escape the curly brackets
else:
regex += ''.join(l)
regex += '$'
# processing the regex rule
pattern = re.compile(regex)
# filter the dictionary by the rule
l = [word for word in dictionary if pattern.match(word)]
print(l)
Output
['about', 'above']

How to use regex to only keep first n repeated words

If I have an input sentence
input = 'ok ok, it is very very very very very hard'
and what I want to do is to only keep the first three replica for any repeated word:
output = 'ok ok, it is very very very hard'
How can I achieve this with re or regex module in python?

One option could be to use a capturing group with a backreference and use that in the replacement.
((\w+)(?: \2){2})(?: \2)*
Explanation
( Capture group 1
(\w+) capture group 2, match 1+ word chars (The example data only uses word characters. To make sure they are no part of a larger word use a word boundary \b)
(?: \2){2} Repeat 2 times matching a space and a backreference to group 2. Instead of a single space you could use [ \t]+ to match 1+ spaces or tabs or use \s+ to match 1+ whitespace chars. (Note that that would also match a newline)
) Close group 1
(?: \2)* Match 0+ times a space and a backreference to group 2 to match the same words that you want to remove
Regex demo | Python demo
For example
import re
regex = r"((\w+)(?: \2){2})(?: \2)*"
s = "ok ok, it is very very very very very hard"
result = re.sub(regex, r"\1", s)
if result:
print (result)
Result
ok ok, it is very very very hard

You can group a word and use a backreference to refer to it to ensure that it repeats for more than 2 times:
import re
print(re.sub(r'\b((\w+)(?:\s+\2){2})(?:\s+\2)+\b', r'\1', input))
This outputs:
ok ok, it is very very very hard

One solution with re.sub with custom function:
s = 'ok ok, it is very very very very very hard'
def replace(n=3):
last_word, cnt = '', 0
current_word = yield
while True:
if last_word == current_word:
cnt += 1
else:
cnt = 0
last_word = current_word
if cnt >= n:
current_word = yield ''
else:
current_word = yield current_word
import re
replacer = replace()
next(replacer)
print(re.sub(r'\s*[\w]+\s*', lambda g: replacer.send(g.group(0)), s))
Prints:
ok ok, it is very very very hard

How to find the largest repeating substring given character in Python?

Given some string say 'aabaaab', how would I go about finding the largest substring of a. So it should return 'aaa'. Any help would be greatly appreciated.
def sub_string(s):
best_run = 0
current_run = 0
for char in s:
if char == 'a'
current_run += 1
else:
current_letter = char
return(best_run)
I have something like the one above. Not sure where I can fix it up.

not the most efficient, but a straightforward solution:
word = "aasfgaaassaasdsddaaaaaafff"
substr_count = 0
substr_counts = []
character = "f"
for i, letter in enumerate(word):
if (letter == character):
substr_count += 1
else:
substr_counts.append(substr_count)
substr_count = 0
if (i == len(word) - 1):
substr_counts.append(substr_count)
print(max(substr_counts))

If you want a short method using standard python tools (and avoid writing loops to reconstruct the string as you iterate), you can use regex to split the string by any non-a characters than get the max() according to len:
import re
test_string = 'aabaaab'
split_string_list = re.split( '[^a]', test_string )
longest_string_subset = max( split_string_list, key=len )
print( longest_string_subset )
The re library is for regex, the '[^a]' is a regex statement for any non-a character. Basically, the 'aabaaab' is being split into a list according to any matches on the regex statement, so that it becomes [ 'aa' 'aaa' '' ]. Then, the max() statement looks for the longest string based on len (aka length).
You can read more about functions like re.split() in the docs: https://docs.python.org/2/library/re.html

check if a pattern is in a list of words

I need an output that contains words that are exactly like a pattern - same letters in same spots only (and letters shouldn't show in the word at other places) and the same length
for example:
words = ['hatch','catch','match','chat','mates']
pattern = '_atc_
needed output:
['hatch','match']
I have tried to use nested for loops but it didn't work for a pattern that starts and ends with '_'
def filter_words_list(words, pattern):
relevant_words = []
for word in words:
if len(word) == len(pattern):
for i in range(len(word)):
for j in range(len(pattern)):
if word[i] != pattern[i]:
break
if word[i] == pattern[i]:
relevant_words.append(word)
thx !

So you should use regex. and replace the underscore with '.' which means any single character.
so the input looks like:
words = ['hatch','catch','match','chat','mates']
pattern = '.atc.'
and the code is:
import re
def filter_words_list(words, pattern):
ret = []
for word in words:
if(re.match(pattern,word)):ret.append(word)
return ret
Hopes tha helped

You could use a regular expression:
import re
words = ['hatch','catch','match','chat','mates']
pattern = re.compile('[^atc]atc[^atc]')
result = list(filter(pattern.fullmatch, words))
print(result)
Output
['hatch', 'match']
The pattern '[^atc]atc[^atc]' matches everything that is not a or t or c ([^atc]) followed by 'atc' and again everything that is not a or t or c.
As an alternative you could write your own matching function that will work with any given pattern:
from collections import Counter
def full_match(word, pattern='_atc_'):
if len(pattern) != len(word):
return False
pattern_letter_counts = Counter(e for e in pattern if e != '_') # count characters that are not wild card
word_letter_counts = Counter(word) # count letters
if any(count != word_letter_counts.get(ch, 0) for ch, count in pattern_letter_counts.items()):
return False
return all(p == w for p, w in zip(pattern, word) if p != '_') # the word must match in all characters that are not wild card
words = ['hatch', 'catch', 'match', 'chat', 'mates']
result = list(filter(full_match, words))
print(result)
Output
['hatch', 'match']
Further
See the documentation on the built-in functions any and all.
See the documentation on Counter.

compare specific string to a word python

say I have a certain string and a list of strings.
I would like to append to a new list all the words from the list (of strings)
that are exactly like the pattern
for example:
list of strings = ['string1','string2'...]
pattern =__letter__letter_ ('_c__ye_' for instance)
I need to add all strings that are made up of the same letters in the same places as the pattern, and has the same length.
so for instance:
new_list = ['aczxyep','zcisyef'...]
I have tried this:
def pattern_word_equality(words,pattern):
list1 = []
for word in words:
for letter in word:
if letter in pattern:
list1.append(word)
return list1
help will be much appreciated :)

If your pattern is as simple as _c__ye_, then you can look for the characters in the specific positions:
words = ['aczxyep', 'cxxye', 'zcisyef', 'abcdefg']
result1 = list(filter(lambda w: w[1] == 'c' and w[4:6] == 'ye', words))
If your pattern is getting more complex, then you can start using regular expressions:
pat = re.compile("^.c..ye.$")
result2 = list(filter(lambda w: pat.match(w), words))
Output:
print(result1) # ['aczxyep', 'zcisyef']
print(result2) # ['aczxyep', 'zcisyef']

This works:
words = ['aczxyep', 'cxxye', 'zcisyef', 'abcdefg']
pattern = []
for i in range(len(words)):
if (words[i])[1].lower() == 'c' and (words[i])[4:6].lower() == 'ye':
pattern.append(words[i])
print(pattern)
You start by defining the words and pattern lists. Then you loop around for the amount of items in words by using len(words). You then find whether the i item number is follows the pattern by seeing if the second letter is c and the 5th and 6th letters are y and e. If this is true then it appends that word onto pattern and it prints them all out at the end.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding all occurrences + substrings of a word - python

you can try something like this: import re mainword = "launcher" words = "(launch|launcher)" for x in (re.findall(r"[A-Za-z##]+|\S", words)): if x in mainword: print (x) result: launch launcher

Related

Filter a list of strings by a char in same position

How to use regex to only keep first n repeated words

How to find the largest repeating substring given character in Python?

check if a pattern is in a list of words

compare specific string to a word python

Categories

Resources