RegEx for matching substrings

RegEx for matching substrings - python

I want to write a regex to match on all substrings of a given input string. I have tried the following:
def sub_string(str):
n = len(str)
# For holding all the formed substrings
output = "\\b("
# This loop maintains the starting character
for i in range(0, n):
# This loop will add a character to start character one by one till the end is reached
for j in range(i, n):
output += str[i:(j + 1)] + "|"
return output + "\\b)"
However, it matches on the wrong words. For example if i input "h" it matches. What could be the problem? And is there another approach I could use?
Thanks in advance!

Related

How to replace the first n numbers of a specific character in a string

I’m new to computer science and really stuck on this question so any help would be great :).
Firstly I was given the following global variable:
ARROWS = ‘<>^v’
The idea is to create a function that takes in a string and an integer (n). The function should then replace the first n number of characters with ‘X’. So far this is easy, however, the problem is that the characters should ONLY be replaced if it is a part of the global variable ARROWS. If it isn’t, it should not be modified but still counts as one of the n numbers. The following exemplifies what needs to be done:
>>>function(‘>>.<>>...’, 4)
‘XX.X>>...’
>>>function(‘>..>..>’, 6)
‘X..X..>’
>>>function(‘..>>>.’, 2)
‘..>>>.’
Please help :)

Hi it seems to me (if you want to avoid using libraries) that you can iterate over the characters in the string and do comparisons to decide if the character needs to be changed. Here is some sample code which should help you.
ARROWS = '<>^v'
def replace_up_to(in_str, n):
# store the new character in this list
result = []
for i, char in enumerate(in_str):
# decide if we need to change the char
if i < n and char in ARROWS:
result.append("X")
continue
result.append(char)
# return a new string from our result list
return "".join(result)

The solution is very straightforward. Hope this helps. :)
ARROWS = "<>^v"
arrows_set = set(ARROWS)
def function(word, n):
newWord = ""
for i in range(0, len(word)):
if word[i] in arrows_set and i < n:
newWord += 'X'
else:
newWord += word[i]
return newWord
print(function(">>.<>>...", 4))
print(function(">..>..>", 6))

You can use the re module to match a regex set of your arrows.
import re
txt_to_replace = "aa^aaa<>aaa>"
x= re.sub(r'[<>^v]', 'X', txt_to_replace ,3 )
# x is now aaXaaaXXaaa>

You can use re.sub() to replace any of the characters in ARROW.
To limit it to the first N characters of the input string, perform the replacement on a slice and concatenate that with the remainder of the input string.
import re
def replace_arrow(string, limit):
arrows_regex = r'[<>^v]'
first_n = string[:limit]
rest = string[limit:]
return re.sub(arrows_regex, 'X', first_n) + rest

Remove string character after run of n characters in string

Suppose you have a given string and an integer, n. Every time a character appears in the string more than n times in a row, you want to remove some of the characters so that it only appears n times in a row. For example, for the case n = 2, we would want the string 'aaabccdddd' to become 'aabccdd'. I have written this crude function that compiles without errors but doesn't quite get me what I want:
def strcut(string, n):
for i in range(len(string)):
for j in range(n):
if i + j < len(string)-(n-1):
if string[i] == string[i+j]:
beg = string[:i]
ends = string[i+1:]
string = beg + ends
print(string)
These are the outputs for strcut('aaabccdddd', n):
n
output
expected
1
'abcdd'
'abcd'
2
'acdd'
'aabccdd'
3
'acddd'
'aaabccddd'
I am new to python but I am pretty sure that my error is in line 3, 4 or 5 of my function. Does anyone have any suggestions or know of any methods that would make this easier?

This may not answer why your code does not work, but here's an alternate solution using regex:
import re
def strcut(string, n):
return re.sub(fr"(.)\1{{{n-1},}}", r"\1"*n, string)
How it works: First, the pattern formatted is "(.)\1{n-1,}". If n=3 then the pattern becomes "(.)\1{2,}"
(.) is a capture group that matches any single character
\1 matches the first capture group
{2,} matches the previous token 2 or more times
The replacement string is the first capture group repeated n times
For example: str = "aaaab" and n = 3. The first "a" is the capture group (.). The next 3 "aaa" matches \1{2,} - in this example a{2,}. So the whole thing matches "a" + "aaa" = "aaaa". That is replaced with "aaa".
regex101 can explain it better than me.

you can implement a stack data structure.
Idea is you add new character in stack, check if it is same as previous one or not in stack and yes then increase counter and check if counter is in limit or not if yes then add it into stack else not. if new character is not same as previous one then add that character in stack and set counter to 1
# your code goes here
def func(string, n):
stack = []
counter = None
for i in string:
if not stack:
counter = 1
stack.append(i)
elif stack[-1]==i:
if counter+1<=n:
stack.append(i)
counter+=1
elif stack[-1]!=i:
stack.append(i)
counter = 1
return ''.join(stack)
print(func('aaabbcdaaacccdsdsccddssse', 2)=='aabbcdaaccdsdsccddsse')
print(func('aaabccdddd',1 )=='abcd')
print(func('aaabccdddd',2 )=='aabccdd')
print(func('aaabccdddd',3 )=='aaabccddd')
output
True
True
True
True

The method I would use is creating a new empty string at the start of the function and then everytime you exceed the number of characters in the input string you just not insert them in the output string, this is computationally efficient because it is O(n) :
def strcut(string,n) :
new_string = ""
first_c, s = string[0], 0
for c in string :
if c != first_c :
first_c, s= c, 0
s += 1
if s > n : continue
else : new_string += c
return new_string
print(strcut("aabcaaabbba",2)) # output : #aabcaabba

Simply, to anwer the question
appears in the string more than n times in a row
the following code is small and simple, and will work fine :-)
def strcut(string: str, n: int) -> str:
tmp = "*" * (n+1)
for char in string:
if tmp[len(tmp) - n:] != char * n:
tmp += char
print(tmp[n+1:])
strcut("aaabccdddd", 1)
strcut("aaabccdddd", 2)
strcut("aaabccdddd", 3)
Output:
abcd
aabccdd
aaabccddd
Notes:
The character "*" in the line tmp = "*"*n+string[0:1] can be any character that is not in the string, it's just a placeholder to handle the start case when there are no characters.
The print(tmp[n:]) line simply removes the "*" characters added in the beginning.

You don't need nested loops. Keep track of the current character and its count. include characters when the count is less or equal to n, reset the current character and count when it changes.
def strcut(s,n):
result = '' # resulting string
char,count = '',0 # initial character and count
for c in s: # only loop once on the characters
if c == char: count += 1 # increase count
else: char,count = c,1 # reset character/count
if count<=n: result += c # include character if count is ok
return result

Just to give some ideas, this is a different approach. I didn't like how n was iterating each time even if I was on i=3 and n=2, I still jump to i=4 even though I already checked that character while going through n. And since you are checking the next n characters in the string, you method doesn't fit with keeping the strings in order. Here is a rough method that I find easier to read.
def strcut(string, n):
for i in range(len(string)-1,0,-1): # I go backwards assuming you want to keep the front characters
if string.count(string[i]) > n:
string = remove(string,i)
print(string)
def remove(string, i):
if i > len(string):
return string[:i]
return string[:i] + string[i+1:]
strcut('aaabccdddd',2)

finding the minimum window substring

the problem says to create a string, take 3 non-consecutive characters from the string and put it into a sub-string and print the which character the first one is and which character the last one is.
str="subliminal"
sub="bmn"
n = len(str)-3
for i in range(0, n):
print(str1[i:i+4])
if sub1 in str1:
print(sub1[i])
this should print 3 to 8 because b is the third letter and n is the 8th letter.
i also don't know how to make the code work for substrings that aren't 3 characters long without changing the code in total.

Not sure if this is what you meant. I assume that the substring is already valid, which means that it contains non consecutive letters. Then I get the first and last letter of the substring and create a list of all the letters in the string using a list comprehension. Then i just loop through the letters and save where the first and last letter occur. If anything is missing, hmu.
sub = "bmn"
str = "subliminal"
first_letter = sub[0]
last_letter = sub[-1]
start = None
end = None
letters = [let for let in str]
for i, letter in enumerate(letters):
if letter == first_letter:
start = i
if letter == last_letter:
end = i
if start and end:
print(f"From %s to %s." % (start + 1, end + 1)) # Output: From 3 to 8.

Some recursion for good health:
def minimum_window_substring(strn, sub, beg=0, fin=0, firstFound=False):
if len(sub) == 0 or len(strn) == 0:
return f'From {beg + 1} to {fin}'
elif strn[0] == sub[0]:
return minimum_window_substring(strn[1:], sub[1:], beg, fin + 1, True)
if not firstFound:
beg += 1
return minimum_window_substring(strn[1:], sub, beg, fin + 1, firstFound)
Explanation:
The base case is if we get our original string or our sub-string to be length 0, we then stop and print the beginning and the end of the substring in the original string.
If the first letter of the current string is equal then we start the counter (we fix the beginning "beg" with the flag "firstFound") Then increment until we finish (sub is an empty string / original string is empty)
Something to think about / More explanation:
If for example, you ask for the first occurrence of the substring, for example if the original string would be "sububusubulum" and the sub would equal to "sbl" then when we hit our first "s" - it means it would 100% start from there, because if another "sbl" is inside the original string - then it must contain the remaining letters, and so we would say they belong to the first s. (A horrible explanation, I am sorry) what I am trying to say is that if we have 2 occurrences of the substring - then we would pick the first one, no matter what.
Note: This function does not really care if the sub-string contains consecutive letters, also, it does not check whether the characters are in the string itself, because you said that we must be given characters from the original string. The positive thing about it, is that the function can be given more than (or less than) 3 characters long substring
When I say "original string" I mean subliminal (or other inputs)

There are many different ways you could do it,
here is a soultion,
import re
def Func(String, SubString):
patt = "".join([char + "[A-Za-z]" + "+" for char in sub[:-1]] + [sub[-1]])
MatchedString = re.findall(patt, String)[0]
FirstIndex = String.find(MatchedString) + 1
LastIndex = FirstIndex + len(MatchedString) -1
return FirstIndex, LastIndex
string="subliminal"
sub="bmn"
FirstIndex, LastIndex = Func(string, sub)
This will return 3, 8 and you could change the length of the substring, and assuming you want just the first match only

String slicing in specific scenarios Python

I have a string I'd like to split to new strings which will contain only text (no commas, spaces, dots etc.). The length of each new string must be of variable n. The slicing must go through each possible combination.
Meaning, for example, an input of func('banana pack', 3) will result in ['ban','ana','nan','ana',pac','ack']. So far what I managed to achieve is:
def func(text, n):
text = text.lower()
text = text.translate(str.maketrans("", "", " .,"))
remainder = len(text) % n
split_text = [text[i:i + n] for i in range(0, len(text) - remainder, n)]
if remainder > 0:
split_text.append(text[-n:])
return split_text

First I clean the input, by removing ',' and '.'. The input is then split at spaces to take only full words into account. For each word the sections are appended.
def func(text,n):
text=text.replace('.','').replace(',','') #Cleanup
words = text.split() #split words
output = []
for word in words:
for i in range(len(word)-n+1):
output.append(word[i:i+n])
return output
You could unroll the loop one level if you just iterate over everything and discard results with unwanted symbols.

Append last letter in a string to another string

I am constructing a chatbot that rhymes in Python. Is it possible to identify the last vowel (and all the letters after that vowel) in a random word and then append those letters to another string without having to go through all the possible letters one by one (like in the following example)
lastLetters = '' # String we want to append the letters to
if user_answer.endswith("a")
lastLetters.append("a")
else if user_answer.endswith("b")
lastLetters.append("b")
Like if the word was right we’d want to get ”ight”

You need to find the last index of a vowel, for that you could do something like this (a bit fancy):
s = input("Enter the word: ") # You can do this to get user input
last_index = len(s) - next((i for i, e in enumerate(reversed(s), 1) if e in "aeiou"), -1)
result = s[last_index:]
print(result)
Output
ight
An alternative using regex:
import re
s = "right"
last_index = -1
match = re.search("[aeiou][^aeiou]*$", s)
if match:
last_index = match.start()
result = s[last_index:]
print(result)
The pattern [aeiou][^aeiou]*$ means match a vowel followed by possibly several characters that are not a vowel ([^aeiou] means not a vowel, the sign ^ inside brackets means negation in regex) until the end of the string. So basically match the last vowel. Notice this assumes a string compose only of consonants and vowels.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

RegEx for matching substrings - python

Related

How to replace the first n numbers of a specific character in a string

Remove string character after run of n characters in string

finding the minimum window substring

String slicing in specific scenarios Python

Append last letter in a string to another string

Categories

Resources