How can you group a very specfic pattern with regex? - python

Problem:
https://coderbyte.com/editor/Simple%20Symbols
The str parameter will be composed of + and = symbols with
several letters between them (ie. ++d+===+c++==a) and for the string
to be true each letter must be surrounded by a + symbol. So the string
to the left would be false. The string will not be empty and will have
at least one letter.
Input:"+d+=3=+s+"
Output:"true"
Input:"f++d+"
Output:"false"
I'm trying to create a regular expression for the following problem, but I keep running into various problems. How can I produce something that returns the specified rules('+\D+')?
import re
plusReg = re.compile(r'[(+A-Za-z+)]')
plusReg.findall()
>>> []
Here I thought I could create my own class that searches for the pattern.
import re
plusReg = re.compile(r'([\\+,\D,\\+])')
plusReg.findall('adf+a+=4=+S+')
>>> ['a', 'd', 'f', '+', 'a', '+', '=', '=', '+', 'S', '+']
Here I thought I the '\\+' would single out the plus symbol and read it as a char.
mo = plusReg.search('adf+a+=4=+S+')
mo.group()
>>>'a'
Here using the same shell, I tried using the search instead of findall, but I just ended up with the first letter which isn't even surrounded by a plus.
My end result is to group the string 'adf+a+=4=+S+' into ['+a+','+S+'] and so on.
edit:
Solution:
import re
def SimpleSymbols(str):
#added padding, because if str = 'y+4==+r+'
#then program would return true when it should return false.
string = '=' + str + '='
#regex that returns false if a letter *doesn't* have a + in front or back
plusReg = re.compile(r'[^\+][A-Za-z].|.[A-Za-z][^\+]')
#if statement that returns "true" if regex doesn't find any letters
#without a + behind or in front
if plusReg.search(string) is None:
return "true"
return "false"
print SimpleSymbols(raw_input())
I borrowed some code from ekhumoro and Sanjay. Thanks

One approach is to search the string for any letters that are either: (1) not preceeded by a +, or (2) not followed by a +. This can be done using look ahead and look behind assertions:
>>> rgx = re.compile(r'(?<!\+)[a-zA-Z]|[a-zA-Z](?!\+)')
So if rgx.search(string) returns None, the string is valid:
>>> rgx.search('+a+') is None
True
>>> rgx.search('+a+b+') is None
True
but if it returns a match, the string is invalid:
>>> rgx.search('+ab+') is None
False
>>> rgx.search('+a=b+') is None
False
>>> rgx.search('a') is None
False
>>> rgx.search('+a') is None
False
>>> rgx.search('a+') is None
False
The important thing about look ahead/behind assertions is that they don't consume characters, so they can handle overlapping matches.

Something like this should do the trick:
import re
def is_valid_str(s):
return re.findall('[a-zA-Z]', s) == re.findall('\+([a-zA-Z])\+', s)
Usage:
In [10]: is_valid_str("f++d+")
Out[10]: False
In [11]: is_valid_str("+d+=3=+s+")
Out[11]: True

I think you are on the right track. The regular expression you have is correct, but it can simplify down to just letters:
search_pattern = re.compile(r'\+[a-zA-z]\+')
for upper and lower case strings. Now we can use this regex with the findall function:
results = re.findall(search_pattern, 'adf+a+=4=+S+') # returns ['+a+', '+S+']
Now the question needs you to return a boolean depending on if the string is valid to the specified pattern so we can wrap this all up into a function:
def is_valid_pattern(pattern_string):
search_pattern = re.compile(r'\+[a-zA-z]?\+')
letter_pattern = re.compile(r'[a-zA-z]') # to search for all letters
results = re.findall(search_pattern, pattern_string)
letters = re.findall(letter_pattern, pattern_string)
# if the lenght of the list of all the letters equals the length of all
# the values found with the pattern, we can say that it is a valid string
return len(results) == len(letter_pattern)

You should be looking for what isn't there, as opposed to what is. You should search for something like, ([^\+][A-Za-z]|[A-Za-z][^\+]). The | in the middle is a logical or operator. Then on either side, it checks if it can find any scenario where there is a letter without a "+" on the left/right respectively. If if finds something, that means the string fails. If it can't find anything, that means that there are no instances of a letter not being surrounded by "+"'s.

Related

How can we remove word with repeated single character?

I am trying to remove word with single repeated characters using regex in python, for example :
good => good
gggggggg => g
What I have tried so far is following
re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
Problem with above solution is that it changes good to god and I just want to remove words with single repeated characters.
A better approach here is to use a set
def modify(s):
#Create a set from the string
c = set(s)
#If you have only one character in the set, convert set to string
if len(c) == 1:
return ''.join(c)
#Else return original string
else:
return s
print(modify('good'))
print(modify('gggggggg'))
If you want to use regex, mark the start and end of the string in our regex by ^ and $ (inspired from #bobblebubble comment)
import re
def modify(s):
#Create the sub string with a regex which only matches if a single character is repeated
#Marking the start and end of string as well
out = re.sub(r'^([a-z])\1+$', r'\1', s)
return out
print(modify('good'))
print(modify('gggggggg'))
The output will be
good
g
If you do not want to use a set in your method, this should do the trick:
def simplify(s):
l = len(s)
if l>1 and s.count(s[0]) == l:
return s[0]
return s
print(simplify('good'))
print(simplify('abba'))
print(simplify('ggggg'))
print(simplify('g'))
print(simplify(''))
output:
good
abba
g
g
Explanations:
You compute the length of the string
you count the number of characters that are equal to the first one and you compare the count with the initial string length
depending on the result you return the first character or the whole string
You can use trim command:
take a look at this examples:
"ggggggg".Trim('g');
Update:
and for characters which are in the middle of the string use this function, thanks to this answer
in java:
public static string RemoveDuplicates(string input)
{
return new string(input.ToCharArray().Distinct().ToArray());
}
in python:
used = set()
unique = [x for x in mylist if x not in used and (used.add(x) or True)]
but I think all of these answers does not match situation like aaaaabbbbbcda, this string has an a at the end of string which does not appear in the result (abcd). for this kind of situation use this functions which I wrote:
In:
def unique(s):
used = set()
ret = list()
s = list(s)
for x in s:
if x not in used:
ret.append(x)
used = set()
used.add(x)
return ret
print(unique('aaaaabbbbbcda'))
out:
['a', 'b', 'c', 'd', 'a']

In python, how to 'if finditer(...) has no matches'?

I would like to do something when finditer() does not find anything.
import re
pattern = "1"
string = "abc"
matched_iter = re.finditer(pattern, string)
# <if matched_iter is empty (no matched found>.
# do something.
# else
for m in matched_iter:
print m.group()
The best thing I could come up with is to keep track of found manually:
mi_no_find = re.finditer(r'\w+',"$$%%%%") # not matching.
found = False
for m in mi_no_find:
print m.group()
found = True
if not found:
print "Nothing found"
Related posts that don't answer:
Counting finditer matches: Number of regex matches (I don't need to count, I just need to know if there are no matches).
finditer vs match: different behavior when using re.finditer and re.match (says always have to loop over an iterator returned by finditer)
[edit]
- I have no interest in enumerating or counting total output. Only if found else not found actions.
- I understand I can put finditer into a list, but this would be inefficient for large strings. One objective is to have low memory utilization.
Updated 04/10/2020
Use re.search(pattern, string) to check if a pattern exists.
pattern = "1"
string = "abc"
if re.search(pattern, string) is None:
print('do this because nothing was found')
Returns:
do this because nothing was found
If you want to iterate over the return, then place the re.finditer() within the re.search().
pattern = '[A-Za-z]'
string = "abc"
if re.search(pattern, string) is not None:
for thing in re.finditer(pattern, string):
print('Found this thing: ' + thing[0])
Returns:
Found this thing: a
Found this thing: b
Found this thing: c
Therefore, if you wanted both options, use the else: clause with the if re.search() conditional.
pattern = "1"
string = "abc"
if re.search(pattern, string) is not None:
for thing in re.finditer(pattern, string):
print('Found this thing: ' + thing[0])
else:
print('do this because nothing was found')
Returns:
do this because nothing was found
previous reply below (not sufficient, just read above)
If the .finditer() does not match a pattern, then it will not perform any commands within the related loop.
So:
Set the variable before the loop you are using to iterate over the regex returns
Call the variable after (And outside of) the loop you are using to iterate over the regex returns
This way, if nothing is returned from the regex call, the loop won't execute and your variable call after the loop will return the exact same variable it was set to.
Below, example 1 demonstrates the regex finding the pattern. Example 2 shows the regex not finding the pattern, so the variable within the loop is never set.
Example 3 shows my suggestion - where the variable is set before the regex loop, so if the regex does not find a match (and subsequently, does not trigger the loop), the variable call after the loop returns the initial variable set (Confirming the regex pattern was not found).
Remember to import the import re module.
EXAMPLE 1 (Searching for the characters 'he' in the string 'hello world' will return 'he')
my_string = 'hello world'
pat = '(he)'
regex = re.finditer(pat,my_string)
for a in regex:
b = str(a.groups()[0])
print(b)
# returns 'he'
EXAMPLE 2 (Searching for the characters 'ab' in the string 'hello world' do not match anything, so the 'for a in regex:' loop does not execute and does not assign the b variable any value.)
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
for a in regex:
b = str(a.groups()[0])
print(b)
# no return
EXAMPLE 3 (Searching for the characters 'ab' again, but this time setting the variable b to 'CAKE' before the loop, and calling the variable b after, outside of the loop returns the initial variable - i.e. 'CAKE' - since the loop did not execute).
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
b = 'CAKE' # sets the variable prior to the for loop
for a in regex:
b = str(a.groups()[0])
print(b) # calls the variable after (and outside) the loop
# returns 'CAKE'
It's also worth noting that when designing your pattern to feed into the regex, make sure to use the parenthesis to indicate the start and end of a group.
pattern = '(ab)' # use this
pattern = 'ab' # avoid using this
To tie back to the initial question:
Since nothing found won’t execute the for loop (for a in regex), the user can preload the variable, then check it after the for loop for the original loaded value. This will allow for the user to know if nothing was found.
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
b = 'CAKE' # sets the variable prior to the for loop
for a in regex:
b = str(a.groups()[0])
if b == ‘CAKE’:
# action taken if nothing is returned
If performance isn't an issue, simply use findall or list(finditer(...)), which returns a list.
Otherwise, you can "peek" into the generator with next, then loop as normal if it raises StopIteration. Though there are other ways to do it, this is the simplest to me:
import itertools
import re
pattern = "1"
string = "abc"
matched_iter = re.finditer(pattern, string)
try:
first_match = next(matched_iter)
except StopIteration:
print("No match!") # action for no match
else:
for m in itertools.chain([first_match], matched_iter):
print(m.group())
You can probe the iterator with next and then chain the results back together while excepting StopIteration which means the iterator was empty:
import itertools as it
matches = iter([])
try:
probe = next(matches)
except StopIteration:
print('empty')
else:
for m in it.chain([probe], matches):
print(m)
Regarding your solution you could check m directly, setting it to None beforehand:
matches = iter([])
m = None
for m in matches:
print(m)
if m is None:
print('empty')
It prints the original string if there are no matches in the string.
It will replace the position n of the string.
For more reference: https://docs.python.org/2/howto/regex.html
Input_Str = "FOOTBALL"
def replacing(Input_String, char_2_replace, replaced_char, n):
pattern = re.compile(char_2_replace)
if len(re.findall(pattern, Input_String)) >= n:
where = [m for m in pattern.finditer(Input_String)][n-1]
before = Input_String[:where.start()]
after = Input_String[where.end():]
newString = before + replaced_char + after
else:
newString = Input_String
return newString
print(replacing(Input_Str, 'L', 'X', 4))```
I know this answer is late, but very suitable for Python 3.8+
You can use the new warlus operator := operator along with next(iterator[, default]) to solve for 'no matches' in re.finditer(pattern, string, flags=0) somewhat like this:
import re
pattern_ = "1"
string_ = "abc"
def is_match():
was_found = False
while next((match := re.finditer(pattern_, string_)), None) is not None:
was_found = True
yield match.group() # or just print it
return was_found

Simple way to accept only one form of delimiter and rejecting multiple types?

I was wondering how I could go about making something that accepts a string with only one type of delimiter, something like this:
car:bus:boat
and rejecting something like:
car:bus-boat
I am not really sure about how to go about creating something like this.
Well, first you have to define what are invalid limiters. A hyphen could well be part of a valid hyphenated word or name, and the algorithm wouldn't be able to tell those apart. Supposing you have a list of invalid delimiters, you could just do:
def string_is_valid(s):
invalid_delimiters = ['-', ';']
for d in invalid_delimiters:
if d in s:
return False
return True
s1 = 'car:bus-boat'
print(string_is_valid(s1)) # False
s2 = 'car:bus:boat'
print(string_is_valid(s2)) # True
If, on the other hand, you have a list of delimiters and you want to make sure that only one type is present on the string, you could do this:
def string_is_valid(s):
valid_delimiters = [',', ':', ';']
# For each delimiter in our list...
for d in valid_delimiters:
# If the delimiter is present in the string...
if d in s:
# If any of the other delimiters is in s (and the other delimiter isn't the same one we're currently looking at), return False (it's invalid)
if any([other_d in s and other_d != d for other_d in valid_delimiters]):
return False
return True
s1 = 'car:bus:boat'
print(string_is_valid(s1)) # True
s2 = 'car,bus,boat'
print(string_is_valid(s2)) # True
s3 = 'car,bus;boat'
print(string_is_valid(s3)) # False
you can have an alphabet of "allowed" characters and count whatever is not on it (hence interpreting it as a sep).
e.g.
allowed = list('abcdefghijklmnopqrstuvxwyz')
def validate(string):
if len(set([k for k in string if k not in allowed])) > 1:
return False
return True
Of course you can expand the allowed for capital letters etc.
Use regex expression:
import re
data = re.compile(r'^([a-zA-Z][:][a-zA-Z]){1, }$')
data.match(string)

Find the repeating substring a string is composed of, if it exists

How would you go about splitting a normal string in to as many identical pieces as possible whilst using all characters. For example
a = "abab"
Would return "ab", whereas with
b= "ababc"
It would return "ababc", as it can't be split into identical pieces using all letters.
This is very similar, but not identical, to How can I tell if a string repeats itself in Python? – the difference being that that question only asks to determine whether a string is made up of identical repeating substrings, rather than what the repeating substring (if any) is.
The accepted (and by far the best performing) answer to that question can be adapted to return the repeating string if there is one:
def repeater(s):
i = (s+s)[1:-1].find(s)
if i == -1:
return s
else:
return s[:i+1]
Examples:
>>> repeater('abab')
'ab'
>>> repeater('ababc')
'ababc'
>>> repeater('xyz' * 1000000)
'xyz'
>>> repeater('xyz' * 50 + 'q')
'xyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzq'
It seems that repeating substring has no pre and after letters, so it also could be this way:
In[4]: re.sub(r'^([a-z]+)\1$',r'\1','abab')
Out[4]: 'ab'
In[5]: re.sub(r'^([a-z]+)\1$',r'\1','ababc')
Out[5]: 'ababc'
([a-z]+) means substring, \1 means repeat.
EDIT :
re.sub(r'^([a-z]+)\1{1,}$',r'\1','abcabcabcabc')
'abc'

How to find what matched in any() with Python?

I'm working in Python, using any() like so to look for a match between a String[] array and a comment pulled from Reddit's API.
Currently, I'm doing it like this:
isMatch = any(string in comment.body for string in myStringArray)
But it would also be useful to not just know if isMatch is true, but which element of myStringArray it was that had a match. Is there a way to do this with my current approach, or do I have to find a different way to search for a match?
You could use next with default=False on a conditional generator expression:
next((string for string in myStringArray if string in comment.body), default=False)
The default is returned when there is no item that matched (so it's like any returning False), otherwise the first matching item is returned.
This is roughly equivalent to:
isMatch = False # variable to store the result
for string in myStringArray:
if string in comment.body:
isMatch = string
break # after the first occurrence stop the for-loop.
or if you want to have isMatch and whatMatched in different variables:
isMatch = False # variable to store the any result
whatMatched = '' # variable to store the first match
for string in myStringArray:
if string in comment.body:
isMatch = True
whatMatched = string
break # after the first occurrence stop the for-loop.
For python 3.8 or newer use Assignment Expressions
if any((match := string) in comment.body for string in myStringArray):
print(match)
I agree with the comment that an explicit loop would be clearest. You could fudge your original like so:
isMatch = any(string in comment.body and remember(string) for string in myStringArray)
^^^^^^^^^^^^^^^^^^^^^
where:
def remember(x):
global memory
memory = x
return True
Then the global memory will contain the matched string if isMatch is True, or retain whatever value (if any) it originally had if isMatch is False.
It's not a good idea to use one variable to store two different kinds of information: whether a string matches (a bool) and what that string is (a string).
You really only need the second piece of information: while there are creative ways to do this in one statement, as in the above answer, it really makes sense to use a for loop:
match = ''
for string in myStringArray:
if string in comment.body:
match = string
break
if match:
pass # do stuff
Say you have a = ['a','b','c','d'] and b = ['x','y','d','z'],
so that by doing any(i in b for i in a) you get True.
You can get:
The array of matches : matches = list( (i in b for i in a) )
Where in a it first matches : posInA = matches.index(True)
The value : value = a[posInA]
Where in b it first matches : posInB = b.index(value)
To get all the values and their indexes, the problem is that matches == [False, False, True, True] whether the multiple values are in a or b, so you need to use enumerate in loops (or in a list comprehension).
for m,i in enumerate(a):
print('considering '+i+' at pos '+str(m)+' in a')
for n,j in enumerate(b):
print('against '+j+' at pos '+str(n)+' in b')
if i == j:
print('in a: '+i+' at pos '+str(m)+', in b: '+j+' at pos '+str(n))

Categories

Resources