how to remove whitespace inside bracket? - python

I have the following string:
res = '(321, 3)-(m-5, 5) -(31,1)'
I wanna remove the whitespace withing the bracket but i haven't any knowledge about regular expression
I ve try this but that doesn't work:
import re
res = re.sub(r'\(.*\s+\)', '', res)

You can substitute a non-greedy wildcard match for characters in parentheses with a function that splits the match on whitespace and rejoins it.
>>> import re
>>> res = '(321, 3)-(m-5, 5) -(31,1)'
>>> re.sub(r'\(.*?\)', lambda x: ''.join(x.group(0).split()), res)
'(321,3)-(m-5,5) -(31,1)'

You could convert the string into a list, go through each letter and count if you are within brackets or not. In toRemove, you collect the positions of whitespaces, which you then remove from the list. Then you convert the list back to a string ...
res = '(321, 3)-(m-5, 5) -(31,1)'
r = list(res)
insideBracket = 0
toRemove = []
for pos,letter in enumerate(r):
if letter == '(':
insideBracket += 1
elif letter == ')':
insideBracket -= 1
if insideBracket > 0:
if letter == ' ':
toRemove.append(pos)
for t in toRemove[::-1]:
r.pop(t)
result = ''.join(r)
print(result)

I think regular expressions aren't quite powerful enough to do what you want here; you want to remove all whitespace that's found in between parenthesis characters. The trouble is, solving this for the general case means you're doing a context-sensitive match on the string, and regular expressions are mostly context-insensitive, and so can't do your job. There are lookaheads and lookbehinds that can restrict matches to particular contexts, but they won't solve your problem in the general case either:
The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Group references are not supported even if they match strings of some fixed length.
Because of this, I would match the parenthesis groups first:
>>> re.split(r'(\([^)]*\))', res)
['', '(321, 3)', '-', '(m-5, 5)', ' -', '(31,1)', '']
and then remove whitespace from them in a second step before joining everything back up into a single string:
>>> g = re.split(r'(\([^)]*\))', res)
>>> g[1::2] = [re.sub(r'\s*', '', x) for x in g[1::2]]
>>> ''.join(g)
'(321,3)-(m-5,5) -(31,1)'

Related

Regex to find specific letter before a condition Python

I just want to find all characters (other than A) which are followed by triple A, i.e., have AAA to the right. I don’t want to include the triple A in the output and just want the character immediately preceding AAA
result = []
s = 'ACAABAACAAABACDBADDDFSDDDFFSSSASDAFAAACBAAAFASD'
pattern = "r'(\w[BF])(?!AAA)'"
for item in re.finditer(pattern, s):
result.append(item.group())
print(result)
I used this pattern r'(\w[BF])(?!AAA)' but didn't worked
I just need find this letters in []
'ACAABAA[C]AAABACDBADDDFSDDDFFSSSASDA[F]AAAC[B]AAAFASD'
In your example, you want to match a single character at the left of tripple A. Using \w[BF] matches at least 2 characters being 1 word character followed by either B or F
The negative lookahead asserts that what is directly to the right is not tripple A, but you want the opposite.
You can match a single B-Z and assert what is directly to the right is AAA
[B-Z](?=AAA)
Regex demo | Python demo
import re
result = []
s = 'ACAABAACAAABACDBADDDFSDDDFFSSSASDAFAAACBAAAFASD'
pattern = r'[B-Z](?=AAA)'
for item in re.finditer(pattern, s):
result.append(item.group())
print(result)
Output
['C', 'F', 'B']
You could also use re.findall
import re
s = 'ACAABAACAAABACDBADDDFSDDDFFSSSASDAFAAACBAAAFASD'
pattern = r'[B-Z](?=AAA)'
result = re.findall(pattern, s)
print(result)
Python demo
[^A](?=A{3})
Here I use positive lookahead.
Here is your problem's solution:
pattern = "([B-Z]{1})(A{3})"
for item in re.finditer(pattern, s):
result.append(item.group(1))

How can we remove word with repeated single character?

I am trying to remove word with single repeated characters using regex in python, for example :
good => good
gggggggg => g
What I have tried so far is following
re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
Problem with above solution is that it changes good to god and I just want to remove words with single repeated characters.
A better approach here is to use a set
def modify(s):
#Create a set from the string
c = set(s)
#If you have only one character in the set, convert set to string
if len(c) == 1:
return ''.join(c)
#Else return original string
else:
return s
print(modify('good'))
print(modify('gggggggg'))
If you want to use regex, mark the start and end of the string in our regex by ^ and $ (inspired from #bobblebubble comment)
import re
def modify(s):
#Create the sub string with a regex which only matches if a single character is repeated
#Marking the start and end of string as well
out = re.sub(r'^([a-z])\1+$', r'\1', s)
return out
print(modify('good'))
print(modify('gggggggg'))
The output will be
good
g
If you do not want to use a set in your method, this should do the trick:
def simplify(s):
l = len(s)
if l>1 and s.count(s[0]) == l:
return s[0]
return s
print(simplify('good'))
print(simplify('abba'))
print(simplify('ggggg'))
print(simplify('g'))
print(simplify(''))
output:
good
abba
g
g
Explanations:
You compute the length of the string
you count the number of characters that are equal to the first one and you compare the count with the initial string length
depending on the result you return the first character or the whole string
You can use trim command:
take a look at this examples:
"ggggggg".Trim('g');
Update:
and for characters which are in the middle of the string use this function, thanks to this answer
in java:
public static string RemoveDuplicates(string input)
{
return new string(input.ToCharArray().Distinct().ToArray());
}
in python:
used = set()
unique = [x for x in mylist if x not in used and (used.add(x) or True)]
but I think all of these answers does not match situation like aaaaabbbbbcda, this string has an a at the end of string which does not appear in the result (abcd). for this kind of situation use this functions which I wrote:
In:
def unique(s):
used = set()
ret = list()
s = list(s)
for x in s:
if x not in used:
ret.append(x)
used = set()
used.add(x)
return ret
print(unique('aaaaabbbbbcda'))
out:
['a', 'b', 'c', 'd', 'a']

How can you group a very specfic pattern with regex?

Problem:
https://coderbyte.com/editor/Simple%20Symbols
The str parameter will be composed of + and = symbols with
several letters between them (ie. ++d+===+c++==a) and for the string
to be true each letter must be surrounded by a + symbol. So the string
to the left would be false. The string will not be empty and will have
at least one letter.
Input:"+d+=3=+s+"
Output:"true"
Input:"f++d+"
Output:"false"
I'm trying to create a regular expression for the following problem, but I keep running into various problems. How can I produce something that returns the specified rules('+\D+')?
import re
plusReg = re.compile(r'[(+A-Za-z+)]')
plusReg.findall()
>>> []
Here I thought I could create my own class that searches for the pattern.
import re
plusReg = re.compile(r'([\\+,\D,\\+])')
plusReg.findall('adf+a+=4=+S+')
>>> ['a', 'd', 'f', '+', 'a', '+', '=', '=', '+', 'S', '+']
Here I thought I the '\\+' would single out the plus symbol and read it as a char.
mo = plusReg.search('adf+a+=4=+S+')
mo.group()
>>>'a'
Here using the same shell, I tried using the search instead of findall, but I just ended up with the first letter which isn't even surrounded by a plus.
My end result is to group the string 'adf+a+=4=+S+' into ['+a+','+S+'] and so on.
edit:
Solution:
import re
def SimpleSymbols(str):
#added padding, because if str = 'y+4==+r+'
#then program would return true when it should return false.
string = '=' + str + '='
#regex that returns false if a letter *doesn't* have a + in front or back
plusReg = re.compile(r'[^\+][A-Za-z].|.[A-Za-z][^\+]')
#if statement that returns "true" if regex doesn't find any letters
#without a + behind or in front
if plusReg.search(string) is None:
return "true"
return "false"
print SimpleSymbols(raw_input())
I borrowed some code from ekhumoro and Sanjay. Thanks
One approach is to search the string for any letters that are either: (1) not preceeded by a +, or (2) not followed by a +. This can be done using look ahead and look behind assertions:
>>> rgx = re.compile(r'(?<!\+)[a-zA-Z]|[a-zA-Z](?!\+)')
So if rgx.search(string) returns None, the string is valid:
>>> rgx.search('+a+') is None
True
>>> rgx.search('+a+b+') is None
True
but if it returns a match, the string is invalid:
>>> rgx.search('+ab+') is None
False
>>> rgx.search('+a=b+') is None
False
>>> rgx.search('a') is None
False
>>> rgx.search('+a') is None
False
>>> rgx.search('a+') is None
False
The important thing about look ahead/behind assertions is that they don't consume characters, so they can handle overlapping matches.
Something like this should do the trick:
import re
def is_valid_str(s):
return re.findall('[a-zA-Z]', s) == re.findall('\+([a-zA-Z])\+', s)
Usage:
In [10]: is_valid_str("f++d+")
Out[10]: False
In [11]: is_valid_str("+d+=3=+s+")
Out[11]: True
I think you are on the right track. The regular expression you have is correct, but it can simplify down to just letters:
search_pattern = re.compile(r'\+[a-zA-z]\+')
for upper and lower case strings. Now we can use this regex with the findall function:
results = re.findall(search_pattern, 'adf+a+=4=+S+') # returns ['+a+', '+S+']
Now the question needs you to return a boolean depending on if the string is valid to the specified pattern so we can wrap this all up into a function:
def is_valid_pattern(pattern_string):
search_pattern = re.compile(r'\+[a-zA-z]?\+')
letter_pattern = re.compile(r'[a-zA-z]') # to search for all letters
results = re.findall(search_pattern, pattern_string)
letters = re.findall(letter_pattern, pattern_string)
# if the lenght of the list of all the letters equals the length of all
# the values found with the pattern, we can say that it is a valid string
return len(results) == len(letter_pattern)
You should be looking for what isn't there, as opposed to what is. You should search for something like, ([^\+][A-Za-z]|[A-Za-z][^\+]). The | in the middle is a logical or operator. Then on either side, it checks if it can find any scenario where there is a letter without a "+" on the left/right respectively. If if finds something, that means the string fails. If it can't find anything, that means that there are no instances of a letter not being surrounded by "+"'s.

Check and remove particular char from string in python

I'm in a situation where I have a string and a special symbol that is consecutively repeating, such as:
s = 'a.b.c...d..e.g'
How can I check whether it is repeating or not and remove consecutive symbols, resulting in this:
s = 'a.b.c.d.e.g'
import re
result = re.sub(r'\.{2,}', '.', 'a.b.c...d..e.g')
A bit more generalized version:
import re
symbol = '.'
regex_pattern_to_replace = re.escape(symbol)+'{2,}'
# Note that escape sequences are processed in replace_to
# but this time we have no backslash characters in it.
# In case of more complex replacement we could use
# replace_to = replace_to.replace('\\', '\\\\')
# to defend against occasional escape sequences.
replace_to = symbol
result = re.sub(regex_pattern_to_replace, replace_to, 'a.b.c...d..e.g')
The same with compiled regex (added after Cristian Ciupitu's comment):
compiled_regex = re.compile(regex_pattern_to_replace)
# You can store the compiled_regex and reuse it multiple times.
result = compiled_regex.sub(replace_to, 'a.b.c...d..e.g')
Check out the docs for re.sub
Simple and clear:
>>> a = 'a.b.c...d..e.g'
>>> while '..' in a:
a = a.replace('..','.')
>>> a
'a.b.c.d.e.g'
Lot's of answers so why not throw another one into the mix.
You can zip the string with itself off by one and eliminate all matching '.'s:
''.join(x[0] for x in zip(s, s[1:]+' ') if x != ('.', '.'))
Certainly not the fastest, just interesting. It's trivial to turn this into eliminating all repeating elements:
''.join(a for a,b in zip(s, s[1:]+' ') if a != b)
Note: you can use izip_longest (py2) or zip_longest (py3) if ' ' as a filler causes an issue.
My previous answer was a dud so here's another attempt using reduce(). This is reasonably efficient with O(n) time complexity:
def remove_consecutive(s, symbol='.'):
def _remover(x, y):
if y == symbol and x[-1:] == y:
return x
else:
return x + y
return reduce(_remover, s, '')
for s in 'abcdefg', '.a.', '..aa..', '..aa...b...c.d.e.f.g.....', '.', '..', '...', '':
print remove_consecutive(s)
Output
abcdefg
.a.
.aa.
.aa.b.c.d.e.f.g.
.
.
.
Kind of complicated, but it works and it's being done in a single loop:
import itertools
def remove_consecutive(s, c='.'):
return ''.join(
itertools.chain.from_iterable(
c if k else g
for k, g in itertools.groupby(s, c.__eq__)
)
)

How to efficiently match regex in python

I am writing a code to match the US phone number format
So it should match:
123-333-1111
(123)111-2222
123-2221111
But should not match
1232221111
matchThreeDigits = r"(?:\s*\(?[\d]{3}\)?\s*)"
matchFourDigits = r"(?:\s*[\d]{4}\s*)"
phoneRegex = '('+ '('+ matchThreeDigits + ')' + '-?' + '('+ matchThreeDigits + ')' + '-?' + '(' + matchFourDigits + ')' +')';
matches = re.findall(re.compile(phoneRegex),line)
The problem is I need to make sure at least one of () or '-' is present in present in the pattern (or else it can be a nine digit number rather than a phone number). I don't want to do another pattern search for efficiency reasons. Is there any way to accommodate this information in the regex pattern itself.
Something like this?
pattern = r'(\(?(\d{3})\)?(?P<A>-)?(\d{3})(?(A)-?|-)(\d{4}))'
Using it:
import re
regex = re.compile(pattern)
check = ['123-333-1111', '(123)111-2222', '123-2221111', '1232221111']
for number in check:
match = regex.match(number)
print number, bool(match)
if match:
# show the numbers
print 'nums:', filter(lambda x: x and x.isalnum(), match.groups())
>>>
123-333-1111 True
nums: ('123', '333', '1111')
(123)111-2222 True
nums: ('123', '111', '2222')
123-2221111 True
nums: ('123', '222', '1111')
1232221111 False
Note:
You requested an explanation of: (?P<A>-) and (?(A)-?|-)
(?P<A>-) : Is a named capture group with the name A, (?P<NAME> ... )
(?(A)-?|-) : Is a group that checks if the named group A captured something or not, if so it does the YES, else it does the NO capture. (?(NAME)YES|NO)
All this can be easily learned if you do a simple help(re) in the Python interpreter, or a Google search for Python Regular Expressions....
You can use the following regex:
regex = r'(?:\d{3}-|\(\d{3}\))\d{3}-?\d{4}'
assuming that (123)1112222 is acceptable.
The | acts as an or, and \( and \) escape ( and ), respectively.
import re
phoneRegex = re.compile("(\({0,1}[\d]{3}\)(?=[\d]{3})|[\d]{3}-)([\d]{3}[-]{0,1}[\d]{4})")
numbers = ["123-333-1111", "(123)111-2222", "123-2221111", "1232221111", "(123)-111-2222"]
for number in numbers:
print bool(re.match(phoneRegex, number))
Output
True
True
True
False
False
You can see an explanation to this regular expression here : http://regex101.com/r/bA4fH8

Categories

Resources