Remove strings with repeating characters [Python] - python

I need to determine if a string is composed of a certain repeating character, for example eeeee, 55555, or !!!.
I know this regex 'e{1,15}' can match eeeee but it obviously can't match 555. I tried [a-z0-9]{1-15} but it matches even the strings I don't need like Hello.
The solution doesn't have to be regex. I just can't think of any other way to do this.

A string consists of a single repeating character if and only if all characters in it are the same. You can easily test that by constructing a set out of the string: set('55555').
All characters are the same if and only if the set has size 1:
>>> len(set('55555')) == 1
True
>>> len(set('Hello')) == 1
False
>>> len(set('')) == 1
False
If you want to allow the empty string as well (set size 0), then use <= 1 instead of == 1.

Regex solution (via re.search() function):
import re
s = 'eeeee'
print(bool(re.search(r'^(.)\1+$', s))) # True
s = 'ee44e'
print(bool(re.search(r'^(.)\1+$', s))) # False
^(.)\1+$ :
(.) - capture any character
\1+ - backreference to the previously captured group, repeated one or many times

You do not have to use regex for this, a test to determine if all characters in the string are the same will produce the desired output:
s = "eee"
assert len(s) > 0
reference = s[0]
result = all([c==reference for c in s])
Or use set as Thomas showed, which is probably a better way.

Related

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string
It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'
re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO
Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)
EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.
The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

How to access the first character and integer pair in a Python string?

I have several strings in Python. Let's assume each string is associated with a variable. These strings are only composed of characters and integers:
one = '74A76B217C'
two = '8B7A1742B'
three = '8123A9A8B'
I would like a conditional in my code which checks these strings if 'A' exists first, and if so, return the integer.
So, in the example above, the first integer and first character is: for one, 74 and A; for two, 8 and B; for three, 8123 and A.
For the function, one would return True, and 74; two would be False, and three would be 8123 and A.
My problem is, I am not sure how to efficiently parse the strings in order to check for the first integer and character.
In Python, there are methods to check whether the character exists in the string, e.g.
if 'A' in one:
print('contains A')
But this doesn't take order into account order. What is the most efficient way to access the first character and first integer, or at least check whether the first character to occur in a string is of a certain identity?
Try this as an alternative of regex:
def check(s):
i = s.find('A')
if i > 0 and s[:i].isdigit():
return int(s[:i]), True
return False
# check(one) (74, True)
# check(two) False
# check(three) (8123, True)
Use regex: ^(\d+)A
import re
def check_string(s):
m_test = re.search("^(\d+)A", s)
if m_test:
return m_test.group(1)
See online regex tester:
https://regex101.com/r/LmSbly/1
Try a regular expression.
>>> import re
>>> re.search('[0-9]+', '74A76B217C').group(0)
'74'
>>> re.search('[A-z]', '74A76B217C').group(0)
'A'
You can use a regex:
>>> re.match('^([0-9]+)A', string)
For example:
import re
for s in ['74A76B217C', '8B7A1742B', '8123A9A8B']:
match = re.match('^([0-9]+)A', s)
print(match is not None and match.group(1))

Regular Expression (find matching characters in order)

Let us say that I have the following string variables:
welcome = "StackExchange 2016"
string_to_find = "Sx2016"
Here, I want to find the string string_to_find inside welcome using regular expressions. I want to see if each character in string_to_find comes in the same order as in welcome.
For instance, this expression would evaluate to True since the 'S' comes before the 'x' in both strings, the 'x' before the '2', the '2' before the 0, and so forth.
Is there a simple way to do this using regex?
Your answer is rather trivial. The .* character combination matches 0 or more characters. For your purpose, you would put it between all characters in there. As in S.*x.*2.*0.*1.*6. If this pattern is matched, then the string obeys your condition.
For a general string you would insert the .* pattern between characters, also taking care of escaping special characters like literal dots, stars etc. that may otherwise be interpreted by regex.
This function might fit your need
import re
def check_string(text, pattern):
return re.match('.*'.join(pattern), text)
'.*'.join(pattern) create a pattern with all you characters separated by '.*'. For instance
>> ".*".join("Sx2016")
'S.*x.*2.*0.*1.*6'
Use wildcard matches with ., repeating with *:
expression = 'S.*x.*2.*0.*1.*6'
You can also assemble this expression with join():
expression = '.*'.join('Sx2016')
Or just find it without a regular expression, checking whether the location of each of string_to_find's characters within welcome proceeds in ascending order, handling the case where a character in string_to_find is not present in welcome by catching the ValueError:
>>> welcome = "StackExchange 2016"
>>> string_to_find = "Sx2016"
>>> try:
... result = [welcome.index(c) for c in string_to_find]
... except ValueError:
... result = None
...
>>> print(result and result == sorted(result))
True
Actually having a sequence of chars like Sx2016 the pattern that best serve your purpose is a more specific:
S[^x]*x[^2]*2[^0]*0[^1]*1[^6]*6
You can obtain this kind of check defining a function like this:
import re
def contains_sequence(text, seq):
pattern = seq[0] + ''.join(map(lambda c: '[^' + c + ']*' + c, list(seq[1:])))
return re.search(pattern, text)
This approach add a layer of complexity but brings a couple of advantages as well:
It's the fastest one because the regex engine walk down the string only once while the dot-star approach go till the end of the sequence and back each time a .* is used. Compare on the same string (~1k chars):
Negated class -> 12 steps
Dot star -> 4426 step
It works on multiline strings in input as well.
Example code
>>> sequence = 'Sx2016'
>>> inputs = ['StackExchange2015','StackExchange2016','Stack\nExchange\n2015','Stach\nExchange\n2016']
>>> map(lambda x: x + ': yes' if contains_sequence(x,sequence) else x + ': no', inputs)
['StackExchange2015: no', 'StackExchange2016: yes', 'Stack\nExchange\n2015: no', 'Stach\nExchange\n2016: yes']

Python Regex for string matching

There are 2 rules that I am trying to match with regex. I've tried testing on various cases, giving me unwanted results.
Rules are as follows:
Find all strings that are numbers (integer, decimal, and negative values included)
Find all strings that have no numeric value. This is referring to special characters like !##$%^&*()
So in my attempt to match these rules, I got this:
def rule(word):
if re.match("\W", word):
return True
elif re.match("[-.\d]", word):
return True
else:
return False
Input: output tests are as follows
word = '972.2' : True
word = '-88.2' : True
word = '8fdsf' : True
word = '86fdsf' : True I want this to be False
word = '&^(' : True
There were some more tests, but I just wanted to show that one I want to return False. It seems like it's matching just the first character, so I tried changing the regex epressions, but that made things worse.
As the documentation says, re.match will return a MatchObject which always evaluates to True whenever the start of the string is matched to the regex.
Thus, you need to use anchors in regex to make sure only whole string match counts, e.g. [-.\d]$ (note the dollar sign).
EDIT: plus what Max said - use + so your regex won't just match a single letter.
Your regexes (both of them) only look at the first character of your string. Change them to add +$ at the end in order to make sure your string is only made of the target characters.
As i understand, you need to exclude all except 1 and 2.
Try this:
import re
def rule(word):
return True if re.search("[^\W\d\-\.]+", word) is None else False
Results on provided samples:
972.2: True
-88.2: True
8fdsf: False
86fdsf: False
&^(: True

How to match specific characters only?

Here I am try to match the specific characters in a string,
^[23]*$
Here my cases,
2233 --> Match
22 --> Not Match
33 --> Not Match
2435 --> Not Match
2322 --> Match
323 --> Match
I want to match the string with correct regular expression. I mean 1,5,6 cases needed.
Update:
If I have more than two digits match, like the patterns,
234 or 43 or etc. how to match this pattern with any string ?.
I want dynamic matching ?
How about:
(2+3|3+2)[23]*$
String must either:
start with one or more 2s, contain a 3, followed by any mix of 2 or 3 only
start with one or more 3s, contain a 2, followed by any mix of 2 or 3 only
Update: to parameterize the pattern
To parameterize this pattern, you could do something like:
x = 2
y = 3
pat = re.compile("(%s+%s|%s+%s)[%s%s]*$" % (x,y,y,x,x,y))
pat.match('2233')
Or a bit clearer, but longer:
pat = re.compile("({x}+{y}|{y}+{x})[{x}{y}]*$".format(x=2, y=3))
Or you could use Python template strings
Update: to handle more than two characters:
If you have more than two characters to test, then the regex gets unwieldy and my other answer becomes easier:
def match(s,ch):
return all([c in s for c in ch]) and len(s.translate(None,ch)) == 0
match('223344','234') # True
match('2233445, '234') # False
Another update: use sets
I wasn't entirely happy with the above solution, as it seemed a bit ad-hoc. Eventually I realized it's just a set comparison - we just want to check that the input consists of a fixed set of characters:
def match(s,ch):
return set(s) == set(ch)
If you want to match strings containing both 2 and 3, but no other characters you could use lookaheads combined with what you already have:
^(?=.*2)(?=.*3)[23]*$
The lookaheads (?=.*2) and (?=.*3) assert the presence of 2 and 3, and ^[23]*$ matches the actual string to only those two characters.
I know you asked for a solution using regex (which I have posted), but here's an alternative approach:
def match(s):
return '2' in s and '3' in s and len(s.translate(None,'23')) == 0
We check that the string contains both desired characters, then translate them both to empty strings, then check that there's nothing left (i.e. we only had 2s and 3s).
This approach can easily be extended to handle more than two characters, using the all function, and a list comprehension:
def match(s,ch):
return all([c in s for c in ch]) and len(s.translate(None,ch)) == 0
which would be used as follows:
match('223344','234') # True
match('2233445, '234') # False
Try this: (both 2 and 3 should exist)
^([2]+[3]+[23]*)|([3]+[2]+[23]*)$
^(2[23]*3[23]*)|(3[23]*2[23]*)$
I think this will do it. Look for either a 2 at the start, then a 3 has to appear somewhere (surrounded by as many other 2s and 3s as needed). Or vice versa, with 3 at the start and a 2 somewhere.
Should start with 2 or 3 followed by 2 or more occurrence of 2 or 3
^[23][23]{2,}$

Categories

Resources