This question already has answers here:
Python RegEx that matches char followed/preceded by same char but uppercase/lowercase
(2 answers)
Closed 3 years ago.
I have the following string string = DdCcaaBbbB. I want to delete all the combinations of the same letter that are of the following form, being x any letter: xX, Xx.
And I want to delete them one by one, in the example, first I would delete Dd, after Cc, Bb and finally bB.
What I have done so far is:
for letter in string.lower():
try:
string = string.replace(re.search(letter + letter.upper(), string).group(),'')
except:
try:
string = string.replace(re.search(letter.upper() + letter, string).group(),'')
except:
pass
But I am sure this is not the most pythonic way to do it. What has come up to my mind, and thus the question, is if I could combine the two patterns I am searching for. Any other suggestion or improvement is more than welcome!
I think you can do a case-insensitive regex search to find all combinations of the same two letters, then have a function check if they're of the xX or Xx format before deciding if it should be replaced (by nothing) or left alone.
def replacer(match):
text = match.group()
if (text[0].islower() and text[1].isupper()) or (text[0].isupper() and text[1].islower()):
return ""
return text
string = "DdCcaaBbbB"
pattern = r'([a-z])\1'
new_string = re.sub(pattern, replacer, string, flags=re.IGNORECASE)
There is a downside to this approach. Because the regex is matching case-insensitively, it won't let you test overlapping matches. So if you have an input string like 'BBbb', it will match the two capital Bs and the two lowercase bs and not replace either pair, and it won't check the the Bb pair in the middle.
Unfortunately I don't think regex can solve that problem, since it has no way to transform cases in the middle of its search. We're already a bit beyond the bounds of the most basic regular expression specifications, since we need to use a backreference to even get as far as we did.
Related
This question already has answers here:
What do ^ and $ mean in a regular expression?
(2 answers)
Closed 2 years ago.
I've got a problem with carets and dollar signs in Python.
I want to find every word which starts with a number and ends with a letter
Here is what I've tried already:
import re
text = "Cell: 415kkk -555- 9999ll Work: 212-555jjj -0000"
phoneNumRegex = re.compile(r'^\d+\w+$')
print(phoneNumRegex.findall(text))
Result is an empty list:
[]
The result I want:
415kkk, 9999ll, 555jjj
Where is the problem?
Problems with your regex:
^...$ means you only want full matches over the whole string - get rid of that.
r'\w+' means "any word character" which means letters + numbers (case ignorant) plus underscore '_'. So this would match '5555' for '555' via
r'\d+' and another '5' as '\w+' hence add it to the result.
You need
import re
text = "Cell: 415kkk -555- 9999ll Work: 212-555jjj -0000"
phoneNumRegex = re.compile(r'\b\d+[a-zA-Z]+\b')
print(phoneNumRegex.findall(text))
instead:
['415kkk', '9999ll', '555jjj']
The '\b' are word boundaries so you do not match 'abcd1111' inside '_§$abcd1111+§$'.
Readup:
re-syntax
regex101.com - Regextester website that can handle python syntax
This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
How do I get a substring of a string in Python? [duplicate]
(16 answers)
Closed 2 years ago.
I'm new to Python and to Regex. Here is my current problem, for which I have not managed to find any straight answer online.
I have a string of 5 or more characters, for which I need to search for all the possible combinations of 5 characters.
I wonder if it's doable with regular expressions (instead of, say, creating a list of all possible 5-character combinations and then testing them in loop with my string).
For example, let's say my string is "stackoverflow", I need an expression that could give me a list containing all the possible combinations of 5 successive letters, such as: ['stack', 'tacko', ackov', ...]. (but not 'stcko' or 'wolfr' for example).
That's what I would try:
import re
word = "stackoverflow"
list = re.findall(r".....", word)
But printing this list would only give:
['stack', 'overfl']
Thus it seems that a position can only be matched once, a 5-character combination cannot concern a position that has already been matched.
Could anyone help me better understand how regex work in this situation, and if my demand is even possible directly using regular expressions?
Thanks!
If the letters are always consecutive, this will work:
wd = "stackoverflow"
lst = ["".join(wd[i:i+5]) for i in range(len(wd)-4)]
print(lst)
Output
['stack', 'tacko', 'ackov', 'ckove', 'kover', 'overf', 'verfl', 'erflo', 'rflow']
I think you could just use a simple loop with a sliding window of size 5
word = "stackoverflow"
result=[]
for i in range(len(word)-5):
result.append(word[i:i+5])
print(result)
This is quite efficient as it runs on O(n) linear time
Because as I can see in findall documentation string it returns all non-overlapping matches:
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
Look at solutions without regex usage in your topic.
I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.
In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200, and the pattern of not being a number (regex pattern in Python for this would be [^0-9]), I would need '8' (the last 'e' before the '200') as result.
What is the most pythonic way to achieve this?
As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:
import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()
I am not satisfied with this for two reasons:
- a) I need to reverse string before using it with [::-1], and
- b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.
There needs to be better ways for this, likely even with the result of re.search().
I am aware of re.search(...).end() over .start(), but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start(), .end(), etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.
What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.
Update
The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash e.g. because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.
You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:
import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])
Prints:
8
Edit:
For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
lst = [m.start() for m in re.finditer(r'\D', s)]
print(s, '-->', lst[-1] if len(lst) > 0 else None)
Prints the following, where if no such index is found then prints None instead of index:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
Edit 2:
As OP stated in his post, \d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
m = re.match(r'.*(\D)', s)
print(s, '-->', m.start(1) if m else None)
Prints the string and their corresponding index of non-digit char and None if not found any:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match.
But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.
To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:
import re
string = 'uiae1iuae200'
pattern = r'[^0-9]'
match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)
Output:
8
Or the exact same as a function and with more test cases:
import re
def last_match(pattern, string):
match = re.match(fr'.*({pattern})', string)
return match.end(1) - 1 if match else None
cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]
for pattern, string in cases:
print(f'{pattern}, {string}: {last_match(pattern, string)}')
Output:
[^0-9], uiae1iuae200: 8
[^0-9], 123a: 3
[^0-9], 123: None
[^abc], abcabc1abc: 6
[^1], 11eea11: 4
This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)), but it's pretty straightforward and probably not too inefficient.
def last_match(pattern, string):
for i in range(1, len(string) + 1):
substring = string[-i:]
if re.match(pattern, substring):
return len(string) - i
The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern.
Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.
This question already has answers here:
How to remove non-alphanumeric characters at the beginning or end of a string
(5 answers)
Closed 6 years ago.
I am wondering how I can implement a string check, where I want to make sure that the first (&last) character of the string is alphanumeric. I am aware of the isalnum, but how do I use this to implement this check/substitution?
So, I have a string like so:
st="-jkkujkl-ghjkjhkj*"
and I would want back:
st="jkkujkl-ghjkjhkj"
Thanks..
Though not exactly what you want, but using str.strip should serve your purpose
import string
st.strip(string.punctuation)
Out[174]: 'jkkujkl-ghjkjhkj'
You could use regex like shown below:
import re
# \W is a set of all special chars, and also include '_'
# If you have elements in the set [\W_] at start and end, replace with ''
p = re.compile(r'^[\W_]+|[\W_]+$')
st="-jkkujkl-ghjkjhkj*"
print p.subn('', st)[0]
Output:
jkkujkl-ghjkjhkj
Edit:
If your special chars are in the set: !"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~
#Abhijit's answer is much simpler and cleaner.
If you are not sure then this regex version is better.
You can use following two expressions:
st = re.sub('^\W*', '', st)
st = re.sub('\W*$', '', st)
This will strip all non alpha chars of the beginning and the end of the string, not just the first ones.
You could use a regular expression.
Something like this could work;
\w.+?\w
However I'm don't know how to do a regexp match in python..
hint 1: ord() can covert a letter to a character number
hint 2: alpha charterers are between 97 and 122 in ord()
hint 3: st[0] will return the first letter in string st[-1] will return the last
An exact answer to your question may be the following:
def stringCheck(astring):
firstChar = astring[0] if astring[0].isalnum() else ''
lastChar = astring[-1] if astring[-1].isalnum() else ''
return firstChar + astring[1:-1] + lastChar