I have this regex:
"\w{4}[A-D]{1}[a-d]*\s*"
How can I repeat the part of [A-D]{1}[a-d]*\s* several time with something like *?
So if I have the expression:
"Bed0Dabc Babc Cabb99rrAbaaaa Daa6ab"
the regex will give me:
"Bed0Dabc Babc Cabb"
"99rrAbaaaa Daa"
Your regex is invalid and lacks "\" at start, your desired output is also invalid and second string should be "99rrAbaaaa Daa".
I believe what you mean is groups, this is a pretty basic concept though, you should probably read more about regular expressions before using them.
The desired regex:
\w{4}([A-D][a-d]*\s*)+
You should add the \s to the set of characters.
import re
data = 'Bed0Dabc Babc Cabb99rrAbaaaa Daa6ab'
pattern = r'\w{4}(?:[A-D][a-d\s]*)+'
matches = re.findall(pattern, data)
The result:
['Bed0Dabc Babc Cabb', '99rrAbaaaa Daa']
The ?: at the start of the group defines a non-capturing group. If you omit it your result will look like this.
['Cabb', 'Daa']
Related
I have the following regex:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
When I apply this to a text string with, let's say,
"this is www.website1.com and this is website2.com", I get:
['www.website1.com']
['website.com']
How can i modify the regex to exclude the 'www', so that I get 'website1.com' and 'website2.com? I'm missing something pretty basic ...
Try this one (thanks #SunDeep for the update):
\s(?:www.)?(\w+.com)
Explanation
\s matches any whitespace character
(?:www.)? non-capturing group, matches www. 0 or more times
(\w+.com) matches any word character one or more times, followed by .com
And in action:
import re
s = 'this is www.website1.com and this is website2.com'
matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)
Output:
['website1.com', 'website2.com']
A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}.
This answer has a lot of helpful info about matching domains:
What is a regular expression which will match a valid domain name without a subdomain?
Next, I only look for .com domains, you could adjust my regular expression to something like:
\s(?:www.)?(\w+.(com|org|net))
To match whichever types of domains you were looking for.
Here a try :
import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)
O/P like :
'website1.com'
if it is s = "website1.com" also it will o/p like :
'website1.com'
Can you please get me python regex that can match
9am, 5pm, 4:30am, 3am
Simply saying - it has the list of times in csv format
I know the pattern for time, here it is:
'^(\\d{1,2}|\\d{1,2}:\\d{1,2})(am|pm)$'
^(\d+(:\d+)?(am|pm)(, |$))+ will work for you.
Demo here
If you have a regex X and you want a list of them separated by comma and (optional) spaces, it's a simple matter to do:
^X(,\s*X)*$
The X is, of course, your current search pattern sans anchors though you could adapt that to be shorter as well. To my mind, a better pattern for the times would be:
\d{1,2}(:\d{2})?[ap]m
meaning that the full pattern for what you want would be:
^\d{1,2}(:\d{2})?[ap]m(,\s*\d{1,2}(:\d{2})?[ap]m)*$
You can use re.findall() to get all the matches for a given regex
>>> str = "hello world 9am, 5pm, 4:30am, 3am hai"
>>> re.findall(r'\d{1,2}(?::\d{1,2})?(?:am|pm)', str)
['9am', '5pm', '4:30am', '3am']
What it does?
\d{1,2} Matches one or two digit
(?::\d{1,2}) Matches : followed by one ore 2 digits. The ?: is to prevent regex from capturing the group.
The ? at the end makes this part optional.
(?:am|pm) Match am or pm.
Use the following regex pattern:
tstr = '9am, 5pm, 4:30am, 3amsdfkldnfknskflksd hello'
print(re.findall(r'\b\d+(?::\d+)?(?:am|pm)', tstr))
The output:
['9am', '5pm', '4:30am', '3am']
Try this,
((?:\d?\d(?:\:?\d\d?)?(?:am|pm)\,?\s?)+)
https://regex101.com/r/nkcWt5/1
I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.
In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"
You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'
You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '
I have a more challenging task, but first I am faced with this issue. Given a string s, I want to extract all the groups of characters marked by some delimiter, e.g. parentheses. How can I accomplish this using regular expressions (or any Pythonic way)?
import re
>>> s = '(3,1)-[(7,2),1,(a,b)]-8a'
>>> pattern = r'(\(.+\))'
>>> re.findall(pattern, s).group() # EDITED: findall vs. search
['(3,1)-[(7,2),1,(a,b)']
# Desire result
['(3,1)', '(7,2)', '(a,b)']
Use findall() instead of search(). The former finds all occurences, the latter only finds the first.
Use the non-greedy ? operator. Otherwise, you'll find a match starting at the first ( and ending at the final ).
Note that regular expressions aren't a good tool for finding nested expressions like: ((1,2),(3,4)).
import re
s = '(3,1)-[(7,2),1,(a,b)]-8a'
pattern = r'(\(.+?\))'
print re.findall(pattern, s)
Use re.findall()
import re
data = '(3,1)-[(7,2),1,(a,b)]-8a'
found = re.findall('(\(\w,\w\))', data)
print found
Output:
['(3,1)', '(7,2)', '(a,b)']
I try to change string like s='2.3^2+3^3-√0.04*2+√4',
where 2.3^2 has to change to pow(2.3,2), 3^3 - pow(3,3), √0.04 - sqrt(0.04) and
√4 - sqrt(4).
s='2.3^2+3^3-√0.04*2+√4'
patt1='[0-9]+\.[0-9]+\^[0-9]+|[0-9]+\^[0-9]'
patt2='√[0-9]+\.[0-9]+|√[0-9]+'
idx1=re.findall(patt1, s)
idx2=re.findall(patt2, s)
idx11=[]
idx22=[]
for i in range(len(idx1)):
idx11.append('pow('+idx1[i][:idx1[i].find('^')]+','+idx1[i][idx1[i].find('^')+1:]+')')
for i in range(len(idx2)):
idx22.append('sqrt('+idx2[i][idx2[i].find('√')+1:]+')')
for i in range(len(idx11)):
s=re.sub(idx1[i], idx11[i], s)
for i in range(len(idx22)):
s=re.sub(idx2[i], idx22[i], s)
print(s)
Temp results:
idx1=['2.3^2', '3^3']
idx2=['√0.04', '√4']
idx11=['pow(2.3,2)', 'pow(3,3)']
idx22=['sqrt(0.04)', 'sqrt(4)']
but string result:
2.3^2+3^3-sqrt(0.04)*2+sqrt(4)
Why calculating 'idx1' is right, but re.sub don't insert this value into string ?
(sorry for my english:)
Try this using only re.sub()
Input string:
s='2.3^2+3^3-√0.04*2+√4'
Replacing for pow()
s = re.sub("(\d+(?:\.\d+)?)\^(\d+)", "pow(\\1,\\2)", s)
Replacing for sqrt()
s = re.sub("√(\d+(?:\.\d+)?)", "sqrt(\\1)", s)
Output:
pow(2.3,2)+pow(3,3)-sqrt(0.04)*2+sqrt(4)
() means group capture and \\1 means first captured group from regex match. Using this link you can get the detail explanation for the regex.
I've only got python 2.7.5 but this works for me, using str.replace rather than re.sub. Once you've gone to the effort of finding the matches and constructing their replacements, this is a simple find and replace job:
for i in range(len(idx11)):
s = s.replace(idx1[i], idx11[i])
for i in range(len(idx22)):
s = s.replace(idx2[i], idx22[i])
edit
I think you're going about this in quite a long-winded way. You can use re.sub in one go to make these changes:
s = re.sub('(\d+(\.\d+)?)\^(\d+)', r'pow(\1,\3)', s)
Will substitute 2.3^2+3^3 for pow(2.3,2)+pow(3,3) and:
s = re.sub('√(\d+(\.\d+)?)', r'sqrt(\1)', s)
Will substitute √0.04*2+√4 to sqrt(0.04)*2+sqrt(4)
There's a few things going on here that are different. Firstly, \d, which matches a digit, the same as [0-9]. Secondly, the ( ) capture whatever is inside them. In the replacement, you can refer to these captured groups by the order in which they appear. In the pow example I'm using the first and third group that I have captured.
The prefix r before the replacement string means that the string is to be treated as "raw", so characters are interpreted literally. The groups are accessed by \1, \2 etc. but because the backslash \ is an escape character, I would have to escape it each time (\\1, \\2, etc.) without the r.