Regex for the repeated pattern - python

Can you please get me python regex that can match
9am, 5pm, 4:30am, 3am
Simply saying - it has the list of times in csv format
I know the pattern for time, here it is:
'^(\\d{1,2}|\\d{1,2}:\\d{1,2})(am|pm)$'

^(\d+(:\d+)?(am|pm)(, |$))+ will work for you.
Demo here

If you have a regex X and you want a list of them separated by comma and (optional) spaces, it's a simple matter to do:
^X(,\s*X)*$
The X is, of course, your current search pattern sans anchors though you could adapt that to be shorter as well. To my mind, a better pattern for the times would be:
\d{1,2}(:\d{2})?[ap]m
meaning that the full pattern for what you want would be:
^\d{1,2}(:\d{2})?[ap]m(,\s*\d{1,2}(:\d{2})?[ap]m)*$

You can use re.findall() to get all the matches for a given regex
>>> str = "hello world 9am, 5pm, 4:30am, 3am hai"
>>> re.findall(r'\d{1,2}(?::\d{1,2})?(?:am|pm)', str)
['9am', '5pm', '4:30am', '3am']
What it does?
\d{1,2} Matches one or two digit
(?::\d{1,2}) Matches : followed by one ore 2 digits. The ?: is to prevent regex from capturing the group.
The ? at the end makes this part optional.
(?:am|pm) Match am or pm.

Use the following regex pattern:
tstr = '9am, 5pm, 4:30am, 3amsdfkldnfknskflksd hello'
print(re.findall(r'\b\d+(?::\d+)?(?:am|pm)', tstr))
The output:
['9am', '5pm', '4:30am', '3am']

Try this,
((?:\d?\d(?:\:?\d\d?)?(?:am|pm)\,?\s?)+)
https://regex101.com/r/nkcWt5/1

Related

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

Use python 3 regex to match a string in double quotes

I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.
In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"
You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'
You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '

How i can do this regex simple?

I have this regex:
"\w{4}[A-D]{1}[a-d]*\s*"
How can I repeat the part of [A-D]{1}[a-d]*\s* several time with something like *?
So if I have the expression:
"Bed0Dabc Babc Cabb99rrAbaaaa Daa6ab"
the regex will give me:
"Bed0Dabc Babc Cabb"
"99rrAbaaaa Daa"
Your regex is invalid and lacks "\" at start, your desired output is also invalid and second string should be "99rrAbaaaa Daa".
I believe what you mean is groups, this is a pretty basic concept though, you should probably read more about regular expressions before using them.
The desired regex:
\w{4}([A-D][a-d]*\s*)+
You should add the \s to the set of characters.
import re
data = 'Bed0Dabc Babc Cabb99rrAbaaaa Daa6ab'
pattern = r'\w{4}(?:[A-D][a-d\s]*)+'
matches = re.findall(pattern, data)
The result:
['Bed0Dabc Babc Cabb', '99rrAbaaaa Daa']
The ?: at the start of the group defines a non-capturing group. If you omit it your result will look like this.
['Cabb', 'Daa']

Regular expression for repeating sequence

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Regex to ensure group match doesn't end with a specific character

I'm having trouble coming up with a regular expression to match a particular case. I have a list of tv shows in about 4 formats:
Name.Of.Show.S01E01
Name.Of.Show.0101
Name.Of.Show.01x01
Name.Of.Show.101
What I want to match is the show name. My main problem is that my regex matches the name of the show with a preceding '.'. My regex is the following:
"^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})"
Some Examples:
>>> import re
>>> SHOW_INFO = re.compile("^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})")
>>> match = SHOW_INFO.match("Name.Of.Show.S01E01")
>>> match.groups()
('Name.Of.Show.', 'S01E01')
>>> match = SHOW_INFO.match("Name.Of.Show.0101")
>>> match.groups()
('Name.Of.Show.0', '101')
>>> match = SHOW_INFO.match("Name.Of.Show.01x01")
>>> match.groups()
('Name.Of.Show.', '01x01')
>>> match = SHOW_INFO.match("Name.Of.Show.101")
>>> match.groups()
('Name.Of.Show.', '101')
So the question is how do I avoid the first group ending with a period? I realize I could simply do:
var.strip(".")
However, that doesn't handle the case of "Name.Of.Show.0101". Is there a way I could improve the regex to handle that case better?
Thanks in advance.
I think this will do:
>>> regex = re.compile(r'^([0-9a-z.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}x[0-9]{2})$', re.I)
>>> regex.match('Name.Of.Show.01x01').groups()
('Name.Of.Show', '01x01')
>>> regex.match('Name.Of.Show.101').groups()
('Name.Of.Show', '101')
ETA: Of course, if you're just trying to extract different bits from trusted strings you could just use string methods:
>>> 'Name.Of.Show.101'.rpartition('.')
('Name.Of.Show', '.', '101')
So the only real restriction on the last group is that it doesn’t contain a dot? Easy:
^(.*?)(\.[^.]+)$
This matches anything, non-greedily. The important part is the second group, which starts with a dot and then matches any non-dot character until the end of the string.
This works with all your test cases.
It seems like the problem is that you haven't specified that the period before the last group is required, so something like ^([0-9a-zA-Z\.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3}) might work.
I believe this will do what you want:
^([0-9a-z\.]+)\.(?:S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}(?:x[0-9]+)?)$
I tested this against the following list of shows:
30.Rock.S01E01
The.Office.0101
Lost.01x01
How.I.Met.Your.Mother.101
If those 4 cases are representative of the types of files you have, then that regex should place the show title in its own capture group and toss away the rest. This filter is, perhaps, a bit more restrictive than some others, but I'm a big fan of matching exactly what you need.
If the last part never contains a dot: ^(.*)\.([^\.]+)$

Categories

Resources