Regex matching returns several capturing block - python

I wanted to extract experience from some line of text. It may contain some variation of years and months. I tried to make two non-capturing blocks using regex but it ends up in giving me several capturing instance.
Work Experience: 15 years 2 months
regex is:
((?:\d{1,3}(?:\.)?(?:\d{1})?\s+year(?:s)?\s+)?(?:\d{1,3}\s+month(?:s)?)?)
Though it captured the string that I want to find and it returns spurious matches as well.
One way to simply join the all instance as rest of matches are '' but that will not be justice to coding practice.
I need a small help to figure out where did I go wrong?
Apologies,
I have missed one scenario which has led to putting everyone off track. there are strings which are like
2 Months
1 year 3 months
1.5 year
15 year 2 months

The pattern matches a position before and after each character as well because all the parts in the pattern are optional.
You can write these parts like (?:s)? just as s? and you can omit {1}
If you don't want to match empty strings, you could either match an optional year part followed by the months, or match the months part.
You could either use a case insensitive match or match either a lowercase chars and upper case char using a character class [yY]
As you want the match only, you can omit the capturing group.
\b(?:\d+(?:\.\d+)? years? )?\d{1,3} months?\b|\b\d+(?:\.\d+)? years?\b
Explanation
\b(?:\d+(?:\.\d+)? years? )? Match an optional year part with optional decimal part
\d{1,3} months?\b Match the month part with 1-3 digits
| Or
\b\d+(?:\.\d+)? years?\b Match the years part
Regex demo
Note that \s could also match a newline

You can reduce your regex as below to capture only those groups that required. here \1 will have required string.
This will also match string separated with tabs and newlines.
^\D*((?:[\d.]+\s*[yY]ears?)?\s*(?:[\d.]+\s*[mM]onths?)?)
Demo

Related

When should we use groups in our regex? What is the real advantage here? [duplicate]

This question already has answers here:
Regular expression pipe confusion
(5 answers)
Closed 2 years ago.
I have two regex. Both matches American Date Formats. Here there are (I highlight the group I talk about):
^(.*?)((0|1)?\d)-((0|1|2|3)?\d)-**(19|20\d\d)**(.*?)$
^(.*?)((0|1)?\d)-((0|1|2|3)?\d)-**((19|20)\d\d)**(.*?)$
Both matches:
asasa12-12-1993.txt
asassa12-12-2010.txt
In the book he put 19|20 into its own group. Why?
AKX is almost right but it's more than that.
19|20\d\d will match either 19 OR 20 followed by 2 digits.
But it will not match 19 followed by 2 digits.
Have a look here: https://regex101.com/r/lvYGUb/3
You'll see 2010 is a single match whereas 19 is matched alone, without the 93, and as a consequence the 93 goes with the .txt group, which is probably not what you want
In a similar way, consider this data file :
20 euros
20 €
Let's say you want to match 100% of both lines using a regex.
\d+ euros|€ won't work because it means either a number followed by the word euros OR just the € sign alone
But
\d+ (euros|€) will work
So the purpose of the parentheses here is not capturing the group, they are just meant to put a boundary to the OR operator.
If you don't want those parentheses to capture the group, you can add ?: to make it a non-capturing group, like so:
^(.*?)((0|1)?\d)-((0|1|2|3)?\d)-((?:19|20)\d\d)(.*?)$
My best guess is it's easier for humans to parse.
The first ((19|20\d\d)) doesn't make it obvious whether the alternation is "19 or 20\d\d", whereas in ((19|20)\d\d) it's obvious to see it's "19 or 20, then \d\d".

Matching a space between occurrences in Regex

I need assistance with matching spaces and subsequent matches in regex.
the example is as follows:
I want to match all of the following scenarios:
60 ml ( 1)
60ML (2 )
60ml(2) (a)
the regex I have used is:
(60\s?(?:ml)\s?(?:\w|\(.{0,3}\)){0,5})
link to the example: link to regex
the regex matches the first 2 examples, but not the instances where there is a space between (2) and (a).
any guidance would be appreciated.
Your regex doesn't allow for spaces between the parenthesised groups (2) and (a) in your last example. You can add <space>* to it to allow it to do so. Note you cannot use \s* unless you are only matching a single value at a time, otherwise the fact that \s will match newline can cause the first match to go too far.
(60\s?ml\s?(?:\w|\(.{0,3}\) *){0,5})
Note that without anchors counting repetitions doesn't really make sense. For example, this regex will match both 60ML (2 )(a)(a)(a)(a) and 60ML (2 )(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a), returning 60ML (2 )(a)(a)(a)(a) in both cases. If that is not what you want, you will need to add an anchor to the end of the regex ($ perhaps) to prevent it matching the longer string.
Demo on regex101

capture the number iwth comma or dot with regex

I have regex code
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
"here is 100,100"
"23,456"
"1,435"
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
65,656½
65,656½,
23,123½
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
"here is 100,100,"
"23,456,"
"1,435,"
unfortunately, there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context
it captures part of a number that is part of equation like this :
4,310,747,475x2
57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides
The main problem here is that (?![\d,]) fails any match followed with a digit or comma while you want to fail the match when it is followed with a digit or a comma plus a digit.
Replace (?![\d,]) with (?!,?\d).
Also, (?<!\S)(?<![\d,]) looks redundant, as (?<!\S) requires a whitespace or start of string and that is certainly not a digit or ,. Either use (?<!\S) or (?<!\d)(?<!\d,) depending on your requirements.
Join the negative lookaheads with OR: (?!x)(?!/) => (?!x|/) => (?![x/]).
You wnat to avoid matching years, but you just fail all numbers that start with them, so 2020222 won't get matched. Add (?!\d) to the lookahead, (?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d)).
So, the pattern might look like
(?<!\S)(?:(?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?!,?\d)[\u00BC-\u00BE\u2150-\u215E]?(?![x/])
See the regex demo.
IMPORTANT: You have [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) at the end, a negative lookahead after an optional pattern. Once the engine fails to find the match for x or /, it will backtrack and will most probably find a partial match. If you do not want to match 65,656 in 65,656½x, replace [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) with (?![\u00BC-\u00BE\u2150-\u215E]?[x/])[\u00BC-\u00BE\u2150-\u215E]?.
See another regex demo.

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?
You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

Regex not working with group in group

I'm wondering why my regex is not working. The only group it works on is year.
The rest of the groups are None.
formatted_date = re.search('.*((?P<day>\d{1,2}) )?((?P<month>[a-zA-Z]+) )?(?P<year>\d{4}).*', '10 may 1991')
The idea behind the regex is that it will work on the following input:
10 may 1991
may 1991
1991
The regex is written in Python.
Thanks in advance :)
The issue is that greedy dot matching subpattern at the beginning of the pattern grabs all the characters up to the end, and then backtracks yielding what it has to yield to accommodate for the other subpatterns. Since the first 2 are optional, no text is given to them.
You do not need any .* as re.search does not require a full string match.
Use
(?:(?P<day>\d{1,2}) )?(?:(?P<month>[a-zA-Z]+) )?(?P<year>\d{4})
See the regex demo
I also converted capturing optional groups to non-capturing so that the match object was a bit cleaner.
Note that if you still use your approach, you might consider using .*? at the beginning of the pattern (lazy dot matching), but you would have to worry about newlines then (ok, you can use re.S flag to solve that one), and that way, you'd get the first instance of the date in your string. If you have more than one, and you need to get the last one, the best approach is to use re.findall with my suggested pattern and just get the last element of the resulting list.

Categories

Resources