I'm trying to find a unique number from string which contain 4 numbers seprated by spaces in between them (not at start & end) & occurrence of these numbers should be 3.
I've tried like this but it gives me numbers with & without spaces which I don't want it should contain spaces in between them.
Example
(\d{4}.?){3}
above regex selects these as correct
2131 2312 3675
2131231212313675
2131 1231 3675 - (this includes spaces at start & end)
In option (3) I can ignore spaces but I don't want output as option (2).
How can I fix this?
Live example
Converting my comment to answer so that solution is easy to find for future visitors.
You may use this regex:
\b\d{4}(?: \d{4}){2}\b
RegEx Demo
RegEx Breakup:
\b: Word boundary
\d{4}: Match 4 digit
(?: \d{4}){2}: Match a space followed by 4 digits. Repeat this group 2 times to make sure to match 3 sets separated by a single space.
\b: Word boundary
This should work: (?:\s*\b\d{4}\b){4}
Related
I'm trying to match some sort of amount, here are all possibilities:
$5.6 million
$4,1 million
$8,1M
$6.3M
$333,333
$2 million
$5 million
I have already this regex:
\$\d{1,3}(?:,\d{3})*(?:\s+(?:thousand|[mb]illion|[MB]illion)|[M])?
See online demo.
But I'm not able to match those ones:
$5.6 million
$4,1 million
$8,1M
$6.3M
Any help would be appreciated.
Let's look at your regular expression:
\$\d{1,3}(?:,\d{3})*(?:\s+(?:thousand|[mb]illion|[MB]illion)|[M])?
\$\d{1,3} is fine. What follows? One way to answer that is to consider the following three possibilities.
The string to be matched ends ' million'
This string (which begins with a space, in case you missed that) is preceded by an empty string or a single digit preceded by a comma or period:
(?:[,.]\d)? million
Evidently, "million" can be "thousand" or "billion", and the first in last might be capitalized, so we change the expression to
(?:[,.]\d)? (?:[MmBb]illion|thousand)
One potential problem is that this matches '$5.6 millionaire'. We can avoid that problem by tacking on a word boundary preventing the match to be followed by a word character:
(?:[,.]\d)? (?:[MmBb]illion|thousand)\b
The string ends 'M'
In this case the 'M' must be preceded by a single digit preceded by a comma or period:
[,.]\dM\b
You could accept 'B' as well by changing M to [MB].
The string ends with three digits preceded by a comma
Here we need
,\d{3}\b
Here the word boundary avoids matching, for example, $333,3333'. It will not match, however, '$333,333,333' or '$333,333,333,333'. If we want to match those we could change the expression to
(?:,\d{3})+\b
or to match '$333' as well, change it to
(?:,\d{3})*\b
Construct the alternation
We therefore can use the following regular expression.
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)\b|[,.]\dMb|,\d{3}b)
Factoring out the end-of-string anchor we obtain
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)|[,.]\dM|,\d{3})b
Demo
You can use
(?i)\$\d+(?:[.,]\d+)*(?:\s+(?:thousand|[mb]illion)|m)?
If you need to make sure you do not match m that is part of another word:
(?i)\$\d+(?:[.,]\d+)*(?:\s+(?:thousand|[mb]illion)|m)?\b
See the regex demo. Details:
(?i) - case insensitive option
\$ - a $ char
\d+ - one or more digits
(?:[.,]\d+)* - zero or more repetitions of . or , and then one or more digits
(?:\s+(?:thousand|[mb]illion)|m)? - an optional occurrence of
\s+(?:thousand|[mb]illion) - one or more whitespaces and then thousand, million or billion
| - or
m - an m char
\b - a word boundary.
I have a text like this:
Film_relase_date:1970_films_by_20th_Century_Fox
I would like to create a regex that matches all text except 1970, resulting in:
Film_relase_date:_films_by_20th_Century_Fox
I tried with the regex:
[^\d{4}]
But this regex returns:
Film_relase_date:_films_by_th_Century_Fox
And therefore also excludes the 20 which instead I would like to be matched.
How can I improve the regex?
EDIT:
I want to use this regex to do something like:
x = 'Film_relase_date: 1970_films_by_20th_Century_Fox'
REPLACE (x, "Anything that is not a 4-digit number", "Non-Space") = 1970
Remember that {4} is supposed to be added after the character class, not inside.
Anyway, if you want to match "all text except 1970", you can use the following regex:
([^\d]|(?<!\d)\d(?!\d{3}(?!\d))\d*)?
see demo.
This regex matches:
a non-digit character or
a digit char that is nor preceded by another digit and it is not followeb by exactly 3 digits
If you want to match all except 4 digits, I would suggest an unrolled version matching either 1-3 or 5 digits asserting not followed by a digit to prevent consecutive matching digits.
If you don't want to cross newlines, you could use [^\d\r\n] instead of \D
\D+(?:(?:\d{1,3}|\d{5,})(?!\d)\D*)*
Explanation
\D+ Match 1+ non digits
(?: Non capture group
(?:\d{1,3}|\d{5,}) Match either 1-3 or 5 or more digits
(?!\d)\D* Negative lookahead, assert not a digit directly to the right followed by matching optional non digits
)* Close the non capture group and repeat 0+ times
Regex demo
Note that if you want to match 4 digits only, you could perhaps extract the 4 digits using (?<!\d)\d{4}(?!\d) instead of replacing with an empty string.
See another regex demo
I have such list (it's only a part);
not match me
norme
16/02574/REMMAJ
20160721
17/00016/FULM
OUT/2017/1071
SMD/2017/0391
17/01090/FULM
2017/30597
17/03940/MAO
18/00076/FULM
CH/17/323
18/00840/OUTMEI
17/00902/EIAM
PL/2017/02671/MINFOT
I need to find general rule to match them all but not this first rows (simple words) or any of \d nor \w if not mixed with each other and slash. Numbers like \d{8} are allowed.
I don't know how to use something like MUST clause applied for each of these 3 groups together - neither can be miss.
These patterns either match not fully or match words. Need as simple regex as possible if possible.
\d{8}|(\w+|/+|\d+)
\d{8}|[\w/\d]+
EDIT
It's funny, but some not provided examples doesn't match for proposed expressions. For example:
7/2018/4127
NWB/18CM032
but I know why and this is outside the scope. However, adding functionality for mixed numbers and letters in one group, like NWB/18CM032 would be great and wouldn't break previous idea I think.
You could match either 1 or more times an uppercase char or 1-8 digits and repeat that zero or more times with a forward slash prepended:
^(?:[a-z0-9]+(?:/[a-z0-9]+)+|\d{8})$
That will match
^ Start of string
(?: Non capturing group
[a-z0-9]+ Match a char a-z or a digit 1+ times
(?:/[a-z0-9]+)+ Match a / followed by a char or digit 1+ times and repeat 1+ times.
| Or
\d{8} Match 8 digits
) Close group
$ End of string
See it on regex101
I am trying to find combination of dates. I am having the following regular expression.
\b([\d]{1,2}[\/\s-]{0,3}\d{2,4})
I want to match the following combinations:
8/1967 or 8-1967
08/1967 same
8/67 same
08/67 same
I dont want it to match the following
08/967
That is i want the combination after "/" or "-" to be either 2 digit or 4 digit.
But "\d{2,4}" will give combinations if 2, 3 and 4. But I dont know how to restrict it to either 2 or 4. If there is any other problem with this regex , please let me know. help please.
If you are matching months and years, do
\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b
Explanation:
\b - a word boundary between non-alphanumeric and alphanumeric character
(?:0?[1-9]|1[0-2]) - 1-12 and 01-12 (with leading zero)
? - possible space on either side of the separator
[-/] 1 separator character, either - or /
(?:[12][0-9])?[0-9]{2}) - either 4-digit number that starts with 1 or 2, or 2 digit number with any digits.
\b - ends with word boundary (the next character is not alphanumeric).
This will match the following strings: 03-1902, 12 / 2014, 6 / 03
but will not match any of 3 / 3009, 13/2009, or 26-30, or 3///60, or 12/34567.
I use [0-9] instead of \d because \d is locale dependent.
DEMO
To match a date range (are you possibly doing a cv/resume parser here?), you can do:
date_re = r'\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b'
date_span = r'%s(?:[\s-]+)-\s*%s' % (date_re, date_re)
which produces the following regular expression in date_span:
\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b(?:[\s-]+)-\s*\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b
DEMO
Change \d{2,4} into \d{2}(\d{2})?
This will get you what you want.
First match 2 digits, then a two digits combination for only one time or not.
That's exactly 2 or 4 digits.
\b((?<!\/)[\d]{1,2}[\/\s-]{0,3}(?!\d{3}\b)\d{2,4})
Try this.See demo.
https://regex101.com/r/wX9fR1/11
(?!\d{3}\b will make 3 digits wont be matched.
Trying to write a regex that will do the following in python 2.7:
FOO 288-B BAR <MATCH: "288-B BAR">
BURT 69/ERNIE 96/KERMIT 287 <MATCH: "69">
53 ORANGE <MATCH: "53 ORANGE">
APPLE 457-W <MATCH: "457-W">
Except for "space" and '-' and '/' no other punctuation. I just want to match the first occurrence of any number and any letter/word following that is preceeded by a '-' or a "space".
I have tried:
([\d]+)(-?[\w+])
This misses the letters AFTER the space. Adding \s? doesn't go well for me.
(\d+(?:(?:\-\w+)|\w)?)(.*)
This picks up the letters but I can't seem to modify it to get rid of the stuff after the backslash.
(\d+(?:(?:\-\w+)|\w))[^\/]*(\/*.*)
I'm trying to use [] to tackle those backslashes. This was clearly unsuccessfull.
If I understand your requirements, you can use this, then retrieve the matches from Group 1:
(?im)^\D*(\d+(?:[- ][a-z ]*[a-z])?)
Here's a demo (please look at the capture groups in the bottom right pane).
To retrieve the matches:
for match in re.finditer(r"(?im)^\D*(\d+(?:[- ][a-z ]*[a-z])?)", subject):
yournumber = match.group(1)
How does it work?
The ^ in (?im) multi-line, case-insensitive mode anchors us at the beginning of the line.
The \D* skips any non-digits
The (\d+(?:[- ][a-z ]*[a-z])?) matches, and captures to Group 1, digits optionally followed by a dash or a space and more spaces and letters, ending with a letter.