Regex to find aadhar number

Regex to find aadhar number - python

I'm trying to find a unique number from string which contain 4 numbers seprated by spaces in between them (not at start & end) & occurrence of these numbers should be 3.
I've tried like this but it gives me numbers with & without spaces which I don't want it should contain spaces in between them.
Example
(\d{4}.?){3}
above regex selects these as correct
2131 2312 3675
2131231212313675
2131 1231 3675 - (this includes spaces at start & end)
In option (3) I can ignore spaces but I don't want output as option (2).
How can I fix this?
Live example

Converting my comment to answer so that solution is easy to find for future visitors.
You may use this regex:
\b\d{4}(?: \d{4}){2}\b
RegEx Demo
RegEx Breakup:
\b: Word boundary
\d{4}: Match 4 digit
(?: \d{4}){2}: Match a space followed by 4 digits. Repeat this group 2 times to make sure to match 3 sets separated by a single space.
\b: Word boundary

This should work: (?:\s*\b\d{4}\b){4}

Related

Regex to match dollar amount with uppercase letter or word

I'm trying to match some sort of amount, here are all possibilities:
$5.6 million
$4,1 million
$8,1M
$6.3M
$333,333
$2 million
$5 million
I have already this regex:
\$\d{1,3}(?:,\d{3})*(?:\s+(?:thousand|[mb]illion|[MB]illion)|[M])?
See online demo.
But I'm not able to match those ones:
$5.6 million
$4,1 million
$8,1M
$6.3M
Any help would be appreciated.

Let's look at your regular expression:
\$\d{1,3}(?:,\d{3})*(?:\s+(?:thousand|[mb]illion|[MB]illion)|[M])?
\$\d{1,3} is fine. What follows? One way to answer that is to consider the following three possibilities.
The string to be matched ends ' million'
This string (which begins with a space, in case you missed that) is preceded by an empty string or a single digit preceded by a comma or period:
(?:[,.]\d)? million
Evidently, "million" can be "thousand" or "billion", and the first in last might be capitalized, so we change the expression to
(?:[,.]\d)? (?:[MmBb]illion|thousand)
One potential problem is that this matches '$5.6 millionaire'. We can avoid that problem by tacking on a word boundary preventing the match to be followed by a word character:
(?:[,.]\d)? (?:[MmBb]illion|thousand)\b
The string ends 'M'
In this case the 'M' must be preceded by a single digit preceded by a comma or period:
[,.]\dM\b
You could accept 'B' as well by changing M to [MB].
The string ends with three digits preceded by a comma
Here we need
,\d{3}\b
Here the word boundary avoids matching, for example, $333,3333'. It will not match, however, '$333,333,333' or '$333,333,333,333'. If we want to match those we could change the expression to
(?:,\d{3})+\b
or to match '$333' as well, change it to
(?:,\d{3})*\b
Construct the alternation
We therefore can use the following regular expression.
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)\b|[,.]\dMb|,\d{3}b)
Factoring out the end-of-string anchor we obtain
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)|[,.]\dM|,\d{3})b
Demo

You can use
(?i)\$\d+(?:[.,]\d+)*(?:\s+(?:thousand|[mb]illion)|m)?
If you need to make sure you do not match m that is part of another word:
(?i)\$\d+(?:[.,]\d+)*(?:\s+(?:thousand|[mb]illion)|m)?\b
See the regex demo. Details:
(?i) - case insensitive option
\$ - a $ char
\d+ - one or more digits
(?:[.,]\d+)* - zero or more repetitions of . or , and then one or more digits
(?:\s+(?:thousand|[mb]illion)|m)? - an optional occurrence of
\s+(?:thousand|[mb]illion) - one or more whitespaces and then thousand, million or billion
| - or
m - an m char
\b - a word boundary.

How can I write a regex that finds everything but 4 digit numbers like 2000 or 1990 or 1234?

I have a text like this:
Film_relase_date:1970_films_by_20th_Century_Fox
I would like to create a regex that matches all text except 1970, resulting in:
Film_relase_date:_films_by_20th_Century_Fox
I tried with the regex:
[^\d{4}]
But this regex returns:
Film_relase_date:_films_by_th_Century_Fox
And therefore also excludes the 20 which instead I would like to be matched.
How can I improve the regex?
EDIT:
I want to use this regex to do something like:
x = 'Film_relase_date: 1970_films_by_20th_Century_Fox'
REPLACE (x, "Anything that is not a 4-digit number", "Non-Space") = 1970

Remember that {4} is supposed to be added after the character class, not inside.
Anyway, if you want to match "all text except 1970", you can use the following regex:
([^\d]|(?<!\d)\d(?!\d{3}(?!\d))\d*)?
see demo.
This regex matches:
a non-digit character or
a digit char that is nor preceded by another digit and it is not followeb by exactly 3 digits

If you want to match all except 4 digits, I would suggest an unrolled version matching either 1-3 or 5 digits asserting not followed by a digit to prevent consecutive matching digits.
If you don't want to cross newlines, you could use [^\d\r\n] instead of \D
\D+(?:(?:\d{1,3}|\d{5,})(?!\d)\D*)*
Explanation
\D+ Match 1+ non digits
(?: Non capture group
(?:\d{1,3}|\d{5,}) Match either 1-3 or 5 or more digits
(?!\d)\D* Negative lookahead, assert not a digit directly to the right followed by matching optional non digits
)* Close the non capture group and repeat 0+ times
Regex demo
Note that if you want to match 4 digits only, you could perhaps extract the 4 digits using (?<!\d)\d{4}(?!\d) instead of replacing with an empty string.
See another regex demo

How to match words in which must be letter, number and slash using regex (Python)?

I have such list (it's only a part);
not match me
norme
16/02574/REMMAJ
20160721
17/00016/FULM
OUT/2017/1071
SMD/2017/0391
17/01090/FULM
2017/30597
17/03940/MAO
18/00076/FULM
CH/17/323
18/00840/OUTMEI
17/00902/EIAM
PL/2017/02671/MINFOT
I need to find general rule to match them all but not this first rows (simple words) or any of \d nor \w if not mixed with each other and slash. Numbers like \d{8} are allowed.
I don't know how to use something like MUST clause applied for each of these 3 groups together - neither can be miss.
These patterns either match not fully or match words. Need as simple regex as possible if possible.
\d{8}|(\w+|/+|\d+)
\d{8}|[\w/\d]+
EDIT
It's funny, but some not provided examples doesn't match for proposed expressions. For example:
7/2018/4127
NWB/18CM032
but I know why and this is outside the scope. However, adding functionality for mixed numbers and letters in one group, like NWB/18CM032 would be great and wouldn't break previous idea I think.

You could match either 1 or more times an uppercase char or 1-8 digits and repeat that zero or more times with a forward slash prepended:
^(?:[a-z0-9]+(?:/[a-z0-9]+)+|\d{8})$
That will match
^ Start of string
(?: Non capturing group
[a-z0-9]+ Match a char a-z or a digit 1+ times
(?:/[a-z0-9]+)+ Match a / followed by a char or digit 1+ times and repeat 1+ times.
| Or
\d{8} Match 8 digits
) Close group
$ End of string
See it on regex101

Restrict the count of occurrences regex

I am trying to find combination of dates. I am having the following regular expression.
\b([\d]{1,2}[\/\s-]{0,3}\d{2,4})
I want to match the following combinations:
8/1967 or 8-1967
08/1967 same
8/67 same
08/67 same
I dont want it to match the following
08/967
That is i want the combination after "/" or "-" to be either 2 digit or 4 digit.
But "\d{2,4}" will give combinations if 2, 3 and 4. But I dont know how to restrict it to either 2 or 4. If there is any other problem with this regex , please let me know. help please.

If you are matching months and years, do
\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b
Explanation:
\b - a word boundary between non-alphanumeric and alphanumeric character
(?:0?[1-9]|1[0-2]) - 1-12 and 01-12 (with leading zero)
? - possible space on either side of the separator
[-/] 1 separator character, either - or /
(?:[12][0-9])?[0-9]{2}) - either 4-digit number that starts with 1 or 2, or 2 digit number with any digits.
\b - ends with word boundary (the next character is not alphanumeric).
This will match the following strings: 03-1902, 12 / 2014, 6 / 03
but will not match any of 3 / 3009, 13/2009, or 26-30, or 3///60, or 12/34567.
I use [0-9] instead of \d because \d is locale dependent.
DEMO
To match a date range (are you possibly doing a cv/resume parser here?), you can do:
date_re = r'\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b'
date_span = r'%s(?:[\s-]+)-\s*%s' % (date_re, date_re)
which produces the following regular expression in date_span:
\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b(?:[\s-]+)-\s*\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b
DEMO

Change \d{2,4} into \d{2}(\d{2})?
This will get you what you want.
First match 2 digits, then a two digits combination for only one time or not.
That's exactly 2 or 4 digits.

\b((?<!\/)[\d]{1,2}[\/\s-]{0,3}(?!\d{3}\b)\d{2,4})
Try this.See demo.
https://regex101.com/r/wX9fR1/11
(?!\d{3}\b will make 3 digits wont be matched.

regex to extract first series of numbers in a string and all words after

Trying to write a regex that will do the following in python 2.7:
FOO 288-B BAR <MATCH: "288-B BAR">
BURT 69/ERNIE 96/KERMIT 287 <MATCH: "69">
53 ORANGE <MATCH: "53 ORANGE">
APPLE 457-W <MATCH: "457-W">
Except for "space" and '-' and '/' no other punctuation. I just want to match the first occurrence of any number and any letter/word following that is preceeded by a '-' or a "space".
I have tried:
([\d]+)(-?[\w+])
This misses the letters AFTER the space. Adding \s? doesn't go well for me.
(\d+(?:(?:\-\w+)|\w)?)(.*)
This picks up the letters but I can't seem to modify it to get rid of the stuff after the backslash.
(\d+(?:(?:\-\w+)|\w))[^\/]*(\/*.*)
I'm trying to use [] to tackle those backslashes. This was clearly unsuccessfull.

If I understand your requirements, you can use this, then retrieve the matches from Group 1:
(?im)^\D*(\d+(?:[- ][a-z ]*[a-z])?)
Here's a demo (please look at the capture groups in the bottom right pane).
To retrieve the matches:
for match in re.finditer(r"(?im)^\D*(\d+(?:[- ][a-z ]*[a-z])?)", subject):
yournumber = match.group(1)
How does it work?
The ^ in (?im) multi-line, case-insensitive mode anchors us at the beginning of the line.
The \D* skips any non-digits
The (\d+(?:[- ][a-z ]*[a-z])?) matches, and captures to Group 1, digits optionally followed by a dash or a space and more spaces and letters, ending with a letter.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to find aadhar number - python

This should work: (?:\s*\b\d{4}\b){4}

Related

Regex to match dollar amount with uppercase letter or word

How can I write a regex that finds everything but 4 digit numbers like 2000 or 1990 or 1234?

How to match words in which must be letter, number and slash using regex (Python)?

Restrict the count of occurrences regex

regex to extract first series of numbers in a string and all words after

Categories

Resources