Write regex for matching all 4 digit numbers between patterns - python

I am trying to write a regex to find pattern in string. Its gonna have a word 'LAT_LON' then some non word characters and then many 4 digit numbers and after then some alphabet or end of string.
Eg1.
SOME EXAMPLE STRING 12334...
LAT_LON .... 1234 5678 9012 1234
1234 1234
Eg2.
SOME EXAMPLE STRING 1234...
LAT_LON ... 1234 5678 9012 1234
1234 1234 SOMETHING_ELSE
In both the examples I need those 6 4-digit numbers after the pattern 'LAT_LON' and before any other alphabet.
EDIT: I am working in python, although I don't care much about the language. I am fairly new to regex world. So I am just trying some random stuff, nothing very conclusive at all till now.

One way is to capture the numbers then split on whitespace.
LAT_LON[^\da-zA-Z]*(\d{4}(?:\s+\d{4})*)
Then split capture group 1 on whitespace.
LAT_LON [^\da-zA-Z]*
( # (1 start)
\d{4}
(?:
\s+
\d{4}
)*
) # (1 end)
Here is a more verbose formatted version.
( Regex's constructed by RegexFormat 6 )
LAT_LON # Exact 'LAT_LON'
[^\da-zA-Z]* # Optinal chars, 0 to many times
# not digit nor letter (case insensitive)
( # (1 start), Capture all 4 digit numbers
\d{4} # Single 4 digit number
(?: # Cluster group
\s+ # Whitespace(s)
\d{4} # Single 4 digit number
)* # End Cluster, do 0 to many times
) # (1 end)

Let me try it another way, just to have some variation in the answers. I'm going to use awk for the job.
awk '/LAT_LON/,/\n[^0-9]/{printf gensub(/[^0-9 ]/, "", "g", $0) " "}' /path/to/intput/file
With a possible pipe to clean up the output | tr -s ' '.
This code just searches for lines containing LAT_LON, then it will parse each of those lines until a non number is found. On these lines we filter out non spaces or numbers using the gensub.
Note that the regex is fairly simple because we have filtered out all irrelevant parts. A simple non-numerical removal does the job here. See also grep if you want to mess around with regex, in my opinion it's the best way to learn. In particular egrep, which supports an enhanced regex language!

Related

Regex to find aadhar number

I'm trying to find a unique number from string which contain 4 numbers seprated by spaces in between them (not at start & end) & occurrence of these numbers should be 3.
I've tried like this but it gives me numbers with & without spaces which I don't want it should contain spaces in between them.
Example
(\d{4}.?){3}
above regex selects these as correct
2131 2312 3675
2131231212313675
2131 1231 3675 - (this includes spaces at start & end)
In option (3) I can ignore spaces but I don't want output as option (2).
How can I fix this?
Live example
Converting my comment to answer so that solution is easy to find for future visitors.
You may use this regex:
\b\d{4}(?: \d{4}){2}\b
RegEx Demo
RegEx Breakup:
\b: Word boundary
\d{4}: Match 4 digit
(?: \d{4}){2}: Match a space followed by 4 digits. Repeat this group 2 times to make sure to match 3 sets separated by a single space.
\b: Word boundary
This should work: (?:\s*\b\d{4}\b){4}

Fetching respective group values in a regex expression

I have an example string like below:
Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00
I can have another example string that can be like:
Unpacking/Unremoval fee Zero Rated 100.00
I am trying to access the first set of words and the last number values.
So I want the dict to be
{'Handling - Uncrating of 3 crates - USD600 each':1800.00}
or
{'Unpacking/Unremoval fee':100.00}
There might be strings where none of the above patterns (Zero Rated or something with %) present and I would skip those strings.
To do that, I was regexing the following pattern
pattern = re.search(r'(.*)Zero.*Rated\s*(\S*)',line.strip())
and then
pattern.group(1)
gives the keys for dict and
pattern.group(2)
gives the value of 1800.00. This works for lines where Zero Rated is present.
However if I want to also check for pattern where Zero Rated is not present but % is present as in first example above, I was trying to use | but it didn't work.
pattern = re.search(r'(.*)Zero.*Rated|%\s*(\S*)',line.strip())
But this time I am not getting the right pattern groups as it is fetching groups.
Sites like regex101.com can help debug regexes.
In this case, the problem is with operator precedence; the | operates over the whole of the rest of the regex. You can group parts of the regex without creating additional groups with (?: )
Try: r'(.*)(?:Zero.*Rated|%)\s*(\S*)'
Definitely give regex101.com a go, though, it'll show you what's going on in the regex.
You might use
^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})
The pattern matches
^ Start of string
(.+?) Capture group 1, match any char except a newline as least as possible
\s* Match 0+ whitespace chars
(?: Non capture group
Zero Rated Match literally
| Or
\d+%= Match 1+ digits and %=
\d{1,3}(?:\,\d{3})*\.\d{2} Match a digit format of 1-3 digits, optionally repeated by a comma and 3 digits followed by a dot and 2 digits
) Close non capture group
\s* Match 0+ whitespace chars
(\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 2, match the digit format
Regex demo | Python demo
For example
import re
regex = r"^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})"
test_str = ("Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00\n"
"Unpacking/Unremoval fee Zero Rated 100.00\n"
"Delivery Cartage - IT Equipment, up to 1000kgs - 7%=210.00 3,000.00")
print(dict(re.findall(regex, test_str, re.MULTILINE)))
Output
{'Handling - Uncrating of 3 crates - USD600 each': '1,800.00', 'Unpacking/Unremoval fee': '100.00', 'Delivery Cartage - IT Equipment, up to 1000kgs -': '3,000.00'}

I need to write a regex that recognize all numbers with coma separated or not, excluding 4 digits numbers

I want to capture all number with comma or not comma-separated excluding 4 digit numbers:
I want to match these numbers (in my case the number are separated by 3 digits always)
978,763,835,536,363
123
123,456
123456
7456
3400
excluding the years like
1200 till 2020
I have written this
regex_patterns = [
re.compile(r'[0-9]+,?[0-9]+,?[0-9]+,?[0-9]+')
]
it works good ,I do not how exclude years from these number...many thanks
Of course, I am working o the sentients, the number are inside the sentences not necessity at first fo the line like this
-Thus 60 is to 41 as 100,000 is to 65,656½, the appropriate magnitude for βυ
This was found to be 36,075,5621 (with an eccentricity of 9165), corresponding to the entire oval path of Mars.
-It was 4657.
EDIT:
Since during my task I faced wit a lot of issues have updated the question a few time.
first of all the problem is mainly solved! thank you for all for the contribution.
just a very tiny issue. based on other comments I have t integrated the solution as here
r'(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
It can caputer most of the case correctly,
https://regex101.com/r/o5gdDt/8
then again as there is a kind of noise in my text like this:
"
I take ψο as a figured unit [x]. It's square GEOM will also be a figured unit [x2]. Add the square GEOM on εο, 227,052, and the sum of the two will be the square GEOM of ψε or ψν. But the square GEOM of βν is 4,310,747,475 PARA
"
It can not capture the number 227,052, which end with ","
when I changed it I faced with this problem
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
``` (basically ignoring comma in (,?![\d]))
I faced with another problem which the regex captured 4,310,747,475 in this:
4,310,747,475x2+978,763,835,536,363
as you see here..
https://regex101.com/r/o5gdDt/9
any idea would be very appreciated
however the regex now works almost good, but in order to be perfect I need to improve it
-
If excluding all 4 digit number years its this
\b(?!\d{4}\b)[0-9]+(?:,(?!\d{4}\b)[0-9]+)*\b
https://regex101.com/r/T3L3X5/1
If excluding just the number years between 1200 and 2020 its this
\b(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+(?:,(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+)*\b
https://regex101.com/r/ZuC6LR/1
You can use following regex to match one to three digit numbers and optionally also match any subsequent numbers that are comma separated but don't have more than 3 digits.
\b\d{1,3}(?:,\d{1,3})*\b
https://regex101.com/r/T6sNUs/1/
The explanation goes like this,
\b - marks word boundary to avoid matching partially in a larger number then 3 digits
\d{1,3} - matches one to three digit number
(?:,\d{1,3})* - non-capturing group optionally matches comma separated number having one to three digits
\b - again marks word boundary to avoid matching partially in a larger number then 3 digits
Edit: For requirement mentioned in comments, where numbers with at least three or more digits optionally separated by comma should match. But it should reject the match if any of the numbers present in the line lies from 1200 to 2020.
This regex should give you what you need,
^(?!.*\b(?:1[2-9]\d\d|20[01]\d|2020)\b)\d{3,}(?:,\d{3,})*$
Demo
Please confirm if this works for you, so I can add explanation to above regex.
And in case you want it to restrict it from 1200 to 1800 as you mentioned in your comments, you can use this regex,
^(?!.*\b(?:1[2-7]\d\d|1800)\b)\d{3,}(?:,\d{3,})*$
Demo
This is matching all your test cases:
(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}|\d{1,3}(?:,\d{3})*)(?![\d,])
Explanation:
(?<![\d,]) # negative lookbehind, make we haven't digit or comma before
(?: # non capture group
(?! # negative lookahead, make sure we haven't after:
(?: # non capture group
1[2-9]\d\d # range 1200 -> 1999
| # OR
20[01]\d # range 2000 -> 2019
| # OR
2020 # 2020
) # end group
) # end lookahead
\d{4,} # 4 or more digits
| # OR
\d{1,3} # 1 up to 3 digits
(?:,\d{3})* # non capture group, a comma and 3 digits, 0 or more times
) # end group
(?![\d,]) # negative lookahead, make sure we haven't digit or comma after
Demo
Here is the final answer that I got with using the comments and integrating according my context:
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
"here is 100,100"
"23,456"
"1,435"
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
65,656½
65,656½,
23,123½
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
"here is 100,100,"
"23,456,"
"1,435,"
unfortunately there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context
it captures part of a number that is part of equation like this :
4,310,747,475x2
57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides

python regex look ahead positive + negative

This regex will get 456. My question is why it CANNOT be 234 from 1-234-56 ? Does 56 qualify the (?!\d)) pattern since it is NOT a single digit. Where is the beginning point that (?!\d)) will look for?
import re
pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
a = pattern.findall("The number is: 123456") ; print(a)
It is in the first stage to add the comma separator like 123,456.
a = pattern.findall("The number is: 123456") ; print(a)
results = pattern.finditer('123456')
for result in results:
print ( result.start(), result.end(), result)
My question is why it CANNOT be 234 from 1-234-56?
It is not possible as (?=(\d{3})+(?!\d)) requires 3-digit sequences appear after a 1-3-digit sequence. 56 (the last digit group in your imagined scenario) is a 2-digit group. Since a quantifier can be either lazy or greedy, you cannot match both one, two and three digit groups with \d{1,3}. To get 234 from 123456, you'd need a specifically tailored regex for it: \B\d{3}, or (?<=1)\d{3} or even \d{3}(?=\d{2}(?!\d)).
Does 56 match the (?!\d)) pattern? Where is the beginning point that (?!\d)) will look for?
No, this is a negative lookahead, it does not match, it only checks if there is no digit right after the current position in the input string. If there is a digit, the match is failed (not result found and returned).
More clarification on the look-ahead: it is located after (\d{3})+ subpattern, thus the regex engine starts searching for a digit right after the last 3-digit group, and fails a match if the digit is found (as it is a negative lookahead). In plain words, the (?!\d) is a number closing/trailing boundary in this regex.
A more detailed breakdown:
\d{1,3} - 1 to 3 digit sequence, as many as possible (greedy quantifier is used)
(?=(\d{3})+(?!\d)) - a positive look-ahead ((?=...)) that checks if the 1-3 digit sequence matched before are followed by
(\d{3})+ - 1 or more (+) sequences of exactly 3 digits...
(?!\d) - not followed by a digit.
Lookaheads do not match, do not consume characters, but you still can capture inside them. When a lookahead is executed, the regex index is at the same character as before. With your regex and input, you match 123 with \d{1,3} as then you have 3-digit sequence (456). But 456 is capured within a lookahead, and re.findall returns only captured texts if capturing groups are set.
To just add comma as digit grouping symbol, use
rx = r'\d(?=(?:\d{3})+(?!\d))'
See IDEONE demo

Restrict the count of occurrences regex

I am trying to find combination of dates. I am having the following regular expression.
\b([\d]{1,2}[\/\s-]{0,3}\d{2,4})
I want to match the following combinations:
8/1967 or 8-1967
08/1967 same
8/67 same
08/67 same
I dont want it to match the following
08/967
That is i want the combination after "/" or "-" to be either 2 digit or 4 digit.
But "\d{2,4}" will give combinations if 2, 3 and 4. But I dont know how to restrict it to either 2 or 4. If there is any other problem with this regex , please let me know. help please.
If you are matching months and years, do
\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b
Explanation:
\b - a word boundary between non-alphanumeric and alphanumeric character
(?:0?[1-9]|1[0-2]) - 1-12 and 01-12 (with leading zero)
? - possible space on either side of the separator
[-/] 1 separator character, either - or /
(?:[12][0-9])?[0-9]{2}) - either 4-digit number that starts with 1 or 2, or 2 digit number with any digits.
\b - ends with word boundary (the next character is not alphanumeric).
This will match the following strings: 03-1902, 12 / 2014, 6 / 03
but will not match any of 3 / 3009, 13/2009, or 26-30, or 3///60, or 12/34567.
I use [0-9] instead of \d because \d is locale dependent.
DEMO
To match a date range (are you possibly doing a cv/resume parser here?), you can do:
date_re = r'\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b'
date_span = r'%s(?:[\s-]+)-\s*%s' % (date_re, date_re)
which produces the following regular expression in date_span:
\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b(?:[\s-]+)-\s*\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b
DEMO
Change \d{2,4} into \d{2}(\d{2})?
This will get you what you want.
First match 2 digits, then a two digits combination for only one time or not.
That's exactly 2 or 4 digits.
\b((?<!\/)[\d]{1,2}[\/\s-]{0,3}(?!\d{3}\b)\d{2,4})
Try this.See demo.
https://regex101.com/r/wX9fR1/11
(?!\d{3}\b will make 3 digits wont be matched.

Categories

Resources