I am trying to write a regular expression that will search based on a string and if it founds even a partial match. I can get extract numbers from lines (2 lines) above and below the matched string or substring.
My text is:
Subtotal AED1,232.20
AED61.61
VAT
5 % Tax:
RECEIPT TOTAL: AED1.293.81
I wish to search for the word VAT and extract all numbers from two lines above and below it.
Expected output:
AED1,232.20
AED61.61
5 %
AED1.293.81
I am able to extract the entire content but I need the numbers, AED can be dropped or ignored.
My regex is:
((.*\n){2}).*vat(.*\n.*\n.*)
Thanks in advance!
try this:
(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)\n(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)\nVAT\n(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*).*\n[^0-9]*(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)
This regex can seem too complex or long, but it has better control and returns only numbers, it will be his work.
Regex Demo
You may use this regex in python:
((?:^.*\d.*\n){0,2})VAT((?:\n.*\d.*){0,2})
RegEx Demo
RegEx Details:
((?:^.*\d.*\n){0,2}): Match 2 leading lines that must contain at least a digit
VAT: match text VAT
((?:\n.*\d.*){0,2}): Match 2 trailing lines that must contain at least a digit
This regex is tailor-made for your input text and expected output:
r'.* (AED\d{1,3}(?:,\d{3})*\.\d{2})\n(AED\d{1,3}(?:,\d{3})*\.\d{2})\nVAT\n(\d{1,2} %) Tax:\n.* (AED\d{1,3}(?:,\d{3})*\.\d{2})'
Your required Regex
It outputs exactly the text you want, without extra words.
It also works with more than one "VAT" in your input text.
Regex Logics:
(AED\d{1,3}(?:,\d{3})*\.\d{2}) Match currency code and amount (in one group)
(\d{1,2} %) Match VAT %. Supports 1 to 2 digits. You can further enhance it to support decimal point.
Note that the proper regex for currency amount (with comma as thousand separator and exactly 2 decimal points) should be as follows:
r'\d{1,3}(?:,\d{3})*\.\d{2}'
[with (?: expr) to indicate nontagged group so that this subgroup will not be tagged as a match for your re.findall function call.]
In case your input supports currency codes other than 'AED', you can replace 'AED' with [A-Z]{3} as currency code should normally be in 3-character capital letters.
I want to capture all number with comma or not comma-separated excluding 4 digit numbers:
I want to match these numbers (in my case the number are separated by 3 digits always)
978,763,835,536,363
123
123,456
123456
7456
3400
excluding the years like
1200 till 2020
I have written this
regex_patterns = [
re.compile(r'[0-9]+,?[0-9]+,?[0-9]+,?[0-9]+')
]
it works good ,I do not how exclude years from these number...many thanks
Of course, I am working o the sentients, the number are inside the sentences not necessity at first fo the line like this
-Thus 60 is to 41 as 100,000 is to 65,656½, the appropriate magnitude for βυ
This was found to be 36,075,5621 (with an eccentricity of 9165), corresponding to the entire oval path of Mars.
-It was 4657.
EDIT:
Since during my task I faced wit a lot of issues have updated the question a few time.
first of all the problem is mainly solved! thank you for all for the contribution.
just a very tiny issue. based on other comments I have t integrated the solution as here
r'(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
It can caputer most of the case correctly,
https://regex101.com/r/o5gdDt/8
then again as there is a kind of noise in my text like this:
"
I take ψο as a figured unit [x]. It's square GEOM will also be a figured unit [x2]. Add the square GEOM on εο, 227,052, and the sum of the two will be the square GEOM of ψε or ψν. But the square GEOM of βν is 4,310,747,475 PARA
"
It can not capture the number 227,052, which end with ","
when I changed it I faced with this problem
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
``` (basically ignoring comma in (,?![\d]))
I faced with another problem which the regex captured 4,310,747,475 in this:
4,310,747,475x2+978,763,835,536,363
as you see here..
https://regex101.com/r/o5gdDt/9
any idea would be very appreciated
however the regex now works almost good, but in order to be perfect I need to improve it
-
If excluding all 4 digit number years its this
\b(?!\d{4}\b)[0-9]+(?:,(?!\d{4}\b)[0-9]+)*\b
https://regex101.com/r/T3L3X5/1
If excluding just the number years between 1200 and 2020 its this
\b(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+(?:,(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+)*\b
https://regex101.com/r/ZuC6LR/1
You can use following regex to match one to three digit numbers and optionally also match any subsequent numbers that are comma separated but don't have more than 3 digits.
\b\d{1,3}(?:,\d{1,3})*\b
https://regex101.com/r/T6sNUs/1/
The explanation goes like this,
\b - marks word boundary to avoid matching partially in a larger number then 3 digits
\d{1,3} - matches one to three digit number
(?:,\d{1,3})* - non-capturing group optionally matches comma separated number having one to three digits
\b - again marks word boundary to avoid matching partially in a larger number then 3 digits
Edit: For requirement mentioned in comments, where numbers with at least three or more digits optionally separated by comma should match. But it should reject the match if any of the numbers present in the line lies from 1200 to 2020.
This regex should give you what you need,
^(?!.*\b(?:1[2-9]\d\d|20[01]\d|2020)\b)\d{3,}(?:,\d{3,})*$
Demo
Please confirm if this works for you, so I can add explanation to above regex.
And in case you want it to restrict it from 1200 to 1800 as you mentioned in your comments, you can use this regex,
^(?!.*\b(?:1[2-7]\d\d|1800)\b)\d{3,}(?:,\d{3,})*$
Demo
This is matching all your test cases:
(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}|\d{1,3}(?:,\d{3})*)(?![\d,])
Explanation:
(?<![\d,]) # negative lookbehind, make we haven't digit or comma before
(?: # non capture group
(?! # negative lookahead, make sure we haven't after:
(?: # non capture group
1[2-9]\d\d # range 1200 -> 1999
| # OR
20[01]\d # range 2000 -> 2019
| # OR
2020 # 2020
) # end group
) # end lookahead
\d{4,} # 4 or more digits
| # OR
\d{1,3} # 1 up to 3 digits
(?:,\d{3})* # non capture group, a comma and 3 digits, 0 or more times
) # end group
(?![\d,]) # negative lookahead, make sure we haven't digit or comma after
Demo
Here is the final answer that I got with using the comments and integrating according my context:
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
"here is 100,100"
"23,456"
"1,435"
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
65,656½
65,656½,
23,123½
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
"here is 100,100,"
"23,456,"
"1,435,"
unfortunately there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context
it captures part of a number that is part of equation like this :
4,310,747,475x2
57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides
I am working on validation of inputs and need a regex which take only number with max length of 2 and one white space between them.
Regex for Python
import re
pattern="^[0-9_ ]{2}$"
check="01 03"
a=re.match(pattern,check)
if a == None:
print'Not valid value'
else:
print"valid value"
the output which i get is non valid value, what am i going wrong here
You're repeating a character set with {2}, which will match exactly two of the preceeding token. There will only be a match if the string contains exactly two characters.
Instead, use the character set [0-9]{1,2} to match one or two digits, followed by a space, followed by that repeated character set again:
[0-9]{1,2} [0-9]{1,2}$
I want to extract the word which is capital and occurs 3 or 4 before word "cell" or "cells"
example
:
Briefly, MCF-7 idential cells grown as described above were treated with a range of LTX-diol or iso-LTX-diol.
I would like to extract MCF-7 from above example.
I tried to use [A-Z0-9-]+cells, but its returning cells, instead of MCF-7
This answer assumes that you want to match a word beginning with a capital letter, which in turn is followed by 1 to 4 other words, followed then by cell or cells. We can try matching using the following pattern:
([A-Z][^ ]*)(?=\s+(?:[^A-Z]\S*\s+){1,4}cells?)
The positive lookahead at the end of the pattern asserts the requirement for 1 to 4 words occurring before cell or cells.
input = "Briefly, MCF-7 idential cells grown as described above were treated with a range of LTX-diol or iso-LTX-diol."
r1 = re.findall(r"([A-Z][^ ]*)(?=\s+(?:[^A-Z]\S*\s+){1,4}cells?)", input)
print(r1)
['MCF-7']
In trying to solve this challenge (which I pasted at the bottom of this question) using Python 3, the first of the two proposed solutions below, passes all test cases, while the second one doesn't. Since, in my eyes, they're doing pretty much the same, this leaves me very confused. Why doesn't the second block of code work?
It must be something very obvious because the second one fails most test cases, but having worked through custom-inputs, I still can't figure it out.
Working solution:
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
text = "\n".join(N)
for s in S:
print(len(re.findall(r"(?<!\W)(?="+s.strip()+r"\w)", text)))
Broken "solution":
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
for s in S:
total=0
for string in N:
total += len(re.findall("(?<!\W)(?="+s.strip()+"\w)", string))
print(total)
We define a word character to be any of the following:
An English alphabetic letter (i.e., a-z and A-Z).
A decimal digit (i.e., 0-9).
An underscore (i.e., _, which corresponds to ASCII value ).
We define a word to be a contiguous sequence of one or more word characters that is preceded and succeeded by one or more occurrences of non-word-characters or line terminators. For example, in the string I l0ve-cheese_?, the words are I, l0ve, and cheese_.
We define a sub-word as follows:
A sequence of word characters (i.e., English alphabetic letters,
digits, and/or underscores) that occur in the same exact order (i.e.,
as a contiguous sequence) inside another word.
It is preceded and succeeded by word characters only.
Given sentences consisting of one or more words separated by non-word characters, process queries where each query consists of a single string, . To process each query, count the number of occurrences of as a sub-word in all sentences, then print the number of occurrences on a new line.
Input Format
The first line contains an integer, n, denoting the number of sentences.
Each of the subsequent lines contains a sentence consisting of words separated by non-word characters.
The next line contains an integer, , denoting the number of queries.
Each line of the subsequent lines contains a string, , to check.
Constraints
1 ≤ n ≤ 100
1 ≤ q ≤ 10
Output Format
For each query string, print the total number of times it occurs as a sub-word within all words in all sentences.
Sample Input
1
existing pessimist optimist this is
1
is
Sample Output
3
Explanation
We must count the number of times is occurs as a sub-word in our input sentence(s):
occurs time as a sub-word of existing.
occurs time as a sub-word of pessimist.
occurs time as a sub-word of optimist.
While is a substring of the word this, it's followed by a blank
space; because a blank space is non-alphabetic, non-numeric, and not
an underscore, we do not count it as a sub-word occurrence.
While is a substring of the word is in the sentence, we do not count
it as a match because it is preceded and succeeded by non-word
characters (i.e., blank spaces) in the sentence. This means it
doesn't count as a sub-word occurrence.
Next, we sum the occurrences of as a sub-word of all our words as 1+1+1+0+0=3. Thus, we print 3 on a new line.
Without specifying your strings as raw strings, the regex metacharacters are actually interpreted as special escaped characters, and the pattern will not match as you expect.
Since you are no longer looking inside a multiline string, you'll want to add modify your negative lookbehind to a positive one: (?<=\w)
As Wiktor mentions in his comment, it would be a good idea to escape s.strip so that any potential chars that could be treated as regex metachars will be escaped and taken literally. You can use re.escape(s.strip()) for that.
Your code will work with this change:
total += len(re.findall(r"(?<\w)(?=" + re.escape(s.strip()) + r"\w)", string))