Extracting data that follows specific string only - python

I have a file I want to extract data from using regex that looks like this :
RID: RSS-130 SERVICE PAGE: 2
REPORTING FOR: 100019912 SSSE INTSERVICE PROC DATE: 15SEP21
ROLLUP FOR: 100076212 SSSE REPORT REPORT DATE: 15SEP21
ENTITY: 1000208212 SSSE
ACQT
PUR
SAME 10SEP21 120 12,263,518 19,48.5
T PUR 120 12,263,518 19,48.5
The regex I wrote to extract the data :
regex_1 = PROC DATE:\s*(\w+).?* # to get 15SEP21
regex_2 = T PUR\s*([0-9,]*\s*[0-9,]*) # to get the first two elements of the line after T PUR
This works but in the file I have multiple records just like this one, under different RID: RSS-130 for example RID: RSS-140, I want to enforce extracting information only that follows RID: RSS-130 and ACQT and stop when that record is over and not carry on extracting data from what ever is under How can I do that?
Desired output would be :
[(15SEP21;120;12,263,518)] for the record that comes under RID: RSS-130 and after ACQT only

I suggest leveraging a tempered greedy token here:
(?s)PROC DATE:\s*(?P<date>\w+)(?:(?!RID:\s+RSS-\d).)*T PUR\s+(?P<num>\d[.,\d]*)\s+(?P<val>\d[\d,]*)
See the regex demo. Details:
(?s) - an inline re.S / re.DOTALL modifier
PROC DATE: - a literal text
\s* - zero or more whitespaces
(?P<date>\w+) - Group "date": one or more word chars
(?:(?!RID:\s+RSS-\d).)* - any single char, zero or more but as many as possible occurrences, that does not start a RID:\s+RSS-\d pattern (block start pattern, RID:, one or more whitespaces, RSS- and a digit)
T PUR - a literal string
\s+ - one or more whitespaces
(?P<num>\d[.,\d]*) - Group "num": a digit and then zero or more commas, dots and digits
\s+ - one or more digits
(?P<val>\d[\d,]*) - Group "val": a digit and then zero or more commas or digits.
See the Python demo:
import re
text = "RID: RSS-130 SERVICE PAGE: 2 \nREPORTING FOR: 100019912 SSSE INTSERVICE PROC DATE: 15SEP21 \nROLLUP FOR: 100076212 SSSE REPORT REPORT DATE: 15SEP21 \nENTITY: 1000208212 SSSE \n \n \n \n \n ACQT \n \n \n PUR \n SAME 10SEP21 120 12,263,518 19,48.5 \n \n T PUR 120 12,263,518 19,48.5"
rx = r"PROC DATE:\s*(?P<date>\w+)(?:(?!RID:\s+RSS-\d).)*T PUR\s+(?P<num>\d[.,\d]*)\s+(?P<val>\d[\d,]*)"
m = re.search(rx, text, re.DOTALL)
if m:
print(m.groupdict())
# => {'date': '15SEP21', 'num': '120', 'val': '12,263,518'}
If you MUST check for T PUR after ACQT, modify the pattern to
(?s)PROC DATE:\s*(?P<date>\w+)(?:(?!RID:\s+RSS-\d|ACQT).)*ACQT(?:(?!RID:\s+RSS-\d).)*T PUR\s+(?P<num>\d[.,\d]*)\s+(?P<val>\d[\d,]*)
See this regex demo.

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

preprocessing the text and excluding form footnotes , extra spaces and

I need to clean my corpus, it includes these problems
multiple spaces --> Tables .
footnote --> 10h 50m,1
unknown ” --> replace " instead of ”
e.g
for instance, you see it here:
On 1580 November 12 at 10h 50m,1 they set Mars down at 8° 36’ 50” Gemini2 without mentioning the horizontal variations, by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. Now this observation is distant and isolated. It was reduced to the moment of opposition using the diurnal motion from the Prutenic Tables .
I have done it using these functions
def fix4token(x):
x=re.sub('”', '\"', x)
if (x[0].isdigit()== False )| (bool(re.search('[a-zA-Z]', x))==True ):
res=x.rstrip('0123456789')
output = re.split(r"\b,\b",res, 1)[0]
return output
else:
return x
def removespaces(x):
res=x.replace(" ", " ")
return(res)
it works not bad for this but the result is so
On 1580 November 12 at 10h 50m, they set Mars down at 8° 36’ 50" Gemini without mentioning the horizontal variations, by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. Now this observation is distant and isolated. It was reduced to the moment of opposition using the diurnal motion from the Prutenic Tables.
but the problem is it damaged other paragraphs. it does not work ver well,
I guess because this break other things
x=re.sub('”', '\"', x)
if (x[0].isdigit()== False )| (bool(re.search('[a-zA-Z]', x))==True ):
res=x.rstrip('0123456789')
output = re.split(r"\b,\b",res, 1)[0]
what is the safest way to do these?
1- remove footnotes like in these phrases
"10h 50m,1" or (extra foot note in text after comma)
"Gemini2" (zodic names of month + footnote)
without changing another part of the text (e.g my approach will break the "DC2" to "DC" which is not desired
2- remove multiple spaces before dot . like "Tables ." to no spaces
or remove multiple before, like: ", by which term" to this 9only one space) ", by which term"
3-replace unknown ” -> replace " ...which is done
thank you
You can use
text = re.sub(r'\b(?:(?<=,)\d+|(Capricorn|Aquarius|Pisces|Aries|Taurus|Gemini|Cancer|Leo|Virgo|Libra|Scorpio|Ophiuchus|Sagittarius)\d+)\b|\s+(?=[.,])', r'\1', text, flags=re.I).replace('”', '"')
text = re.sub(r'\s{2,}', ' ', text)
Details:
\b - a word boundary
(?: - start of a non-capturing group:
(?<=,)\d+ - one or more digits that are preceded with a comma
| - or
(Capricorn|Aquarius|Pisces|Aries|Taurus|Gemini|Cancer|Leo|Virgo|Libra|Scorpio|Ophiuchus|Sagittarius)\d+ - one of the zodiac sign words (captured into Group 1, \1 in the replacement pattern refers to this value) and then one or more digits
) - the end of the non-capturing group
\b
| - or
\s+ - one or more whitespaces
(?=[.,]) - that are immediately followed with . or ,.
The .replace('”', '"') replaces all ” with a " char.
The re.sub(r'\s{2,}', ' ', text) code replaces all chunks of two or more whitespaces with a single regular space.

Regex match match "words" that contain two continuous streaks of digits and letters or vice-versa and split them

I am having following line of text as given below:
text= 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
I am trying to split numbers followed by characters or characters followed by numbers only to get the output as:
output_text = 'Cms 12345678 Gleandaleacademy Fee Collection 00001234 Abcd Renewal 123Acgf456789
I have tried the following approcah:
import re
text = 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
text = text.lower().strip()
text = text.split(' ')
output_text =[]
for i in text:
if bool(re.match(r'[a-z]+\d+|\d+\w+',i, re.IGNORECASE))==True:
out_split = re.split('(\d+)',i)
for j in out_split:
output_text.append(j)
else:
output_text.append(i)
output_text = ' '.join(output_text)
Which is giving output as:
output_text = 'cms 12345678 gleandaleacademy fee collection 00001234 abcd renewal 123 acgf 456789 '
This code is also splliting the last element of text 123acgf456789 due to incorrect regex in re.match.
Please help me out to get correct output.
You can use
re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text)
See the regex demo
Details
\b - word boundary
(?: - start of a non-capturing group (necessary for the word boundaries to be applied to all the alternatives):
([a-zA-Z]+)(\d+) - Group 1: one or more letters and Group 2: one or more digits
| - or
(\d+)([a-zA-Z]+) - Group 3: one or more digits and Group 4: one or more letters
) - end of the group
\b - word boundary
During the replacement, either \1 and \2 or \3 and \4 replacement backreferences are initialized, so concatenating them as \1\3 and \2\4 yields the right results.
See a Python demo:
import re
text = "Cms1291682971 Gleandaleacademy Fee Collecti 0000548Andb Renewal 402Ecfev845410001"
print( re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text) )
# => Cms 1291682971 Gleandaleacademy Fee Collecti 0000548 Andb Renewal 402Ecfev845410001

Finding Standardized Text Pattern In String

We are looking through a very large set of strings for standard number patterns in order to locate drawing sheet numbers. For example valid sheet numbers are: A-101, A101, C-101, C102, E-101, A1, C1, A-100-A, ect.
They may be contained in a string such as "The sheet number is A-101 first floor plan"
The sheet number patterns are always comprised of similar patterns of character type (numbers, characters and separators (-, space, _)) and if we convert all valid numbers to a pattern indicating the character type (A-101=ASNNN, A101=ANNN, A1 - AN, etc) that there are only ~100 valid patterns.
Our plan is to convert each character in the string to it's character type and then search for a valid pattern. So the question is what is the best way to search through "AAASAAAAASAAAAAASAASASNNNSAAAAASAAAAASAAAA" to find one of 100 valid character type patterns. We considered doing 100 text searches for each pattern, but there seems like there could be a better way to find a candidate pattern and then search to see if it is one of the 100 valid patterns.
Solution
Is it what you want?
import re
pattern_dict = {
'S': r'[ _-]',
'A': r'[A-Z]',
'N': r'[0-9]',
}
patterns = [
'ASNNN',
'ANNN',
'AN',
]
text = "A-1 A2 B-345 C678 D900 E80"
for pattern in patterns:
converted = ''.join(pattern_dict[c] for c in pattern)
print(pattern, re.findall(rf'\b{converted}\b', text))
output:
ASNNN ['B-345']
ANNN ['C678', 'D900']
AN ['A2']
Exmplanation
rf'some\b {string}': Combination of r-string and f-string.
r'some\b': Raw string. It prevents python string escaping. So it is same with 'some\\b'
f'{string}': Literal format string. Python 3.6+ supports this syntax. It is similar to '{}'.format(string).
So you can alter rf'\b{converted}\b' to '\\b' + converted + '\\b'.
\b in regex: It matches word boundary.
bookmark_strings = []
bookmark_strings.append("I-111 - INTERIOR FINISH PLAN & FINISH SCHEDULE")
bookmark_strings.append("M0.01 SCHEDULES & CALCULATIONS")
bookmark_strings.append("M-1 HVAC PLAN - OH Maple Heights PERMIT")
bookmark_strings.append("P-2 - PLUMBING DEMOLITION")
pattern_dict = {
'S': r'[. _-]',
'A': r'[A-Z]',
'N': r'[0-9]',
}
patterns = [
'ASNNN',
'ANSNN',
'ASN',
'ANNN'
]
for bookmark in bookmark_strings:
for pattern in patterns:
converted = ''.join(pattern_dict[c] for c in pattern)
if len(re.findall(rf'\b{converted}\b', bookmark)) > 0:
print ("We found a match for pattern - {}, value = {} in bookmark {}".format(pattern, re.findall(rf'\b{converted}\b', bookmark) , bookmark))
Output:
We found a match for pattern - ASNNN, value = ['I-111'] in bookmark I-111 - INTERIOR FINISH PLAN & FINISH SCHEDULE
We found a match for pattern - ANSNN, value = ['M0.01'] in bookmark M0.01 SCHEDULES & CALCULATIONS
We found a match for pattern - ASN, value = ['M-1'] in bookmark M-1 HVAC PLAN - OH Maple Heights PERMIT
We found a match for pattern - ASN, value = ['P-2'] in bookmark P-2 - PLUMBING DEMOLITION
use regex
import re
re.findall("[A-Z][-_ ]?[0-9]+",text)

Python - regular expressions - part of the search patern is in same line, part is in the next one

I have 3 emails which have following in email body:
1st email
2nd email
3rd email
=
means new line.There are 3 cases:
Case 1
machine name is on the next line
Example
MACHINE: =
ldnmdsbatchxl01
Case 2
machine name is on the same line:
MACHINE: p2prog06
Case 3
Part of the machine is in the same line, part is in next line
MACHINE: p1prog=
07
Following works for first 2 and partial for 3rd case:regex2 = r'\bMACHINE:\s*(?:=.*)?\s*([^<^\n ]+)
in 3rd i'm getting p1prog=
> Desired output:
p1prog07
ldnmdsbatchxl01
p2prog06
Thanks
if resp == 'OK':
email_body = data[0][1].decode('utf-8')
mail = email.message_from_string(email_body)
#get all emails with words "PA1" or "PA2" in subject
if mail["Subject"].find("PA1") > 0 or mail["Subject"].find("PA2") > 0:
#search email body for job name (string after word "JOB")
regex1 = r'(?<!^)JOB:\s*(\S+)'
regex2 = r'\bMACHINE:\s*(?:=.*)?\s*([^<^\n ]+)|$'
c=re.findall(regex2, email_body)[0]#,re.DOTALL)
a=re.findall(regex1 ,email_body)
You may use
import re
email = 'MACHINE: =\nldnmdsbatchxl01\n\n\nMACHINE: p2prog06\n\n\nMACHINE: p1prog=^M\n07'
res = list(set([re.sub(r'=(?:\^M)?|[\r\n]+', '', x) for x in re.findall(r'\bMACHINE:\s*(.*(?:(?:\r\n?|\n)\S+)?)', email, re.M)]))
print(res)
# => ['ldnmdsbatchxl01', 'p2prog06', 'p1prog07']
See the Python demo
The regex used is \bMACHINE:\s*(.*(?:(?:\r\n?|\n)\S+)?):
\bMACHINE - whole word MACHINE
: - a : char
\s* - 0+ whitespaces
(.*(?:(?:\r\n?|\n)\S+)?) - Group 1 (this substring will be returned by re.findall):
.* - 0+ chars other than line break chars
(?:(?:\r\n?|\n)\S+)? - an optional substring:
(?:\r\n?|\n) - a CRLF, LF or CR line break sequence
\S+ - 1+ non-whitespace chars
The re.sub(r'=(?:\^M)?|[\r\n]+', '', x) removes = or =^M and CR/LF symbols from the Group 1 value.
To get unique values, use list(set(res)).
Short answer:
regexp = re.compile('MACHINE:\s={0,1}\s{0,1}((\S+=\^M\s\S+|\S+))')
value = regexp.search(data)[1]
value.replace('=^M\n', ''))
Long answer:
Assume we have data from your examples:
data = """
BFAILURE JOB: p2_batch_excel_quants_fx_daily_vol_check_0800 MACHINE: =
ldnmdsbatchxl01 EXITCODE: 268438455
(...)
RUNALARM JOB: p2_credit_qv_curve_snap MACHINE: p2prog06
Attachments:
(...)
[11/01/2019 08:15:09] CAUAJM_I_40245 EVENT: ALARM ALARM: JO=^M
BFAILURE JOB: p1_static_console_row_based_permissions MACHINE: p1prog=^M
07 EXITCODE: 1<br>^M
"""
Then we may use code:
import re
regexp = re.compile('MACHINE:\s={0,1}\s{0,1}((\S+=\^M\s\S+|\S+))')
for d in data.split("(...)"):
value = regexp.search(d)[1]
print(value.replace('=^M\n', ''))
As you see regexp match =^M\n too, so we need to remove it after.
output:
ldnmdsbatchxl01
p2prog06
p1prog07
EDIT:
if your data contains many email bodies in one string:
import re
regexp = re.compile('MACHINE:\s={0,1}\s{0,1}((\S+=\^M\s\S+|\S+))')
matches = regexp.findall(data)
print(matches)
print('---')
for m in matches:
print(m[0].replace('=^M\n', ''))
produce:
[('ldnmdsbatchxl01', 'ldnmdsbatchxl01'), ('p2prog06', 'p2prog06'), ('p1prog=^M\n07', 'p1prog=^M\n07')]
---
ldnmdsbatchxl01
p2prog06
p1prog07

Categories

Resources