Python regex string groups capture - python

I have a number of medical reports from each which i am trying to capture 6 groups (groups 5 and 6 are optional):
(clinical details | clinical indication) + (text1) + (result|report) + (text2) + (interpretation|conclusion) + (text3).
The regex I am using is:
reportPat=re.compile(r'(Clinical details|indication)(.*?)(result|description|report)(.*?)(Interpretation|conclusion)(.*)',re.IGNORECASE|re.DOTALL)
works except on strings missing the optional groups on whom it fails.i have tried putting a question mark after group5 like so: (Interpretation|conclusion)?(.*) but then this group gets merged into group4. I am pasting two conflicting strings (one containing group 5/6 and the other without it) for people to test their regex. thanks for helping
text 1 (all groups present)
Technical Report:\nAdministrations:\n1.04 ml of Fluorine 18, fluorodeoxyglucose with aco - Bronchus and lung\nJA - Staging\n\nClinical Details:\nSquamous cell lung cancer, histology confirmed ?stage\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion. \n\nThere is a large mass noted in the left upper lobe proximally, with lower grade uptake within a collapsed left upper lobe. This lesi\n\nInterpretation: \nThe scan findings are in keeping with the known lung primary in the left upper lobe and involvement of the lymph nodes as dThere is no evidence of distant metastatic disease.
text 2 (without group 5 and 6)
Technical Report:\nAdministrations:\n0.81 ml of Fluorine 18, fluorodeoxyglucose with activity 312.79\nScanner: 3D Static\nPatient Position: Supine, Head First. Arms up\n\n\nDiagnosis Codes:\n- Bronchus and lung\nJA - Staging\n\nClinical Indication:\nNewly diagnosed primary lung cancer with cranial metastasis. PET scan to assess any further metastatic disease.\n\nScanner DST 3D\n\nSession 1 - \n\n.\n\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion.\n\nThere is increased FDG uptake in the right lower lobe mass abutting the medial and posterior pleura with central necrosis (maximum SUV 18.2). small nodule at the right paracolic gutte

It seems like that what you were missing is basically an end of pattern match to fool the greedy matches when combining with the optional presence of the group 5 & 6. This regexp should do the trick, maintaining your current group numbering:
reportPat=re.compile(
r'(Clinical details|indication)(.*)'
r'(result|description|report)(.*?)'
r'(?:(Interpretation|conclusion)(.*))?$',
re.IGNORECASE|re.DOTALL)
Changes done are adding the $ to the end, and enclosing the two last groups in a optional non-capturing group, (?: ... )?. Also note how you easily can make the entire regexp more readable by splitting the lines (which the interpreter will autoconnect when compiling).
Added: When reviewing the result of the matches I saw some :\n or : \n, which can easily be cleaned up by adding (?:[:\s]*)? inbetween the header and text groups. This is an optional non-capturing group of colons and whitespace. Your regexp does then look like this:
reportPat=re.compile(
r'(Clinical details|indication)(?:[:\s]*)?(.*)'
r'(result|description|report)(?:[:\s]*)?(.*?)'
r'(?:(Interpretation|conclusion)(?:[:\s]*)?(.*))?$',
re.IGNORECASE|re.DOTALL)
Added 2: At this link: https://regex101.com/r/gU9eV7/3, you can see the regex in action. I've also added some unit test cases to verify that it works against both texts, and that in for text1 it has a match for text1, and that for text2 it has nothing. I used this parallell to direct editing in a python script to verify my answer.

The following pattern works for both your test cases though given the format of the data you're having to parse I wouldn't be confident that the pattern will work for all cases (for example I've added : after each of the keyword matches to try to prevent inadvertent matches against more common words like result or description):
re.compile(
r'(Clinical details|indication):(.+?)(result|description|report):(.+?)((Interpretation|conclusion):(.+?)){0,1}\Z',
re.IGNORECASE|re.DOTALL
)
I grouped the last 2 groups and marked them as optional using {0,1}. This means the output groups will vary a little from your original pattern (you'll have an extra group, the 4th group will now contain the output of both the last 2 groups and the data for the last 2 groups will be in groups 5 and 6).

Related

How to reduce computational cost with Regex function

I am trying to use regex to extract sentences containing specific words, and sentences before and next to them. My code works, but it takes 20 seconds for each txt and I have about a million txt files. Is it possible to get the same result in less time? Any other relalted suggestions are also welcome. Thanks!
My current thought is to extract paragraphs containing these target words first, then use nltk to tokenize target paragraphs and extract the target sentences and sentences before and next them.
Here is my demo:
import re, nltk
txt = '''There is widespread agreement that screening for breast cancer, when combined with appropriate follow-up, will reduce mortality from the disease. How we measure response to treatment is called the 5-year survival rate, or the percentage of people who live 5 years after being diagnosed with cancer. According to information provided by National Cancer Institute, Cancer stage at diagnosis, which refers to extent of a cancer in the body, determines treatment options and has a strong influence on the length of survival. In general, if the cancer is found only in the part of the body where it started it is localized (sometimes referred to as stage 1). If it has spread to a different part of the body, the stage is regional or distant . The earlier female breast cancer is caught, the better chance a person has of surviving five years after being diagnosed. For female breast cancer, 62.5 are diagnosed at the local stage. The 5-year survival for localized female breast cancer is 98.8 . It decreases from 98.8 to 85.5 after the cancer has spread to the lymph nodes (stage 2), and to 27.4
(stage 4) after it has spread to other organs such as the lung, liver or brain. A major problem with current detection methods is that studies have shown that mammography does not detect 10 -20 of breast cancers that are detected by physical exam alone, which may be attributed to a falsely negative mammogram.
Breast cancer screening is generally recommended as a routine part of preventive healthcare for women over the age of 20 (approximately 90 million in the United States). Besides skin cancer, breast cancer is the most commonly diagnosed cancer among U.S. women. For these women, the American Cancer Society (ACS) has published guidelines for breast cancer screening including: (i) monthly breast self-examinations for all women over the age of 20; (ii) a clinical breast exam (CBE) every three years for women in their 20s and 30s; (iii) a baseline mammogram for women by the age of 40; and (iv) an annual mammogram for women age 40 or older (according to the American College of Radiology). Unfortunately, the U.S. Preventive Task Force Guidelines have stirred confusion by recommending biennial screening mammography for women ages 50-74.
Each year, approximately eight million women in the United States require diagnostic testing for breast cancer due to a physical symptom, such as a palpable lesion, pain or nipple discharge, discovered through self or physical examination (approximately seven million) or a non-palpable lesion detected by screening x-ray mammography
(approximately one million). Once a physician has identified a suspicious lesion in a woman's breast, the physician may recommend further diagnostic procedures, including a diagnostic x-ray mammography, an ultrasound study, a magnetic resonance imaging procedure, or a minimally invasive procedure such as fine needle aspiration or large core needle biopsy. In each case, the potential benefits of additional diagnostic testing must be balanced against the costs, risks and discomfort to the patient associated with undergoing the additional procedures.
'''
target_words = ['risks', 'discomfort', 'surviving', 'risking', 'risks', 'risky']
pattern = r'.*\b(?='+'|'.join(target_words) + r')\b.*'
target_paras = re.findall(pattern, txt, re.IGNORECASE)
# Function to extract sentences containing any target word and its neighbor sentences
def UncertaintySentences (paragraph):
sent_token = nltk.tokenize.sent_tokenize(paragraph)
keepsents = []
for i, sentence in enumerate(sent_token):
# sentences contain any target word
if re.search(pattern, sentence, re.IGNORECASE) != None:
try:
if i==0: # first sentence in a para, keep it and the one next to it
keepsents.extend([sent_token[i], sent_token[i+1]])
elif i!=len(sent_token)-1: # sentence in the middle, keep it ant the ones before and next to it
keepsents.extend([sent_token[i-1], sent_token[i], sent_token[i+1]])
else: # last sentence, keep it and the one before it
keepsents.extend([sent_token[i-1], sent_token[i]])
except: # para with only one sentence
keepsents = sent_token
# drop duplicate sentences
del_dup = []
[del_dup.append(x) for x in keepsents if x not in del_dup]
return(del_dup)
for para in target_paras:
uncertn_sents = UncertaintySentences(para)
print(uncertn_sents)
The final speed of your original regex is highly dependant on the data you are inspecting.
There's a problem with your regex:
r'.*\b(?='+'|'.join(target_words) + r')\b.*'
If there are many/big paragraphs with no keywords then the search process is very slow.
Why this happens?
Because your regex starts with .*
Your regex matches the whole paragraph and starts to backtrack characters one by one and tries to match the keywords while doing so. If there are no keywords at all, the backtracking process reaches the beginning of the paragraph.
Then, it advances one more character and repeats the whole process again (It reaches the end of string, backtracks to position 1), then advances to position 2 and repeats everything again...
You can better look at this process with this regex debugger:
https://regex101.com/r/boZLQU/1/debugger
Optimization
Just add an ^ to your regex, like this:
r'^.*\b(?='+'|'.join(target_words) + r')\b.*'
Note that we also need to use the M flag in order to make ^ behave as "beginning of line" instead of "beginning of string"
re.findall(pattern, txt, re.MULTILINE | re.IGNORECASE)
That way you'll just do the backtracking process one time instead of one for every character, which in the end should speed up the process a lot when searching through paragraphs that don't have any of the required keywords.
In terms of computational cost of the regex, it decreases from to
Here's a few ideas to optimize this code:
The target_words list can be converted to a set to make the in
operation more efficient.
The pattern variable can be precompiled using re.compile to make the
subsequent calls to re.findall and re.search faster.
The del_dup list comprehension can be replaced with a set() call to
remove duplicates more efficiently.
Maybe move the sent_token = nltk.tokenize.sent_tokenize(paragraph) out
of the loop of the UncertaintySentences function, so that the
tokenization operation is only performed once per paragraph.

Regex that matches all German and Austrian mobile phone numbers

I need a Python regex which matches to mobile phone numbers from Germany and Austria.
In order to do so, we first have to understand the structure of a phone number:
a mobile number can be written with a country calling code in the beginning. However, this code is optional!
if we use the country calling code the trunk prefix is redundant!
The prefix is composed out of the trunk prefix and the company code
The prefix is followed by an individual and unique number with 7 or 8 digits, respectivley.
List of German prefixes:
0151, 0160, 0170, 0171, 0175, 0152, 0162, 0172, 0173, 0174, 0155, 0157, 0159, 0163, 0176, 0177, 0178, 0179, 0164, 0168, 0169
List of Austrian prefixes:
0664, 0680, 0688, 0681, 0699, 0664, 0667, 0650, 0678, 0650, 0677, 0676, 0660, 0699, 0690, 0665, 0686, 0670
Now that we know all rules to build a regex, we have to consider, that humans sometimes write numbers in a very strange ways with multiple whitespaces, / or (). For example:
0176 98 600 18 9
+49 17698600189
+(49) 17698600189
0176/98600189
0176 / 98600189
many more ways to write the same number
I am looking for a Python regex which can match all Austian and German mobile numbers.
What I have so far is this:
^(?:\+4[39]|004[39]|0|\+\(49\)|\(\+49\))\s?(?=(?:[^\d\n]*\d){10,11}(?!\d))(\()?[19][1567]\d{1,2}(?(1)\))\s?\d(?:[ /-]?\d)+
You can use
(?x)^ # Free spacing mode on and start of string
(?: # A container group:
(\+49|0049|\+\(49\)|\(\+49\))? [ ()\/-]* # German: country code
(?(1)|0)1(?:5[12579]|6[023489]|7[0-9]) # trunk prefix and company code
| # or
(\+43|0043|\+\(43\)|\(\+43\))? [ ()\/-]* # Austrian: country code
(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])) # trunk prefix and company code
)
[ ()\/-]* # zero or more spaces, parens, / and -
\d(?:[ \/-]*\d){6,7} # a digit and then six or seven occurrences of space, / or - and a digit
\s* # zero or more whites
$ # end of string
See the regex demo.
A one-line version of the pattern is
^(?:(\+49|0049|\+\(49\)|\(\+49\))?[ ()\/-]*(?(1)|0)1(?:5[12579]|6[023489]|7[0-9])|(\+43|0043|\+\(43\)|\(\+43\))?[ ()\/-]*(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])))[ ()\/-]*\d(?:[ \/-]*\d){6,7}\s*$
See this demo.
How to create company code regex
Go to the Optimize long lists of fixed string alternatives in regex
Click the Run code snippet button at the bottom of the answer to run the last code snippet
Re-size the input box if you wish
Get the list of your supported numbers, either comma or linebreak separated and paste it into the field
Click Generate button, and grab the pattern that will appear below.

Regular expression in Python, 2-3 numbers then 2 letters

I am trying to do autodetection of bra size in a list of clothes. While I managed to extract only the bra items, I am now looking at extracting the size information and I think I am almost there (thanks to the stackoverflow community). However, there is a particular case that I could not find on another post.
I am using:
regexp = re.compile(r' \d{2,3} ?[a-fA-F]([^bce-zBCE-Z]|$)')
So
Possible white space if not at the beginning of the description
two or three numbers
Another possible white space or not
Any letters (lower or upper case) between A and F
and then another letter for the two special case AA and FF or the end of the string.
My question is, is there a way to have the second letter to be a match of the first letter (AA or FF) because in my case, my code output me some BA and CA size which are not existing
Examples:
Not working:
"bh sexig top matchande h&m genomskinligt parti svart detaljer 42 basic plain" return "42 ba" instead of not found
"puma, sport-bh, strl: 34cd, svart/grå", I guess the customer meant c/d
Working fine:
"victoria's secret, bh, strl: 32c, gul/vit" returns "32 c"
"pink victorias secret bh 75dd burgundy" returns "75 dd"
Thanks!
You might use
\d{2,3} ?([a-fA-F])\1?(?![a-fA-F])
Explanation
\d{2,3} ? Match a space, 2-3 digits and optional space
([a-fA-F])\1? Capture a-fA-F in group 1 followed by an optional backreference to group 1
(?![a-fA-F]) Negative lookahead, assert what is on the right is not a-fA-F
Regex demo

Use regex to extract unit number

I have a list of descriptions and I want to extract the unit information using regular expression
I watched a video on regex and here's what I got
import re
x = ["Four 10-story towers - five 11-story residential towers around Lake Peterson - two 9-story hotel towers facing Devon Avenue & four levels of retail below the hotels",
"265 rental units",
"10 stories and contain 200 apartments",
"801 residential properties that include row homes, town homes, condos, single-family housing, apartments, and senior rental units",
"4-unit townhouse building (6,528 square feet of living space & 2,755 square feet of unheated garage)"]
unit=[]
for item in x:
extract = re.findall('[0-9]+.unit',item)
unit.append(extract)
print unit
This works with string ends in unit, but I also strings end with 'rental unit','apartment','bed' and other as in this example.
I could do this with multiple regex, but is there a way to do this within one regex?
Thanks!
As long as your not afraid of making a hideously long regex you could use something to the extent of:
compiled_re = re.compile(ur"(\d*)-unit|(\d*)\srental unit|(\d*)\sbed|(\d*)\sappartment")
unit = []
for item in x:
extract = re.findall(compiled_re, item)
unit.append(extract)
You would have to extend the regex pattern with a new "|" followed by a search pattern for each possible type of reference to unit numbers. Unfortunately, if there is very low consistency in the entries this approach would become basically unusable.
Also, might I suggest using a regex tester like Regex101. It really helps determining if your regex will do what you want it to.

How to not match string not containing two consecutive newlines

Demo at regex101. I have the following text file (a bibtex .bbl file):
\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{a}})\textit{Alfonsi, Spogli,
De~Franceschi, Romano, Aquino, Dodson, and Mitchell}}]{alfonsi2011bcg}
Alfonsi, L., L.~Spogli, G.~De~Franceschi, V.~Romano, M.~Aquino, A.~Dodson, and
C.~N. Mitchell (2011{\natexlab{a}}), Bipolar climatology of {GPS} ionospheric
scintillation at solar minimum, \textit{Radio Science}, \textit{46}(3),
\doi{10.1029/2010RS004571}.
\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{b}})\textit{Alfonsi, Spogli,
Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, and
Mitchell}}]{alfonsi2011gsa}
Alfonsi, L., L.~Spogli, J.~Tong, G.~De~Franceschi, V.~Romano, A.~Bourdillon,
M.~Le~Huy, and C.~Mitchell (2011{\natexlab{b}}), {GPS} scintillation and
{TEC} gradients at equatorial latitudes in april 2006, \textit{Advances in
Space Research}, \textit{47}(10), 1750--1757,
\doi{10.1016/j.asr.2010.04.020}.
\bibitem[{\textit{Anghel et~al.}(2008)\textit{Anghel, Astilean, Letia, and
Komjathy}}]{anghel2008nrm}
Anghel, A., A.~Astilean, T.~Letia, and A.~Komjathy (2008), Near real-time
monitoring of the ionosphere using dual frequency {GPS} data in a kalman
filter approach, in \textit{{IEEE} International Conference on Automation,
Quality and Testing, Robotics, 2008. {AQTR} 2008}, vol.~2, pp. 54--58,
\doi{10.1109/AQTR.2008.4588793}.
\bibitem[{\textit{Baker and Wing}(1989)}]{baker1989nmc}
Baker, K.~B., and S.~Wing (1989), A new magnetic coordinate system for
conjugate studies at high latitudes, \textit{Journal of Geophysical Research:
Space Physics}, \textit{94}(A7), 9139--9143, \doi{10.1029/JA094iA07p09139}.
I want to match the whole \bibitem command for a single entry (with some capture groups) if I know the reference code at the end of the command. I use this regex, which works for the first entry, but not for the rest (second entry exemplified below):
\\bibitem\[{(.*?)\((.*?)\)(.*?)}\]{alfonsi2011gsa}
This doesn't work, since it matches everything from the start of the first \bibitem command to the end of the second \bibitem command. How can I match only the second \bibitem command? I have tried using a negative lookahead for ^$ and \n\n, but I couldn't get either to work - basically, I want the third (.*?) to match any string not including two consecutive newlines. (If there's any other way to do this, I'm all ears.)
You can use negative look-arounds (?!) to prevent the match from having multiple occurrences of 'bibitem'. With this, the match will start with the 'bibitem' which immediately precedes your reference code. This seems to work:
\\bibitem\[{(((?!bibitem).)*?)\((((?!bibitem).)*?)\)(((?!bibitem).)*?)}\]{alfonsi2011gsa}
regex is not my strong point but this will get all the content you want without reading all the content into memory at once:
from itertools import groupby
import re
with open("file.txt") as f:
r = re.compile(r"\[{(.*?)\((.*?)\)(.*?)}\]\{alfonsi2011gsa\}")
for k, v in groupby(map(str.strip, f), key=lambda x: bool(x.strip())):
match = r.search("".join(v))
if match:
print(match.groups())
('\\textit{Alfonsi et~al.}', '2011{\\natexlab{b}}', '\\textit{Alfonsi, Spogli,Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, andMitchell}')

Categories

Resources