How to reduce computational cost with Regex function - python

I am trying to use regex to extract sentences containing specific words, and sentences before and next to them. My code works, but it takes 20 seconds for each txt and I have about a million txt files. Is it possible to get the same result in less time? Any other relalted suggestions are also welcome. Thanks!
My current thought is to extract paragraphs containing these target words first, then use nltk to tokenize target paragraphs and extract the target sentences and sentences before and next them.
Here is my demo:
import re, nltk
txt = '''There is widespread agreement that screening for breast cancer, when combined with appropriate follow-up, will reduce mortality from the disease. How we measure response to treatment is called the 5-year survival rate, or the percentage of people who live 5 years after being diagnosed with cancer. According to information provided by National Cancer Institute, Cancer stage at diagnosis, which refers to extent of a cancer in the body, determines treatment options and has a strong influence on the length of survival. In general, if the cancer is found only in the part of the body where it started it is localized (sometimes referred to as stage 1). If it has spread to a different part of the body, the stage is regional or distant . The earlier female breast cancer is caught, the better chance a person has of surviving five years after being diagnosed. For female breast cancer, 62.5 are diagnosed at the local stage. The 5-year survival for localized female breast cancer is 98.8 . It decreases from 98.8 to 85.5 after the cancer has spread to the lymph nodes (stage 2), and to 27.4
(stage 4) after it has spread to other organs such as the lung, liver or brain. A major problem with current detection methods is that studies have shown that mammography does not detect 10 -20 of breast cancers that are detected by physical exam alone, which may be attributed to a falsely negative mammogram.
Breast cancer screening is generally recommended as a routine part of preventive healthcare for women over the age of 20 (approximately 90 million in the United States). Besides skin cancer, breast cancer is the most commonly diagnosed cancer among U.S. women. For these women, the American Cancer Society (ACS) has published guidelines for breast cancer screening including: (i) monthly breast self-examinations for all women over the age of 20; (ii) a clinical breast exam (CBE) every three years for women in their 20s and 30s; (iii) a baseline mammogram for women by the age of 40; and (iv) an annual mammogram for women age 40 or older (according to the American College of Radiology). Unfortunately, the U.S. Preventive Task Force Guidelines have stirred confusion by recommending biennial screening mammography for women ages 50-74.
Each year, approximately eight million women in the United States require diagnostic testing for breast cancer due to a physical symptom, such as a palpable lesion, pain or nipple discharge, discovered through self or physical examination (approximately seven million) or a non-palpable lesion detected by screening x-ray mammography
(approximately one million). Once a physician has identified a suspicious lesion in a woman's breast, the physician may recommend further diagnostic procedures, including a diagnostic x-ray mammography, an ultrasound study, a magnetic resonance imaging procedure, or a minimally invasive procedure such as fine needle aspiration or large core needle biopsy. In each case, the potential benefits of additional diagnostic testing must be balanced against the costs, risks and discomfort to the patient associated with undergoing the additional procedures.
'''
target_words = ['risks', 'discomfort', 'surviving', 'risking', 'risks', 'risky']
pattern = r'.*\b(?='+'|'.join(target_words) + r')\b.*'
target_paras = re.findall(pattern, txt, re.IGNORECASE)
# Function to extract sentences containing any target word and its neighbor sentences
def UncertaintySentences (paragraph):
sent_token = nltk.tokenize.sent_tokenize(paragraph)
keepsents = []
for i, sentence in enumerate(sent_token):
# sentences contain any target word
if re.search(pattern, sentence, re.IGNORECASE) != None:
try:
if i==0: # first sentence in a para, keep it and the one next to it
keepsents.extend([sent_token[i], sent_token[i+1]])
elif i!=len(sent_token)-1: # sentence in the middle, keep it ant the ones before and next to it
keepsents.extend([sent_token[i-1], sent_token[i], sent_token[i+1]])
else: # last sentence, keep it and the one before it
keepsents.extend([sent_token[i-1], sent_token[i]])
except: # para with only one sentence
keepsents = sent_token
# drop duplicate sentences
del_dup = []
[del_dup.append(x) for x in keepsents if x not in del_dup]
return(del_dup)
for para in target_paras:
uncertn_sents = UncertaintySentences(para)
print(uncertn_sents)

The final speed of your original regex is highly dependant on the data you are inspecting.
There's a problem with your regex:
r'.*\b(?='+'|'.join(target_words) + r')\b.*'
If there are many/big paragraphs with no keywords then the search process is very slow.
Why this happens?
Because your regex starts with .*
Your regex matches the whole paragraph and starts to backtrack characters one by one and tries to match the keywords while doing so. If there are no keywords at all, the backtracking process reaches the beginning of the paragraph.
Then, it advances one more character and repeats the whole process again (It reaches the end of string, backtracks to position 1), then advances to position 2 and repeats everything again...
You can better look at this process with this regex debugger:
https://regex101.com/r/boZLQU/1/debugger
Optimization
Just add an ^ to your regex, like this:
r'^.*\b(?='+'|'.join(target_words) + r')\b.*'
Note that we also need to use the M flag in order to make ^ behave as "beginning of line" instead of "beginning of string"
re.findall(pattern, txt, re.MULTILINE | re.IGNORECASE)
That way you'll just do the backtracking process one time instead of one for every character, which in the end should speed up the process a lot when searching through paragraphs that don't have any of the required keywords.
In terms of computational cost of the regex, it decreases from to

Here's a few ideas to optimize this code:
The target_words list can be converted to a set to make the in
operation more efficient.
The pattern variable can be precompiled using re.compile to make the
subsequent calls to re.findall and re.search faster.
The del_dup list comprehension can be replaced with a set() call to
remove duplicates more efficiently.
Maybe move the sent_token = nltk.tokenize.sent_tokenize(paragraph) out
of the loop of the UncertaintySentences function, so that the
tokenization operation is only performed once per paragraph.

Related

How to match complete words for acronym using regex?

I want to only get complete words from acronyms with ( ) around them.
For example, there is a sentence
'Lung cancer screening (LCS) reduces NSCLC mortality';
->I want to get 'Lung cancer screening' as a result.
How can I do it with regex?
original question:
I want to remove repeated upper alphabets :
"HIV acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer" => " acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer"
Assuming you want to target 2 or more capital letters, I would use re.sub here:
inp = "Lung cancer screening (LCS) reduces NSCLC mortality"
output = re.sub(r'\s*(?:\([A-Z]+\)|[A-Z]{2,})\s*', ' ', inp).strip()
print(output) # Lung cancer screening reduces mortality
import re
s = 'HIV acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer'
print(re.sub(r'([A-Z])', lambda pat:'', s).strip()) # Inline
according to #jensgram answer

matching content creating new column

Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool

How can I remove numbers that may occur at the end of words in a text

I have text data to be cleaned using regex. However, some words in the text are immediately followed by numbers which I want to remove.
For example, one row of the text is:
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes
terminology10 Lessons learnt from the RUPES project12 Payment for
environmental service and it potential and example in Vietnam16
Chapter Integrating payment for ecosystem service into Vietnams policy
and programmes17 Chapter Creating incentive for Tri An watershed
protection20 Chapter Sustainable financing for landscape beauty in
Bach Ma National Park 24 Chapter Building payment mechanism for carbon
sequestration in forestry a pilot project in Cao Phong district of Hoa
Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam28 Synthesis and Recommendations30
References32
The first word in the above text should be 'preface' instead of 'preface2' and so on.
line = re.sub(r"[A-Za-z]+(\d+)", "", line)
This, however removes the words as well as seen:
Pes Lessons learnt from the RUPES Payment for environmental service
and it potential and example in Chapter Integrating payment for
ecosystem service into Vietnams policy and Chapter Creating incentive
for Tri An watershed Chapter Sustainable financing for landscape
beauty in Bach Ma National Park 24 Chapter Building payment mechanism
for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Chapter 5 Local revenue sharing Nha
Trang Bay Marine Protected Area Synthesis and
How can I capture only the numbers that immediately follow words?
You can capture the text part and substitute the word with that captured part. It simply writes:
re.sub(r"([A-Za-z]+)\d+", r"\1", line)
You could try lookahead assertions to check for words before your numbers. Try word boundaries (\b) at the end of forcing your regex to only match numbers at the end of a word:
re.sub(r'(?<=\w+)\d+\b', '', line)
Hope this helps
EDIT:
Sorry about the glitch, mentioned in the comments about matching numbers that are NOT preceeded by words as well. That is because (sorry again) \w matches alphanumeric characters instead of only alphabetic ones. Depending on what you would like to delete you can use the positive version
re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)
to only check for english alphabetic characters (you can add characters to the [a-zA-Z] list) preceeding your number or the negative version
re.sub(r'(?<![\d\s])\d+\b', '', line)
to match anything that is NOT \d (numbers) or \s (spaces) before your desired number. This will also match punctuation marks though.
Try this:
line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one
\\1 will match the word, \\2 the number. See: How to use python regex to replace using captured group?
below, I'm proposing a working sample of code that might solve your problem.
Here's the snippet:
import re
# I'will write a function that take the test data as input and return the
# desired result as stated in your question.
def transform(data):
"""Replace in a text data words ending with number.""""
# first, lest construct a pattern matching those words we're looking for
pattern1 = r"([A-Za-z]+\d+)"
# Lest construct another pattern that will replace the previous in the final
# output.
pattern2 = r"\d+$"
# Let find all matching words
matches = re.findall(pattern1, data)
# Let construct a list of replacement for each word
replacements = []
for match in matches:
replacements.append(pattern2, '', match)
# Intermediate variable to construct tuple of (word, replacement) for
# use in string method 'replace'
changers = zip(matches, replacements)
# We now recursively change every appropriate word matched.
output = data
for changer in changers:
output.replace(*changer)
# The work is done, we can return the result
return output
For test purpose, we run the above function with your test data:
data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons
learnt from the RUPES project12 Payment for environmental service and it potential and
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""
result = transform(data)
print(result)
And the result looks like this:
Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from
the RUPES project Payment for environmental service and it potential and example in
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and
programmes Chapter Creating incentive for Tri An watershed protection Chapter
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam Synthesis and Recommendations References
You can create a range of numbers as well:
re.sub(r"[0-9]", "", line)

Python regex string groups capture

I have a number of medical reports from each which i am trying to capture 6 groups (groups 5 and 6 are optional):
(clinical details | clinical indication) + (text1) + (result|report) + (text2) + (interpretation|conclusion) + (text3).
The regex I am using is:
reportPat=re.compile(r'(Clinical details|indication)(.*?)(result|description|report)(.*?)(Interpretation|conclusion)(.*)',re.IGNORECASE|re.DOTALL)
works except on strings missing the optional groups on whom it fails.i have tried putting a question mark after group5 like so: (Interpretation|conclusion)?(.*) but then this group gets merged into group4. I am pasting two conflicting strings (one containing group 5/6 and the other without it) for people to test their regex. thanks for helping
text 1 (all groups present)
Technical Report:\nAdministrations:\n1.04 ml of Fluorine 18, fluorodeoxyglucose with aco - Bronchus and lung\nJA - Staging\n\nClinical Details:\nSquamous cell lung cancer, histology confirmed ?stage\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion. \n\nThere is a large mass noted in the left upper lobe proximally, with lower grade uptake within a collapsed left upper lobe. This lesi\n\nInterpretation: \nThe scan findings are in keeping with the known lung primary in the left upper lobe and involvement of the lymph nodes as dThere is no evidence of distant metastatic disease.
text 2 (without group 5 and 6)
Technical Report:\nAdministrations:\n0.81 ml of Fluorine 18, fluorodeoxyglucose with activity 312.79\nScanner: 3D Static\nPatient Position: Supine, Head First. Arms up\n\n\nDiagnosis Codes:\n- Bronchus and lung\nJA - Staging\n\nClinical Indication:\nNewly diagnosed primary lung cancer with cranial metastasis. PET scan to assess any further metastatic disease.\n\nScanner DST 3D\n\nSession 1 - \n\n.\n\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion.\n\nThere is increased FDG uptake in the right lower lobe mass abutting the medial and posterior pleura with central necrosis (maximum SUV 18.2). small nodule at the right paracolic gutte
It seems like that what you were missing is basically an end of pattern match to fool the greedy matches when combining with the optional presence of the group 5 & 6. This regexp should do the trick, maintaining your current group numbering:
reportPat=re.compile(
r'(Clinical details|indication)(.*)'
r'(result|description|report)(.*?)'
r'(?:(Interpretation|conclusion)(.*))?$',
re.IGNORECASE|re.DOTALL)
Changes done are adding the $ to the end, and enclosing the two last groups in a optional non-capturing group, (?: ... )?. Also note how you easily can make the entire regexp more readable by splitting the lines (which the interpreter will autoconnect when compiling).
Added: When reviewing the result of the matches I saw some :\n or : \n, which can easily be cleaned up by adding (?:[:\s]*)? inbetween the header and text groups. This is an optional non-capturing group of colons and whitespace. Your regexp does then look like this:
reportPat=re.compile(
r'(Clinical details|indication)(?:[:\s]*)?(.*)'
r'(result|description|report)(?:[:\s]*)?(.*?)'
r'(?:(Interpretation|conclusion)(?:[:\s]*)?(.*))?$',
re.IGNORECASE|re.DOTALL)
Added 2: At this link: https://regex101.com/r/gU9eV7/3, you can see the regex in action. I've also added some unit test cases to verify that it works against both texts, and that in for text1 it has a match for text1, and that for text2 it has nothing. I used this parallell to direct editing in a python script to verify my answer.
The following pattern works for both your test cases though given the format of the data you're having to parse I wouldn't be confident that the pattern will work for all cases (for example I've added : after each of the keyword matches to try to prevent inadvertent matches against more common words like result or description):
re.compile(
r'(Clinical details|indication):(.+?)(result|description|report):(.+?)((Interpretation|conclusion):(.+?)){0,1}\Z',
re.IGNORECASE|re.DOTALL
)
I grouped the last 2 groups and marked them as optional using {0,1}. This means the output groups will vary a little from your original pattern (you'll have an extra group, the 4th group will now contain the output of both the last 2 groups and the data for the last 2 groups will be in groups 5 and 6).

Use regex to extract unit number

I have a list of descriptions and I want to extract the unit information using regular expression
I watched a video on regex and here's what I got
import re
x = ["Four 10-story towers - five 11-story residential towers around Lake Peterson - two 9-story hotel towers facing Devon Avenue & four levels of retail below the hotels",
"265 rental units",
"10 stories and contain 200 apartments",
"801 residential properties that include row homes, town homes, condos, single-family housing, apartments, and senior rental units",
"4-unit townhouse building (6,528 square feet of living space & 2,755 square feet of unheated garage)"]
unit=[]
for item in x:
extract = re.findall('[0-9]+.unit',item)
unit.append(extract)
print unit
This works with string ends in unit, but I also strings end with 'rental unit','apartment','bed' and other as in this example.
I could do this with multiple regex, but is there a way to do this within one regex?
Thanks!
As long as your not afraid of making a hideously long regex you could use something to the extent of:
compiled_re = re.compile(ur"(\d*)-unit|(\d*)\srental unit|(\d*)\sbed|(\d*)\sappartment")
unit = []
for item in x:
extract = re.findall(compiled_re, item)
unit.append(extract)
You would have to extend the regex pattern with a new "|" followed by a search pattern for each possible type of reference to unit numbers. Unfortunately, if there is very low consistency in the entries this approach would become basically unusable.
Also, might I suggest using a regex tester like Regex101. It really helps determining if your regex will do what you want it to.

Categories

Resources