I want to only get complete words from acronyms with ( ) around them.
For example, there is a sentence
'Lung cancer screening (LCS) reduces NSCLC mortality';
->I want to get 'Lung cancer screening' as a result.
How can I do it with regex?
original question:
I want to remove repeated upper alphabets :
"HIV acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer" => " acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer"
Assuming you want to target 2 or more capital letters, I would use re.sub here:
inp = "Lung cancer screening (LCS) reduces NSCLC mortality"
output = re.sub(r'\s*(?:\([A-Z]+\)|[A-Z]{2,})\s*', ' ', inp).strip()
print(output) # Lung cancer screening reduces mortality
import re
s = 'HIV acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer'
print(re.sub(r'([A-Z])', lambda pat:'', s).strip()) # Inline
according to #jensgram answer
Related
I am trying to use regex to extract sentences containing specific words, and sentences before and next to them. My code works, but it takes 20 seconds for each txt and I have about a million txt files. Is it possible to get the same result in less time? Any other relalted suggestions are also welcome. Thanks!
My current thought is to extract paragraphs containing these target words first, then use nltk to tokenize target paragraphs and extract the target sentences and sentences before and next them.
Here is my demo:
import re, nltk
txt = '''There is widespread agreement that screening for breast cancer, when combined with appropriate follow-up, will reduce mortality from the disease. How we measure response to treatment is called the 5-year survival rate, or the percentage of people who live 5 years after being diagnosed with cancer. According to information provided by National Cancer Institute, Cancer stage at diagnosis, which refers to extent of a cancer in the body, determines treatment options and has a strong influence on the length of survival. In general, if the cancer is found only in the part of the body where it started it is localized (sometimes referred to as stage 1). If it has spread to a different part of the body, the stage is regional or distant . The earlier female breast cancer is caught, the better chance a person has of surviving five years after being diagnosed. For female breast cancer, 62.5 are diagnosed at the local stage. The 5-year survival for localized female breast cancer is 98.8 . It decreases from 98.8 to 85.5 after the cancer has spread to the lymph nodes (stage 2), and to 27.4
(stage 4) after it has spread to other organs such as the lung, liver or brain. A major problem with current detection methods is that studies have shown that mammography does not detect 10 -20 of breast cancers that are detected by physical exam alone, which may be attributed to a falsely negative mammogram.
Breast cancer screening is generally recommended as a routine part of preventive healthcare for women over the age of 20 (approximately 90 million in the United States). Besides skin cancer, breast cancer is the most commonly diagnosed cancer among U.S. women. For these women, the American Cancer Society (ACS) has published guidelines for breast cancer screening including: (i) monthly breast self-examinations for all women over the age of 20; (ii) a clinical breast exam (CBE) every three years for women in their 20s and 30s; (iii) a baseline mammogram for women by the age of 40; and (iv) an annual mammogram for women age 40 or older (according to the American College of Radiology). Unfortunately, the U.S. Preventive Task Force Guidelines have stirred confusion by recommending biennial screening mammography for women ages 50-74.
Each year, approximately eight million women in the United States require diagnostic testing for breast cancer due to a physical symptom, such as a palpable lesion, pain or nipple discharge, discovered through self or physical examination (approximately seven million) or a non-palpable lesion detected by screening x-ray mammography
(approximately one million). Once a physician has identified a suspicious lesion in a woman's breast, the physician may recommend further diagnostic procedures, including a diagnostic x-ray mammography, an ultrasound study, a magnetic resonance imaging procedure, or a minimally invasive procedure such as fine needle aspiration or large core needle biopsy. In each case, the potential benefits of additional diagnostic testing must be balanced against the costs, risks and discomfort to the patient associated with undergoing the additional procedures.
'''
target_words = ['risks', 'discomfort', 'surviving', 'risking', 'risks', 'risky']
pattern = r'.*\b(?='+'|'.join(target_words) + r')\b.*'
target_paras = re.findall(pattern, txt, re.IGNORECASE)
# Function to extract sentences containing any target word and its neighbor sentences
def UncertaintySentences (paragraph):
sent_token = nltk.tokenize.sent_tokenize(paragraph)
keepsents = []
for i, sentence in enumerate(sent_token):
# sentences contain any target word
if re.search(pattern, sentence, re.IGNORECASE) != None:
try:
if i==0: # first sentence in a para, keep it and the one next to it
keepsents.extend([sent_token[i], sent_token[i+1]])
elif i!=len(sent_token)-1: # sentence in the middle, keep it ant the ones before and next to it
keepsents.extend([sent_token[i-1], sent_token[i], sent_token[i+1]])
else: # last sentence, keep it and the one before it
keepsents.extend([sent_token[i-1], sent_token[i]])
except: # para with only one sentence
keepsents = sent_token
# drop duplicate sentences
del_dup = []
[del_dup.append(x) for x in keepsents if x not in del_dup]
return(del_dup)
for para in target_paras:
uncertn_sents = UncertaintySentences(para)
print(uncertn_sents)
The final speed of your original regex is highly dependant on the data you are inspecting.
There's a problem with your regex:
r'.*\b(?='+'|'.join(target_words) + r')\b.*'
If there are many/big paragraphs with no keywords then the search process is very slow.
Why this happens?
Because your regex starts with .*
Your regex matches the whole paragraph and starts to backtrack characters one by one and tries to match the keywords while doing so. If there are no keywords at all, the backtracking process reaches the beginning of the paragraph.
Then, it advances one more character and repeats the whole process again (It reaches the end of string, backtracks to position 1), then advances to position 2 and repeats everything again...
You can better look at this process with this regex debugger:
https://regex101.com/r/boZLQU/1/debugger
Optimization
Just add an ^ to your regex, like this:
r'^.*\b(?='+'|'.join(target_words) + r')\b.*'
Note that we also need to use the M flag in order to make ^ behave as "beginning of line" instead of "beginning of string"
re.findall(pattern, txt, re.MULTILINE | re.IGNORECASE)
That way you'll just do the backtracking process one time instead of one for every character, which in the end should speed up the process a lot when searching through paragraphs that don't have any of the required keywords.
In terms of computational cost of the regex, it decreases from to
Here's a few ideas to optimize this code:
The target_words list can be converted to a set to make the in
operation more efficient.
The pattern variable can be precompiled using re.compile to make the
subsequent calls to re.findall and re.search faster.
The del_dup list comprehension can be replaced with a set() call to
remove duplicates more efficiently.
Maybe move the sent_token = nltk.tokenize.sent_tokenize(paragraph) out
of the loop of the UncertaintySentences function, so that the
tokenization operation is only performed once per paragraph.
I have a corpus that looks something like this
LETTER AGREEMENT N°5 CHINA SOUTHERN AIRLINES COMPANY LIMITED Bai Yun
Airport, Guangzhou 510405, People's Republic of China Subject: Delays
CHINA SOUTHERN AIRLINES COMPANY LIMITED (the ""Buyer"") and AIRBUS
S.A.S. (the ""Seller"") have entered into a purchase agreement (the
""Agreement"") dated as of even date
And a list of company names that looks like this
l = [ 'airbus', 'airbus internal', 'china southern airlines', ... ]
The elements of this list do not always have exact matches in the corpus, because of different formulations or just typos: for this reason I want to perform fuzzy matching.
What is the most efficient way of finding the best matches of l in the corpus? In theory the task is not super difficult but I don't see a way of solving it that does not entail looping through both the corpus and list of matches, which could cause huge slowdowns.
You can concatenate your list l in a single regex expression, then use regex to fuzzy match (https://github.com/mrabarnett/mrab-regex#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109) the words in the corpus.
Something like
my_regex = ""
for pattern in l:
my_regex += f'(?:{pattern}' + '{1<=e<=3})' #{1<=e<=3} permit at least 1 and at most 3 errors
my_regex += '|'
my_regex = my_regex[:-1] #remove the last |
I went through various answers before posting and they are all regular expression based and involve symbols and special characters.
Here is my input text and expected output. I want to extract the text between 'Investment Objective' and 'Investment Policy'
input_text
"Investment Objective To provide long - term capital growth by investing primarily in a portfolio of African companies. Investment Policy"
output_text:
"To provide long - term capital growth by investing primarily in a portfolio of African companies."
An alternative to Joshua's answer:
input_text="Investment Objective To provide long - term capital growth by investing primarily in a portfolio of African companies. Investment Policy"
start_str = "Investment Objective"
startpos = input_text.find(start_str)
end_str = "Investment Policy"
endpos = input_text.find(end_str)
output_str = input_text[startpos + len(start_str):endpos]
output_str_nospaces = output_str.strip()
print(f"'{output_str}'")
print(f"'{output_str_nospaces}'")
Which prints:
' To provide long - term capital growth by investing primarily in a portfolio of African companies. '
'To provide long - term capital growth by investing primarily in a portfolio of African companies.'
Lets say, your blacklisted words are:
black = ["Investment Objective","Investment Policy"]
Now lets remove it:
for i in black:
input_text = input_text.replace(i,'').strip()
this gives:
'To provide long - term capital growth by investing primarily in a portfolio of African companies.'
I have text data to be cleaned using regex. However, some words in the text are immediately followed by numbers which I want to remove.
For example, one row of the text is:
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes
terminology10 Lessons learnt from the RUPES project12 Payment for
environmental service and it potential and example in Vietnam16
Chapter Integrating payment for ecosystem service into Vietnams policy
and programmes17 Chapter Creating incentive for Tri An watershed
protection20 Chapter Sustainable financing for landscape beauty in
Bach Ma National Park 24 Chapter Building payment mechanism for carbon
sequestration in forestry a pilot project in Cao Phong district of Hoa
Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam28 Synthesis and Recommendations30
References32
The first word in the above text should be 'preface' instead of 'preface2' and so on.
line = re.sub(r"[A-Za-z]+(\d+)", "", line)
This, however removes the words as well as seen:
Pes Lessons learnt from the RUPES Payment for environmental service
and it potential and example in Chapter Integrating payment for
ecosystem service into Vietnams policy and Chapter Creating incentive
for Tri An watershed Chapter Sustainable financing for landscape
beauty in Bach Ma National Park 24 Chapter Building payment mechanism
for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Chapter 5 Local revenue sharing Nha
Trang Bay Marine Protected Area Synthesis and
How can I capture only the numbers that immediately follow words?
You can capture the text part and substitute the word with that captured part. It simply writes:
re.sub(r"([A-Za-z]+)\d+", r"\1", line)
You could try lookahead assertions to check for words before your numbers. Try word boundaries (\b) at the end of forcing your regex to only match numbers at the end of a word:
re.sub(r'(?<=\w+)\d+\b', '', line)
Hope this helps
EDIT:
Sorry about the glitch, mentioned in the comments about matching numbers that are NOT preceeded by words as well. That is because (sorry again) \w matches alphanumeric characters instead of only alphabetic ones. Depending on what you would like to delete you can use the positive version
re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)
to only check for english alphabetic characters (you can add characters to the [a-zA-Z] list) preceeding your number or the negative version
re.sub(r'(?<![\d\s])\d+\b', '', line)
to match anything that is NOT \d (numbers) or \s (spaces) before your desired number. This will also match punctuation marks though.
Try this:
line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one
\\1 will match the word, \\2 the number. See: How to use python regex to replace using captured group?
below, I'm proposing a working sample of code that might solve your problem.
Here's the snippet:
import re
# I'will write a function that take the test data as input and return the
# desired result as stated in your question.
def transform(data):
"""Replace in a text data words ending with number.""""
# first, lest construct a pattern matching those words we're looking for
pattern1 = r"([A-Za-z]+\d+)"
# Lest construct another pattern that will replace the previous in the final
# output.
pattern2 = r"\d+$"
# Let find all matching words
matches = re.findall(pattern1, data)
# Let construct a list of replacement for each word
replacements = []
for match in matches:
replacements.append(pattern2, '', match)
# Intermediate variable to construct tuple of (word, replacement) for
# use in string method 'replace'
changers = zip(matches, replacements)
# We now recursively change every appropriate word matched.
output = data
for changer in changers:
output.replace(*changer)
# The work is done, we can return the result
return output
For test purpose, we run the above function with your test data:
data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons
learnt from the RUPES project12 Payment for environmental service and it potential and
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""
result = transform(data)
print(result)
And the result looks like this:
Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from
the RUPES project Payment for environmental service and it potential and example in
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and
programmes Chapter Creating incentive for Tri An watershed protection Chapter
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam Synthesis and Recommendations References
You can create a range of numbers as well:
re.sub(r"[0-9]", "", line)
Given 10,000,000,000 lines of around 20-50 words each line, e.g.:
Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .
However , others argue that while anti-statism is central , it is inadequate to define anarchism .
Therefore , they argue instead that anarchism entails opposing authority or hierarchical organization in the conduct of human relations , including , but not limited to , the state system .
Proponents of anarchism , known as " anarchists " , advocate stateless societies based on non - hierarchical free association s. As a subtle and anti-dogmatic philosophy , anarchism draws on many currents of thought and strategy .
Anarchism does not offer a fixed body of doctrine from a single particular world view , instead fluxing and flowing as a philosophy .
There are many types and traditions of anarchism , not all of which are mutually exclusive .
Anarchist schools of thought can differ fundamentally , supporting anything from extreme individualism to complete collectivism .
Strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications .
Anarchism is often considered a radical left-wing ideology , and much of anarchist economics and anarchist legal philosophy reflect anti-authoritarian interpretations of communism , collectivism , syndicalism , mutualism , or participatory economics .
Anarchism as a mass social movement has regularly endured fluctuations in popularity .
The central tendency of anarchism as a social movement has been represented by anarcho-communism and anarcho-syndicalism , with individualist anarchism being primarily a literary phenomenon which nevertheless did have an impact on the bigger currents and individualists have also participated in large anarchist organizations .
Many anarchists oppose all forms of aggression , supporting self-defense or non-violence ( anarcho-pacifism ) , while others have supported the use of some coercive measures , including violent revolution and propaganda of the deed , on the path to an anarchist society .
Etymology and terminology The term derives from the ancient Greek ἄναρχος , anarchos , meaning " without rulers " , from the prefix ἀν - ( an - , " without " ) + ἀρχός ( arkhos , " leader " , from ἀρχή arkhē , " authority , sovereignty , realm , magistracy " ) + - ισμός ( - ismos , from the suffix - ιζειν , - izein " - izing " ) . "
Anarchists " was the term adopted by Maximilien de Robespierre to attack those on the left whom he had used for his own ends during the French Revolution but was determined to get rid of , though among these " anarchists " there were few who exhibited the social revolt characteristics of later anarchists .
There would be many revolutionaries of the early nineteenth century who contributed to the anarchist doctrines of the next generation , such as William Godwin and Wilhelm Weitling , but they did not use the word " anarchist " or " anarchism " in describing themselves or their beliefs .
Pierre-Joseph Proudhon was the first political philosopher to call himself an anarchist , making the formal birth of anarchism the mid-nineteenth century .
Since the 1890s from France , the term " libertarianism " has often been used as a synonym for anarchism and was used almost exclusively in this sense until the 1950s in the United States ; its use as a synonym is still common outside the United States .
On the other hand , some use " libertarianism " to refer to individualistic free-market philosophy only , referring to free-market anarchism as " libertarian anarchism " .
And let's say I have a list of dictionary terms that is made up of one or more words, e.g:
clinical anatomy
clinical psychology
cognitive neuroscience
cognitive psychology
cognitive science
comparative anatomy
comparative psychology
compound morphology
computational linguistics
correlation
cosmetic dentistry
cosmography
cosmology
craniology
craniometry
criminology
cryobiology
cryogenics
cryonics
cryptanalysis
crystallography
curvilinear correlation
cybernetics
cytogenetics
cytology
deixis
demography
dental anatomy
dental surgery
dentistry
philosophy
political philosophy
And I need to find all sentences that contains any of these terms and then replace the spaces between the words within the terms as underscores.
For example, there is this sentence in the text:
Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .
And there's the dictionary term political philosophy in the text. So the output for this sentence needs to be:
Anarchism is often defined as a political_philosophy which holds the state to be undesirable , unnecessary , or harmful .
I could do this:
dictionary = sort(dictionary, key=len) # replace the longest terms first.
for line in text:
for term in dictionary:
if term in line:
line = line.replace(term, term.replace(' ', '_'))
Assuming that I have 10,000 Dictionary terms (D) and 10,000,000,000 Sentences (S), the complexity of using the simple method would be O(D*S), right? Is there a faster and less brute-forcey way to achieve the same results?
Is there a way to replace all of the terms with spaces by the terms with underscore for each line? That will help in avoiding the inner loop.
Would it be more efficient if index the text using something like whoosh first then query the index and replace the terms? I would still need something like a O(1*S) to do the replacements, right?
The solution doesn't need to be in Python, even if it's some Unix command tricks like grep/sed/awk, it's fine, as long as subprocess.Popen-able.
Please correct my complexity assumptions if i'm wrong, pardon my noobiness.
Given a sentence:
This is a sentence that contains multiple phrases that I need to
replace with phrases with underscores, e.g. social political
philosophy with political philosophy under the branch of philosophy
and some computational linguistics where the cognitive linguistics and
psycho cognitive linguistics appears with linguistics
And let's say i have the dictionary:
cognitive linguistics
psycho cognitive linguistics
socio political philosophy
political philosophy
computational linguistics
linguistics
philosophy
social political philosophy
The output should look like:
This is a sentence that contains multiple phrases that I need to
replace with phrases with underscores, e.g.
social_political_philosophy with political_philosophy under the branch
of philosophy and some computational_linguistics where the
cognitive_linguistics and psycho_cognitive_linguistics appears with
linguistics
And the aim is to do this with a text file of 10 billion lines and a dictionary of 10-100k phrases.
It may be better to split the words, mapping the words from the start of the phrase to the full phrase, if you need the largest, instead of checking every item in the dict you just need to sort the phrases that appear by length:
from collections import defaultdict
def get_phrases(fle):
phrase_dict = defaultdict(list)
with open(fle) as ph:
for line in map(str.rstrip, ph):
k, _, phr = line.partition(" ")
phrase_dict[k].append(line)
return phrase_dict
from itertools import chain
def replace(fle, dct):
with open(fle) as f:
for line in f:
phrases = sorted(chain.from_iterable(dct[word] for word in line.split()
if word in dct) ,reverse=1, key=len)
for phr in phrases:
line = line.replace(phr, phr.replace(" ", "_"))
yield line
Output:
In [10]: cat out.txt
This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social political philosophy with political philosophy under the branch of philosophy and some computational linguistics where the cognitive linguistics and psycho cognitive linguistics appears with linguistics
In [11]: cat phrases.txt
cognitive linguistics
psycho cognitive linguistics
socio political philosophy
political philosophy
computational linguistics
linguistics
philosophy
social political philosophy
In [12]: list(replace("out.txt",get_phrases("phrases.txt")))
Out[12]: ['This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social_political_philosophy with political_philosophy under the branch of philosophy and some computational_linguistics where the cognitive_linguistics and psycho_cognitive_linguistics appears with linguistics']
A few other versions:
def repl(x):
if x:
return x.group().replace(" ", "_")
return x
def replace_re(fle, dct):
with open(fle) as f:
for line in f:
spl = set(line.split())
phrases = chain.from_iterable(dct[word] for word in spl if word in dct)
line = re.sub("|".join(phrases), repl, line)
yield line
def replace_re2(fle, dct):
cached = {}
with open(fle) as f:
for line in f:
phrases = tuple(chain.from_iterable(dct[word] for word in set(line.split()) if word in dct))
if phrases not in cached:
r = re.compile("|".join(phrases))
cached[phrases] = r
line = r.sub(repl, line)
else:
line = cached[phrases].sub(repl, line)
yield line
I would make a regex of your Dictionary to match the data.
Then on the replacement side, use a callback to replace spaces with _.
I estimate it would take less than 3 hours to do the whole thing.
Fortunately there is a Ternary Tool (Dictionary) regex generator.
To generate the regex and for what is shown below, you'll need the Trial
version of RegexFormat 7
Some links:
Screenshot of tool
TernaryTool(Dictionary) - Text version Dictionary samples
A 175,000 word Dictionary Regex
You basically generate your own Dictionary
by dropping in the strings you want to find, then press the Generate button.
Then all you have to do is read in 5 MB chunks and do a find/replace using the
regex, then append it to the new file.. rinse repeat.
Pretty simple really.
Based on your sample (above) this is an estimate of the time it would take
to complete 10 Billion lines.
This analysis is based on using a benchmark that was run on your sample input using the generated regex (below).
19 lines (# 3600 chars)
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 5
Elapsed Time: 4.03 s, 4034.28 ms, 4034278 µs
////////////////////////////
3606 chars
x 50,000
------------
180,300,000 (chars)
or
20 lines
x 50,000
------------
1,000,000 (lines)
=========================
10,000,000,000 lines
/
1,000,000 (lines) per 4 seconds
-----------------------------------------
40,000 seconds
/
3600 secs per hour
-------------------------
11 hours
////////////////////////////
However, if you read in and process 5 megabyte chunks
(as a single string) it will reduce the engine overhead
and have the time down to 1-3 hours.
This is the generated regex for your sample Dictionary (compressed):
\b(?:c(?:linical[ ](?:anatomy|psychology)|o(?:gnitive[ ](?:neuroscience|psychology|science)|mp(?:arative[ ](?:anatomy|psychology)|ound[ ]morphology|utational[ ]linguistics)|rrelation|sm(?:etic[ ]dentistry|o(?:graphy|logy)))|r(?:anio(?:logy|metry)|iminology|y(?:o(?:biology|genics|nics)|ptanalysis|stallography))|urvilinear[ ]correlation|y(?:bernetics|to(?:genetics|logy)))|de(?:ixis|mography|nt(?:al[ ](?:anatomy|surgery)|istry))|p(?:hilosophy|olitical[ ]philosophy))\b
(Note that the space separation are generated as [ ] per space.
If you want to change it to a quantified class, just run a
find (?:\[ \])+ and replace with whatever you want.
For example \s+ or [ ]+)
Here it is Formatted:
\b
(?:
c
(?:
linical [ ]
(?: anatomy | psychology )
| o
(?:
gnitive [ ]
(?: neuroscience | psychology | science )
| mp
(?:
arative [ ]
(?: anatomy | psychology )
| ound [ ] morphology
| utational [ ] linguistics
)
| rrelation
| sm
(?:
etic [ ] dentistry
| o
(?: graphy | logy )
)
)
| r
(?:
anio
(?: logy | metry )
| iminology
| y
(?:
o
(?: biology | genics | nics )
| ptanalysis
| stallography
)
)
| urvilinear [ ] correlation
| y
(?:
bernetics
| to
(?: genetics | logy )
)
)
| de
(?:
ixis
| mography
| nt
(?:
al [ ]
(?: anatomy | surgery )
| istry
)
)
| p
(?: hilosophy | olitical [ ] philosophy )
)
\b
Adding 10,000 phrases is very easy and the regex is no bigger than
the amount of bytes in the phrases plus a bit of overhead to interlace
the regex.
A final note. You can reduce the time even further by only generating the
regex on phrases.. that is only words separated by horizontal whitespace.
And, be sure to pre-compile the regex. Only have to do this once.