Retrieve definition for parenthesized abbreviation, based on letter count - python

I need to retrieve the definition of an acronym based on the number of letters enclosed in parentheses. For the data I'm dealing with, the number of letters in parentheses corresponds to the number of words to retrieve. I know this isn't a reliable method for getting abbreviations, but in my case it will be. For example:
String = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'
Desired output: family health history (FHH), nurse practitioner (NP)
I know how to extract parentheses from a string, but after that I am stuck. Any help is appreciated.
import re
a = 'Although family health history (FHH) is commonly accepted as an
important risk factor for common, chronic diseases, it is rarely considered
by a nurse practitioner (NP).'
x2 = re.findall('(\(.*?\))', a)
for x in x2:
length = len(x)
print(x, length)

Use the regex match to find the position of the start of the match. Then use python string indexing to get the substring leading up to the start of the match. Split the substring by words, and get the last n words. Where n is the length of the abbreviation.
import re
s = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'
for match in re.finditer(r"\((.*?)\)", s):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = s[:start_index].split()[-size:]
definition = " ".join(words)
print(abbr, definition)
This prints:
FHH family health history
NP nurse practitioner

does this solve your problem?
a = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'
splitstr=a.replace('.','').split(' ')
output=''
for i,word in enumerate(splitstr):
if '(' in word:
w=word.replace('(','').replace(')','').replace('.','')
for n in range(len(w)+1):
output=splitstr[i-n]+' '+output
print(output)
actually, Keatinge beat me to it

An idea, to use a recursive pattern with PyPI regex module.
\b[A-Za-z]+\s+(?R)?\(?[A-Z](?=[A-Z]*\))\)?
See this pcre demo at regex101
\b[A-Za-z]+\s+ matches a word boundary, one or more alpha, one or more white space
(?R)? recursive part: optionally paste the pattern from start
\(? need to make the parenthesis optional for recursion to fit in \)?
[A-Z](?=[A-Z]*\) match one upper alpha if followed by closing ) with any A-Z in between
Does not check if the first word letter actually match the letter at position in the abbreviation.
Does not check for an opening parenthesis in front of the abbreviation. To check, add a variable length lookbehind. Change [A-Z](?=[A-Z]*\)) to (?<=\([A-Z]*)[A-Z](?=[A-Z]*\)).

Using re with list-comprehension
x_lst = [ str(len(i[1:-1])) for i in re.findall('(\(.*?\))', a) ]
[re.search( r'(\S+\s+){' + i + '}\(.{' + i + '}\)', a).group(0) for i in x_lst]
#['family health history (FHH)', 'nurse practitioner (NP)']

This solution isn't particularly clever, it simpy searches for the acronyms and then builds up a pattern to extract the words ahead of each one:
import re
string = "Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP)."
definitions = []
for acronym in re.findall(r'\(([A-Z]+?)\)', string):
length = len(acronym)
match = re.search(r'(?:\w+\W+){' + str(length) + r'}\(' + acronym + r'\)', string)
definitions.append(match.group(0))
print(", ".join(definitions))
OUTPUT
> python3 test.py
family health history (FHH), nurse practitioner (NP)
>

Related

Regex for matching alphabet, numbers and special charters while looping in python

I am trying to find words and print using below code. Everything is working perfect but only issue is i am unable to print the last word(which is number).
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid ']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(i), textfile)
print(y)
Text file i working with:
textfile = """1, REBECCA M. ROTH , COLLECTOR OF TAXES of the taxing district of the
township of MORRIS for Six Hundred Sixty Seven dollars andFifty Two cents, the land
in said taxing district described as Block No. 10303 Lot No. 10 :
and known as 239 E HANOVER AVE , on the tax Taxes For: 2012
Sewer
Assessments For Improvements
Total Cost of Sale 35.00
Total
Premium (if any) Paid 1,400.00 """
Would like to know where am i making mistake.
Any suggestion is appreciated.
A couple of issues:
As others have mentioned, you need to escape special characters like parentheses ( ) and dots .. Very simply, you can use re.escape
Another issue is the trailing space in Premium \(if any\) Paid (it's trying to match two spaces instead of one as you're also checking for a space in your regex {} ([^ ]*))
You should instead change your code to the following:
See working code here
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(re.escape(i)), textfile)
print(y)
Two problems:
Your current 'Premium (if any) Paid ' string ends on a space, and '{} ([^ ]*)' also has a space after {}, which adds them together. Delete the trailing space in 'Premium (if any) Paid '.
You need to escape parenthesis, so if you want to keep your regular expression unchanged, the string in the list should be ['Premium \(if any\) Paid']. You can also use re.escape instead.
For your particular cases, this seems to be an optimal solution:
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{}\s+([\S]*)'.format(re.escape(i)), text, re.I)
print(y)

Returning strings contained within blocks of text, without punctuation attached

I need to match and return any word containing at least one of the strings/combinations of characters below:
- tion (as in navigation, isolation, or mitigation)
- ex (as in explanation, exfiltrate, or expert)
- ph (as in philosophy, philanthropy, or ephemera)
- ost, ist, ast (as in hostel, distribute, past)
My function appears to do this
TEXT_SAMPLE = """
Striking an average of observations taken at different times-- rejecting those
timid estimates that gave the object a length of 200 feet, and ignoring those
exaggerated views that saw it as a mile wide and three long--you could still
assert that this phenomenal creature greatly exceeded the dimensions of
anything then known to ichthyologists, if it existed at all.
Now then, it did exist, this was an undeniable fact; and since the human mind
dotes on objects of wonder, you can understand the worldwide excitement caused
by this unearthly apparition. As for relegating it to the realm of fiction,
that charge had to be dropped.
In essence, on July 20, 1866, the steamer Governor Higginson, from the
Calcutta & Burnach Steam Navigation Co., encountered this moving mass five
miles off the eastern shores of Australia.
"""
def latin_ish_words(text):
#Returns input text into list of words, splitting on whitespace, allocates list to text_list
text_list = text.split()
#Creates an empty string, match_list
match_list = []
#Creates a string containing latinish featurs
part_list = ["tion", "ex", "ph", "ost", "ist", "ast"]
#Iterates through list of words and latinish features, adds word to match_list if contains latinish features
for word in text_list:
for part in part_list:
if part in word:
match_list.append(word)
match_list = list(dict.fromkeys(match_list))
return match_list
latin_ish_words(TEXT_SAMPLE)
['observations', 'exaggerated', 'phenomenal', 'exceeded', 'ichthyologists,', 'existed', 'exist,', 'excitement', 'apparition.', 'fiction,', 'Navigation', 'eastern']
However, when numbers have punctuation attached, the function will also return punctuation
E.g - exist,',
How could one filter out such attached punctuation?
You can use r"\b\w*(?:tion|ex|ph|ost|ist|ast)\w*\b" regex. Explanation (see also docs):
\b ... word boundary
\w ... word character
* ... 0 or more repetition
\w* ... 0 or more word characters
(?:...) ... "plain" parens, not creating a group
| ... or
tion|ex|ph ... tion or ex or ph
Code:
import re
print(re.findall(r"\b\w*(?:tion|ex|ph|ost|ist|ast)\w*\b",TEXT_SAMPLE))
For convenience, you can build the pattern programtically, adding the parts from a variable:
import re
part_list = [
"tion",
"ex",
"ph",
"ost",
"ist",
"ast",
]
part_re = "|".join(part_list)
pattern = fr"\b\w*(?:{part_re})\w*\b"
# pattern = r"\b\w*(?:{})\w*\b".format(part_re) # for older versions not allowing f-string syntax
print(re.findall(pattern,TEXT_SAMPLE))
Output:
[
'observations',
'exaggerated',
'phenomenal',
'exceeded',
'ichthyologists',
'existed',
'exist',
'excitement',
'apparition',
'fiction',
'Navigation',
'eastern',
]

Program to grab abbreviations and definitions - trouble getting all lower case abbreviations

I have a program that grabs abbreviations (i.e., looks for words enclosed in parentheses) and then based on the number of characters in the abbreviation, goes back that many words and defines it. So far, it works for definitions like with preceding words that start with capital letters or when most preceding words start with capital letters. For the latter, it skips lower case letters like "in" and goes to the next one. However, my problem is when the number of corresponding words are all lowercase.
Current Output:
All Awesome Dudes (AAD)
Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT)
Trials (IMMPACT). Some patient prefer the usual care (UC)
Desired Output:
All Awesome Dudes (AAD)
Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT)
usual care (UC)
import re
s = """Too many people, but not All Awesome Dudes (AAD) only care about the
Initiative on Methods, Measurement, and Pain Assessment in Clinical
Trials (IMMPACT). Some patient perfer the usual care (UC) approach of
doing nothing"""
allabbre = []
for match in re.finditer(r"\((.*?)\)", s):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = s[:start_index].split()
count=0
for k,i in enumerate(words[::-1]):
if i[0].isupper():count+=1
if count==size:break
words=words[-k-1:]
definition = " ".join(words)
abbr_keywords = definition + " " + "(" + abbr + ")"
pattern='[A-Z]'
if re.search(pattern, abbr):
if abbr_keywords not in allabbre:
allabbre.append(abbr_keywords)
print(abbr_keywords)
The flag is used for rare cases like All are Awesome Dudes (AAD)
import re
s = """Too many people, but not All Awesome Dudes (AAD) only care about the
Initiative on Methods, Measurement, and Pain Assessment in Clinical
Trials (IMMPACT). Some patient perfer the usual care (UC) approach of
doing nothing
"""
allabbre = []
for match in re.finditer(r"\((.*?)\)", s):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = s[:start_index].split()
count=size-1
flag=words[-1][0].isupper()
for k,i in enumerate(words[::-1]):
first_letter=i[0] if flag else i[0].upper()
if first_letter==abbr[count]:count-=1
if count==-1:break
words=words[-k-1:]
definition = " ".join(words)
abbr_keywords = definition + " " + "(" + abbr + ")"
pattern='[A-Z]'
if re.search(pattern, abbr):
if abbr_keywords not in allabbre:
allabbre.append(abbr_keywords)
print(abbr_keywords)

Delete regex matches within loop and continue with updated string version

I have a string that I want to run through four wordlists, one with four-grams, one with tri-grams, one with bigrams and one with single terms. To avoid that a word of the single term wordlist gets counted twice when it also forms part of a bigram or trigrams for example, I start with counting for four-grams, then want to update the string in terms of removing the matches to only check the remaining part of the string for matches of trigrams, bigrams and single terms, respectively. I have used the following code and illustrate it here just starting with fourgrams and then trigrams:
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
def clean_text1(pattern_fourgrams, new_strn):
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
return new_strn
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
def clean_text2(pattern_trigrams, new_strn1):
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
return new_strn1
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)
For fourgrams there is not match, so new_strn will be similar to strn. However, there is one match with trigrams ("chief financial officer"), however, I do not succees in deleteing the match from new_strn1. Instead, I again yield the full string, namely strn (or new_strn which is the same).
Could someone help me find the mistake here?
(As a complement to Tilak Putta's answer)
Note that you are searching the string twice: once when counting the occurrences of the ngrams with .count() and once more when you remove the matches using re.sub().
You can increase performance by counting and removing at the same time.
This can be done using re.subn. This function takes the same parameters as re.sub but returns a tuple containing the cleaned string as well as the number of matches.
Example:
for i in pattern_fourgrams:
new_strn, n = re.subn(r, '', new_strn)
financial_fourgrams_count += n
Note that this assumes the n-grams are pairwaise different (for fixed n), i.e. they shouldn't have a common word, since subn will delete that word the firs time it sees it and thus won't be able to find occurence of other ngrams containing that particular word.
you need to remove def
import re
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
print(new_strn1)
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
Example output of what it should look like
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
Try this. split your string this.You can also check demo.
http://regex101.com/r/nG1gU7/27
Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP, spaCy. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)
In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.
To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer just to decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:
function isEndOfSentence(leftContext, rightContext)
Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02" Consider also section references like 1.2.A.3.a, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later. (Was that one sentence or two? Impossible to determine)
You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if #midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a].
Ultimately you get as much correctness as you are prepared to pay for.
All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
All of the above is still only ASCII, which is practically speaking only 96 characters. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)
In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext) would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).
References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]
Try to split the input according to the spaces rather than a dot or ?, if you do like this then the dot or ? won't be printed in the final result.
>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
... print i
...
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
print s
Here the regex used is : (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)
First block: (?<!\w\.\w.) : this pattern searches in a negative feedback loop (?<!) for all words (\w) followed by fullstop (\.) , followed by other words (\.)
Second block: (?<![A-Z][a-z]\.): this pattern searches in a negative feedback loop for anything starting with uppercase alphabets ([A-Z]), followed by lower case alphabets ([a-z]) till a dot (\.) is found.
Third block: (?<=\.|\?): this pattern searches in a feedback loop of dot (\.) OR question mark (\?)
Fourth block: (\s|[A-Z].*): this pattern searches after the dot OR question mark from the third block. It searches for blank space (\s) OR any sequence of characters starting with a upper case alphabet ([A-Z].*).
This block is important to split if the input is as
Hello world.Hi I am here today.
i.e. if there is space or no space after the dot.
Naive approach for proper english sentences not starting with non-alphas and not containing quoted parts of speech:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
print(''.join(sentence))
sentence = []
if len(part):
sentence.append(part)
if len(sentence):
print(''.join(sentence))
False positive splitting may be reduced by extending NonEndings a bit. Other cases will require additional code. Handling typos in a sensible way will prove difficult with this approach.
You will never reach perfection with this approach. But depending on the task it might just work "enough"...
I'm not great at regular expressions, but a simpler version, "brute force" actually, of above is
sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")
which means
start acceptable units are '[A-Z] or "[A-Z]
please note, most regular expressions are greedy so the order is very important when we do |(or). That's, why I have written i.e. regular expression first, then is come forms like Inc.
Try this:
(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)
I wrote this taking into consideration smci's comments above. It is a middle-of-the-road approach that doesn't require external libraries and doesn't use regex. It allows you to provide a list of abbreviations and accounts for sentences ended by terminators in wrappers, such as a period and quote: [.", ?', .)].
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
I used Karl's find_all function from this entry: Find all occurrences of a substring in Python
My example is based on the example of Ali, adapted to Brazilian Portuguese. Thanks Ali.
ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
'[0-9]+', 'reg', 'f[ilĂ­]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']
ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)
def sentencas(texto, min_len=5):
# baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
texto = re.sub(r'\s\s+', ' ', texto)
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
# print(NonEndings)
parts = EndPunctuation.split(texto)
sentencas = []
sentence = []
for part in parts:
txt_sent = ''.join(sentence)
q_len = len(txt_sent)
if len(part) and len(sentence) and q_len >= min_len and \
EndPunctuation.match(sentence[-1]) and \
not ABREVIACOES_RGX.search(txt_sent):
sentencas.append(txt_sent)
sentence = []
if len(part):
sentence.append(part)
if sentence:
sentencas.append(''.join(sentence))
return sentencas
Full code in: https://github.com/luizanisio/comparador_elastic
If you want to break up sentences at 3 periods (not sure if this is what you want) you can use this regular expresion:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)
for stuff in sentences:
print(stuff)

Categories

Resources