Long regex replace requires multiple passes to finish - why? - python

I've checked the site for an answer to this question and exhausted Google and my own patience trying to answer it myself, so here it is. Happy to be pointed to the answer if this is a dupe.
So I have a long regex--nothing complicated, just a bunch of simple conditions piped together. I'm using it to remove the piped words from the beginnings and ends of named entities I've extracted from news article data. The use case is, many of the names have these short words within them (think Centers for Disease Control and Prevention) but I want to remove the words when they appear at the beginning or end of the name. E.g., I don't want "Centers for Disease Control" counted differently from "the Centers for Disease Control" for obvious reasons.
I used this regex string on a large (>1M) list of named entities in Python 3.7.2 using the following code (file here):
with open('pnames.csv','r') as f:
named_entities = f.read().splitlines()
print(len([i for i in named_entities if i == 'the wall street journal']))
# 146
short_words = "^and\s|\sand$|^at\s|\sat$|^by\s|\sby$|^for\s|\sfor$|^in\s|\sin$|^of\s|\sof$|^on\s|\son$|^the\s|\sthe$|^to\s|\sto$"
cleaned_entities = [re.sub(short_words,"",i)
for i
in named_entities]
print(len([i for i in cleaned_entities
if i == 'the wall street journal']))
# 80 (huh, should be 0. Let me try again...)
cleaned_entities2 = [re.sub(short_words,"",i)
for i
in cleaned_entities]
print(len([i for i in cleaned_entities2
if i == 'the wall street journal']))
# 1 (better, but still unexpected. One more time...)
cleaned_entities3 = [re.sub(short_words,"",i)
for i
in cleaned_entities2]
print(len([i for i in cleaned_entities3
if i == 'the wall street journal']))
# 0 (this is what I expected on the first run!)
My question is, why doesn't the regex remove all the matching substrings in one pass? i.e., why is len([i for i in cleaned_entities if i == 'the wall street journal']) not equal to 0? Why does it take multiple runs to finish the job?
Things I've tried:
Restarting Spyder
Running the same code in Python 3.7.2, Python 3.6.2, and equivalent code in R 3.4.2 (the Pythons gave the exact same results, and R gave different numbers but I still had to run it several times to get to zero)
Running the code only on the substrings that match the regex (same result)
Running the code only on the strings that equal "the wall street journal" (works in one pass)
Substituting the regex "^the " in the above code (fixes all matches in one pass)
So yeah, any ideas would be helpful.

Your regular expression will only ever remove one layer of unwanted words per pass. So if you had a
sentence as:
and and at by in of the the wall street journal at the by on the
it would have needed many passes to completely remove everything.
The expression can be rearranged to make use of + to indicate one or more occurances of as follows:
import re
with open('pnames2.csv','r') as f:
named_entities = f.read().splitlines()
print(len([i for i in named_entities if i == 'the wall street journal']))
# 146
short_words = "^((and|at|by|for|in|of|on|the|to)\s)+|(\s(and|at|by|for|in|of|on|the|to))+$"
re_sw = re.compile(short_words)
cleaned_entities = [re_sw.sub("", i) for i in named_entities]
print(len([i for i in cleaned_entities if i == 'the wall street journal']))
# 0
The process can be sped up slightly by pre-compiling the regular expression. It would be even faster if you applied it to
the whole file rather than applying it on a line by line basis.

Related

Why doesn't this pattern with a negative look-ahead restrict these overrides of the re.sub() function?

Using a negative look-ahead X(?!Y), check that it is NOT ahead of the match, the goal is to identify substrings "they" that are not ahead some sequence ((PERS)the , and if there aren't any then than replace that substring "they" with the string "((PERS)they NO DATA) ". Otherwise you should not make any replacement.
import re
# Example 1 :
input_text = "They are great friends, the cellos of they, soon they became best friends. They saw each other in the park before taking the old cabinets, since ((PERS)the printers) were still useful to the company themselves. They are somewhat worse than the new models."
# Example 2 :
input_text = "They finished the flow chart pretty quickly"
input_text = re.sub(r"\(\(PERS\)\s*the", "((PERS)the", input_text)
#constraint_pattern = r"\bthey\b(?<!\(\(PERS\)/s*the)" # --> re.error: look-ahead requires fixed-width pattern
constraint_pattern = r"\bellos\b(?<!\(\(PERS\)the)"
input_text = re.sub(constraint_pattern,
"((PERS)they NO DATA)",
input_text, flags = re.IGNORECASE)
print(input_text) # --> output
Using this code, for some reason all occurrences of the "ellos" substring are replaced by "((PERS)they NO DATA)", but really only "they" substrings that are NOT preceded by a sequence "((PERS)the" must be replaced by "((PERS)they NO DATA)"
The goal is really to get this output:
#correct output for example 1
"((PERS)they NO DATA) are great friends, the cellos of ((PERS)they NO DATA), soon ((PERS)they NO DATA) became best friends. ((PERS)they NO DATA) saw each other in the park before taking the old cabinets, since ((PERS)the printers) were still useful to the company themselves. They are somewhat worse than the new models."
#correct output for example 2
"((PERS)they NO DATA) finished the flow chart pretty quickly"

How to efficiently search for a list of strings in another list of strings using Python?

I have two list of names (strings) that look like this:
executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
analysts = ['Justin Post', 'Some Dude', 'Some Chick']
I need to find where those names occur in a list of strings that looks like this:
str = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.
The reason I need to do this is so that I can concatenate the conversation strings together (which are separated by the names). How would I go about doing this efficiently?
I looked at some similar questions and tried the solutions to no avail, such as this:
if any(x in str for x in executives):
print('yes')
and this ...
match = next((x for x in executives if x in str), False)
match
I'm not sure if that is what you are looking for:
executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
text = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores."]
result = [s for s in text if any(ex in s for ex in executives)]
print(result)
output:
['Brian Olsavsky - Amazon.com']
str = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores"]
executives = ['Brian Olsavsky', 'Justin', 'Some Guy', 'Some Lady']
As an addition, if you need the exact location, you could use this:
print([[i, str.index(q), q.index(i)] for i in executives for q in str if i in q ])
this outputs
[['Brian Olsavsky', 3, 0], ['Justin', 0, 0], ['Justin', 4, 11], ['Justin', 9, 5]]
TLDR
This answer is focusing on efficiency. Use the other answers if it is not a key issue. If it is, make a dict from the corpus you are searching in, then use this dict to find what you are looking for.
#import stuff we need later
import string
import random
import numpy as np
import time
import matplotlib.pyplot as plt
Create example corpus
First we create a list of strings which we will search in.
Create random words, by which I mean random sequence of characters, with a length drawn from a Poisson distribution, with this function:
def poissonlength_words(lam_word): #generating words, length chosen from a Poisson distrib
return ''.join([random.choice(string.ascii_lowercase) for _ in range(np.random.poisson(lam_word))])
(lam_word being the parameter of the Poisson distribution.)
Let's create number_of_sentences variable length sentences from these words (by sentence I mean list of the randomly generated words separated by spaces).
The length of the sentences can also be drawn from a Poisson distribution.
lam_word=5
lam_sentence=1000
number_of_sentences = 10000
sentences = [' '.join([poissonlength_words(lam_word) for _ in range(np.random.poisson(lam_sentence))])
for x in range(number_of_sentences)]
sentences[0] now will start like this:
tptt lxnwf iem fedg wbfdq qaa aqrys szwx zkmukc...
Let's also create names, which we will search for. Let these names be bigrams. The first name (ie first element of bigram) will be n characters, last name (second bigram element) will be m character long, and it will consist of random characters:
def bigramgen(n,m):
return ''.join([random.choice(string.ascii_lowercase) for _ in range(n)])+' '+\
''.join([random.choice(string.ascii_lowercase) for _ in range(m)])
The task
Let's say we want to find sentences where bigrams such as ab c appears. We don't want to find dab c or ab cd, only where ab c stands alone.
To test how fast a method is, let's find an ever-increasing number of bigrams, and measure the elapsed time. The number of bigrams we search for can be, for example:
number_of_bigrams_we_search_for = [10,30,50,100,300,500,1000,3000,5000,10000]
Brute force method
Just loop through each bigram, loop through each sentence, use in to find matches. Meanwhile, measure elapsed time with time.time().
bruteforcetime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
start = time.time()
for bigram in bigrams:
#the core of the brute force method starts here
reslist=[]
for sentencei, sentence in enumerate(sentences):
if ' '+bigram+' ' in sentence:
reslist.append([bigram,sentencei])
#and ends here
end = time.time()
bruteforcetime.append(end-start)
bruteforcetime will hold the number of seconds necessary to find 10, 30, 50 ... bigrams.
Warning: this might take a long time for high number of bigrams.
The sort your stuff to make it quicker method
Let's create an empty set for every word appearing in any of the sentences (using dict comprehension):
worddict={word:set() for sentence in sentences for word in sentence.split(' ')}
To each of these sets, add the index of each word in which it appears:
for sentencei, sentence in enumerate(sentences):
for wordi, word in enumerate(sentence.split(' ')):
worddict[word].add(sentencei)
Note that we only do this once, no matter how many bigrams we search later.
Using this dictionary, we can seach for the sentences where the each part of the bigram appears. This is very fast, since calling a dict value is very fast. We then take the intersection of these sets. When we are searching for ab c, we will have a set of sentence indexes where ab and c both appears.
for bigram in bigrams:
reslist=[]
setlist = [worddict[gram] for gram in target.split(' ')]
intersection = set.intersection(*setlist)
for candidate in intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
Let's put the whole thing together, and measure the time elapsed:
logtime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
start_time=time.time()
worddict={word:set() for sentence in sentences for word in sentence.split(' ')}
for sentencei, sentence in enumerate(sentences):
for wordi, word in enumerate(sentence.split(' ')):
worddict[word].add(sentencei)
for bigram in bigrams:
reslist=[]
setlist = [worddict[gram] for gram in bigram.split(' ')]
intersection = set.intersection(*setlist)
for candidate in intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
end_time=time.time()
logtime.append(end_time-start_time)
Warning: this might take a long time for high number of bigrams, but less than the brute force method.
Results
We can plot how much time each method took.
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
Or, plotting the y axis on a log scale:
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.yscale('log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
Giving us the plots:
Making the worddict dictionary takes a lot of time, and is a disadvantage when searching for a small number of names. There is a point however, that the corpus is big enough and the number of names we are searching for is high enough that this time is compensated by the speed of searching in it, compared to the brute force method. So, if these conditions are satisfied, I recommend using this method.
(Notebook available here.)

Delete regex matches within loop and continue with updated string version

I have a string that I want to run through four wordlists, one with four-grams, one with tri-grams, one with bigrams and one with single terms. To avoid that a word of the single term wordlist gets counted twice when it also forms part of a bigram or trigrams for example, I start with counting for four-grams, then want to update the string in terms of removing the matches to only check the remaining part of the string for matches of trigrams, bigrams and single terms, respectively. I have used the following code and illustrate it here just starting with fourgrams and then trigrams:
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
def clean_text1(pattern_fourgrams, new_strn):
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
return new_strn
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
def clean_text2(pattern_trigrams, new_strn1):
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
return new_strn1
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)
For fourgrams there is not match, so new_strn will be similar to strn. However, there is one match with trigrams ("chief financial officer"), however, I do not succees in deleteing the match from new_strn1. Instead, I again yield the full string, namely strn (or new_strn which is the same).
Could someone help me find the mistake here?
(As a complement to Tilak Putta's answer)
Note that you are searching the string twice: once when counting the occurrences of the ngrams with .count() and once more when you remove the matches using re.sub().
You can increase performance by counting and removing at the same time.
This can be done using re.subn. This function takes the same parameters as re.sub but returns a tuple containing the cleaned string as well as the number of matches.
Example:
for i in pattern_fourgrams:
new_strn, n = re.subn(r, '', new_strn)
financial_fourgrams_count += n
Note that this assumes the n-grams are pairwaise different (for fixed n), i.e. they shouldn't have a common word, since subn will delete that word the firs time it sees it and thus won't be able to find occurence of other ngrams containing that particular word.
you need to remove def
import re
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
print(new_strn1)
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
Example output of what it should look like
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
Try this. split your string this.You can also check demo.
http://regex101.com/r/nG1gU7/27
Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP, spaCy. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)
In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.
To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer just to decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:
function isEndOfSentence(leftContext, rightContext)
Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02" Consider also section references like 1.2.A.3.a, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later. (Was that one sentence or two? Impossible to determine)
You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if #midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a].
Ultimately you get as much correctness as you are prepared to pay for.
All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
All of the above is still only ASCII, which is practically speaking only 96 characters. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)
In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext) would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).
References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]
Try to split the input according to the spaces rather than a dot or ?, if you do like this then the dot or ? won't be printed in the final result.
>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
... print i
...
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
print s
Here the regex used is : (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)
First block: (?<!\w\.\w.) : this pattern searches in a negative feedback loop (?<!) for all words (\w) followed by fullstop (\.) , followed by other words (\.)
Second block: (?<![A-Z][a-z]\.): this pattern searches in a negative feedback loop for anything starting with uppercase alphabets ([A-Z]), followed by lower case alphabets ([a-z]) till a dot (\.) is found.
Third block: (?<=\.|\?): this pattern searches in a feedback loop of dot (\.) OR question mark (\?)
Fourth block: (\s|[A-Z].*): this pattern searches after the dot OR question mark from the third block. It searches for blank space (\s) OR any sequence of characters starting with a upper case alphabet ([A-Z].*).
This block is important to split if the input is as
Hello world.Hi I am here today.
i.e. if there is space or no space after the dot.
Naive approach for proper english sentences not starting with non-alphas and not containing quoted parts of speech:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
print(''.join(sentence))
sentence = []
if len(part):
sentence.append(part)
if len(sentence):
print(''.join(sentence))
False positive splitting may be reduced by extending NonEndings a bit. Other cases will require additional code. Handling typos in a sensible way will prove difficult with this approach.
You will never reach perfection with this approach. But depending on the task it might just work "enough"...
I'm not great at regular expressions, but a simpler version, "brute force" actually, of above is
sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")
which means
start acceptable units are '[A-Z] or "[A-Z]
please note, most regular expressions are greedy so the order is very important when we do |(or). That's, why I have written i.e. regular expression first, then is come forms like Inc.
Try this:
(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)
I wrote this taking into consideration smci's comments above. It is a middle-of-the-road approach that doesn't require external libraries and doesn't use regex. It allows you to provide a list of abbreviations and accounts for sentences ended by terminators in wrappers, such as a period and quote: [.", ?', .)].
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
I used Karl's find_all function from this entry: Find all occurrences of a substring in Python
My example is based on the example of Ali, adapted to Brazilian Portuguese. Thanks Ali.
ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
'[0-9]+', 'reg', 'f[ilĂ­]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']
ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)
def sentencas(texto, min_len=5):
# baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
texto = re.sub(r'\s\s+', ' ', texto)
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
# print(NonEndings)
parts = EndPunctuation.split(texto)
sentencas = []
sentence = []
for part in parts:
txt_sent = ''.join(sentence)
q_len = len(txt_sent)
if len(part) and len(sentence) and q_len >= min_len and \
EndPunctuation.match(sentence[-1]) and \
not ABREVIACOES_RGX.search(txt_sent):
sentencas.append(txt_sent)
sentence = []
if len(part):
sentence.append(part)
if sentence:
sentencas.append(''.join(sentence))
return sentencas
Full code in: https://github.com/luizanisio/comparador_elastic
If you want to break up sentences at 3 periods (not sure if this is what you want) you can use this regular expresion:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)
for stuff in sentences:
print(stuff)

How to use text.split() and retain blank (empty) lines

New to python, need some help with my program. I have a code which takes in an unformatted text document, does some formatting (sets the pagewidth and the margins), and outputs a new text document. My entire code works fine except for this function which produces the final output.
Here is the segment of the problem code:
def process(document, pagewidth, margins, formats):
res = []
onlypw = []
pwmarg = []
count = 0
marg = 0
for segment in margins:
for i in range(count, segment[0]):
res.append(document[i])
text = ''
foundmargin = -1
for i in range(segment[0], segment[1]+1):
marg = segment[2]
text = text + '\n' + document[i].strip(' ')
words = text.split()
Note: segment [0] means the beginning of the document, and segment[1] just means to the end of the document if you are wondering about the range. My problem is when I copy text to words (in words=text.split() ) it does not retain my blank lines. The output I should be getting is:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me.
There now is your insular city of the Manhattoes, belted
round by wharves as Indian isles by coral reefs--commerce
surrounds it with her surf.
And what my current output looks like:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me. There now is your insular city of
the Manhattoes, belted round by wharves as Indian isles by
coral reefs--commerce surrounds it with her surf.
I know the problem happens when I copy text to words, since it doesn't keep the blank lines. How can I make sure it copies the blank lines plus the words?
Please let me know if I should add more code or more detail!
First split on at least 2 newlines, then split on words:
import re
paragraphs = re.split('\n\n+', text)
words = [paragraph.split() for paragraph in paragraphs]
You now have a list of lists, one per paragraph; process these per paragraph, after which you can rejoin the whole thing into new text with double newlines inserted back in.
I've used re.split() to support paragraphs being delimited by more than 2 newlines; you could use a simple text.split('\n\n') if there are ever only going to be exactly 2 newlines between paragraphs.
use a regexp to find the words and the blank lines rather than split
m = re.compile('(\S+|\n\n)')
words=m.findall(text)

Categories

Resources