Python: getting a certain no. of strings from a dictionary - python

I have a dictionary in the following format, i split the different elements (where a comma(,) occured) using a split function and am now trying to extract the names from the list...i am trying to use regular expression but obviously am miserably failing being new to python... the names are in the following formats...
firstname(space)last name
name(space)name(space)name
x.name
x.y.name
name(space) x.(space)(name)
where x and y represent the an name initial like J. for john etc.
also if you can guide me in removing the "\t" keeping other information intact would also be great.
any sort of help would be more than welcome...thank you all.
[[' I. Antonov', ' I. Antonova', ' E. R. Kandel', ' and R. D. Hawkins. Activity-dependent presynaptic facilitation and hebbian ltp are both required and interact during classical conditioning in aplysia. Neuron', ' 37(1):135--47', ' Jan 2003.'], ['\tSander M. Bohte ', ' Joost N. Kok', ' Applications of spiking neural networks', ' Information Processing Letters', ' v.95 n.6', ' p.519-520'], [' L. J. Eshelman. The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. Foundations Of Genetic Algorithms', ' pages 265-283', ' 1990.'], ['Wulfram Gerstner ', ' Werner Kistler', ' Spiking Neuron Models: An Introduction', ' Cambridge University Press', ''], [' D. O. Hebb. Organization of behavior. New York: Wiley', ' 1949.'], [' D. Z. Jin. Spiking neural network for recognizing spatiotemporal sequences of spikes. Physical Review E', '69', ' 2004.'], ['Wolfgang Maass ', ' Christopher M. Bishop', ' Pulsed Neural Networks', ' MIT Press', ' '], ['Wolfgang Maass ', ' Henry Markram', ' Synapses as dynamic memory buffers', ' Neural Networks', ' v.15 n.2', ' p.'], [' H. Markram', ' Y. Wang', ' and M. Tsodyks. Differential signaling via the same axon of neocortical pyramidal neurons. Neurobiology', ' 95:5323--5328', ' April 1998.'], ['\t\tD. E. Rumelhart ', ' G. E. Hinton ', ' R. J. Williams', ' Learning internal representations by error propagation', ' Parallel distributed processing: explorations in the microstructure of cognition', ' vol. 1: foundations', ' MIT Press', ' Cambridge', ' MA', ' 1986 </a> \t\t\t\t\t\t\t\t\t'], ['\t J. D. Schaffer', ' L. D. Whitley', ' and L. J. Eshelman. Combinations of genetic algorithms and neural networks: A survey of the state of the art. In Combinations of Genetic Algorithms and NeuralNetworks', ' 1992.', ' COGANN-92. International Workshop on', ' pages 1--37', ' Philips Labs.', ' Briarcliff Manor', ' NY', ' 6 Jun 1992.'], ['\t S. Song', ' K. D. Miller', ' and L. F. Abbott. Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience', ' 3(9):919--926', ' 2000.'], ['\t L. Watts. Event-driven simulation of networks of spiking neurons. Advances in Neural Information Processing Systems', ' 6:927--934', ' 1994.']]

It looks like you're going to have to tailor this pretty heavily to your input. Because there are so many different words and constructs in the text you're parsing, you're probably not going to get 100% accuracy with the rules you create. Here's an example, though, assuming your original input text is called input_text (and I don't think using the split() method is really all that useful, because the commas don't just delimit names):
import re
regexes = (r'[A-Z][a-z]+ [A-Z][a-z]+', # capitalized first and last name
r'[A-Z]\. [A-Z][a-z]+') # capitalized initial, then last name
names = []
for regex in regexes:
names += re.findall(regex, input_text)
You'd obviously want to write additional specific regexes for your vaious name types. This does a good job of finding names, but also comes up with a lot of false positives (Information Processing looks a lot like a name based on these rules). This should give you a starting point though.

To remove the tab (and other empty spaces at beginning or end of the strings):
stripped = [s.strip() for t in mylist]
To be honest, if you are trying to extract names, splitting lines like that will not help -- notice how some names are still grouped together with titles. Would be better to build a good regex that will match names, and use re.findall on individual lines.

To remove tabs and extra spaces, use strip():
>>> "\t foobar \t\t\t".strip()
'foobar'

It may also be, that its easier to find some online source of information where this job has been already done. For example, at places like this or this.

strip all the strings
identify the string that are surely not names (very long ones, ones that include numbers, and one after these in the list)
indentify string that are surely names (short strings at the begining of the list, string starting by the pattern $[A-Z][a-z]{0,3}.?\s (Dr., Miss, Mr, Prof, etc)
sudy the last strings that you can't match with these rules, and try to make fuzzy rules to chose by creating a coefficient of certidude: the close to the beginin of the list, the shorter strings will have a hight score that something at the end with a big size. Add criterias like that and set a minimum score.
If you need a hight accuracy, loof for names database and bayesian filters.
It won't be perfect: it's very hard to know the difference between 'name name name' and 'word word word'

Related

Extract text with multiple regex patterns in Python

I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1

How to extract particular data from nested data structure in Python

['[{"word":"meaning","phonetics":[{"text":"/ˈmiːnɪŋ/","audio":"https://lex-audio.useremarkable.com/mp3/meaning_gb_1.mp3"}],"meanings":[{"partOfSpeech":"noun","definitions":[{"definition":"What '
'is meant by a word, text, concept, or '
'action.","synonyms":["definition","sense","explanation","denotation","connotation","interpretation","elucidation","explication"],"example":"the '
'meaning of the Hindu word is ‘breakthrough, '
'release’"}]},{"partOfSpeech":"adjective","definitions":[{"definition":"Intended '
'to communicate something that is not directly '
'expressed.","synonyms":["meaningful","significant","pointed","eloquent","expressive","pregnant","speaking","telltale","revealing","suggestive"]}]}]}]']
This is the format.
I wanna extract:
"meanings":[{"partOfSpeech":"noun","definitions":[{"definition":"A single distinct meaningful element of speech or writing, used with others (or sometimes alone) to form a sentence and typically shown with a space on either side when written or printed.",
How may I do it, in Python.
I believe this is what you want
import json
data = ['[{"word":"meaning","phonetics":[{"text":"/ˈmiːnɪŋ/","audio":"https://lex-audio.useremarkable.com/mp3/meaning_gb_1.mp3"}],"meanings":[{"partOfSpeech":"noun","definitions":[{"definition":"What ' 'is meant by a word, text, concept, or ' 'action.","synonyms":["definition","sense","explanation","denotation","connotation","interpretation","elucidation","explication"],"example":"the ' 'meaning of the Hindu word is ‘breakthrough, ' 'release’"}]},{"partOfSpeech":"adjective","definitions":[{"definition":"Intended ' 'to communicate something that is not directly ' 'expressed.","synonyms":["meaningful","significant","pointed","eloquent","expressive","pregnant","speaking","telltale","revealing","suggestive"]}]}]}]']
json_data = json.loads(data[0])
meanings = json_data[0]['meanings']
print(meanings)
# [{'partOfSpeech': 'noun', 'definitions': [{'definition': 'What is meant by a word, text, concept, or action.', 'synonyms': ['definition', 'sense', 'explanation', 'denotation', 'connotation', 'interpretation', 'elucidation', 'explication'], 'example': 'the meaning of the Hindu word is ‘breakthrough, release’'}]}, {'partOfSpeech': 'adjective', 'definitions': [{'definition': 'Intended to communicate something that is not directly expressed.', 'synonyms': ['meaningful', 'significant', 'pointed', 'eloquent', 'expressive', 'pregnant', 'speaking', 'telltale', 'revealing', 'suggestive']}]}]

Replacing text using regex in a list of strings in a dataframe

I have a dataframe of text where I want to replace the text of some substrings. For example:
"[' Foods are adequately protected from\\n contamination during handling and storage.', ' Food handler hygiene and hand washing is\\n properly followed.', ' Foods are cooked, cooled and stored at\\n proper temperatures.', ' Garbage and/or waste is properly stored\\n and removed.', ' Pest control practices are properly maintained.', ' Equipment and utensils are properly cleaned,\\n sanitized and maintained.', ' Food premise is properly maintained in a clean\\n and sanitary condition.']"
I want to replace '\n' with ''.
[sub.replace('\\n', '') for sub in abc_test]
where abc_test is just the first row of the dataframe content. When I apply this function the result turns out to be different than what I was hoping for.
['[',
"'",
' ',
'F',
'o',
'o',
'd',
's',
' ',
'a',
'r',
'e',
'
Any help would be appreciated.
The point here is that your strings contain combinations of a backslash and n char, not newline chars. Thus, neither "\n" (an LF, line feed, char) nor "\\n" (a \n regex escape that matches a newline, LF, char) work.
You can use
df['res'] = df['text'].str.replace(r"\\n", "")
Pandas test:
>>> import pandas as pd
>>> df = pd.DataFrame({'text': [' Foods are adequately protected from\\n contamination during handling and storage.', ' Food handler hygiene and hand washing is\\n properly followed.', ' Foods are cooked, cooled and stored at\\n proper temperatures.', ' Garbage and/or waste is properly stored\\n and removed.', ' Pest control practices are properly maintained.', ' Equipment and utensils are properly cleaned,\\n sanitized and maintained.', ' Food premise is properly maintained in a clean\\n and sanitary condition.']})
>>> df['res'] = df['text'].str.replace(r"\\n", "")
>>> df
text res
0 Foods are adequately protected from\n contami... Foods are adequately protected from contamina...
1 Food handler hygiene and hand washing is\n pr... Food handler hygiene and hand washing is prop...
2 Foods are cooked, cooled and stored at\n prop... Foods are cooked, cooled and stored at proper...
3 Garbage and/or waste is properly stored\n and... Garbage and/or waste is properly stored and r...
4 Pest control practices are properly maintained. Pest control practices are properly maintained.
5 Equipment and utensils are properly cleaned,\... Equipment and utensils are properly cleaned, ...
6 Food premise is properly maintained in a clea... Food premise is properly maintained in a clea...

How can I concatenate the lines of dialogue while doing Natural Language Processing on a book

I am working on a sentiment analysis project of a book. I am using nltk.vader.sentimentintensityanalyzer to record the sentiment polarity of paragraphs in the Harry Potter series.
To create paragraphs and remove the line breaks I did:
text_file = open('HP1 Sorcerer of Stone.txt', 'r')
text = str(text_file.readlines())
text.replace('\\n"', "").replace("\'", "").replace(" , ","")
This breaks the book down into paragraphs. The problem arises when it comes to dialogue.
Dialogue has the same paragraph breaks in between each character's words
' "So?" snapped Mrs. Dursley. ',
' "Well, I just thought... maybe... it was something to do with... you
know... her crowd." ',
' Mrs. Dursley sipped her tea through pursed lips. Mr. Dursley wondered
whether he dared tell her he\\d heard the name "Potter." He decided he
didn\\t dare. Instead he said, as casually as he could, "Their son --
he\\d be about Dudley\\s age now, wouldn\\t he?" ',
' "I suppose so," said Mrs. Dursley stiffly. ',
' "What\\s his name again? Howard, isn\\t it?" ',
' "Harry. Nasty, common name, if you ask me." ',
How can I edit my breakdown methods so dialogue stays together as one element? The dialogue as a whole will then be used as a single input into the intensity analyzer.

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
Example output of what it should look like
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
Try this. split your string this.You can also check demo.
http://regex101.com/r/nG1gU7/27
Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP, spaCy. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)
In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.
To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer just to decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:
function isEndOfSentence(leftContext, rightContext)
Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02" Consider also section references like 1.2.A.3.a, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later. (Was that one sentence or two? Impossible to determine)
You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if #midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a].
Ultimately you get as much correctness as you are prepared to pay for.
All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
All of the above is still only ASCII, which is practically speaking only 96 characters. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)
In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext) would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).
References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]
Try to split the input according to the spaces rather than a dot or ?, if you do like this then the dot or ? won't be printed in the final result.
>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
... print i
...
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
print s
Here the regex used is : (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)
First block: (?<!\w\.\w.) : this pattern searches in a negative feedback loop (?<!) for all words (\w) followed by fullstop (\.) , followed by other words (\.)
Second block: (?<![A-Z][a-z]\.): this pattern searches in a negative feedback loop for anything starting with uppercase alphabets ([A-Z]), followed by lower case alphabets ([a-z]) till a dot (\.) is found.
Third block: (?<=\.|\?): this pattern searches in a feedback loop of dot (\.) OR question mark (\?)
Fourth block: (\s|[A-Z].*): this pattern searches after the dot OR question mark from the third block. It searches for blank space (\s) OR any sequence of characters starting with a upper case alphabet ([A-Z].*).
This block is important to split if the input is as
Hello world.Hi I am here today.
i.e. if there is space or no space after the dot.
Naive approach for proper english sentences not starting with non-alphas and not containing quoted parts of speech:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
print(''.join(sentence))
sentence = []
if len(part):
sentence.append(part)
if len(sentence):
print(''.join(sentence))
False positive splitting may be reduced by extending NonEndings a bit. Other cases will require additional code. Handling typos in a sensible way will prove difficult with this approach.
You will never reach perfection with this approach. But depending on the task it might just work "enough"...
I'm not great at regular expressions, but a simpler version, "brute force" actually, of above is
sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")
which means
start acceptable units are '[A-Z] or "[A-Z]
please note, most regular expressions are greedy so the order is very important when we do |(or). That's, why I have written i.e. regular expression first, then is come forms like Inc.
Try this:
(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)
I wrote this taking into consideration smci's comments above. It is a middle-of-the-road approach that doesn't require external libraries and doesn't use regex. It allows you to provide a list of abbreviations and accounts for sentences ended by terminators in wrappers, such as a period and quote: [.", ?', .)].
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
I used Karl's find_all function from this entry: Find all occurrences of a substring in Python
My example is based on the example of Ali, adapted to Brazilian Portuguese. Thanks Ali.
ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
'[0-9]+', 'reg', 'f[ilí]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']
ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)
def sentencas(texto, min_len=5):
# baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
texto = re.sub(r'\s\s+', ' ', texto)
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
# print(NonEndings)
parts = EndPunctuation.split(texto)
sentencas = []
sentence = []
for part in parts:
txt_sent = ''.join(sentence)
q_len = len(txt_sent)
if len(part) and len(sentence) and q_len >= min_len and \
EndPunctuation.match(sentence[-1]) and \
not ABREVIACOES_RGX.search(txt_sent):
sentencas.append(txt_sent)
sentence = []
if len(part):
sentence.append(part)
if sentence:
sentencas.append(''.join(sentence))
return sentencas
Full code in: https://github.com/luizanisio/comparador_elastic
If you want to break up sentences at 3 periods (not sure if this is what you want) you can use this regular expresion:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)
for stuff in sentences:
print(stuff)

Categories

Resources