Splitting text into sentences using regex in Python [duplicate] - python

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I'm trying to split a piece sample text into a list of sentences without delimiters and no spaces at the end of each sentence.
Sample text:
The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?
Into this (desired output):
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
My code is currently:
def sent_tokenize(text):
sentences = re.split(r"[.!?]", text)
sentences = [sent.strip(" ") for sent in sentences]
return sentences
However this outputs (current output):
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing', '']
Notice the extra '' on the end.
Any ideas on how to remove the extra '' at the end of my current output?

nltk's sent_tokenize
If you're in the business of NLP, I'd strongly recommend sent_tokenize from the nltk package.
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(text)
[
'The first time you see The Second Renaissance it may look boring.',
'Look at it at least twice and definitely watch part 2.',
'It will change your view of the matrix.',
'Are the human people the ones who started the war?',
'Is AI a bad thing?'
]
It's a lot more robust than regex, and provides a lot of options to get the job done. More info can be found at the official documentation.
If you are picky about the trailing delimiters, you can use nltk.tokenize.RegexpTokenizer with a slightly different pattern:
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'[^.?!]+')
>>> list(map(str.strip, tokenizer.tokenize(text)))
[
'The first time you see The Second Renaissance it may look boring',
'Look at it at least twice and definitely watch part 2',
'It will change your view of the matrix',
'Are the human people the ones who started the war',
'Is AI a bad thing'
]
Regex-based re.split
If you must use regex, then you'll need to modify your pattern by adding a negative lookahead -
>>> list(map(str.strip, re.split(r"[.!?](?!$)", text)))
[
'The first time you see The Second Renaissance it may look boring',
'Look at it at least twice and definitely watch part 2',
'It will change your view of the matrix',
'Are the human people the ones who started the war',
'Is AI a bad thing?'
]
The added (?!$) specifies that you split only when you do not have not reached the end of the line yet. Unfortunately, I am not sure the trailing delimiter on the last sentence can be reasonably removed without doing something like result[-1] = result[-1][:-1].

You can use filter to remove the empty elements
Ex:
import re
text = """The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"""
def sent_tokenize(text):
sentences = re.split(r"[.!?]", text)
sentences = [sent.strip(" ") for sent in sentences]
return filter(None, sentences)
print sent_tokenize(text)

Any ideas on how to remove the extra '' at the end of my current
output?
You could remove it by doing this:
sentences[:-1]
Or faster (by ᴄᴏʟᴅsᴘᴇᴇᴅ)
del result[-1]
Output:
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']

You could either strip your paragraph first before splitting it or filter empty strings in the result out.

Related

Lemmatisation of web scraped data

Let's suppose that I have a text document such as the following:
document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
( or a more complex text example:
document = '<p>Forde Education are looking to recruit a Teacher of Geography for an immediate start in a Doncaster Secondary school.</p> <p>The school has a thriving and welcoming environment with very high expectations of students both in progress and behaviour. This position will be working until Easter with a <em><strong>likely extension until July 2011.</strong></em></p> <p>The successful candidates will need to demonstrate good practical subject knowledge but also possess the knowledge and experience to teach to GCSE level with the possibility of teaching to A’Level to smaller groups of students.</p> <p>All our candidate will be required to hold a relevant teaching qualifications with QTS successful applicants will be required to provide recent relevant references and undergo a Enhanced CRB check.</p> <p>To apply for this post or to gain information regarding similar roles please either submit your CV in application or Call Debbie Slater for more information. </p>'
)
I am applying a series of pre-processing NLP techniques to get a "cleaner" version of this document by also taking the stem word for each of its words.
I am using the following code for this:
stemmer_1 = PorterStemmer()
stemmer_2 = LancasterStemmer()
stemmer_3 = SnowballStemmer(language='english')
# Remove all the special characters
document = re.sub(r'\W', ' ', document)
# remove all single characters
document = re.sub(r'\b[a-zA-Z]\b', ' ', document)
# Substituting multiple spaces with single space
document = re.sub(r' +', ' ', document, flags=re.I)
# Converting to lowercase
document = document.lower()
# Tokenisation
document = document.split()
# Stemming
document = [stemmer_3.stem(word) for word in document]
# Join the words back to a single document
document = ' '.join(document)
This gives the following output for the text document above:
'am sent am anoth sent am third sent'
(and this output for the more complex example:
'ford educ are look to recruit teacher of geographi for an immedi start in doncast secondari school the school has thrive and welcom environ with veri high expect of student both in progress and behaviour nbsp this posit will be work nbsp until easter with nbsp em strong like extens until juli 2011 strong em the success candid will need to demonstr good practic subject knowledg but also possess the knowledg and experi to teach to gcse level with the possibl of teach to level to smaller group of student all our candid will be requir to hold relev teach qualif with qts success applic will be requir to provid recent relev refer and undergo enhanc crb check to appli for this post or to gain inform regard similar role pleas either submit your cv in applic or call debbi slater for more inform nbsp'
)
What I want to do now is to get an output like the one exactly above but after I have applied lemmatisation and not stemming.
However, unless I am missing something, this requires to split the original document into (sensible) sentences, apply POS tagging and then implement the lemmatisation.
But here things are a little bit complicated because the text data are coming from web scraping and hence you will encounter many HTML tags such as <br>, <p> etc.
My idea is that every time a sequence of words is ending with some common punctuation mark (fullstop, exclamation point etc) or with a HTML tag such as <br>, <p> etc then this should be considered as a separate sentence.
Thus for example the original document above:
document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
Should be split in something like this:
['I am a sentence', 'I am another sentence', 'I am a third sentence']
and then I guess we will apply POS tagging at each sentence, split each sentence in words, apply lemmatization and .join() the words back to a single document as I am doing it with my code above.
How can I do this?
Removing HTML tags is the common part of text refining. You can use your own-writed rules like text.replace('<p>', '.') , but there is the better solution: html2text. This library can do all dirty HTML refining work for you, like:
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!
You can import this library in your Python code, or you can use it as a stand-alone program.
Edit: Here is the small chain example that splits your text to sentences:
>>> document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
>>> text_without_html = html2text.html2text(document)
>>> refined_text = re.sub(r'\n+', '. ', text_without_html)
>>> sentences = nltk.sent_tokenize(refined_text)
>>> sentences
['I am a sentence.', 'I am another sentence.', 'I am a third sentence..']

Break sentence into parts from full stop, comma, and & but

sentence = "Very disorganized and hard professor. Does not come to classes on time, she grades tough, does not help on anything. She says come for help but when you go to her office hour, she is not there to help."
I want to break this sentence into parts from full stop, comma, and & but.
the output should be like,
Very disorganized
and hard professor.
Does not come to classes on time,
she grades tough,
does not help on anything.
She says come for help
but when you go to her office hour,
she is not there to help.
for now I am using,
sample = re.split(r' *[\.\?!][\'"\)\]]* *', sentence)
print (sample)
and this only break the sentence from full stops.
output,
['Very disorganized and hard professor', 'Does not come to classes on time, she grades tough, does not help on anything', 'She says come for help but when you go to her office hour, she is not there to help']
Any idea how to do this.
Try this
Result=re.split(r'[.,&]', sentence)
You can use re.sub() to add in newline characters where your stopwords are encountered.
The regex is simply: (and|\.|but|,), which matches your stopwords. Then you replace that group with itself, plus a newline character.
>>> import re
>>> sentence = "Very disorganized and hard professor. Does not come to classes on time, she grades tough, does not help on anything. She says come for help but when you go to her office hour, she is not there to help."
>>> sample = re.sub(r'(and|\.|but|,)', r'\1\n', sentence)
>>> sample
Very disorganized and
hard professor.
Does not come to classes on time,
she grades tough,
does not help on anything.
She says come for help but
when you go to her office hour,
she is not there to help.
If you want it in a list:
>>> re.sub(r'(and|\.|but|,)', r'\1\n', sentence).split('\n')
['Very disorganized and', ' hard professor.', ' Does not come to classes on time,', ' she grades tough,', ' does not help on anything.', ' She says come for help but', ' when you go to her office hour,', ' she is not there to help.', '']
If you want to remove the whitespace before each following line, you can use this:
sample = re.sub(r'(and|\.|but|,)(?:\s)', r'\1\n', sentence)
Or a loop
for x in ['.', ',', 'and', 'but']:
sentence=sentence.replace(x, x+'\n')
Adds a \n after each of those delimiters.
Output:
Very disorganized and
hard professor.
Does not come to classes on time,
she grades tough,
does not help on anything.
She says come for help but
when you go to her office hour,
she is not there to help.

Splitting A List At Certain Characer

say I have a list that goes something like this:
['Try not to become a man of success, but rather try to become a man of value.
\n', 'Look deep into nature, and then you will understand everything
better.\n', 'The true sign of intelligence is not knowledge but imagination.
\n', 'We cannot solve our problems with the same thinking we used when we
created them. \n', 'Weakness of attitude becomes weakness of character.\n',
"You can't blame gravity for falling in love. \n", 'The difference between
stupidity and genius is that genius has its limits. \n']
How would I go about splitting this list into sub-lists at every newline?
How do I turn the output above to my desired output below?
['Try not to become a man of success, but rather try to become a man of value.']
['Look deep into nature, and then you will understand everything better'] ['The true sign of intelligence is not knowledge but imagination.]
Your expected result seems like a list of item where each item is a list containing a single element composed of the list item stripped of whitespaces, something that can be achieved using:
result = [[line.strip()] for line in lines]

Removing "\n"s when printing sentences from text file in python?

I am trying to print a list of sentences from a text file (one of the Project Gutenberg eBooks). When I print the file as a single string string it looks fine:
file = open('11.txt','r+')
alice = file.read()
print(alice[:500])
Output is:
ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d
Now, when I split it into sentences (The assignment was specifically to do this by "splitting at the periods," so it's a very simplified split), I get this:
>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']
Where do the extra "\n" characters come from and how can I remove them?
If you want to replace all the newlines with one space, do this:
import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]
You may not want to use regex, but I would do:
import re
new_sentences = []
for s in sentences:
new_sentences.append(re.sub(r'\n{2,}', '\n', s))
This should replace all instances of two or more '\n' with a single newline, so you still have newlines, but don't have "extra" newlines.
If you want to avoid creating a new list, and instead modify the existing one (credit to #gavriel and Andrew L.: I hadn't thought of using enumerate when I first posted my answer):
import re
for i, s in enumerate(sentences):
sentences[i] = re.sub(r'\n{2,}', '\n', s)
The extra newlines aren't really extra, by which I mean they are meant to be there and are visible in the text in your question: the more '\n' there are, the more space there is visible between the lines of text (i.e., one between the chapter heading and the first paragraph, and many between the edition and the chapter heading.
You'll understand where the \n characters come from with this little example:
alice = """ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d"""
print len(alice.split("."))
print len(alice.split("\n"))
It all depends the way you're splitting your text, the above example will give this output:
3
19
Which means there are 3 substrings if you were to split the text using . or 19 substrings if you splitted using \n as separator. You can read more about str.split
In your case you've splitted your text using ., so the 3 substrings will contain multiple newlines characters \n, to get rid of them you can either split these substrings again or just get rid of them using str.replace
The text uses newlines to delimit sentences as well as fullstops. You have an issue where just replacing the new line characters with an empty string will result in having words without spaces between them. Before you split alice by '.', I would use something along the lines of #elethan's solution to replace all of the multiple new lines in alice with a '.' Then you could do alice.split('.') and all of the sentences separated with multiple new lines would be split appropriately along with the sentences separated with . initially.
Then your only issue is the decimal point in the version number.
file = open('11.txt','r+')
file.read().split('\n')

Using regular expressions in Python

I'm struggling with the problem to cut the very first sentence from the string.
It wouldn't be such a problem if I there were no abbreviations ended with dot.
So my example is:
string = 'I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.'
And the result should be:
result = 'I like cheese, cars, etc. but my the most favorite website is stackoverflow.'
Normally I would do with:
re.findall(r'^(\s*.*?\s*)(?:\.|$)', event)
but I would like to skip some pre-defined words, like above mentioned etc.
I came with couple of expression but none of them worked.
You could try NLTK's Punkt sentence tokenizer, which does this kind of thing using a real algorithm to figure out what the abbreviations are instead of your ad-hoc collection of abbreviations.
NLTK includes a pre-trained one for English; load it with:
nltk.data.load('tokenizers/punkt/english.pickle')
From the source code:
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
How about looking for the first capital letter after a sentence-ending character? It's not foolproof, of course.
import re
r = re.compile("^(.+?[.?!])\s*[A-Z]")
print r.match('I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.').group(1)
outputs
'I like cheese, cars, etc. but my the most favorite website is stackoverflow.'

Categories

Resources