Lemmatisation of web scraped data - python

Let's suppose that I have a text document such as the following:
document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
( or a more complex text example:
document = '<p>Forde Education are looking to recruit a Teacher of Geography for an immediate start in a Doncaster Secondary school.</p> <p>The school has a thriving and welcoming environment with very high expectations of students both in progress and behaviour. This position will be working until Easter with a <em><strong>likely extension until July 2011.</strong></em></p> <p>The successful candidates will need to demonstrate good practical subject knowledge but also possess the knowledge and experience to teach to GCSE level with the possibility of teaching to A’Level to smaller groups of students.</p> <p>All our candidate will be required to hold a relevant teaching qualifications with QTS successful applicants will be required to provide recent relevant references and undergo a Enhanced CRB check.</p> <p>To apply for this post or to gain information regarding similar roles please either submit your CV in application or Call Debbie Slater for more information. </p>'
)
I am applying a series of pre-processing NLP techniques to get a "cleaner" version of this document by also taking the stem word for each of its words.
I am using the following code for this:
stemmer_1 = PorterStemmer()
stemmer_2 = LancasterStemmer()
stemmer_3 = SnowballStemmer(language='english')
# Remove all the special characters
document = re.sub(r'\W', ' ', document)
# remove all single characters
document = re.sub(r'\b[a-zA-Z]\b', ' ', document)
# Substituting multiple spaces with single space
document = re.sub(r' +', ' ', document, flags=re.I)
# Converting to lowercase
document = document.lower()
# Tokenisation
document = document.split()
# Stemming
document = [stemmer_3.stem(word) for word in document]
# Join the words back to a single document
document = ' '.join(document)
This gives the following output for the text document above:
'am sent am anoth sent am third sent'
(and this output for the more complex example:
'ford educ are look to recruit teacher of geographi for an immedi start in doncast secondari school the school has thrive and welcom environ with veri high expect of student both in progress and behaviour nbsp this posit will be work nbsp until easter with nbsp em strong like extens until juli 2011 strong em the success candid will need to demonstr good practic subject knowledg but also possess the knowledg and experi to teach to gcse level with the possibl of teach to level to smaller group of student all our candid will be requir to hold relev teach qualif with qts success applic will be requir to provid recent relev refer and undergo enhanc crb check to appli for this post or to gain inform regard similar role pleas either submit your cv in applic or call debbi slater for more inform nbsp'
)
What I want to do now is to get an output like the one exactly above but after I have applied lemmatisation and not stemming.
However, unless I am missing something, this requires to split the original document into (sensible) sentences, apply POS tagging and then implement the lemmatisation.
But here things are a little bit complicated because the text data are coming from web scraping and hence you will encounter many HTML tags such as <br>, <p> etc.
My idea is that every time a sequence of words is ending with some common punctuation mark (fullstop, exclamation point etc) or with a HTML tag such as <br>, <p> etc then this should be considered as a separate sentence.
Thus for example the original document above:
document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
Should be split in something like this:
['I am a sentence', 'I am another sentence', 'I am a third sentence']
and then I guess we will apply POS tagging at each sentence, split each sentence in words, apply lemmatization and .join() the words back to a single document as I am doing it with my code above.
How can I do this?

Removing HTML tags is the common part of text refining. You can use your own-writed rules like text.replace('<p>', '.') , but there is the better solution: html2text. This library can do all dirty HTML refining work for you, like:
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!
You can import this library in your Python code, or you can use it as a stand-alone program.
Edit: Here is the small chain example that splits your text to sentences:
>>> document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
>>> text_without_html = html2text.html2text(document)
>>> refined_text = re.sub(r'\n+', '. ', text_without_html)
>>> sentences = nltk.sent_tokenize(refined_text)
>>> sentences
['I am a sentence.', 'I am another sentence.', 'I am a third sentence..']

Related

Python text to sentences when uppercase word appears

I am using Google Speech-to-Text API and after I transcribe an audio file, I end up with a text which is a conversation between two people and it doesn't contain punctuation (Google's automatic punctuation or speaker diarization features are not supported for this non-English language). For example:
Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course
It appears as one big sentence, but I want to split the different sentences whenever an uppercase word appears, and thus have:
Hi you are speaking with customer support how can i help you
Hi my name is whatever and this is my problem
Can you give me your address please
Yes of course
I am using Python and I don't want to use regex, instead I want to use a simpler method. What should I add to this code in order to split each result into multiple sentences as soon as I see an uppercase letter?
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for i, result in enumerate(response.results):
transcribed_text = []
# The first alternative is the most likely one for this portion.
alternative = result.alternatives[0]
print("-" * 20)
print("First alternative of result {}".format(i))
print("Transcript: {}".format(alternative.transcript))
A simple solution would be a regex split:
inp = "Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course"
sentences = re.split(r'\s+(?=[A-Z])', inp)
print(sentences)
This prints:
['Hi you are speaking with customer support how can i help you',
'Hi my name is whatever and this is my problem',
'Can you give me your address please',
'Yes of course']
Note that this simple approach can easily fail should there be things like proper names in the middle of sentences, or maybe acronyms, both of which also have uppercase letters but are not markers for the actual end of the sentence. A better long term approach would be to use a library like nltk, which has the ability to find sentences with much higher accuracy.

Determine if a text extract from spacy is a complete sentence

We are working on sentences extracted from a PDF. The problem is that it includes the title, footers, table of contents, etc. Is there a way to determine if the sentence we get when pass the document to spacy is a complete sentence. Is there a way to filter parts of sentences like titles?
A complete sentence contains at least one subject, one predicate, one object, and closes with punctuation.
Subject and object are almost always nouns, and the predicate is always a verb.
Thus you need to check if your sentence contains two nouns, one verb and closes with punctuation:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I. Introduction\nAlfred likes apples! A car runs over a red light.")
for sent in doc.sents:
if sent[0].is_title and sent[-1].is_punct:
has_noun = 2
has_verb = 1
for token in sent:
if token.pos_ in ["NOUN", "PROPN", "PRON"]:
has_noun -= 1
elif token.pos_ == "VERB":
has_verb -= 1
if has_noun < 1 and has_verb < 1:
print(sent.string.strip())
Update
I also would advise to check if the sentence starts with an upper case letter, I added the modification in the code. Furthermore, I would like to point out that what I wrote is true for English and German, I don't know how it is in other languages.
Try looking for the first noun chunk in each sentence. That is usually (but not always) is the title subject of the sentence.
sentence_title = [chunk.text for chunk in doc.noun_chunks][0]
You can perform sentence segmentation using trainable pipeline component in Spacy.
https://spacy.io/api/sentencerecognizer
Additionally, if you can come up with some pattern in the text string then use python regex
lib re https://docs.python.org/3/library/re.html.

Splitting text into sentences using regex in Python [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I'm trying to split a piece sample text into a list of sentences without delimiters and no spaces at the end of each sentence.
Sample text:
The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?
Into this (desired output):
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
My code is currently:
def sent_tokenize(text):
sentences = re.split(r"[.!?]", text)
sentences = [sent.strip(" ") for sent in sentences]
return sentences
However this outputs (current output):
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing', '']
Notice the extra '' on the end.
Any ideas on how to remove the extra '' at the end of my current output?
nltk's sent_tokenize
If you're in the business of NLP, I'd strongly recommend sent_tokenize from the nltk package.
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(text)
[
'The first time you see The Second Renaissance it may look boring.',
'Look at it at least twice and definitely watch part 2.',
'It will change your view of the matrix.',
'Are the human people the ones who started the war?',
'Is AI a bad thing?'
]
It's a lot more robust than regex, and provides a lot of options to get the job done. More info can be found at the official documentation.
If you are picky about the trailing delimiters, you can use nltk.tokenize.RegexpTokenizer with a slightly different pattern:
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'[^.?!]+')
>>> list(map(str.strip, tokenizer.tokenize(text)))
[
'The first time you see The Second Renaissance it may look boring',
'Look at it at least twice and definitely watch part 2',
'It will change your view of the matrix',
'Are the human people the ones who started the war',
'Is AI a bad thing'
]
Regex-based re.split
If you must use regex, then you'll need to modify your pattern by adding a negative lookahead -
>>> list(map(str.strip, re.split(r"[.!?](?!$)", text)))
[
'The first time you see The Second Renaissance it may look boring',
'Look at it at least twice and definitely watch part 2',
'It will change your view of the matrix',
'Are the human people the ones who started the war',
'Is AI a bad thing?'
]
The added (?!$) specifies that you split only when you do not have not reached the end of the line yet. Unfortunately, I am not sure the trailing delimiter on the last sentence can be reasonably removed without doing something like result[-1] = result[-1][:-1].
You can use filter to remove the empty elements
Ex:
import re
text = """The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"""
def sent_tokenize(text):
sentences = re.split(r"[.!?]", text)
sentences = [sent.strip(" ") for sent in sentences]
return filter(None, sentences)
print sent_tokenize(text)
Any ideas on how to remove the extra '' at the end of my current
output?
You could remove it by doing this:
sentences[:-1]
Or faster (by ᴄᴏʟᴅsᴘᴇᴇᴅ)
del result[-1]
Output:
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
You could either strip your paragraph first before splitting it or filter empty strings in the result out.

NLTK Sentence Tokenizer, custom sentence starters

I'm trying to split a text into sentences with the PunktSentenceTokenizer from nltk. The text contains listings starting with bullet points, but they are not recognized as new sentences. I tried to add some parameters but that didn't work. Is there another way?
Here is some example code:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
params = PunktParameters()
params.sent_starters = set(['•'])
tokenizer = PunktSentenceTokenizer(params)
tokenizer.tokenize('• I am a sentence • I am another sentence')
['• I am a sentence • I am another sentence']
You can subclass PunktLanguageVars and adapt the sent_end_chars attribute to fit your needs like so:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
class BulletPointLangVars(PunktLanguageVars):
sent_end_chars = ('.', '?', '!', '•')
tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())
tokenizer.tokenize(u"• I am a sentence • I am another sentence")
This will result in the following output:
['•', 'I am a sentence •', 'I am another sentence']
However, this makes • a sentence end marker, while in your case it is more of a sentence start marker. Thus this example text:
I introduce a list of sentences.
I am sentence one
I am sentence two
And I am one, too!
Would, depending on the details of your text, result in something like the following:
>>> tokenizer.tokenize("""
Look at these sentences:
• I am sentence one
• I am sentence two
But I am one, too!
""")
['\nLook at these sentences:\n\n•', 'I am sentence one\n•', 'I am sentence two\n\nBut I am one, too!\n']
One reason why PunktSentenceTokenizer is used for sentence tokenization instead of simply employing something like a multi-delimiter split function is, because it is able to learn how to distinguish between punctuation used for sentences and punctuation used for other purposes, as in "Mr.", for example.
There should, however, be no such complications for •, so you I would advise you to write a simple parser to preprocess the bullet point formatting instead of abusing PunktSentenceTokenizer for something it is not really designed for.
How this might be achieved in detail is dependent on how exactly this kind of markup is used in the text.

Using regular expressions in Python

I'm struggling with the problem to cut the very first sentence from the string.
It wouldn't be such a problem if I there were no abbreviations ended with dot.
So my example is:
string = 'I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.'
And the result should be:
result = 'I like cheese, cars, etc. but my the most favorite website is stackoverflow.'
Normally I would do with:
re.findall(r'^(\s*.*?\s*)(?:\.|$)', event)
but I would like to skip some pre-defined words, like above mentioned etc.
I came with couple of expression but none of them worked.
You could try NLTK's Punkt sentence tokenizer, which does this kind of thing using a real algorithm to figure out what the abbreviations are instead of your ad-hoc collection of abbreviations.
NLTK includes a pre-trained one for English; load it with:
nltk.data.load('tokenizers/punkt/english.pickle')
From the source code:
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
How about looking for the first capital letter after a sentence-ending character? It's not foolproof, of course.
import re
r = re.compile("^(.+?[.?!])\s*[A-Z]")
print r.match('I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.').group(1)
outputs
'I like cheese, cars, etc. but my the most favorite website is stackoverflow.'

Categories

Resources