I have a dataset, which contains collection of text messages. I want to calculate the average words per sentence. But each message is in different format. ie, some messages ends with fullstop some messages not...
eg messages:
Tiwary to rcb.battle between bang and kochi
Dhawan for dc:)
Warner to delhi.
make it fast...
by using,
words = messages.split() #get each words in the sentence
leg_wrd = len(words)
but there is problem to find the end of sentence because it's not in similar. Then how can I identify the end of a sentence? And how to calculate the same using python 2.7.
This is not a trivial problem. I would recommend to use a 3rd party library like NTLK. This has a sentence tokenizer which works like this:
# Make sure that you have NLTK
from nltk.tokenize import sent_tokenize
text = “this’s a sent tokenize test. this is sent two. is this sent three? sent 4 is cool! Now it’s your turn.”
sent_tokenize_list = sent_tokenize(text)
print(sent_tokenize_list)
# Will output [“this’s a sent tokenize test.”, ‘this is sent two.’, ‘is this sent three?’, ‘sent 4 is cool!’, “Now it’s your turn.”]
Related
I am using Google Speech-to-Text API and after I transcribe an audio file, I end up with a text which is a conversation between two people and it doesn't contain punctuation (Google's automatic punctuation or speaker diarization features are not supported for this non-English language). For example:
Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course
It appears as one big sentence, but I want to split the different sentences whenever an uppercase word appears, and thus have:
Hi you are speaking with customer support how can i help you
Hi my name is whatever and this is my problem
Can you give me your address please
Yes of course
I am using Python and I don't want to use regex, instead I want to use a simpler method. What should I add to this code in order to split each result into multiple sentences as soon as I see an uppercase letter?
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for i, result in enumerate(response.results):
transcribed_text = []
# The first alternative is the most likely one for this portion.
alternative = result.alternatives[0]
print("-" * 20)
print("First alternative of result {}".format(i))
print("Transcript: {}".format(alternative.transcript))
A simple solution would be a regex split:
inp = "Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course"
sentences = re.split(r'\s+(?=[A-Z])', inp)
print(sentences)
This prints:
['Hi you are speaking with customer support how can i help you',
'Hi my name is whatever and this is my problem',
'Can you give me your address please',
'Yes of course']
Note that this simple approach can easily fail should there be things like proper names in the middle of sentences, or maybe acronyms, both of which also have uppercase letters but are not markers for the actual end of the sentence. A better long term approach would be to use a library like nltk, which has the ability to find sentences with much higher accuracy.
for example I want to save inevitable, unavoidable, certain, sure = "necessary" if mentioned words are using in my giving sentence, so my program automatically change these words into "necessary" and give me sentence
for example
it is inevitable or unavoidable or certain or sure, that person age should be 18
so my python program automatically detect these words and convert in to
"it is necessary that person age should be 18"
Your issue isn't very clear, tell us what you want to do and what you can't figure out.
I think you should split your sentence to get a list of all words in it. Then, check if one of the words belongs to your list of "changeable" words ( inevitable, unavoidable, certain, sure) if so, replace it with the word you want ("necessary" in your example).
But i'm not sure i understood your problem.
sen = "this is unavoidable that the kids must be 18"
words = sen.split()
new_words = []
for word in words:
if word in ['inevitable', 'unavoidable', 'certain', 'sure']:
word = 'necessary'
new_words.append(word)
new_sen = " ".join(new_words)
print(new_sen)
So I'm trying to do a cosine similarity with a text file I have. https://lms.uwa.edu.au/bbcswebdav/pid-1143173-dt-content-rid-16133365_1/courses/CITS1401_SEM-2_2018/CITS1401_SEM-2_2018_ImportedContent_20180713092326/CITS1401_SEM-1_2018/Unit%20Content/Resources/Project2_2018/sample.txt
I'm wondering how I print this sentence by sentence and not readline() to read line by line.
I'm trying to create the sentence variables. For example
s1 = "the mississippi is well worth reading about"
s2 = "it is not a commonplace river, but on the contrary is in all ways remarkable"
Is this first the way to go about it? If it is, my next step which I know how to do is remove the common words from the sentences and only leave unique words to compare with.
How do I stop at the full-stop and then store that sentence to a variable who looping through the text?
Thanks
Do you mean this:
with open("file.txt",'r') as in_f:
sentences = in_f.read().replace('\n','').split('.')
for each s in sentences:
#your code
We are working on sentences extracted from a PDF. The problem is that it includes the title, footers, table of contents, etc. Is there a way to determine if the sentence we get when pass the document to spacy is a complete sentence. Is there a way to filter parts of sentences like titles?
A complete sentence contains at least one subject, one predicate, one object, and closes with punctuation.
Subject and object are almost always nouns, and the predicate is always a verb.
Thus you need to check if your sentence contains two nouns, one verb and closes with punctuation:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I. Introduction\nAlfred likes apples! A car runs over a red light.")
for sent in doc.sents:
if sent[0].is_title and sent[-1].is_punct:
has_noun = 2
has_verb = 1
for token in sent:
if token.pos_ in ["NOUN", "PROPN", "PRON"]:
has_noun -= 1
elif token.pos_ == "VERB":
has_verb -= 1
if has_noun < 1 and has_verb < 1:
print(sent.string.strip())
Update
I also would advise to check if the sentence starts with an upper case letter, I added the modification in the code. Furthermore, I would like to point out that what I wrote is true for English and German, I don't know how it is in other languages.
Try looking for the first noun chunk in each sentence. That is usually (but not always) is the title subject of the sentence.
sentence_title = [chunk.text for chunk in doc.noun_chunks][0]
You can perform sentence segmentation using trainable pipeline component in Spacy.
https://spacy.io/api/sentencerecognizer
Additionally, if you can come up with some pattern in the text string then use python regex
lib re https://docs.python.org/3/library/re.html.
Should I use NLTK or regular expressions to split it?
How can I do the selection for pronouns (he/she). I want to select any sentence that has a pronoun.
This is a part of a larger project and I am new to Python. Could you please point me to any helpful code?
I am working on a NLP project which has similar needs. I suggest you to use NLTK since it makes things really easy and gives us a lot of flexibility. Since you need to collect all sentences having pronouns, you can split all sentences in the text and hold them in a list. Then, you can iterate over the list and look for sentences containing pronouns. Also make sure you note down the index of the sentence (in the list) or you can form a new list.
Sample code below:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
sentences = ['alice loves to read crime novels.', 'she also loves to play chess with him']
sentences_with_pronouns = []
for sentence in sentences:
words = word_tokenize(sentence)
for word in words:
word_pos = pos_tag([word])
if word_pos[0][1] == 'PRP':
sentences_with_pronouns.append(sentence)
break
print sentences_with_pronouns
Output:
['she also loves to play chess.']
P.S. Also check pylucene and whoosh libraries which are pretty useful NLP python
packages.
NLTK is your best bet. Given a string of sentences as input, you can obtain a list of those sentences containing pronouns by doing:
from nltk import pos_tag, sent_tokenize, word_tokenize
paragraph = "This is a sentence with no pronouns. Take it or leave it."
print [sentence for sentence in sent_tokenize(paragraph)
if 'PRP' in {pos for _,pos in pos_tag(word_tokenize(sentence))}]
Returns:
['Take it or leave it.']
Basically we split the string into a list of sentences, those sentences into a list of words and convert the list of words for each sentence into a set of part of speech tags (this is important since if we don't, when we have multiple pronouns in a sentence, we would get get duplicate sentences).