sentence = "Very disorganized and hard professor. Does not come to classes on time, she grades tough, does not help on anything. She says come for help but when you go to her office hour, she is not there to help."
I want to break this sentence into parts from full stop, comma, and & but.
the output should be like,
Very disorganized
and hard professor.
Does not come to classes on time,
she grades tough,
does not help on anything.
She says come for help
but when you go to her office hour,
she is not there to help.
for now I am using,
sample = re.split(r' *[\.\?!][\'"\)\]]* *', sentence)
print (sample)
and this only break the sentence from full stops.
output,
['Very disorganized and hard professor', 'Does not come to classes on time, she grades tough, does not help on anything', 'She says come for help but when you go to her office hour, she is not there to help']
Any idea how to do this.
Try this
Result=re.split(r'[.,&]', sentence)
You can use re.sub() to add in newline characters where your stopwords are encountered.
The regex is simply: (and|\.|but|,), which matches your stopwords. Then you replace that group with itself, plus a newline character.
>>> import re
>>> sentence = "Very disorganized and hard professor. Does not come to classes on time, she grades tough, does not help on anything. She says come for help but when you go to her office hour, she is not there to help."
>>> sample = re.sub(r'(and|\.|but|,)', r'\1\n', sentence)
>>> sample
Very disorganized and
hard professor.
Does not come to classes on time,
she grades tough,
does not help on anything.
She says come for help but
when you go to her office hour,
she is not there to help.
If you want it in a list:
>>> re.sub(r'(and|\.|but|,)', r'\1\n', sentence).split('\n')
['Very disorganized and', ' hard professor.', ' Does not come to classes on time,', ' she grades tough,', ' does not help on anything.', ' She says come for help but', ' when you go to her office hour,', ' she is not there to help.', '']
If you want to remove the whitespace before each following line, you can use this:
sample = re.sub(r'(and|\.|but|,)(?:\s)', r'\1\n', sentence)
Or a loop
for x in ['.', ',', 'and', 'but']:
sentence=sentence.replace(x, x+'\n')
Adds a \n after each of those delimiters.
Output:
Very disorganized and
hard professor.
Does not come to classes on time,
she grades tough,
does not help on anything.
She says come for help but
when you go to her office hour,
she is not there to help.
Related
I am using Google Speech-to-Text API and after I transcribe an audio file, I end up with a text which is a conversation between two people and it doesn't contain punctuation (Google's automatic punctuation or speaker diarization features are not supported for this non-English language). For example:
Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course
It appears as one big sentence, but I want to split the different sentences whenever an uppercase word appears, and thus have:
Hi you are speaking with customer support how can i help you
Hi my name is whatever and this is my problem
Can you give me your address please
Yes of course
I am using Python and I don't want to use regex, instead I want to use a simpler method. What should I add to this code in order to split each result into multiple sentences as soon as I see an uppercase letter?
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for i, result in enumerate(response.results):
transcribed_text = []
# The first alternative is the most likely one for this portion.
alternative = result.alternatives[0]
print("-" * 20)
print("First alternative of result {}".format(i))
print("Transcript: {}".format(alternative.transcript))
A simple solution would be a regex split:
inp = "Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course"
sentences = re.split(r'\s+(?=[A-Z])', inp)
print(sentences)
This prints:
['Hi you are speaking with customer support how can i help you',
'Hi my name is whatever and this is my problem',
'Can you give me your address please',
'Yes of course']
Note that this simple approach can easily fail should there be things like proper names in the middle of sentences, or maybe acronyms, both of which also have uppercase letters but are not markers for the actual end of the sentence. A better long term approach would be to use a library like nltk, which has the ability to find sentences with much higher accuracy.
My aim is to remove all punctuations from a string so that I can then get the frequency of each word in the string.
My string is:
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show, a horrified Tucker Carlson stated,
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air. “We’ve discovered evidence of rampant voter fraud, and the
president has every right to call for an investigation even if the
mainstream media thinks...” said Carlson, who trailed off, stared down
at his shaking hands, and felt a sudden ringing in his ears as he
looked back up and zeroed in on the production crew surrounding him.
“The media says…wait. Those liars on TV will try to tell you…oh God.
We’re the number-one program on cable news, aren’t we? Fox News…Fox
‘News.’ It’s the media. It’s me. This can’t be. No, no, no, no. Jesus
Christ, I make $6 million a year. Get that camera off me!” At press
time, Carlson had torn the microphone from his lapel and fled the set
in panic.
source: https://www.theonion.com/i-i-am-the-mainstream-media-realizes-horrified-tuc-1845646901
I want to remove all punctuations from it. I do that like this -
s.translate(str.maketrans('', '', string.punctuation))
This is the output -
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show a horrified Tucker Carlson stated
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air “We’ve discovered evidence of rampant voter fraud and the
president has every right to call for an investigation even if the
mainstream media thinks” said Carlson who trailed off stared down at
his shaking hands and felt a sudden ringing in his ears as he looked
back up and zeroed in on the production crew surrounding him “The
media says…wait Those liars on TV will try to tell you…oh God We’re
the numberone program on cable news aren’t we Fox News…Fox ‘News’ It’s
the media It’s me This can’t be No no no no Jesus Christ I make 6
million a year Get that camera off me” At press time Carlson had torn
the microphone from his lapel and fled the set in panic
As you can see that characters/ string like ", — and ... still exist. Am I incorrectly expecting them to be removed too? If the output is correct then how can I NOT differentiate between "`News`" and "News"?
>>> import string
>>> "“" in string.punctuation
False
>>> "—" in string.punctuation
False
Welcome to the wonderful world of Unicode where, among many other things, … is not three concatenated full stop periods and :
>>> import unicodedata
>>> unicodedata.name('—')
'EM DASH'
is not a hyphen.
How you want to handle the full scope of what could be considered 'punctuation' across the Unicode table is probably out of scope for this question, but you could either come up with your own ad-hoc list or use a third-party library designed for that type of text manipulation. Here is one such approach:
Best way to strip punctuation from a string
I added the list of characters you can remove from string by using your implementation.
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
You can check this implementation to remove all special characters and keep whitespaces
''.join(e for e in s if e.isalnum() or e == ' ')
It looks like the … and a couple of the other characters you are having trouble with are special Unicode characters. A workaround is to use string.isalpha(), which tells you whether the characters of a string are part of the alphabet or not.
result = ""
for x in string:
if x.isalpha() or x == " ":
result = result + x
An example of this would be having a sentence with a set length sentence1 = "How do you learn 21 code?" then having a few more. sentence2 = "Wow,coding is 4353 so much fun!"``sentence3 = ETC
The goal is to keep the important information and get rid of the rest allowing us to store it in a variable and call it a day.
So though there is a grammatical error in sentences it's because in my situation there is always the same amount of space between the first letter and the first number.
With that, I would love if anyone could answer this for me and tell me if there's any missing information.
You don't have to know the length. You could just replace the element with the empty string, like this:
sentence1 = 'How much do you 21 like to code?'
sentence1 = sentence1.replace('21 ', '')
print(sentence1)
How much do you like to code?
This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I'm trying to split a piece sample text into a list of sentences without delimiters and no spaces at the end of each sentence.
Sample text:
The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?
Into this (desired output):
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
My code is currently:
def sent_tokenize(text):
sentences = re.split(r"[.!?]", text)
sentences = [sent.strip(" ") for sent in sentences]
return sentences
However this outputs (current output):
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing', '']
Notice the extra '' on the end.
Any ideas on how to remove the extra '' at the end of my current output?
nltk's sent_tokenize
If you're in the business of NLP, I'd strongly recommend sent_tokenize from the nltk package.
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(text)
[
'The first time you see The Second Renaissance it may look boring.',
'Look at it at least twice and definitely watch part 2.',
'It will change your view of the matrix.',
'Are the human people the ones who started the war?',
'Is AI a bad thing?'
]
It's a lot more robust than regex, and provides a lot of options to get the job done. More info can be found at the official documentation.
If you are picky about the trailing delimiters, you can use nltk.tokenize.RegexpTokenizer with a slightly different pattern:
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'[^.?!]+')
>>> list(map(str.strip, tokenizer.tokenize(text)))
[
'The first time you see The Second Renaissance it may look boring',
'Look at it at least twice and definitely watch part 2',
'It will change your view of the matrix',
'Are the human people the ones who started the war',
'Is AI a bad thing'
]
Regex-based re.split
If you must use regex, then you'll need to modify your pattern by adding a negative lookahead -
>>> list(map(str.strip, re.split(r"[.!?](?!$)", text)))
[
'The first time you see The Second Renaissance it may look boring',
'Look at it at least twice and definitely watch part 2',
'It will change your view of the matrix',
'Are the human people the ones who started the war',
'Is AI a bad thing?'
]
The added (?!$) specifies that you split only when you do not have not reached the end of the line yet. Unfortunately, I am not sure the trailing delimiter on the last sentence can be reasonably removed without doing something like result[-1] = result[-1][:-1].
You can use filter to remove the empty elements
Ex:
import re
text = """The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"""
def sent_tokenize(text):
sentences = re.split(r"[.!?]", text)
sentences = [sent.strip(" ") for sent in sentences]
return filter(None, sentences)
print sent_tokenize(text)
Any ideas on how to remove the extra '' at the end of my current
output?
You could remove it by doing this:
sentences[:-1]
Or faster (by ᴄᴏʟᴅsᴘᴇᴇᴅ)
del result[-1]
Output:
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
You could either strip your paragraph first before splitting it or filter empty strings in the result out.
I am trying to print a list of sentences from a text file (one of the Project Gutenberg eBooks). When I print the file as a single string string it looks fine:
file = open('11.txt','r+')
alice = file.read()
print(alice[:500])
Output is:
ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d
Now, when I split it into sentences (The assignment was specifically to do this by "splitting at the periods," so it's a very simplified split), I get this:
>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']
Where do the extra "\n" characters come from and how can I remove them?
If you want to replace all the newlines with one space, do this:
import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]
You may not want to use regex, but I would do:
import re
new_sentences = []
for s in sentences:
new_sentences.append(re.sub(r'\n{2,}', '\n', s))
This should replace all instances of two or more '\n' with a single newline, so you still have newlines, but don't have "extra" newlines.
If you want to avoid creating a new list, and instead modify the existing one (credit to #gavriel and Andrew L.: I hadn't thought of using enumerate when I first posted my answer):
import re
for i, s in enumerate(sentences):
sentences[i] = re.sub(r'\n{2,}', '\n', s)
The extra newlines aren't really extra, by which I mean they are meant to be there and are visible in the text in your question: the more '\n' there are, the more space there is visible between the lines of text (i.e., one between the chapter heading and the first paragraph, and many between the edition and the chapter heading.
You'll understand where the \n characters come from with this little example:
alice = """ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d"""
print len(alice.split("."))
print len(alice.split("\n"))
It all depends the way you're splitting your text, the above example will give this output:
3
19
Which means there are 3 substrings if you were to split the text using . or 19 substrings if you splitted using \n as separator. You can read more about str.split
In your case you've splitted your text using ., so the 3 substrings will contain multiple newlines characters \n, to get rid of them you can either split these substrings again or just get rid of them using str.replace
The text uses newlines to delimit sentences as well as fullstops. You have an issue where just replacing the new line characters with an empty string will result in having words without spaces between them. Before you split alice by '.', I would use something along the lines of #elethan's solution to replace all of the multiple new lines in alice with a '.' Then you could do alice.split('.') and all of the sentences separated with multiple new lines would be split appropriately along with the sentences separated with . initially.
Then your only issue is the decimal point in the version number.
file = open('11.txt','r+')
file.read().split('\n')