Capitalizing the beginning of sentences in Python - python

The following code is for an assignment that asks that a string of sentences is entered from a user and that the beginning of each sentence is capitalized by a function.
For example, if a user enters: 'hello. these are sample sentences. there are three of them.'
The output should be: 'Hello. These are sample sentences. There are three of them.'
I have created the following code:
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalize(sentences)
#This function capitalizes the first letter of each sentence
def capitalize(user_sentences):
sent_list = user_sentences.split('. ')
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
main()
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line. Second, it adds an extra period at the end. The output from this code using the sample input from above would be:
Hello.
These are sample sentences.
There are three of them..
Is there a way to format the output to be one line and remove the final period?

The following works for reasonably clean input:
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(x.capitalize() for x in s.split('. '))
'Hello. These are sample sentences. There are three of them.'
If there is more varied whitespace around the full-stop, you might have to use some more sophisticated logic:
>>> '. '.join(x.strip().capitalize() for x in s.split('.'))
Which normalizes the whitespace which may or may not be what you want.

def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalizeFunc(sentences)
def capitalizeFunc(user_sentences):
sent_list = user_sentences.split('. ')
print(".".join((i.capitalize() for i in sent_list)))
main()
Output:
Enter sentences with lowercase letters: "hello. these are sample sentences. there are three of them."
Hello.These are sample sentences.There are three of them.

I think this might be helpful:
>>> sentence = input()
>>> '. '.join(map(lambda s: s.strip().capitalize(), sentence.split('.')))

>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(map(str.capitalize, s.split('. ')))
'Hello. These are sample sentences. There are three of them.'

This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line.
That’s because you’re printing each sentence with a separate call to print. By default, print adds a newline. If you don’t want it to, you can override what it adds with the end keyword parameter. If you don’t want it to add anything at all, just use end=''
Second, it adds an extra period at the end.
That’s because you’re explicitly adding a period to every sentence, including the last one.
One way to fix this is to keep track of the index as well as the sentence as you’re looping over them—e.g., with for index, sentence in enumerate(sentences):. Then you only add the period if index isn’t the last one. Or, slightly more simply, you add the period at the start, if the index is anything but zero.
However, theres a better way out of both of these problems. You split the string into sentences by splitting on '. '. You can join those sentences back into one big string by doing the exact opposite:
sentences = '. '.join(sentences)
Then you don’t need a loop (there’s one hidden inside join of course), you don’t need to worry about treating the last or first one special, and you only have one print instead of a bunch of them so you don’t need to worry about end.
A different trick is to put the cleverness of print to work for you instead of fighting it. Not only does it add a newline at the end by default, it also lets you print multiple things and adds a space between them by default. For example, print(1, 2, 3) or, equivalently, print(*[1, 2, 3]) will print out 1 2 3. And you can override that space separator with anything else you want. So you can print(*sentences, sep='. ', end='') to get exactly what you want in one go. However, this may be a bit opaque or over-clever to people reading your code. Personally, whenever I can use join instead (which is usually), I do that even though it’s a bit more typing, because it makes it more obvious what’s happening.
As a side note, a bit of your code is misleading:
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
The logic of that loop is fine, but it would be a lot easier to understand if you called the one-new-sentence variable new_sentence instead of new_sentences, and didn’t set it to an empty list at the start. As it is, the reader is led to expect that you’re going to build up a list of new sentences and then do something with it, but actually you just throw that list away at the start and handle each sentence one by one.
And, while we’re at it, you don’t need count here; just loop over sent_list directly:
for sentence in sent_list:
new_sentence = sent + '. '
print(new_sentence.capitalize())
This does the same thing as the code you had, but I think it’s easier to understand that it does that think from a quick glance.
(Of course you still need the fixes for your two problems.)

Use nltk.sent_tokenize to tokenize the string into sentences. And capitalize each sentence, and join them again.
A sentence can't always end with a ., there can other things too, like a ?, or !. Also three consecutive dots ..., doesn't end the sentence. sent_tokenize will handle them all.
from nltk.tokenize import sent_tokenize
def capitalize(user_sentences):
sents = sent_tokenize(user_sentences)
capitalized_sents = [sent.capitalize() for sent in sents]
joined_ = ' '.join(capitalized_sents)
print(joined_)
The reason your sentences were being printed on separate lines, were because print always ends its output with a newline. So, printing sentences separately in loop will make them print on newlines. So, you should print them all at once, after joining them. Or, you can specify end='' in print statement, so it doesn't end the sentences with newline characters.
The second thing, about output being ended with an extra period, is because, you're appending '. ' with each of the sentence. The good thing about sent_tokenize is, it doesn't remove '.', '?', etc from the end of the sentences, so you don't have to append '. ' at the end manually again. Instead, you can just join the sentences with a space character, and you'll be good to go.
If you get an error for nltk not being recognized, you can install it by running pip install nltk on the terminal/cmd.

Related

Python: 4 words before a substring if they exsist

I want to have 4 words a particular word. if there are only 3 wrords before I want the 3 words to be printed.
Example:
input: There is a bad CAT sitting on the wall
output: is a bad
line is having the sentance.
if 'CAT' in line:
print(line.split('CAT')[0].split()[len((line.split('CAT')[0]))-3): len(line.split('CAT')[0])])
Can you let me know if I am missing anything and if there is any other efficent way.
Planning to do line.split(CAT)[0] to get all the data before cat.
again on that I want to getoutput of [0] starting at len-3 to len.
Its giving Error am I missing anything.
also can I add a condition if there are only 2 words print only 2
You're on the right track. If you want to get the three words before D in a string S, defaulting to fewer words if there's less than three available, you can use this:
S.split(D)[0].split()[-3:]
Examples:
>>> S = 'There is a bad CAT sitting on the wall'
>>> S.split('CAT')[0].split()[-3:]
['is', 'a', 'bad']
>>> S = 'The bad CAT is sitting on the wall'
>>> S.split('CAT')[0].split()[-3:]
['The', 'bad']
Of course, if you wish to join this back into a string, you can use:
' '.join(S.split(D)[0].split()[-3:])
This can also be accomplished using regular expressions, but I doubt it would offer much better performance.
Split the line at the start, then find the index of the word you want in the resulting list. You can then slice the list (making sure that you don't start the slice at less than zero), and join it back together again. If there are fewer than 3 words preceding it will only show what is there.
line = "There is a bad CAT sitting on the wall"
sline = line.split(' ')
if 'CAT' in sline:
pos = sline.index('CAT')
print(' '.join(sline[max(0, pos-3):pos]))
IMO, trying to do all this in one line makes things too confusing. I recommend that you break it up into smaller parts.
if line.find('CAT') != -1:
words = line.split('CAT')[0].strip().split(' ')
print(words[max(len(words) - 3, 0):])
Explanation of some things:
Yes, some people will think 'CAT' in line is more Pythonic, but I prefer line.find('CAT') != -1, as it won't ignore some errors that might occur if line isn't a string. See the str.find() documentation for details about the function.
The strip() in the 2nd line assures that trailing spaces are removed.
The final line finds the position of the third word before 'CAT', if there is one, and then prints the appropriate words out as a list. As noted in other answers, you and use str.join() to put them back together as a string if you want.

Remove unmixed numbers from file

Say I have a file called input.txt that looks like this
I listened to 4 u2 albums today
meet me at 5
squad 4ever
I want to filter out the numbers that are on their own, so "4" and "5" should go but "u2" and "4ever" should remain the same. i.e the output should be
I listened to u2 albums today
meet me at
squad 4ever
I've been trying to use this code
for line in fileinput.input("input.txt", inplace=True):
new_s = ""
for word in line.split(' '):
if not all(char.isdigit() for char in word):
new_s += word
new_s += ' '
print(new_s, end='')
Which is pretty similar to the code I found here: Removing numbers mixed with letters from string
But instead of the wanted output I get
I listened to u2 albums today
meet me at 5
squad 4ever
As you can see there are two problems here, first only the first line loses the digit I want it to lose, "5" is still present in the second line. The second problem is the extra white space at the beginning of a new line.
I've been playing around with the code for a while and browsing stackoverflow, but can't find where the problem is coming from. Any insights?
str.split(' ') does not remove the trailing newlines from each line. They end up attached to the last word of the line. So for your first problem, the '5' doesn't get removed because it's actually '5\n', and the \n is not a digit.
The second problem is related. When you print the last word of each line, it contains that newline, plus you're adding a space on to the end. That space shows up as the first character of the next line.
The simplest solution is simply to change line.split(' ') to line.split(). Without any arguments, split() will remove all whitespace, including the newlines. You'll also need to remove the end='' from your print so that the newlines are added back in.
Just use regexp.
re.sub(r"\b\d+\b", "", input)
match any digit between word boundaries
Or to avoid double spaces:
re.sub(r"\s\d+\s", " ", input)
You can use regex:
data = open('file.txt').read()
import re
data = re.sub('(?<=\s)\d+(?=$)|(?<=^)\d+(?<=\s)|(\s\d+\s)', '', data)
Output:
I listened tou2 albums today
meet me at
squad 4ever

Python Create List of Words Per Sentence and Calculate Mean and Place in CSV File

I'm looking to count the number of words per sentence, calculate the mean words per sentence, and put that info into a CSV file. Here's what I have so far. I probably just need to know how to count the number of words before a period. I might be able to figure it out from there.
#Read the data in the text file as a string
with open("PrideAndPrejudice.txt") as pride_file:
pnp = pride_file.read()
#Change '!' and '?' to '.'
for ch in ['!','?']:
if ch in pnp:
pnp = pnp.replace(ch,".")
#Remove period after Dr., Mr., Mrs. (choosing not to include etc. as that often ends a sentence although in can also be in the middle)
pnp = pnp.replace("Dr.","Dr")
pnp = pnp.replace("Mr.","Mr")
pnp = pnp.replace("Mrs.","Mrs")
To split a string into a list of strings on some character:
pnp = pnp.split('.')
Then we can split each of those sentences into a list of strings (words)
pnp = [sentence.split() for sentence in pnp]
Then we get the number of words in each sentence
pnp = [len(sentence) for sentence in pnp]
Then we can use statistics.mean to calculate the mean:
statistics.mean(pnp)
To use statistics you must put import statistics at the top of your file. If you don't recognize the ways I'm reassigning pnp, look up list comprehensions.
You might be interested in the split() function for strings. It seems like you're editing your text to make sure all sentences end in a period and every period ends a sentence.
Thus,
pnp.split('.')
is going to give you a list of all sentences. Once you have that list, for each sentence in the list,
sentence.split() # i.e., split according to whitespace by default
will give you a list of words in the sentence.
Is that enough of a start?
You can try the code below.
numbers_per_sentence = [len(element) for element in (element.split() for element in pnp.split("."))]
mean = sum(numbers_per_sentence)/len(numbers_per_sentence)
However, for real natural language processing I would probably recommend a more robust solution such as NLTK. The text manipulation you perform (replacing "?" and "!", removing commas after "Dr.","Mr." and "Mrs.") is probably not enough to be 100% sure that comma is always a sentence separator (and that there are no other sentence separators in your text, even if it happens to be true for Pride And Prejudice)

Count all occurrences of elements with and without special characters in a list from a text file in python

I really apologize if this has been answered before but I have been scouring SO and Google for a couple of hours now on how to properly do this. It should be easy and I know I am missing something simple.
I am trying to read from a file and count all occurrences of elements from a list. This list is not just whole words though. It has special characters and punctuation that I need to get as well.
This is what I have so far, I have been trying various ways and this post got me the closest:
Python - Finding word frequencies of list of words in text file
So I have a file that contains a couple of paragraphs and my list of strings is:
listToCheck = ['the','The ','the,','the;','the!','the\'','the.','\'the']
My full code is:
#!/usr/bin/python
import re
from collections import Counter
f = open('text.txt','r')
wanted = ['the','The ','the,','the;','the!','the\'','the.','\'the']
words = re.findall('\w+', f.read().lower())
cnt = Counter()
for word in words:
if word in wanted:
print word
cnt[word] += 1
print cnt
my output thus far looks like:
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
Counter({'the': 17})
It is counting my "the" strings with punctuation but not counting them as separate counters. I know it is because of the \W+. I am just not sure what the proper regex pattern to use here or if I'm going about this the wrong way.
I suspect there may be some extra details to your specific problem that you are not describing here for simplicity. However, I'll assume that what you are looking for is to find a given word, e.g. "the", which could have either an upper or lower case first letter, and can be preceded and followed either by a whitespace or by some punctuation characters such as ;,.!'. You want to count the number of all the distinct instances of this general pattern.
I would define a single (non-disjunctive) regular expression that define this. Something like this
import re
pattern = re.compile(r"[\s',;.!][Tt]he[\s.,;'!]")
(That might not be exactly what you are looking for in general. I just assuming it is based on what you stated above. )
Now, let's say our text is
text = '''
Foo and the foo and ;the, foo. The foo 'the and the;
and the' and the; and foo the, and the. foo.
'''
We could do
matches = pattern.findall(text)
where matches will be
[' the ',
';the,',
' The ',
"'the ",
' the;',
" the'",
' the;',
' the,',
' the.']
And then you just count.
from collections import Counter
count = Counter()
for match in matches:
count[match] += 1
which in this case would lead to
Counter({' the;': 2, ' the.': 1, ' the,': 1, " the'": 1, ' The ': 1, "'the ": 1, ';the,': 1, ' the ': 1})
As I said at the start, this might not be exactly what you want, but hopefully you could modify this to get what you want.
Just to add, a difficulty with using a disjunctive regular expression like
'the|the;|the,|the!'
is that the strings like "the," and "the;" will also match the first option, i.e. "the", and that will be returned as the match. Even though this problem could be avoided by more careful ordering of the options, I think it might not be easier in general.
The simplest option is to combine all "wanted" strings into one regular expression:
rr = '|'.join(map(re.escape, wanted))
and then find all matches in the text using re.findall.
To make sure longer stings match first, just sort the wanted list by length:
wanted.sort(key=len, reverse=True)
rr = '|'.join(map(re.escape, wanted))

Python NLTK not taking out punctuations correctly

I have defined the following code
exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer
I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?
When you create a set from wordList it stores the string '"the' as the only element,
>>> set(wordList)
set(['"the'])
So using set difference will return the same set,
>>> set(wordList) - set(string.punctuation)
set(['"the'])
If you want to just remove punctuation you probably want something like,
>>> [word.translate(None, string.punctuation) for word in wordList]
['the']
Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.
You can then perform the lemmatization on the new list.

Categories

Resources