I must use Python to print the number of words and mean length of words in each sentence of a text file. I cannot use NLTK or Regex for this assignment.
The sentence in the file ends with a period, exclamation point, or question mark. A hyphen, dash, or apostrophe does not end a sentence. Quotation marks do not end a sentence. But also, some periods do not end sentences. For example, Mrs., Mr., Dr., Fr., Jr., St., are all commonly occurring abbreviations.
For example, if input text is:
"My name? Bob. Your name? Lily! Hi there"
...output should be:
[(no. of words, mean length of words in sentence1),
(no. of words, mean length of words in sentence2),
...]
The code:
p= ("Mrs.","Mr.","St.")
def punct_after_ab(texts):
new_text = texts
for abb in p:
new_text = new_text.replace(abb,abb[:-1])
return print(new_text)
import numpy
def word_list(text):
special_characters = ["'",","]
clean_text = text
for string in special_characters:
clean_text = clean_text.replace(string, "")
count_list = [len(i) for i in clean_text.split()]
count = [numpy.mean(count_list)]
return print((count_list),(count))
But when I tested this, it does not split sentences.
Use something along the lines of .split(' ') to separate the words (in the stated case by spaces) and then use array operations and basic math/statistics to get your answers. If you update your question to be more specific and include some of your own code I would be willing to revise my answer accordingly.
You will find that on this site if you do not put much effort into the question you are asking, you aren't going to get very helpful answers. Try doing some research and writing as much code as you can before asking questions. This makes it much easier for people to help you and they will be more willing. As of right now it seems like you are just trying to get someone to do your homework for you.
Update:
You code works for the most part, there's just some things you need to change. I played around with what you have and I was able to break the text down to arrays of sentences from which you could continue to run statistics on them.
input.txt:
My name? Mr. Bob. Your name? Mrs. Lily!
What's up?
test.py (I use python 3.6):
def punct_after_ab(texts):
p = ("Mrs.", "Mr.", "St.")
new_text = texts
for abb in p:
new_text = new_text.replace(abb,abb[:-1])
return new_text
def clean_text(text):
special_characters = ["'", ","]
clean_text = text
for string in special_characters:
clean_text = clean_text.replace(string, "")
return clean_text
def split_sentence(text):
#Initialize vars
sentences = []
start = 0
i = 0
# Loop through the text until you find punctuation,
# then add the sentence to the final array
for char in text:
if char == '.':
sentences.append(text[start:i+1])
start = i + 2
if char == '?':
sentences.append(text[start:i+1])
start = i + 2
if char == '!':
sentences.append(text[start:i+1])
start = i + 2
i += 1
# Print the sentences to console
for sentence in sentences:
print(sentence)
def main():
# Ask user for file name
file = input("Enter file name: ")
# Open the file and strip newline chars
fd = open(file).read()
fd = fd.strip("\n")
# Remove punctuation that doesn't delineate sentences
text = punct_after_ab(fd)
text = clean_text(text)
# Separate sentences
split_sentence(text)
# Run program
if __name__ == '__main__':
main()
I was able to get this to output the text below:
Enter file name: input.txt
My name?
Mr Bob.
Your name?
Mrs Lily!
Whats up?
Process finished with exit code 0
From there you can easily do your sentence statistics. I just put typed this up so you'll probably want to go through it and clean it up a bit. I hope this helps.
Related
I want to replace the beginning of each word in a sentence if the beginning matches string x, regardless of uppercase lowercase.
For example:
s = "My fAther is your grandFAther and Family is important"
replacing "FA" with "zZ" in the beginning of each word in the sentence results in "My zZther is your grandFAther and zZamily is important".
A regex would be very helpful, but if you can't, other code would be ok. I honestly can't find an effective solution.
Also, if you can give me a regex for when you want to replace the word termination, it would be very nice. eg: "green blablaen enenbla" -> "grezz blablazz enenbla".
thank you very much :D!
my code:
input: s, value_to_replace, new_value
w = s.split()
for i in range(len(w)):
w[i] = re.sub(pattern='^' + value_to_replace, repl=new_value, string=w[i],flags=re.IGNORECASE)
return ' '.join(w)
But idk if this is efficient, it can somehow make a regex that applies over all the sentence, not over a separate word.
Give this a shot, I think it'll get you want you want
import re
def main():
text = "My father is your grandFAther and family is important"
text = re.sub(r"\bfa", "zZ", text, flags=re.IGNORECASE)
print (text)
term_text = "green blablaen enenbla"
term_text = re.sub(r"\w{2}\b", "zz", term_text, flags=re.IGNORECASE)
print(term_text)
if __name__ == "__main__":
main()
My zZther is your grandFAther and zZmily is important
grezz blablazz enenbzz
You can certainly iterate/loop by splitting the sentence, but this is probably the direction you want to go in.
What we are taking advantage of is word boundaries (the \b). The \w is a character class (A-Z0-9_) and the {2} is number of characters to its immediate left. You can read more word boundaries here, they are extremely powerful! :) https://www.regular-expressions.info/wordboundaries.html
note: nice update/edit to your question :)
# Reading line of text
text = input("Enter text: ")
print("English: " ,text)
# Removing punctuation
text = removePunctuation(text)
# Converting to lower case
text = text.lower()
# Iterating over words
for word in text.split(" "):
# Converting word to Pig Latin form
word = pigLatin(word)
# Printing word
print(word, end = " ");
How can I get it to say Pig: and then the Pig Latin form? Every time i try this, it just adds the Pig Latin transformation to the previous word.
You're trying to make one variable do two things at once.
If you want to use the old value later (in your print) then quit destroying the original value:
pl_word = pigLatin(word)
print (word, pl_word)
print("Pig:", word)
Next time, please use capital letters and some markdown... https://stackoverflow.com/editing-help
i have two text file 1.txt is dictionary of words, and other is 2.txt with phrases now i would like to check the common words in 1.txt & 2.txt and i want to replace those common words with a third word "explain".
i have tried many ways to crack but failed. can any one help me
code i have used:
wordsreplace = open("1.txt",'r')
with open("2.txt") as main:
words = main.read().split()
replaced = []
for y in words:
if y in wordreplace:
replaced.append(wordreplace[y])
else:
replaced.append(y)
text = ' '.join(replaced)
replaced = []
for y in words:
replacement = wordreplace.get(y, y)
replaced.append(replacement)
text = ' '.join(replaced)
text = ' '.join(wordreplace.get(y, y) for y in words)
new_main = open("2.txt", 'w')
new_main.write(text)
new_main.close()
this code writes the 2.txt but i cannot words replaced
I don't want to point out problems in your code, because this task can be basically done in a few lines. Here's a self-contained example (no files, only text input)
first create a set called words you can lookup into when needed (passing read().split() to a set() will do when done from a file: words = set(main.read().split()))
now use a word-boundary word regex and a replacement function
The replacement function issues the word if not found in the dictionary else, it issues "explain":
words = {"computer","random"}
text = "computer sometimes yields random results"
import re
new_text = re.sub(r"\b(\w+)\b",lambda m : "explain" if m.group(1) in words else m.group(1),text)
print(new_text)
So the replacement is handled by the regex engine, calling my lambda when there's a match, so I can decide whether to replace the word, or issue it back again.
result:
explain sometimes yields explain results
Of course this doesn't handle plurals (computers, ...) which must be in the dictionary as well.
I'm trying to use one function to count the number of words in a text file, after having this text file "cleaned" up by only including letters and single spaces. So i have my first function, which i want to clean up the text file, then i have my next function to actually return the length of the result of the previous function
(cleaned text). Here are those two functions.
def cleanUpWords(file):
words = (file.replace("-", " ").replace(" ", " ").replace("\n", " "))
onlyAlpha = ""
for i in words:
if i.isalpha() or i == " ":
onlyAlpha += i
return onlyAlpha
So words is the text file cleaned up without double spaces, hyphens, line feeds.
Then, i take out all numbers, then return the cleaned up onlyAlpha text file.
Now if i put return len(onlyAlpha.split()) instead of just return onlyAlpha...it gives me the correct amount of words in the file (I know because i have the answer). But if i do it this way, and try to split it into two functions, it screws up the amount of words. Here's what i'm talking about (here's my word counting function)
def numWords(newWords):
'''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
return len(newWords.split())
newWords i define in main(), where `newWords = cleanUpWords(harper)-----harper is a varible that runs another read funtion (besides the point).
def main():
harper = readFile("Harper's Speech.txt") #readFile function reads
newWords = cleanUpWords(harper)
print(numWords(harper), "Words.")
Given all of this, please tell me why it gives a different answer if i split it into two functions.
for reference, here is the one that counts the words right, but doesn't split the word cleaning and word counting functions, numWords cleans and counts now, which isn't preffered.
def numWords(file):
'''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
words = (file.replace("-", " ").replace(" ", " ").replace("\n", " "))
onlyAlpha = ""
for i in words:
if i.isalpha() or i == " ":
onlyAlpha += i
return len(onlyAlpha.split())
def main():
harper = readFile("Harper's Speech.txt")
print(numWords(harper), "Words.")
Hope i gave enough info.
The problem is quite simple: You split it into two function, but you completely ignore the result of the first function and instead calculate the number of words before the cleanup!
Change your main function to this, then it should work.
def main():
harper = readFile("Harper's Speech.txt")
newWords = cleanUpWords(harper)
print(numWords(newWords), "Words.") # use newWords here!
Also, your cleanUpWords function could be improved a bit. It can still leave double or triple spaces in the text, and you could also make it a bit shorter. Either, you could use regular expressions:
import re
def cleanUpWords(string):
only_alpha = re.sub("[^a-zA-Z]", " ", string)
single_spaces = re.sub("\s+", " ", only_alpha)
return single_spaces
Or you could first filter out all the illegal characters, and then split the words and join them back together with a single space.
def cleanUpWords(string):
only_alpha = ''.join(c for c in string if c.isalpha() or c == ' ')
single_spaces = ' '.join(only_alpha.split())
return single_spaces
Example, for which your original function would leave some double spaces:
>>> s = "text with triple spaces and other \n sorts \t of strange ,.-#+ stuff and 123 numbers"
>>> cleanUpWords(s)
text with triple spaces and other sorts of strange stuff and numbers
(Of course, if you intend to split the words anyway, double spaces are not a problem.)
I have written a really good program that uses text files as word banks for generating sentences from sentence skeletons. An example:
The skeleton
"The noun is good at verbing nouns"
can be made into a sentence by searching a word bank of nouns and verbs to replace "noun" and "verb" in the skeleton. I would like to get a result like
"The dog is good at fetching sticks"
Unfortunately, the handy replace() method was designed for speed, not custom functions in mind. I made methods that accomplish the task of selecting random words from the right banks, but doing something like skeleton = skeleton.replace('noun', getNoun(file.txt)) replaces ALL instances of 'noun' with the single call of getNoun(), instead of calling it for each replacement. So the sentences look like
"The dog is good at fetching dogs"
How might I work around this feature of replace() and make my method get called for each replacement? My minimum length code is below.
import random
def getRandomLine(rsv):
#parameter must be a return-separated value text file whose first line contains the number of lines in the file.
f = open(rsv, 'r') #file handle on read mode
n = int(f.readline()) #number of lines in file
n = random.randint(1, n) #line number chosen to use
s = "" #string to hold data
for x in range (1, n):
s = f.readline()
s = s.replace("\n", "")
return s
def makeSentence(rsv):
#parameter must be a return-separated value text file whose first line contains the number of lines in the file.
pattern = getRandomLine(rsv) #get a random pattern from file
#replace word tags with random words from matching files
pattern = pattern.replace('noun', getRandomLine('noun.txt'))
pattern = pattern.replace('verb', getRandomLine('verb.txt'))
return str(pattern);
def main():
result = makeSentence('pattern.txt');
print(result)
main()
The re module's re.sub function does the job str.replace does, but with far more abilities. In particular, it offers the ability to pass a function for the replacement, rather than a string. The function is called once for each match with a match object as an argument and must return the string that will replace the match:
import re
pattern = re.sub('noun', lambda match: getRandomLine('noun.txt'), pattern)
The benefit here is added flexibility. The downside is that if you don't know regexes, the fact that the replacement interprets 'noun' as a regex may cause surprises. For example,
>>> re.sub('Aw, man...', 'Match found.', 'Aw, manatee.')
'Match found.e.'
If you don't know regexes, you may want to use re.escape to create a regex that will match the raw text you're searching for even if the text contains regex metacharacters:
>>> re.sub(re.escape('Aw, man...'), 'Match found.', 'Aw, manatee.')
'Aw, manatee.'
I don't know if you are asking to edit your code or to write new code, so I wrote new code:
import random
verbs = open('verb.txt').read().split()
nouns = open('noun.txt').read().split()
def makeSentence(sent):
sent = sent.split()
for k in range(0, len(sent)):
if sent[k] == 'noun':
sent[k] = random.choice(nouns)
elif sent[k] == 'nouns':
sent[k] = random.choice(nouns)+'s'
elif sent[k] == 'verbing':
sent[k] = random.choice(verbs)
return ' '.join(sent)
var = raw_input('Enter: ')
print makeSentence(var)
This runs as:
$ python make.py
Enter: the noun is good at verbing nouns
the mouse is good at eating cats