Ordering sentences according to their length - python

I tried using this code that I found online:
K=sentences
m=[len(i.split()) for i in K]
lengthorder= sorted(K, key=len, reverse=True)
#print(lengthorder)
#print("\n")
list1 = lengthorder
str1 = '\n'.join(list1)
print(str1)
print('\n')
Sentence1 = "We have developed speed, but we have shut ourselves in"
res = len(Sentence1.split())
print ("The longest sentence in this text contains" + ' ' + str(res) + ' ' + "words.")
Sentence2 = "More than cleverness we need kindness and gentleness"
res = len(Sentence2.split())
print ("The second longest sentence in this text contains" + ' ' + str(res) + ' ' + "words.")
Sentence3 = "Machinery that gives abundance has left us in want"
res = len(Sentence3.split())
print ("The third longest sentence in this text contains" + ' ' + str(res) + ' ' + "words.")
but it doesn't sort out the sentences per word number, but per actual length (as in cm)

You can simply iterate through the different sentaces and split them up into words like this:
text = " We have developed speed. but we have. shut ourselves in Machinery that. gives abundance has left us in want Our knowledge has made us cynical Our cleverness, hard and unkind We think too much and feel too little More than machinery we need humanity More than cleverness we need kindness and gentleness"
# split into sentances
text2array = text.split(".")
i =0
# interate through sentances and split them into words
for sentance in text2array:
text2array[i] = sentance.split(" ")
i += 1
# sort the sentances by word length
text2array.sort(key=len,reverse=True)
i = 0
#iterate through sentances and print them to screen
for sentance in text2array:
i += 1
sentanceOut = ""
for word in sentance:
sentanceOut += " " + word
sentanceOut += "."
print("the nr "+ str(i) +" longest sentence is" + sentanceOut)

You can define a function that uses the regex to obtain the number of words in a given sentence:
import re
def get_word_count(sentence: str) -> int:
return len(re.findall(r"\w+", sentence))
Assuming you already have a list of sentences, you can iterate the list and pass each sentence to the word count function then store each sentence and its word count in a dictionary:
sentences = [
"Assume that this sentence has one word. Really?",
"Assume that this sentence has more words than all sentences in this list. Obviously!",
"Assume that this sentence has more than one word. Duh!",
]
word_count_dict = {}
for sentence in sentences:
word_count_dict[sentence] = get_word_count(sentence)
At this point, the word_count_dict contains sentences as keys and their associated word count as values.
You can then sort word_count_dict by values:
sorted_word_count_dict = dict(
sorted(word_count_dict.items(), key=lambda item: item[1], reverse=True)
)
Here's the full snippet:
import re
def get_word_count(sentence: str) -> int:
return len(re.findall(r"\w+", sentence))
sentences = [
"Assume that this sentence has one word. Really?",
"Assume that this sentence has more words than all sentences in this list. Obviously!",
"Assume that this sentence has more than one word. Duh!",
]
word_count_dict = {}
for sentence in sentences:
word_count_dict[sentence] = get_word_count(sentence)
sorted_word_count_dict = dict(
sorted(word_count_dict.items(), key=lambda item: item[1], reverse=True)
)
print(sorted_word_count_dict)

Let's assume that your sentences are already separate and there is no need to detect the sentences.
So we have a list of sentences. Then we need to calculate the length of the sentence based on the word count. the basic way is to split them by space. So each space separates two words from each other in a sentence.
list_of_sen = ['We have developed speed, but we have shut ourselves in','Machinery that gives abundance has left us in want Our knowledge has made us cynical Our cleverness', 'hard and unkind We think too much and feel too little More than machinery we need humanity More than cleverness we need kindness and gentleness']
sen_len=[len(i.split()) for i in list_of_sen]
sen_len= sorted(sen_len, reverse=True)
for index , count in enumerate(sen_len):
print(f'The {index+1} longest sentence in this text contains {count} words')
But if your sentence is not separated, first we need to recognize the end of the sentence then split them. Your sample date does not contain any punctuation that can be useful to separate sentences. So if we assume that your data has punctuation the answer below can be helpful.
see this question
from nltk import tokenized
p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."
tokenize.sent_tokenize(p)

Related

Looking for single exact word within string

I am trying to find a single exact word within a large string.
I have tried the below:
for word in words:
if word in strings:
best.append("The word " + word + " The Sentence " + strings)
else:
pass
This seemed to work at first until tried with a larger set of words in a much larger string and was getting partial matches. As an example if the word is "me" it would pass "message" off as being found.
Is there a way of searching for exactly "me"?
Thanks in advance.
You need to set boundaries in order to find complete word. I'd go to regex. Something like:
re.search(r'\b' + word_to_find + r'\b')
You can split the string into words and then perform the in operation, making sure you strip the words in the list and the string of any trailing whitespaces
import string
def find_words(words, s):
best = []
#Strip extra whitespaces if any around the word and make them all lowercase
modified_words = [word.strip().lower() for word in words]
#Strip away punctuations from string, and make it lower
modified_s = s.translate(str.maketrans('', '', string.punctuation))
words_list = [word.strip().lower() for word in modified_s.lower().split()]
#Iterate through the list
for idx, word in enumerate(modified_words):
#If word is found in lit of words, append to result
if word in words_list:
best.append("The word " + words[idx] + " The Sentence " + s)
return best
print(find_words(['me', 'message'], 'I me myself'))
print(find_words([' me ', 'message'], 'I me myself'))
print(find_words(['me', 'message'], 'I me myself'))
print(find_words(['me', 'message'], 'I am me.'))
print(find_words(['me', 'message'], 'I am ME.'))
print(find_words(['Me', 'message'], 'I am ME.'))
The output will be
['The word me The Sentence I me myself']
['The word me The Sentence I me myself']
['The word me The Sentence I me myself']
['The word me The Sentence I am me.']
['The word me The Sentence I am ME.']
['The word Me The Sentence I am ME.']
You can also use regex to find the word exactly. \\b means boundary like space or punctuation marks.
for word in words:
if len(re.findall("\\b" + word + "\\b", strings)) > 0:
best.append("The word " + word + " The Sentence " + strings)
else:
pass
The double backslashes are due to a '\b' character being the backspace control sequence. Source
You could include the surrounding spaces in the if statement.
for word in words:
if f' {word} ' in strings:
best.append("The word " + word + " The Sentence " + strings)
else:
pass
To make sure you don't detect words inside words that they are contained within (like "me" in "message" or "flame") is to add spaces before and after the words in the detection. The easiest way of doing this is to replace
if word in strings:
with
if " "+word+" " in strings:
Hope this helps! -Theo
You need to set boundaries for your search, \b is the boundary character.
import re
string = 'youyou message me me me me me'
print(re.findall(r'\bme\b', string))
The string has message and me, we only need me explicitly. So added boundaries in my search expression. The result is below -
['me', 'me', 'me', 'me', 'me']
Got all the me(s), but not the message which also has a me in it.
Without knowing the rest of the code, the best I could suggest is using == to get a direct match, so for example
a = 0
list = ["Me","Hello","Message"]
b = len(list)
i = input("What do you want to find?")
for d in range(b):
if list[a] == i:
print("Found a match")
else:
a = a+1

Need help making a "compliment generator"

I'm trying to make a simple compliment generator that takes a noun and an adjective from two separate list ands randomly combines them together. I can get one on its own to work but trying to get the second word to appear makes weird stuff happen. What am I doing wrong here? any input would be great.
import random
sentence = "Thou art a *adj *noun."
sentence = sentence.split()
adjectives = ["decadent", "smelly", "delightful", "volatile", "marvelous"]
indexCount= 0
noun = ["dandy", "peaseant", "mule", "maiden", "sir"]
wordCount= 0
for word in sentence:
if word =="*adj":
wordChoice = random.choice (adjectives)
sentence [indexCount] = wordChoice
indexCount += 1
for word in sentence:
if "*noun" in word:
wordChoice = random.choice (noun)
sentence [wordCount] = wordChoice
wordCount += 1
st =""
for word in sentence:
st+= word + " "
print (st)
The end result nets me a double noun. how would I get rid of the duplicate?
You aren't incrementing wordCount in the second loop as you do indexCount in the first.

Need to edit a list in a for loop. [Python} [duplicate]

This question already has answers here:
How to modify list entries during for loop?
(10 answers)
Closed 6 years ago.
I'm taking a sentence and turning it into pig latin, but when I edit the words in the list it never stays.
sentence = input("Enter a sentence you want to convert to pig latin")
sentence = sentence.split()
for words in sentence:
if words[0] in "aeiou":
words = words+'yay'
And when I print sentence I get the same sentence I put in.
another way to do it (includes some fixes)
sentence = input("Enter a sentence you want to convert to pig latin: ")
sentence = sentence.split()
for i in range(len(sentence)):
if sentence[i][0] in "aeiou":
sentence[i] = sentence[i] + 'yay'
sentence = ' '.join(sentence)
print(sentence)
Because you did not change sentence
So to get the results you want
new_sentence = ''
for word in sentence:
if word[0] in "aeiou":
new_sentence += word +'yay' + ' '
else:
new_sentence += word + ' '
So now print new_sentence
I set this up to return a string, if you would rather have a list that can be accomplished as easily
new_sentence = []
for word in sentence:
if word[0] in "aeiou":
new_sentence.append(word + 'yay')
else:
new_sentence.append(word)
If you are working with a list and you want to then convert the list to a string then just
" ".join(new_sentence)
It does not seem as though you are updating sentence.
sentence = input("Enter a sentence you want to convert to pig latin")
sentence = sentence.split()
# lambda and mapping instead of a loop
sentence = list(map(lambda word: word+'yay' if word[0] in 'aeiou' else word, sentence))
# instead of printing a list, print the sentence
sentence = ' '.join(sentence)
print(sentence)
PS. Kinda forgot some things about Python's for loop so I didn't use it. Sorry

How to remove stop words using string.replace()

I have a text file where I am counting the sum of lines, sum of characters and sum of words. How can I clean the data by removing stop words such as (the, for, a) using string.replace()
I have the codes below as of now.
Ex. if the text file contains the line:
"The only words to count are Apple and Grapes for this text"
It should output:
2 Apple
2 Grapes
1 words
1 only
1 text
And should not output words like:
the
to
are
for
this
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
After opening and reading the file (fname = open('2013_honda_accord.txt', 'r').read()), you can place this code:
blacklist = ["the", "to", "are", "for", "this"] # Blacklist of words to be filtered out
for word in blacklist:
fname = fname.replace(word, "")
# The above causes multiple spaces in the text (e.g. ' Apple Grapes Apple')
while " " in fname:
fname = fname.replace(" ", " ") # Replace double spaces by one while double spaces are in text
Edit:
To avoid problems with words containing the unwanted words, you may do it like this (assuming words are in sentence middle):
blacklist = ["the", "to", "are", "for", "this"] # Blacklist of words to be filtered out
for word in blacklist:
fname = fname.replace(" " + word + " ", " ")
# Or .'!? ect.
A check for double spaces is not required here.
Hope this helps!
You can easily terminate those words by writing a simple function:
#This function drops the restricted words from a sentece.
#Input - sentence, list of restricted words (restricted list should be all lower case)
#Output - list of allowed words.
def restrict (sentence, restricted):
return list(set([word for word in sentence.split() if word.lower() not in restricted]))
Then you can use this function whenever you want (before or after the word count).
for example:
restricted = ["the", "to", "are", "and", "for", "this"]
sentence = "The only words to count are Apple and Grapes for this text"
word_list = restrict(sentence, restricted)
print word_list
Would print:
["count", "Apple", "text", "only", "Grapes", "words"]
Of course you can add empty words removal (double spaces):
return list(set([word for word in sentence.split() if word.lower() not in restricted and len(word) > 0]))

Scanning for two word phrase in Python dictionary

I am trying to use a Python dictionary object to help translate an input string to other words or phrases. I am having success with translating single words from the input, but I can't seem to figure out how to translate multi-word phrases.
Example:
sentence = input("Please enter a sentence: ")
myDict = {"hello": "hi","mean adult":"grumpy elder", ...ect}
How can I return hi grumpy elder if the user enters hello mean adult for the input?
"fast car" is a key to the dictionary, so you can extract the value if you use the key coming back from it.
If you're taking the input straight from the user and using it to reference the dictionary, get is safer, as it allows you to provide a default value in case the key doesn't exist.
print(myDict.get(sentence, "Phrase not found"))
Since you've clarified your requirements a bit more, the hard part now is the splitting; the get doesn't change. If you can guarantee the order and structure of the sentences (that is, it's always going to be structured such that we have a phrase with 1 word followed by a phrase with 2 words), then split only on the first occurrence of a space character.
split_input = input.split(' ', 1)
print("{} {}".format(myDict.get(split_input[0]), myDict.get(split_input[1])))
More complex split requirements I leave as an exercise for the reader. A hint would be to use the keys of myDict to determine what valid tokens are present in the sentence.
The same way as you normally would.
translation = myDict['fast car']
A solution to your particular problem would be something like the following, where maxlen is the maximum number of words in a single phrase in the dictionary.
translation = []
words = sentence.split(' ')
maxlen = 3
index = 0
while index < len(words):
for i in range(maxlen, 0, -1):
phrase = ' '.join(words[index:index+i])
if phrase in myDict:
translation.append(myDict[phrase])
index += i
break
else:
translation.append(words[index])
index += 1
print ' '.join(translation)
Given the sentence hello this is a nice fast car, it outputs hi this is a sweet quick ride
This will check for each word and also a two word phrase using the word before and after the current word to make the phrase:
myDict = {"hello": "hi",
"fast car": "quick ride"}
sentence = input("Please enter a sentence: ")
words = sentence.split()
for i, word in enumerate(words):
if word in myDict:
print myDict.get(word)
continue
if i:
phrase = ' '.join([words[i-1], word])
if phrase1 in myDict:
print myDict.get(phrase)
continue
if i < len(words)-1:
phrase = ' '.join([word, words[i+1])
if phrase in myDict:
print myDict.get(phrase)
continue

Categories

Resources