I have a text that goes like this:
text = "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood."
How do I write a function hedging(text) that processes my text and produces a new version that inserts the word "like" in the every third word of the text?
The outcome should be like that:
text2 = "All human beings like are born free like and equal in like..."
Thank you!
Instead of giving you something like
solution=' like '.join(map(' '.join, zip(*[iter(text.split())]*3)))
I'm posting a general advice on how to approach the problem. The "algorithm" is not particularly "pythonic", but hopefully easy to understand:
words = split text into words
number of words processed = 0
for each word in words
output word
number of words processed += 1
if number of words processed is divisible by 3 then
output like
Let us know if you have questions.
You could go with something like that:
' '.join([n + ' like' if i % 3 == 2 else n for i, n in enumerate(text.split())])
Related
I have been searching for the solution to this problem. I am writing a custom function to count number of sentences. I tried nltk and textstat for this problem but both are giving me different counts.
An Example of a sentence is something like this.
Annie said, "Are you sure? How is it possible? you are joking, right?"
NLTK is giving me --> count=3.
['Annie said, "Are you sure?', 'How is it possible?', 'you are
joking, right?"']
another example:
Annie said, "It will work like this! you need to go and confront your
friend. Okay!"
NLTK is giving me --> count=3.
Please suggest. The expected count is 1 as it is a single direct sentence.
I have written a simple function that does what you want:
def sentences_counter(text: str):
end_of_sentence = ".?!…"
# complete with whatever end of a sentence punctuation mark I might have forgotten
# you might for instance want to add '\n'.
sentences_count = 0
sentences = []
inside_a_quote = False
start_of_sentence = 0
last_end_of_sentence = -2
for i, char in enumerate(text):
# quote management, to solve your issue
if char == '"':
inside_a_quote = not inside_a_quote
if not inside_a_quote and text[i-1] in end_of_sentence: # 🚩
last_end_of_sentence = i # 🚩
elif inside_a_quote:
continue
# basic management of sentences with the punctuation marks in `end_of_sentence`
if char in end_of_sentence:
last_end_of_sentence = i
elif last_end_of_sentence == i-1:
sentences.append(text[start_of_sentence:i].strip())
sentences_count += 1
start_of_sentence = i
# same as the last block in case there is no end punctuation mark in the text
last_sentence = text[start_of_sentence:]
if last_sentence:
sentences.append(last_sentence.strip())
sentences_count += 1
return sentences_count, sentences
Consider the following:
text = '''Annie said, "Are you sure? How is it possible? you are joking, right?" No, I'm not... I thought you were'''
To generalize your problem a bit, I added 2 more sentences, one with ellipsis and the last one without even any end punctuation mark. Now, if I execute this:
sentences_count, sentences = sentences_counter(text)
print(f'{sentences_count} sentences detected.')
print(f'The detected sentences are: {sentences}')
I obtain this:
3 sentences detected.
The detected sentences are: ['Annie said, "Are you sure? How is it possible? you are joking, right?"', "No, I'm not...", 'I thought you were']
I think it works fine.
Note: Please consider the quote management of my solution works for American style quotes, where the end punctuation mark of the sentence can be inside of the quote. Remove the lines where I have put flag emojis 🚩 to disable this.
I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.
I want to create a little homemade translation tool where only a specific list of sentences is translated.
I have learnt to use the replace() method but my main problem is that I am translating from English to Spanish so two problems appear:
-the order reverses many times
-sometimes a group of words is translated as just one, and also sometimes a single word has to be translated as two or more
I know how to translate word by word but that is not enough for this problem.
In this particular case I guess I have to translate whole chuncks of words.
How could I do that?
I know how to translate word by word.
I am able to define two lists, in the first one I put the original english words to be translated, and in the other one the corresponding spanish words.
Then I get the input text, split it and using two for loops I check if any of the words are present. In case they are I use replace to change them for the Spanish version.
After that I use the join method adding a space between words to get the final result.
a = (["Is", "this", "the", "most","violent","show"])
b = (["Es", "este", "el", "más", "violento", "show"])
text = "Is this the most violent show?"
text2 = text.split()
for i in range (len(a)):
for j in range ((text2.__len__())):
if a[i] == text2[j]:
text2[j] = b[i]
print ("Final text is: ", " ".join(text2))
The output is:
Final text is: Es este el más violento show?
The result is on the wrong order since "más violento show" sounds weird in Spanish, it should be instead "show más violento".
What I want to learn is to put in the array a chuncks of words like this:
a = (["most violent show"])
b= (["show más violento"])
But in that case I can't use the split tool and I am a bit lost on how to do this.
What about a more simple solution using replace and mapping:
t = {'aa': 'dd', 'bbb': 'eee', 'c c c': 'f f f'}
v = 'dd eee zz f f f'
output = v
for a, b in t.iteritems():
output = output.replace(a, b)
print(output)
# 'aa bbb zz c c c'
This is actually a fairly complicated problem (if you allow it to be)! As of writing, some other answers are perfectly fine for this particular example, so if they work, please mark one of those as the accepted answer.
First off, you should use dictionaries for this. They are a "dictionary" where you look something up (the key) and get a definition (the value).
The difficult part is being able to match parts of the input phrase to-be-translated in order to get a translated output. Our general algorithm: go through every single one of the English key words/phrases and then translate them to Spanish.
There are a few problems:
You will be translating as-you-go, meaning if your translation contains words that could be both English and Spanish, you can run into nonsense translations.
English key words might be character subsets of other key terms, e.g.: "most" -> "más", "most violent show" -> "show más violento".
You need to match case sensitivity.
I won't bother with 3 as it's not really in scope of the question and will take too long. Solving 2 is easiest: when reading the keys of the dictionary, order by length of the input key. Solving 1 is much harder: you need to know which terms have already been translated when looking at the "translation in progress."
So a complex but thorough solution for this is outlined below:
translation_dict = {
"is": "es",
"this": "este",
"the": "el",
"most violent show": "show más violento",
}
input_phrase = "Is this the most violent show?"
translations = list()
# Force the translation to be lower-case.
input_phrase = input_phrase.lower()
for key in sorted(translation_dict.keys(), key=lambda phrase: -len(phrase)):
spanish_translation = translation_dict[key]
# Code will assume all keys are lower-case.
if key in input_phrase:
input_phrase = input_phrase.replace(key, "{{{}}}".format(len(translations)))
translations.append(spanish_translation)
print(input_phrase.format(*translations))
There are yet more complex solutions if you know the max word size for a translation (i.e.: iterating n-grams where n <= m, and m is the largest group of words you expect to translate). You would iterate the n-gram for largest m first, attempting to search your translation dictionary, and decrementing n by 1 until you go through individual words to iterate.
For example, with m = 3 with input: "This is a test string.", you would get the following English phrases that you would attempt to translate.
"This is a"
"is a test"
"a test string"
"this is"
"is a"
"a test"
"test string"
"this"
"is"
"a"
"test"
"string"
This can have a performance benefit with a huge translation dictionary. I would show it but this answer is complex enough as it is.
I think you can achieve what you are looking for with the string replace method:
a = ("Is", "this", "the", "most violent show")
b = ("Es", "este", "el", "show más violento")
text = "Is this the most violent show?"
for val, elem in enumerate(a):
text = text.replace(elem, b[val])
print(text)
>>> 'Es este el show más violento?'
Also note you have a list in a tuple which is redundant.
Note Caspar Wylie's solution is a neater method using dicts instead
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
hey guys Im confused and very unsure why my code is not working. what I am doing in this code is trying to find certain words from a list in a sentence I have and output the number of times it is repeated within the sentence.
vocabulary =["in","on","to","www"]
numwords = [0,0,0,0]
mysentence = (" ʺAlso the sea tosses itself and breaks itself, and should any sleeper fancying that he might find on the beach an answer to his doubts, a sharer of his solitude, throw off his bedclothes and go down by himself to walk on the sand, no image with semblance of serving and divine promptitude comes readily to hand bringing the night to order and making the world reflect the compass of the soul.ʺ)
for word in mysentence.split():
if (word == vocabulary):
else:
numwords[0] += 1
if(word == vocabulary):
else:
numwords[1] +=1
if (word == vocabulary):
else:
numwords [2] += 1
if (word == vocabulary):
else :
numwords [3] += 1
if (word == vocabulary):
else:
numwords [4] += 1
print "total number of words : " + str(len(mysentence))
The easiest way to do this is to use collections.Counter to count all the words in the sentence, and then look up the ones you're interested in.
from collections import Counter
vocabulary =["in","on","to","www"]
mysentence = "Also the sea tosses itself and breaks itself, and should any sleeper fancying that he might find on the beach an answer to his doubts, a sharer of his solitude, throw off his bedclothes and go down by himself to walk on the sand, no image with semblance of serving and divine promptitude comes readily to hand bringing the night to order and making the world reflect the compass of the soul."
mysentence = mysentence.split()
c = Counter(mysentence)
numwords = [c[i] for i in vocabulary]
print(numwords)
Presumably you could iterate through the list with a for loop checking if it's in the list and then incrementing the counter - an example implementation might look like
def find_word(word,string):
word_count = 0
for i in range(len(string)):
if list[i] == word:
word_count +=1
This might be a little inefficient, but I'm sure it might be easier to understand for you than collections.Counter :)
I would do it like this honestly to check:
for word in mysentence.split():
if word in vocabulary:
numwords[vocabulary.index(word)] += 1
Therefore your entire code would look like this:
vocabulary = ["in", "on", "to", "www"]
numwords = [0, 0, 0, 0]
mysentence = (" ʺAlso the sea tosses itself and breaks itself, and should any sleeper fancying that he might find on the beach an answer to his doubts, a sharer of his solitude, throw off his bedclothes and go down by himself to walk on the sand, no image with semblance of serving and divine promptitude comes readily to hand bringing the night to order and making the world reflect the compass of the soul.ʺ")
for word in mysentence.replace('.', '').replace(',', '').split():
if word in vocabulary:
numwords[vocabulary.index(word)] += 1
print("total number of words : " + str(len(mysentence)))
As #Jacob suggested, replacing the '.' and ',' characters can also be applied before the split, to avoid any possible conflicts.
Consider the issue that characters like “ and ” may not parse well unless an appropriate encoding scheme has been specified.
this_is_how_you_define_a_string = "The string goes here"
# and thus:
mysentence = "Also the sea tosses itself and breaks itself, and should any sleeper fancying that he might find on the beach an answer to his doubts, a sharer of his solitude, throw off his bedclothes and go down by himself to walk on the sand, no image with semblance of serving and divine promptitude comes readily to hand bringing the night to order and making the world reflect the compass of the soul."
for v in vocabulary:
v in mysentence # Notice the indentation of 4 spaces
This solution will return TRUE or FALSE if v i sin mysentence. I think I will leave as an exercise how to accumulate the values. Hint: TRUE == 1 and FALSE = 0. You need the sum of the true values for each word v.
in Python, I have created a text generator that acts on certain parameters but my code is -at most of the time- slow and performs below my expectations. I expect one sentence per every 3-4 minutes but it fails to comply if the database it works on is large -I use the project Gutenberg's 18-book corpus and I will create my custom corpus and add further books so performance is vital.- The algorithm and the implementation is below:
ALGORITHM
1- Enter the trigger sentence -only once, at the beginning of the program-
2- Get the longest word in the trigger sentence
3- Find all the sentences of the corpus that contain the word at step2
4- Randomly select one of those sentences
5- Get the sentence (named sentA to resolve the ambiguity in description) that follows the sentence picked at step4 -so long as sentA is longer than 40 characters-
6- Go to step 2, now the trigger sentence is the sentA of step5
IMPLEMENTATION
from nltk.corpus import gutenberg
from random import choice
triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user
previousLongestWord = ""
listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus
sentenceAppender = ""
longestWord = ""
#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
def findLongestWord(longestWord):
if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
longestWord = sortedSetOfValidWords[-2]
if(listOfWords.count(longestWord) == 1):
longestWord = sortedSetOfValidWords[-3]
doappend = corpusSentences.append
def appending():
for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
sentenceAppender = " ".join(mysentence)
doappend(sentenceAppender)
appending()
sentencesContainingLongestWord = []
def getSentence(longestWord, sentencesContainingLongestWord):
for sentence in corpusSentences:
if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
sentencesContainingLongestWord.append(sentence)
def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):
while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
sentencesContainingLongestWord.remove(triggerSentence)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence
sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set
setOfValidWords = [] #set for words in a sentence that exists in a corpus
split_str = triggerSentence.split()#split the sentence into words
setOfValidWords = [word for word in split_str if listOfWords.count(word)]
sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))
longestWord = sortedSetOfValidWords[-1]
findLongestWord(longestWord)
previousLongestWord = longestWord
getSentence(longestWord, sentencesContainingLongestWord)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)
triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence
print triggerSentence
print "\n"
corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers
print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors
The computer I run the code on is a bit old, dual-core CPU was bought in Feb. 2006 and 2x512 RAM was bought in Sept. 2004 so I'm not sure if my implementation is bad or the hardware is a reason for the slow runtime. Any ideas on how I can rescue this from its hazardous form ? Thanks in advance.
I think my first advice must be: Think carefully about what your routines do, and make sure the name describes that. Currently you have things like:
arraySorter which neither deals with arrays nor sorts (it's an implementation of nub)
findLongestWord which counts things or selects words by criteria not present in the algorithm description, yet ends up doing nothing at all because longestWord is a local variable (argument, as it were)
getSentence which appends an arbitrary number of sentences onto a list
appending which sounds like it might be a state checker, but operates only through side effects
considerable confusion between local and global variables, for instance the global variable sentenceAppender is never used, nor is it an actor (for instance, a function) like the name suggests
For the task itself, what you really need are indices. It might be overkill to index every word - technically you should only need index entries for words that occur as the longest word of a sentence. Dictionaries are your primary tool here, and the second tool is lists. Once you have those indices, looking up a random sentence containing any given word takes only a dictionary lookup, a random.choice, and a list lookup. Perhaps a few list lookups, given the sentence length restriction.
This example should prove a good object lesson that modern hardware or optimizers like Psyco do not solve algorithmic problems.
Maybe Psyco speeds up the execution?