There is a text.
text = """Among domestic cats, males are more likely to fight than females. Among feral cats, the most common reason for cat fighting is competition between two males to mate with a female. In such cases, most fights are won by the heavier male. Another common reason for fighting in domestic cats is the difficulty of establishing territories within a small home. Female cats also fight over territory or to defend their kittens."""
How to implement this function (mark "***" every 12 words), please tell me in python 3?
"""
Among domestic cats, males are more likely to fight than females. Among***
feral cats, the most common reason for cat fighting is competition between***
next ...***
"""
Use a list comprehension:
text = "Create your own function that takes in a sentence and mark every 12th word with ***"
mark = " ".join(["{}***".format(word)
for idx, word in enumerate(text.split())
if idx % 12 == 0])
print(mark)
The main point here is to use the enumerate() function and the modulo operator (%).
First we break the text into individual words using str.split(), we can then iterate through every 12 words by setting the step of range to be 12, adding "***" where appropriate and rejoining the words with a space.
words = text.split()
for i in range(0, len(words), 12): # step by 12
words[i] += "***"
new_text = " ".join(words)
NOTE: this will mark the 0th word with "***", use range(11, len(words), 12): to start with 12th word
Related
I'm trying to analyze an article to determine if a specific substring appears.
If "Bill" appears, then I want to delete the substring's parent sentence from the article, as well as every sentence following the first deleted sentence.
If "Bill" does not appear, no alteration are made to the article.
Sample Text:
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, Star Fox in the way you can rotate your craft to fit through narrow gaps.
This is Bill, signing off. Thank you for reading. And see you tomorrow!"""
Desired Result When Targeted Substring is "Bill":
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, but does that hindsight extend to this thoroughly literally-named racing tie-in? Star Fox in the way you can rotate your craft to fit through narrow gaps.
"""
This is the code so far:
if "Bill" not in stringy[-200:]:
print(stringy)
text = stringy.rsplit("Bill")[0]
text = text.split('.')[:-1]
text = '.'.join(text) + '.'
It currently doesn't work when "Bill" appears outside of the last 200 characters, cutting off the text at the very first instance of "Bill" (the opening sentence, "This is Bill Everest here"). How can this code be altered to only select for "Bill"s in the last 200 characters?
Here's another approach that loops through each sentence using a regex. We keep a line count and once we're in the last 200 characters we check for 'Bill' in the line. If found, we exclude from this line onward.
Hope the code is readable enough.
import re
def remove_bill(stringy):
sentences = re.findall(r'([A-Z][^\.!?]*[\.!?]\s*\n*)', stringy)
total = len(stringy)
count = 0
for index, line in enumerate(sentences):
#Check each index of 'Bill' in line
for pos in (m.start() for m in re.finditer('Bill', line)):
if count + pos >= total - 200:
stringy = ''.join(sentences[:index])
return stringy
count += len(line)
return stringy
stringy = remove_bill(stringy)
Here is how you can use re:
import re
stringy = """..."""
target = "Bill"
l = re.findall(r'([A-Z][^\.!?]*[\.!?])',stringy)
for i in range(len(l)-1,0,-1):
if target in l[i] and sum([len(a) for a in l[i:]])-sum([len(a) for a in l[i].split(target)[:-1]]) < 200:
strings = ' '.join(l[:i])
print(stringy)
Have a string:
s = "Now is the time for all good men to come to the aid of their country. Time is of the essence friends."
I want to divide s into substrings 25 characters in length like this:
"Now is the time for all\n",
"good men to come to the\n",
"aid of their country.\n",
"Time is of the essence,\n",
"friends."
or
Now is the time for all
good men to come to the
aid of their country.
Time is of the essence,
friends.
using spaces to pad the string 'equally' starting on the left to create a substring 25 characters.
Using split() I can divide the string s into a list of lists of words 25 characters long:
d=[]
s=0
c = a.split()
for i in c:
s+=len(i)
if s <= 25:
d.append(i)
else:
s=0
d=[]
d.append(i)
result:
d=['Now is the time for all', 'good men to come to the', 'aid of their country.', 'Time is of the essence', 'friends.']
Then use this list d to build the string t
I don't understand how I can pad in between the words in each group of words to reach a length of 25. It involves some circular method but I haven't found that method yet.
I am teaching myself python and have completed a rudimentary text summarizer. I'm nearly happy with the summarized text but want to polish the final product a bit more.
The code performs some standard text processing correctly (tokenization, remove stopwords, etc). The code then scores each sentence based on a weighted word frequency. I am using the heapq.nlargest() method to return the top 7 sentences which I feel does a good job based on my sample text.
The issue I'm facing is that the top 7 sentences are returned sorted from highest score -> lowest score. I understand the why this is happening. I would prefer to maintain the same sentence order as present in the original text. I've included the relevant bits of code and hope someone can guide me on a solution.
#remove all stopwords from text, build clean list of lower case words
clean_data = []
for word in tokens:
if str(word).lower() not in stoplist:
clean_data.append(word.lower())
#build dictionary of all words with frequency counts: {key:value = word:count}
word_frequencies = {}
for word in clean_data:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
#print(word_frequencies.items())
#update the dictionary with a weighted frequency
maximum_frequency = max(word_frequencies.values())
#print(maximum_frequency)
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
#print(word_frequencies.items())
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
#print(sentence_scores.items())
summary_sentences = heapq.nlargest(7, sentence_scores, key = sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
I'm testing using the following article: https://www.bbc.com/news/world-australia-45674716
Current output: "Australia bank inquiry: 'They didn't care who they hurt'
The inquiry has also heard testimony about corporate fraud, bribery rings at banks, actions to deceive regulators and reckless practices. A royal commission this year, the country's highest form of public inquiry, has exposed widespread wrongdoing in the industry. The royal commission came after a decade of scandalous behaviour in Australia's financial sector, the country's largest industry. "[The report] shines a very bright light on the poor behaviour of our financial sector," Treasurer Josh Frydenberg said. "When misconduct was revealed, it either went unpunished or the consequences did not meet the seriousness of what had been done," he said. The bank customers who lost everything
He also criticised what he called the inadequate actions of regulators for the banks and financial firms. It has also received more than 9,300 submissions of alleged misconduct by banks, financial advisers, pension funds and insurance companies."
As an example of the desired output: The third sentence above, "A royal commission this year, the country's highest form of public inquiry, has exposed widespread wrongdoing in the industry." actually comes before "Australia bank inquiry: They didnt care who they hurt" in the original article and I would like the output to maintain that sentence order.
Got it working, leaving here in case others are curious:
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
cnt = 0
for sent in sentence_list:
sentence_scores[sent] = []
score = 0
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
score = word_frequencies[word]
else:
score += word_frequencies[word]
sentence_scores[sent].append(score)
sentence_scores[sent].append(cnt)
cnt = cnt + 1
#Sort the dictionary using the score in descending order and then index in ascending order
#Getting the top 7 sentences
#Putting them in 1 string variable
from operator import itemgetter
top7 = dict(sorted(sentence_scores.items(), key=itemgetter(1), reverse = True)[0:7])
#print(top7)
def Sort(sub_li):
return(sorted(sub_li, key = lambda sub_li: sub_li[1]))
sentence_summary = Sort(top7.values())
summary = ""
for value in sentence_summary:
for key in top7:
if top7[key] == value:
summary = summary + key
print(summary)
I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks
import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']
Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]
Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2
For example if I had a string without any punctuation:
"She walked the dog to the park and played ball with the dog When she threw the ball to the dog the dog missed the ball and ran to the other side of the park to fetch it"
I know how to do it by converting the string to uppercase/lowercase and using the function
from collections import Counter
but I can't think of any other way to count without using built-in functions (this includes set.default, get, sorted, etc.)
It should come out in a key:value format. Any ideas?
Forget about libraries and "fast" ways of doing it, use simpler logic:
Start by splitting your string using stringName.split(). This returns to you an array of words. Now create an empty dicitonary. Then iterate through the array and do one of two things, if it exists in the dictionary, increment the count by 1, otherwise, create the key value pair with key as the word and value as 1.
At the end, you'll have a count of words.
The code:
testString = "She walked the dog to the park and played ball with the dog When she threw the ball to the dog the dog missed the ball and ran to the other side of the park to fetch it"
dic = {}
words = testString.split()
for raw_word in words:
word = raw_word.lower()
if word in dic:
dic[word] += 1
else:
dic[word] = 1
print dic