Python Text Summarizer - maintain sentence order

Python Text Summarizer - maintain sentence order - python

I am teaching myself python and have completed a rudimentary text summarizer. I'm nearly happy with the summarized text but want to polish the final product a bit more.
The code performs some standard text processing correctly (tokenization, remove stopwords, etc). The code then scores each sentence based on a weighted word frequency. I am using the heapq.nlargest() method to return the top 7 sentences which I feel does a good job based on my sample text.
The issue I'm facing is that the top 7 sentences are returned sorted from highest score -> lowest score. I understand the why this is happening. I would prefer to maintain the same sentence order as present in the original text. I've included the relevant bits of code and hope someone can guide me on a solution.
#remove all stopwords from text, build clean list of lower case words
clean_data = []
for word in tokens:
if str(word).lower() not in stoplist:
clean_data.append(word.lower())
#build dictionary of all words with frequency counts: {key:value = word:count}
word_frequencies = {}
for word in clean_data:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
#print(word_frequencies.items())
#update the dictionary with a weighted frequency
maximum_frequency = max(word_frequencies.values())
#print(maximum_frequency)
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
#print(word_frequencies.items())
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
#print(sentence_scores.items())
summary_sentences = heapq.nlargest(7, sentence_scores, key = sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
I'm testing using the following article: https://www.bbc.com/news/world-australia-45674716
Current output: "Australia bank inquiry: 'They didn't care who they hurt'
The inquiry has also heard testimony about corporate fraud, bribery rings at banks, actions to deceive regulators and reckless practices. A royal commission this year, the country's highest form of public inquiry, has exposed widespread wrongdoing in the industry. The royal commission came after a decade of scandalous behaviour in Australia's financial sector, the country's largest industry. "[The report] shines a very bright light on the poor behaviour of our financial sector," Treasurer Josh Frydenberg said. "When misconduct was revealed, it either went unpunished or the consequences did not meet the seriousness of what had been done," he said. The bank customers who lost everything
He also criticised what he called the inadequate actions of regulators for the banks and financial firms. It has also received more than 9,300 submissions of alleged misconduct by banks, financial advisers, pension funds and insurance companies."
As an example of the desired output: The third sentence above, "A royal commission this year, the country's highest form of public inquiry, has exposed widespread wrongdoing in the industry." actually comes before "Australia bank inquiry: They didnt care who they hurt" in the original article and I would like the output to maintain that sentence order.

Got it working, leaving here in case others are curious:
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
cnt = 0
for sent in sentence_list:
sentence_scores[sent] = []
score = 0
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
score = word_frequencies[word]
else:
score += word_frequencies[word]
sentence_scores[sent].append(score)
sentence_scores[sent].append(cnt)
cnt = cnt + 1
#Sort the dictionary using the score in descending order and then index in ascending order
#Getting the top 7 sentences
#Putting them in 1 string variable
from operator import itemgetter
top7 = dict(sorted(sentence_scores.items(), key=itemgetter(1), reverse = True)[0:7])
#print(top7)
def Sort(sub_li):
return(sorted(sub_li, key = lambda sub_li: sub_li[1]))
sentence_summary = Sort(top7.values())
summary = ""
for value in sentence_summary:
for key in top7:
if top7[key] == value:
summary = summary + key
print(summary)

Related

mark "***" at the end of every 12 words

There is a text.
text = """Among domestic cats, males are more likely to fight than females. Among feral cats, the most common reason for cat fighting is competition between two males to mate with a female. In such cases, most fights are won by the heavier male. Another common reason for fighting in domestic cats is the difficulty of establishing territories within a small home. Female cats also fight over territory or to defend their kittens."""
How to implement this function (mark "***" every 12 words), please tell me in python 3?
"""
Among domestic cats, males are more likely to fight than females. Among***
feral cats, the most common reason for cat fighting is competition between***
next ...***
"""

Use a list comprehension:
text = "Create your own function that takes in a sentence and mark every 12th word with ***"
mark = " ".join(["{}***".format(word)
for idx, word in enumerate(text.split())
if idx % 12 == 0])
print(mark)
The main point here is to use the enumerate() function and the modulo operator (%).

First we break the text into individual words using str.split(), we can then iterate through every 12 words by setting the step of range to be 12, adding "***" where appropriate and rejoining the words with a space.
words = text.split()
for i in range(0, len(words), 12): # step by 12
words[i] += "***"
new_text = " ".join(words)
NOTE: this will mark the 0th word with "***", use range(11, len(words), 12): to start with 12th word

Count swear words from tweets in Python

I want to count how many times certain (swear) words appear in my database of tweets, below a simple example. It does not give me the desired output due to the fact that after the "#" is introduced in the text it becomes comments, but I do not know how to solve this.
Thank you.
text = RT #JGalt09: #Trump never owed millions $$$ to the Bank of China. Another hoax from the #FakeNews media.
word_list = ['fakenews', 'hoax']
swearword_count = 0
text_swear_count = text.lower().replace('.,#?!', ' ').split()
for word in text_swear_count:
if word in word_list:
swearword_count += 1

How to classify derived words that share meaning as the same tokens?

I would like to count unrelated words in an article but I have troubles with grouping words of the same meaning derived from one another.
For instance, I would like gasoline and gas to be treated as the same token in sentences like The price of gasoline has risen. and "Gas" is a colloquial form of the word gasoline in North American English. Conversely, in BE the term would be "petrol". Therefore, if these two sentences comprised the entire article, the count for gas (or gasoline) would be 3 (petrol would not be counted).
I have tried using NLTK's stemmers and lemmatizers but to no avail. Most seem to reproduce gas as gas and gasoline as gasolin which is not helpful for my purposes at all. I understand that this is the usual behaviour. I have checked out a thread that seems to be a little bit similar, however the answers there are not completely applicable to my case as I require the words to be derived from one another.
How to treat derived words of the same meaning as same tokens in order to count them together?

I propose a two steps approach:
First, find synonyms by comparing word embeddings (only non-stopwords). This should remove similar written words, which mean something else, such as gasolineand gaseous.
Then, check if synonyms share some of their stem. Essentially if "gas" is in "gasolin" and the other way around. This shall suffice because you only compare your synonyms.
import spacy
import itertools
from nltk.stem.porter import *
threshold = 0.6
#compare the stems of the synonyms
stemmer = PorterStemmer()
def compare_stems(a, b):
if stemmer.stem(a) in stemmer.stem(b):
return True
if stemmer.stem(b) in stemmer.stem(a):
return True
return False
candidate_synonyms = {}
#add a candidate to the candidate dictionary of sets
def add_to_synonym_dict(a,b):
if a not in candidate_synonyms:
if b not in candidate_synonyms:
candidate_synonyms[a] = {a, b}
return
a, b = b,a
candidate_synonyms[a].add(b)
nlp = spacy.load('en_core_web_lg')
text = u'The price of gasoline has risen. "Gas" is a colloquial form of the word gasoline in North American English. Conversely in BE the term would be petrol. A gaseous state has nothing to do with oil.'
words = nlp(text)
#compare every word with every other word, if they are similar
for a, b in itertools.combinations(words, 2):
#check if one of the word pairs are stopwords or punctuation
if a.is_stop or b.is_stop or a.is_punct or b.is_punct:
continue
if a.similarity(b) > threshold:
if compare_stems(a.text.lower(), b.text.lower()):
add_to_synonym_dict(a.text.lower(), b.text.lower())
print(candidate_synonyms)
#output: {'gasoline': {'gas', 'gasoline'}}
Then you can count your synonym candidates based on their appearances in the text.
Note: I chose the threshold for synonyms with 0.6 by chance. You would probably test which threshold suits your task. Also my code is just a quick and dirty example, this could be done a lot cleaner.
`

Optimizing the process of finding word association strengths from an input text

I have written the following (crude) code to find the association strengths among the words in a given piece of text.
import re
## The first paragraph of Wikipedia's article on itself - you can try with other pieces of text with preferably more words (to produce more meaningful word pairs)
text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopadia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018."
text = re.sub("[\[].*?[\]]", "", text) ## Remove brackets and anything inside it.
text=re.sub(r"[^a-zA-Z0-9.]+", ' ', text) ## Remove special characters except spaces and dots
text=str(text).lower() ## Convert everything to lowercase
## Can add other preprocessing steps, depending on the input text, if needed.
from nltk.corpus import stopwords
import nltk
stop_words = stopwords.words('english')
desirable_tags = ['NN'] # We want only nouns - can also add 'NNP', 'NNS', 'NNPS' if needed, depending on the results
word_list = []
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.append(word)
'''
Construct the association matrix, where we count 2 words as being associated
if they appear in the same sentence.
Later, I'm going to define associations more properly by introducing a
window size (say, if 2 words seperated by at most 5 words in a sentence,
then we consider them to be associated)
'''
table = np.zeros((len(word_list),len(word_list)), dtype=int)
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
df = pd.DataFrame(table, columns=word_list, index=word_list)
# Count the number of occurrences of each word from word_list in the text
all_words = pd.DataFrame(np.zeros((len(df), 2)), columns=['Word', 'Count'])
all_words.Word = df.index
for sent in text.split('.'):
count=0
for word in sent.split():
if word in word_list:
all_words.loc[all_words.Word==word,'Count'] += 1
# Sort the word pairs in decreasing order of their association strengths
df.values[np.triu_indices_from(df, 0)] = 0 # Make the upper triangle values 0
assoc_df = pd.DataFrame(columns=['Word 1', 'Word 2', 'Association Strength (Word 1 -> Word 2)'])
for row_word in df:
for col_word in df:
'''
If Word1 occurs 10 times in the text, and Word1 & Word2 occur in the same sentence 3 times,
the association strength of Word1 and Word2 is 3/10 - Please correct me if this is wrong.
'''
assoc_df = assoc_df.append({'Word 1': row_word, 'Word 2': col_word,
'Association Strength (Word 1 -> Word 2)': df[row_word][col_word]/all_words[all_words.Word==row_word]['Count'].values[0]}, ignore_index=True)
assoc_df.sort_values(by='Association Strength (Word 1 -> Word 2)', ascending=False)
This produces the word associations like so:
Word 1 Word 2 Association Strength (Word 1 -> Word 2)
330 wiki encyclopedia 3.0
895 encyclopadia found 1.0
1317 anyone edit 1.0
754 peer science 1.0
755 peer encyclopadia 1.0
756 peer britannica 1.0
...
...
...
However, the code contains a lot of for loops which hampers its running time. Specially the last part (sort the word pairs in decreasing order of their association strengths) consumes a lot of time as it computes the association strengths of n^2 word pairs/combinations, where n is the number of words we are interested in (those in word_list in my code above).
So, the following are what I would like some help on:
How do I vectorize the code, or otherwise make it more efficient?
Instead of producing n^2 combinations/pairs of words in the last step, is there any way to prune some of them before producing them? I am going to prune some of the useless/meaningless pairs by inspection after they are produced anyway.
Also, and I know this does not fall into the purview of a coding question, but I would love to know if there's any mistake in my logic, specially when calculating the word association strengths.

Since you asked about your specific code, I will not go into alternate libraries. I will be mostly concentrating on points 1) and 2) of your question:
Instead of iterating through the whole word ist twice (i and j) you can already cut the processing time by ~ half by just iterating j between i + i and the end of the list. This removes duplicate pairs (index 24 and 42 as well as index 42 and 24) as well as the identical pair (index 42 and 42).
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(i+1, len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
Be careful with this, though. the in operator will also match partial words (like and in hand)
Of course, you could also remove the j iteration completely by first filtering for all words in your word list and then pairing them afterward:
word_list = set() # Using set instead of list makes lookups faster since this is a hashed structure
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.add(word)
(...)
for sent in text.split('.'):
found_words = [word for word in sent.split() if word in word_list] # list comprehensions are usually faster than pure for loops
# If you want to count duplicate words, then leave the whole line below out.
found_words = tuple(frozenset(found_words)) # make every word unique using a set and then iterable by index again by converting it into a tuple.
for i in range(len(found_words):
for j in range(i+1, len(found_words):
table[i, j] += 1
In general, you should really think about using external libraries for most of this, though. As some of the comments on your question already pointed out, splitting on . may get you wrong results, the same counts for splitting on whitespace, for example with words that are separated with a - or words that are followed by a ,.

NER naive algorithm

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()

From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]

What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.