bag of words model not making sense

bag of words model not making sense - python

I made a bag of words model and when I printed it out, the output is not quite making sense.
This is my code I used to initalise the bag of words:
#creating the bag of words model
headline_bow = CountVectorizer()
headline_bow.fit(x)
a = headline_bow.transform(x)
b = headline_bow.get_feature_names()
print(a)
This is a sample of the output that comes out from the bag of words model:
(0, 837) 1
(0, 1496) 1
(0, 1952) 1
(0, 2610) 1
From my understanding, for "(0, 837) 1" this means that in the first list passed through the model, the 837th word in that list appears once. This makes no sense because when I print
x[0] I get this:
Four ways Bob Corker skewered Donald Trump
There is clearly not 837 words here so im confused as to whats going on.
Here is a sample of what x is: (a bunch of headlines)
['Four ways Bob Corker skewered Donald Trump'
"Linklater's war veteran comedy speaks to modern America, says star"
'Trump’s Fight With Corker Jeopardizes His Legislative Agenda' ...
'Ron Paul on Trump, Anarchism & the AltRight'
'China to accept overseas trial data in bid to speed up drug approvals'
'Vice President Mike Pence Leaves NFL Game Because of Anti-American Protests']
Here is the rest of my code:
data = pd.read_csv("/Users/amanpuranik/Desktop/fake-news-detection/data.csv")
data = data[['Headline', "Label"]]
x = np.array(data['Headline'])
print(x[0])
y = np.array(data["Label"])
# tokenization of the data here'
headline_vector = []
for headline in x:
headline_vector.append(word_tokenize(headline))
print(headline_vector)
stopwords = set(stopwords.words('english'))
#removing stopwords at this part
filtered = [[word for word in sentence if word not in stopwords]
for sentence in headline_vector]
#print(filtered)
stemmed2 = [[stem(word) for word in headline] for headline in filtered]
#print(stemmed2)
#lowercase
lower = [[word.lower() for word in headline] for headline in stemmed2] #start here
#conver lower into a list of strings
lower_sentences = [" ".join(x) for x in lower]
#organising
articles = []
for headline in lower:
articles.append(headline)
#creating the bag of words model
headline_bow = CountVectorizer()
headline_bow.fit(lower_sentences)
a = headline_bow.transform(lower_sentences)
print(a)

Related

frequency of words in text not present in another text with tf.Tokenizer

I have a text A and a text B. I wish to find the percentage of words in text B (counting all occurrences) not present in the vocabulary (i.e., the list of all unique words) of text A.
E.g.,
A = "a cat and a cat and a mouse"
B = "a plump cat and some more cat and some more mice too"
B has 12 words. Plump, some, more, mice and too are not in A. Plumb is not in A and occurs once, some twice, more twice, mice once, too once. 7 out of 12 words in B are not in A. --> 58 % of B is not in A.
I think we can use Tensorflow's Tokenizer. We can probably also use something else, plain python or another tokenizer and other solutions are welcome.
With tf.Tokenizer I get the word_index for the text A
a_tokenizer = Tokenizer()
a_tokenizer.fit_on_texts(textA) #Builds the word index
word_index=a_tokenizer.word_index
and the word_count for text B
b_tokenizer = Tokenizer()
b_tokenizer.fit_on_texts(generated_text)
word_count=b_tokenizer.word_count
A bad way to achieve this is to go through words in B's word_count and look-up in A
num_words_b=0
num_words_b_in_a=0
for word_b, count_b in tokenizer.word_count.items():
num_words_b += count_b
for word_a, index_a in tokenizer.word_index.items():
if word_b == word_a:
num_words_b_in_a += count_b
breaks
and then do 1-num_words_b_in_a/num_words_b. Some more elegant look-up?
EDIT : in fact the above does not work at all, because it tokenizes into characters. I want to tokenize into words.

Maybe doing it like this:
import tensorflow as tf
docs1 = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!']
docs2 = ['Well!',
'work',
'work',
'effort',
'rabbit',
'nice']
a = tf.keras.preprocessing.text.Tokenizer()
a.fit_on_texts(docs1)
b = tf.keras.preprocessing.text.Tokenizer()
b.fit_on_texts(docs2)
word_index = a.word_index
word_counts = dict(b.word_counts)
diff = set(word_counts.keys()).intersection(set(word_index.keys()))
num_words_b = sum(list(word_counts.values()))
num_words_b_in_a = sum(list(map(word_counts.get, list(diff))))
percentage = 1 - num_words_b_in_a/ num_words_b
print(percentage)
0.16666666666666663

Search text from bag of words in python

Let's say I have a bag of keywords.
Ex :
['profit low', 'loss increased', 'profit lowered']
I have a pdf document and I parse the entire text from that,
now I want to get the sentences which match the bag of words.
Lets say one sentence is :
'The profit in the month of November lowered from 5% to 3%.'
This should match as in bag of words 'profit lowered' matches this sentence.
What will be the best approach to solve this problem in python?

# input
checking_words = ['profit low', 'loss increased', 'profit lowered']
checking_string = 'The profit in the month of November lowered from 5% to 3%.'
trans_check_words = checking_string.split()
# output
for word_bug in [st.split() for st in checking_words]:
if word_bug[0] in trans_check_words and word_bug[1] in trans_check_words:
print(word_bug)

You want to check all your Check Words list elements if it is inside the long sentence
sentence = 'The profit in the month of November lowered from 5% to 3%.'
words = ['profit','month','5%']
for element in words:
if element in sentence:
#do something with it
print(element)
If you want to go cleaner, you can use this one liner loop to collect the matched words to a list:
sentence = 'The profit in the month of November lowered from 5% to 3%.'
words = ['profit','month','5%']
matched_words = [] # Will collect the matched words in the next life loop:
[matched_words.append(word) for word in words if word in sentence]
print(matched_words)
If you have "spaced" words on each element if your list, you want to take care of it by using the split() method.
sentence = 'The profit in the month of November lowered from 5% to 3%.'
words = ['profit low','month high','5% 3%']
single_words = []
for w in words:
for s in range(len(w.split(' '))):
single_words.append(w.split(' ')[s])
matched_words = [] # Will collect the matched words in the next life loop:
[matched_words.append(word) for word in single_words if word in sentence]
print(matched_words)

you can try something like the following:
convert the bag of words to a sentence:
bag_of_words = ['profit low', 'loss increased', 'profit lowered']
bag_of_word_sent = ' '.join(bag_of_words)
then with the list of sentences:
list_sents = ['The profit in the month of November lowered from 5% to 3%.']
use Levenshtein distance:
import distance
for sent in list_sents:
dist = distance.levenshtein(bag_of_word_sent, sent)
if dist > len(bag_of_word_sent):
# do something
print(dist)

Python Text Summarizer - maintain sentence order

I am teaching myself python and have completed a rudimentary text summarizer. I'm nearly happy with the summarized text but want to polish the final product a bit more.
The code performs some standard text processing correctly (tokenization, remove stopwords, etc). The code then scores each sentence based on a weighted word frequency. I am using the heapq.nlargest() method to return the top 7 sentences which I feel does a good job based on my sample text.
The issue I'm facing is that the top 7 sentences are returned sorted from highest score -> lowest score. I understand the why this is happening. I would prefer to maintain the same sentence order as present in the original text. I've included the relevant bits of code and hope someone can guide me on a solution.
#remove all stopwords from text, build clean list of lower case words
clean_data = []
for word in tokens:
if str(word).lower() not in stoplist:
clean_data.append(word.lower())
#build dictionary of all words with frequency counts: {key:value = word:count}
word_frequencies = {}
for word in clean_data:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
#print(word_frequencies.items())
#update the dictionary with a weighted frequency
maximum_frequency = max(word_frequencies.values())
#print(maximum_frequency)
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
#print(word_frequencies.items())
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
#print(sentence_scores.items())
summary_sentences = heapq.nlargest(7, sentence_scores, key = sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
I'm testing using the following article: https://www.bbc.com/news/world-australia-45674716
Current output: "Australia bank inquiry: 'They didn't care who they hurt'
The inquiry has also heard testimony about corporate fraud, bribery rings at banks, actions to deceive regulators and reckless practices. A royal commission this year, the country's highest form of public inquiry, has exposed widespread wrongdoing in the industry. The royal commission came after a decade of scandalous behaviour in Australia's financial sector, the country's largest industry. "[The report] shines a very bright light on the poor behaviour of our financial sector," Treasurer Josh Frydenberg said. "When misconduct was revealed, it either went unpunished or the consequences did not meet the seriousness of what had been done," he said. The bank customers who lost everything
He also criticised what he called the inadequate actions of regulators for the banks and financial firms. It has also received more than 9,300 submissions of alleged misconduct by banks, financial advisers, pension funds and insurance companies."
As an example of the desired output: The third sentence above, "A royal commission this year, the country's highest form of public inquiry, has exposed widespread wrongdoing in the industry." actually comes before "Australia bank inquiry: They didnt care who they hurt" in the original article and I would like the output to maintain that sentence order.

Got it working, leaving here in case others are curious:
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
cnt = 0
for sent in sentence_list:
sentence_scores[sent] = []
score = 0
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
score = word_frequencies[word]
else:
score += word_frequencies[word]
sentence_scores[sent].append(score)
sentence_scores[sent].append(cnt)
cnt = cnt + 1
#Sort the dictionary using the score in descending order and then index in ascending order
#Getting the top 7 sentences
#Putting them in 1 string variable
from operator import itemgetter
top7 = dict(sorted(sentence_scores.items(), key=itemgetter(1), reverse = True)[0:7])
#print(top7)
def Sort(sub_li):
return(sorted(sub_li, key = lambda sub_li: sub_li[1]))
sentence_summary = Sort(top7.values())
summary = ""
for value in sentence_summary:
for key in top7:
if top7[key] == value:
summary = summary + key
print(summary)

Predict new text using Python Latent Dirichlet Allocation (LDA) model

I use LDA package to model topics with a large set of text documents. A simplified(!) example (I removed all other cleaning steps, lemmatization, biograms etc.) of my code is below and I'm happy with the results so far. But now I struggle to write a code to predict a new text. I can't find any reference in LDA's documentation about save/loading/predict options. I can add a new text to my set and fit it again but it is an expensive way of doing it.
I know I can do it with gensim. But somehow the results from the gensim model are less impressive so I'd stick to my initial LDA model.
Will appreciate any suggestions!
My code:
import lda
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.corpus import stopwords
stops = set(stopwords.words('english')) # nltk stopwords list
documents = ["Liz Dawn: Coronation Street's Vera Duckworth dies at 77",
'Game of Thrones stars Kit Harington and Rose Leslie to wed',
'Tony Booth: Till Death Us Do Part actor dies at 85',
'The Child in Time: Mixed reaction to Benedict Cumberbatch drama',
"Alanna Baker: The Cirque du Soleil star who 'ran off with the circus'",
'How long can The Apprentice keep going?',
'Strictly Come Dancing beats X Factor for Saturday viewers',
"Joe Sugg: 8 things to know about one of YouTube's biggest stars",
'Sir Terry Wogan named greatest BBC radio presenter',
"DJs celebrate 50 years of Radio 1 and 2'"]
clean_docs = []
for doc in documents:
# set all to lower case and tokenize
tokens = nltk.tokenize.word_tokenize(doc.lower())
# remove stop words
texts = [i for i in tokens if i not in stops]
clean_docs.append(texts)
# join back all tokens to create a list of docs
docs_vect = [' '.join(txt) for txt in clean_docs]
cvectorizer = CountVectorizer(max_features=10000, stop_words=stops)
cvz = cvectorizer.fit_transform(docs_vect)
n_topics = 3
n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)
n_top_words = 3
topic_summaries = []
topic_word = lda_model.topic_word_ # get the topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
topic_summaries.append(' '.join(topic_words))
print('Topic {}: {}'.format(i+1, ' '.join(topic_words)))
# How to predict a new document?
new_text = '50 facts about Radio 1 & 2 as they turn 50'

Python, find words from array in string

I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks

import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']

Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]

Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

bag of words model not making sense - python

Related

frequency of words in text not present in another text with tf.Tokenizer

Search text from bag of words in python

Python Text Summarizer - maintain sentence order

Predict new text using Python Latent Dirichlet Allocation (LDA) model

Python, find words from array in string

Categories

Resources