I am trying to train a naive bayes classifier and I am having troubles with the data. I plan to use it for extractive text summarization.
Example_Input: It was a sunny day. The weather was nice and the birds were singing.
Example_Output: The weather was nice and the birds were singing.
I have a dataset that I plan to use and in every document there is at least 1 sentence for summary.
I decided to use sklearn but I don't know how to represent the data that I have. Namely X and y.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X, y)
The closest to my mind is to make it like this:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
y = [
[0,1],
[1,0]
]
where the target values mean 1 - included in the summary and 0 - not included. This unfortunately gives bad shape exception beacause y is expected to be 1-d array. I cannot think of a way of representing it as such so please help.
btw, I don't use the string values in X directly but represent them as vectors with CountVectorizer and TfidfTransformer from sklearn.
As per your requirement, you are classifying the data. That means, you need to separate each sentence to predict it's class.
for example:
Instead of using:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
Use it as following:
X = [
'It was a sunny day.',
'The weather was nice and the birds were singing.',
'I like trains.',
'Hi, again.'
]
Use sentence tokenizer of NLTK to achieve this.
Now, for labels, use two-classes. let say 1 for yes, 0 for no.
y = [
[0,],
[1,],
[1,],
[0,]
]
Now, use this data to fit and predict the way you want!
Hope it helps!
Related
I need to generate embeddings for documents in lists, calculate the Cosine Similarity between every sentence of corpus 1 with every sentence of corpus2, rank them and give out the best fit:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings1 = ["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We buy apples every week",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)
embeddings2 = ["I'd like an orange juice",
"An orange a day keeps the doctor away",
"Eat orange every day",
"We buy orange every week",
"We use machine learning for document classification",
"Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)
print(cosine_similarity(embeddings1, embeddings2))
The vectors seem to work fine (due to the shape of the array) and also the calculation of the cosine similarity.
My problem is that the Universal Sentence Encoder does not give them out with the respective strings which is crucial. It always has to find the right fit and I must be able to order after the value of Cosine Similarity
array([[ 0.7882168 , 0.3366559 , 0.22973989, 0.15428472, -0.10180502,
-0.04344492],
[ 0.256085 , 0.7713026 , 0.32120776, 0.17834462, -0.10769081,
-0.09398925],
[ 0.23850328, 0.446203 , 0.62606746, 0.25242645, -0.03946173,
-0.00908459],
[ 0.24337521, 0.35571027, 0.32963073, 0.6373588 , 0.08571904,
-0.01240187],
[-0.07001016, -0.12002315, -0.02002328, 0.09045915, 0.9141338 ,
0.8373743 ],
[-0.04525191, -0.09421931, -0.00631144, -0.00199519, 0.75919366,
0.9686416 ]]
The goal is that the code finds out itself that the highest cosine similarity of "I'd like an apple juice" in the second corpus is "I'd like an orange juice" and matches them.
I tried for loops, for instance:
for sentence in embeddings1:
print(sentence, embed(sentence))
resulting in this error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: input must be a vector, got shape: []
[[{{node StatefulPartitionedCall/StatefulPartitionedCall/text_preprocessor/tokenize/StringSplit/StringSplit}}]] [Op:__inference_restored_function_body_5285]
Function call stack:
restored_function_body
As I mentioned in the comment, you should write the for loop as follows:
for sentence in embeddings1:
print(sentence, embed([sentence]))
the reason is simply that embed is expecting a list of strings as an input. No more detailed explanation than that.
I'm trying to write a program that, given a list of sentences, returns the most probable one. I want to use GPT-2, but I am quite new to using it (as in I don't really know how to do it). I'm planning on finding the probability of a word given the previous words and multiplying all the probabilities together to get the overall probability of that sentence occurring, however I don't know how to find the probability of a word occurring given the previous words. This is my (psuedo) code:
sentences = # my list of sentences
max_prob = 0
best_sentence = sentences[0]
for sentence in sentences:
prob = 1 #probability of that sentence
for idx, word in enumerate(sentence.split()[1:]):
prob *= probability(word, " ".join(sentence[:idx])) # this is where I need help
if prob > max_prob:
max_prob = prob
best_sentence = sentence
print(best_sentence)
Can I have some help please?
You can also try lm-scorer, a tiny wrapper around transformers that allows you to get sentences probabilities using models that support it (only GPT2 models are implemented at the time of writing).
https://github.com/simonepri/lm-scorer
I just used it myself and works perfectly.
Warning: If you use other transformers / pipelines in the same environment, things may get messy.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def score(tokens_tensor):
loss=model(tokens_tensor, labels=tokens_tensor)[0]
return np.exp(loss.cpu().detach().numpy())
texts = ['i would like to thank you mr chairman', 'i would liking to thanks you mr chair in', 'thnks chair' ]
for text in texts:
tokens_tensor = tokenizer.encode( text, add_special_tokens=False, return_tensors="pt")
print (text, score(tokens_tensor))
This code snippet could be an example of what are you looking for. You feed the model with a list of sentences, and it scores each whereas the lowest the better.
The output of the code above is:
i would like to thank you mr chairman 122.3066
i would liking to thanks you mr chair in 1183.7637
thnks chair 14135.129
I wrote a set of functions that can do precisely what you're looking for. Recall that GPT-2 parses its input into tokens (not words): the last word in 'Joe flicked the grasshopper' is actually three tokens: ' grass', 'ho', and 'pper'. The cloze_finalword function takes this into account, and computes the probabilities of all tokens (conditioned on the tokens appearing before them). You can adapt part of this function so that it returns what you're looking for. I hope you find the code useful!
I think GPT-2 is a bit overkill for what you're trying to achieve. You can build a basic language model which will give you sentence probability using NLTK. A tutorial for this can be found here.
I am trying to train a Naive Bayes classifier to predict whether a movie review is good or bad.
I am following this tutorial but have run into an error when trying to train the model:
https://medium.com/#MarynaL/analyzing-movie-review-data-with-natural-language-processing-7c5cba6ed922
I have followed all steps until training the model. My data and code looks as such:
Reviews Labels
0 For fans of Chris Farley, this is probably his... 1
1 Fantastic, Madonna at her finest, the film is ... 1
2 From a perspective that it is possible to make... 1
3 What is often neglected about Harold Lloyd is ... 1
4 You'll either love or hate movies such as this... 1
... ...
14995 This is perhaps the worst movie I have ever se... 0
14996 I was so looking forward to seeing this film t... 0
14997 It pains me to see an awesome movie turn into ... 0
14998 "Grande Ecole" is not an artful exploration of... 0
14999 I felt like I was watching an example of how n... 0
gnb = MultinomialNB()
gnb.fit(all_train_set['Reviews'], all_train_set['Labels'])
However when trying to fit the model I receive this error:
ValueError: could not convert string to float: 'For fans of Chris Farley, this is probably his best film. David Spade pl
If anyone could help me decide why following this tutorial has gone wrong it would e greatly appreciated.
Many Thanks
Indeed with Scikit-learn you have to convert texts to numbers before calling a classifier. You can do so by using, for instance, the CountVectorizer or the TfidfVectorizer.
If you want to use the more modern word embeddings, you can use the Zeugma package (install it with pip install zeugma in a terminal), e.g.
from zeugma.embeddings import EmbeddingTransformer
embedding = EmbeddingTransformer('glove')
X = embedding.transform(all_train_set['Reviews'])
y = all_train_set['Labels']
gnb = MultinomialNB()
gnb.fit(X, y)
I hope it helps!
I have a dataset, where each document possesses a corresponding score/ rating
dataset = [
{"text":"I don't like this small device", "rating":"2"},
{"text":"Really love this large device", "rating":"5"},
....
]
In addition, I have a category(variable) of term lists extracted out of text variables from the same dataset
x1 = [short, slim, small, shrink]
x2 = [big,huge,large]
So, how can I do the linear regression with multiple independent variables as a word lists(or varible representing the existence of any word from corresponding term list, because each term in lists is unique ) above and the dependent variable as a rating. In other words
how could I evaluate term lists impact on the rating with sklearn
I used TfidfVectorizer to derive the document-term matrix. If it's possible please provide simple code snippet or example.
Given the discussion in the comments, it seems that the interpretation should be that each list defines a binary variable whose value depends on whether or not any words from the list appear in the text in question. So, let us first change the texts so that the words actually appear:
dataset = [
{"text": "I don't like this large device", "rating": "2"},
{"text": "Really love this small device", "rating": "5"},
{"text": "Some other text", "rating": "3"}
]
To simplify our work, we'll then load this data into a data frame, change the ratings to be integers, and create the relevant variables:
df = pd.DataFrame(dataset)
df['rating'] = df['rating'].astype(int)
df['text'] = df['text'].str.split().apply(set)
x1 = ['short', 'slim', 'small', 'shrink']
x2 = ['big', 'huge', 'large']
df['x1'] = df.text.apply(lambda x: x.intersection(x1)).astype(bool)
df['x2'] = df.text.apply(lambda x: x.intersection(x2)).astype(bool)
That is, at this point df is the following data frame:
rating text x1 x2
0 2 {this, large, don't, like, device, I} False True
1 5 {this, small, love, Really, device} True False
2 3 {other, Some, text} False False
With this, we can create the relevant model, and check what the coefficients end up being:
model = LinearRegression()
model.fit(df[['x1', 'x2']], df.rating)
print(model.coef_) # array([ 2., -1.])
print(model.intercept_) # 3.0
As also mentioned in the comments, this thing will produce at most four ratings, one for each of the combinations of x1 and x2 being True or False. In this case, it just so happens that all possible outputs are integers, but in general, they need not be, nor need they be confined to the interval of interest. Given the ordinal nature of the ratings, this is really a case for some sort of ordinal regression (cf. e.g. mord).
I am working on keyword extraction problem. Consider the very general case
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.
"How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."
"Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"
Our best blessings are often the least appreciated."""
tfs = tfidf.fit_transform(t.split(" "))
str = 'tree cat travellers fruit jupiter'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print(feature_names[col], ' - ', response[0, col])
and this gives me
(0, 28) 0.443509712811
(0, 27) 0.517461475101
(0, 8) 0.517461475101
(0, 6) 0.517461475101
tree - 0.443509712811
travellers - 0.517461475101
jupiter - 0.517461475101
fruit - 0.517461475101
which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?
You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:
feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]
n = 3
top_n = feature_array[tfidf_sorting][:n]
This gives me:
array([u'fruit', u'travellers', u'jupiter'],
dtype='<U13')
The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.
Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split("\n\n"))? Otherwise, each term in the multiline string is being treated as a "document". Using \n\n instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.
Solution using sparse matrix itself (without .toarray())!
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus',
'frequency of words in a document is called term frequency'
]
X = tfidf.fit_transform(corpus)
feature_names = np.array(tfidf.get_feature_names())
new_doc = ['can key words in this new document be identified?',
'idf is the inverse document frequency caculcated for each of the words']
responses = tfidf.transform(new_doc)
def get_top_tf_idf_words(response, top_n=2):
sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
return feature_names[response.indices[sorted_nzs]]
print([get_top_tf_idf_words(response,2) for response in responses])
#[array(['key', 'words'], dtype='<U9'),
array(['frequency', 'words'], dtype='<U9')]
Here is a quick code for that:
(documents is a list)
def get_tfidf_top_features(documents,n_top=10):
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
importance = np.argsort(np.asarray(tfidf.sum(axis=0)).ravel())[::-1]
tfidf_feature_names = np.array(tfidf_vectorizer.get_feature_names())
return tfidf_feature_names[importance[:n_top]]