I am trying to understand why the output from sklearn prediction is different when put inside a function.
I have a naive bayes classifier trained for text and when I make my predictions like this
examples = ['my favorite sport is probably baseball']
predictions = vec_clf.predict(examples)[0]
probs = vec_clf.predict_proba(examples)
m = np.max(probs)
print predictions,m
I get the right prediction result. However if I write a function to do this
def classify(input):
predictions = vec_clf.predict(input)[0]
probs = vec_clf.predict_proba(input)
m = np.max(probs)
return predictions,m
classify('my favorite sport is probably baseball')
It returns a completely different and very wrong result with different confidence and class label. Why would it do this?
In the first attempt, you are passing a list of strings to model.predict_proba and model.predict (which is what is expected), in the latter attempt, you are passing a single string. Instead, pass a list of strings:
classify(['my favorite sport is probably baseball'])
Or wrap input in a list inside your function:
def classify(input):
input = [input]
predictions = vec_clf.predict(input)[0]
probs = vec_clf.predict_proba(input)
m = np.max(probs)
return predictions,m
What is going on when you only pass a string is that each individual character is being interpreted as a document. So, try just doing:
vec_clf.predict('my favorite sport is probably baseball')
To better see what is going on.
Related
dfff is a Dataframes that already been tokenized and will be used to convert to tf-idf by using tfidfvectorizer.
This is dfff sample:
[![enter image description here][1]][1]
then I create a tfidfvectorizer one
tfidf_vecer2 = TfidfVectorizer(analyzer = 'word', token_pattern=None)
then I ran this code:
tfidf_vectorr= tfidf_vecer2.fit_transform(dfff)
tfidf_array = np.array(tfidf_vectorr.todense())
suddenly, TypeError occurred as an output and I still can't figure it out.
I tried to use a list instead of a dataframe but it still error. this is an output:
TypeError: first argument must be string or compiled pattern
Can't see your example dataframe, but let's assume it's something like this:
import nltk
df = pd.DataFrame({'text':["Though worlds of wanwood leafmeal lie",
"And yet you will weep and know why"]})
df['tokenized'] = df['text'].apply(nltk.word_tokenize)
text tokenized
0 Though worlds of wanwood leafmeal lie [Though, worlds, of, wanwood, leafmeal, lie]
1 And yet you will weep and know why [And, yet, you, will, weep, and, know, why]
Then you need a dummy function use as tokenizer, in order to leave the input as it is:
def func(x):
return x
tfidf_vec = TfidfVectorizer(tokenizer=func,analyzer='word',
preprocessor=func,token_pattern=None)
tfidf_vec.fit(df['tokenized'])
There are plenty good tutorials on building a tensorflow model and I achieved creating a model with good accuracy. However, there are 2 questions left.
In my dataset there are plenty of classes, I try to illustrate that like this:
label - text
--------------------
A - this is a A text
B - this is a B text
C - this is a C text
...
Z - this is a Z text
...
ZA - this is a ZA text
...
Now I want to build a network that leans to classify the texts. I understand, that I have to deliver a fixed set of labels, because the net needs to have a fixed count of "output neurons". So, for learning purposes, I started to build a network only for the 3 classes A, B and C. I fed the network only with corresponding rows (A, B, C) and I got a model, that can recognize A, B, C with good accuracy.
Now I want to predict new texts and would like to get an output like this:
input text -> predicted label
----------------------------
this is a B text -> B // successful prediction
this is a xyz text -> ? // cannot be predicted, because not learned
How do I achieve the "not predictable" for the not yet learned classes?
All in all my way to get a csv-file with added prediction column might be a little clumsy. Could you show me how to do it better?
import pandas as pd
df = pd.read_parquet(path)
#print(df)
#label = df['kategorie'].fillna("N/A")
text = df['text'].fillna("")
text_padded = tokenize_and_pad(text)
# Predictions
probability_model = tf.keras.Sequential([model,
tf.keras.layers.Softmax()])
predictions = probability_model.predict(text_padded)
# get the predicted labels
# I only achieved this with this loop - there must be a more elegant way???
predictedLabels = []
for prediction in predictions:
labelID = np.argmax(prediction)
predictedLabel = label_encoder.inverse_transform([labelID])
predictedLabels.append(predictedLabel)
# add the new column to the dataframe
# the prediction is accurate for the learned labels
# but totally wrong for the labels, that I excluded from the learning
df['predictedLabels'] = predictedLabels
# todo: write to file
From your question, I understand that you need help in two areas:
Answer to the question, How do I achieve the "not predictable" for the not yet learned classes?:
a. Since you want to consider only 3 Classes, instead of deleting the Rows corresponding to other Classes, you can replace the Names of those Columns with, "Not Predictable" i.e., replace 'D', 'E', 'F', etc.. with "Not Predictable".
b. In the Final Dense Layer, change the Number of Neurons from 3 to 4, 4th class representing "Not Predictable"
Answer to the question, How to write Predictions to a CSV File:
Now that Predictions are added as a Column in the DataFrame, df you can write it to a CSV File using the command,
df.to_csv('My_Predictions.csv')
For more information about this command, please refer this link.
The way you are accessing the Labels looks elegant.
Please let me know if you face any other error and I will be Happy to help you.
Hope this helps. Happy Learning!
I am just a beginner in this subject, I have tested some NN for image recognition as well as using NLP for sequence classification.
This second topic is interesting for me.
using
sentences = [
'some test sentence',
'and the second sentence'
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sentences = tokenizer.texts_to_sequences(sentences)
will result with an array of size [n,1] where n is word size in sentence. And assuming I have implemented padding correctly each Training example in set will be size of [n,1] where n is the max sentence length.
that prepared training set I can pass into keras model.fit
what when I have multiple features in my data set?
let's say I would like to build an event prioritization algorithm and my data structure would look like:
[event_description, event_category, event_location, label]
trying to tokenize such array would result in [n,m] matrix where n is maximum word length and m is the feature number
how to prepare such a dataset so a model could be trained on such data?
would this approach be ok:
# Going through training set to get all features into specific ararys
for data in dataset:
training_sentence.append(data['event_description'])
training_category.append(data['event_category'])
training_location.append(data['event_location'])
training_labels.append(data['label'])
# Tokenize each array which contains tokenized value
tokenizer.fit_on_texts(training_sentence)
tokenizer.fit_on_texts(training_category)
tokenizer.fit_on_texts(training_location)
sequences = tokenizer.texts_to_sequences(training_sentence)
categories = tokenizer.texts_to_sequences(training_category)
locations = tokenizer.texts_to_sequences(training_location)
# Concatenating arrays with features into one
training_example = numpy.concatenate([sequences,categories, locations])
#ommiting model definition, training the model
model.fit(training_example, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
I haven't been testing it yet. I just want to make sure if I understand everything correctly and if my assumptions are correct.
Is this a correct approach to create NPL using NN?
I know of two common ways to manage multiple input sequences, and your approach lands somewhere between them.
One approach is to design a multi-input model with each of your text columns as a different input. They can share the vocabulary and/or embedding layer later, but for now you still need a distinct input sub-model for each of description, category, etc.
Each of these becomes an input to the network, using the Model(inputs=[...], outputs=rest_of_nn) syntax. You will need to design rest_of_nn so it can take multiple inputs. This can be as simple as your current concatenation, or you could use additional layers to do the synthesis.
It could look something like this:
# Build separate vocabularies. This could be shared.
desc_tokenizer = Tokenizer()
desc_tokenizer.fit_on_texts(training_sentence)
desc_vocab_size = len(desc_tokenizer.word_index)
categ_tokenizer = Tokenizer()
categ_tokenizer.fit_on_texts(training_category)
categ_vocab_size = len(categ_tokenizer.word_index)
# Inputs.
desc = Input(shape=(desc_maxlen,))
categ = Input(shape=(categ_maxlen,))
# Input encodings, opting for different embeddings.
# Descriptions go through an LSTM as a demo of extra processing.
embedded_desc = Embedding(desc_vocab_size, desc_embed_size, input_length=desc_maxlen)(desc)
encoded_desc = LSTM(categ_embed_size, return_sequences=True)(embedded_desc)
encoded_categ = Embedding(categ_vocab_size, categ_embed_size, input_length=categ_maxlen)(categ)
# Rest of the NN, which knows how to put everything together to get an output.
merged = concatenate([encoded_desc, encoded_categ], axis=1)
rest_of_nn = Dense(hidden_size, activation='relu')(merged)
rest_of_nn = Flatten()(rest_of_nn)
rest_of_nn = Dense(output_size, activation='softmax')(rest_of_nn)
# Create the model, assuming some sort of classification problem.
model = Model(inputs=[desc, categ], outputs=rest_of_nn)
model.compile(optimizer='adam', loss=K.categorical_crossentropy)
The second approach is to concatenate all of your data before encoding it, and then treat everything as a more standard single-sequence problem after that. It is common to use a unique token to separate or define the different fields, similar to BOS and EOS for the beginning and end of the sequence.
It would look something like this:
XXBOS XXDESC This event will be fun. XXCATEG leisure XXLOC Seattle, WA XXEOS
You can also do end tags for the fields like DESCXX, omit the BOS and EOS tokens, and generally mix and match however you want. You can even use this to combine some of your input sequences, but then use a multi-input model as above to merge the rest.
Speaking of mixing and matching, you also have the option to treat some of your inputs directly as an embedding. Low-cardinality fields like category and location do not need to be tokenized, and can be embedded directly without any need to split into tokens. That is, they don't need to be a sequence.
If you are looking for a reference, I enjoyed this paper on Large Scale Product Categorization using Structured and Unstructured Attributes. It tests all or most of the ideas I have just outlined, on real data at scale.
How do I make "First word in the doc was [target word]" a feature?
Consider these two sentences:
example = ["At the moment, my girlfriend is Jenny. She is working as an artist at the moment.",
"My girlfriend is Susie. She is working as an accountant at the moment."]
If I were trying to measure relationship commitment, I'd want to be able to treat the phrase "at the moment" as a feature only when it shows up at the beginning like that.
I would love to be able to use regex's in the vocabulary...
phrases = ["^at the moment", 'work']
vect = CountVectorizer(vocabulary=phrases, ngram_range=(1, 3), token_pattern=r'\w{1,}')
dtm = vect.fit_transform(example)
But that doesn't seem to work.
I have also tried this, but get a 'vocabulary is empty' error...
CountVectorizer(token_pattern = r"(?u)^currently")
What's the right way to do this? Do I need a custom vectorizer? Any simple tutorials you can link me to? This is my first sklearn project, and I've been Googling this for hours. Any help much appreciated!
OK I think I've figured out a way, based on hacking the get_tweet_length() function in this tutorial...
https://ryan-cranfill.github.io/sentiment-pipeline-sklearn-4/
I added this function...
def first_words(text):
matchesList = re.findall('^at the moment', text, re.I)
if len(matchesList) > 0:
return 1
else:
return 0
And used them with base sklearn_helper pipelinize_feature() function, which converts output into the array format desired by the sklearn's FeautreUnion function.
vect4 = pipelinize_feature(first_words, active=True)
I can then use this along with my normal CountVectorizers via FeatureUnion
unionObj = FeatureUnion([
('vect1', vect1),
('vect2', vect2),
('vect4', vect4)
])
I have read a description, how to apply random forest regression here. In this example the authors use the following code to create the features:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
I am thinking of combining several possibilities as features and turn them on and off. And I don't know how to do it.
What I have so far is that I define a class, where I will be able to turn on and off the features and see if it brings something (for example, all unigrams and 20 most frequent unigrams, it could be then 10 most frequent adjectives, tf-idf). But for now I don't understand how to combine them together.
The code looks like this, and in the function part I am lost (the kind of function I have would replicate what they do in the tutorial, but it doesn't seem to be really helpful the way I do it):
class FeatureGen: #for example, feat = FeatureGen(unigrams = False) creates feature set without the turned off feature
def __init__(self, unigrams = True, unigrams_freq = True)
self.unigrams = unigrams
self.unigrams_freq = unigrams_freq
def get_features(self, input):
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
tokens = input["token"]
if self.unigrams:
train_data_features = vectorizer.fit_transform(tokens)
return train_data_features
What should I do to add one more feature possibility? Like contains 10 most frequent words.
if self.unigrams
train_data_features = vectorizer.fit_transform(tokens)
if self.unigrams_freq:
#something else
return features #and this should be a combination somehow
Looks like you need np.hstack
However you need each features array to have one row per training case.