Predict a text with bag of word approach - python

I am trying text classification using the bag of word model. Everything works fine till I use the test set for testing and evaluation of accuracy but how we can check the class of a single statement.
I have a data frame with 2 classes labels and body.
cout_vect = CountVectorizer()
final_count = cout_vect.fit_transform(df['body'].values.astype('U'))
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
X_train, X_test, y_train, y_test = train_test_split(final_count, df['label'], test_size = .3, random_state=25)
model = Sequential()
model.add(Dense(264, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
y_train = np_utils.to_categorical(y_train, num_classes=3)
y_test = np_utils.to_categorical(y_test, num_classes=3)
model.fit(X_train, y_train, epochs=50, batch_size=32)
model.evaluate(x=X_test, y=y_test, batch_size=None, verbose=1, sample_weight=None)
Now I want to predict this statement using my model. How to do this
I tried converting my statement to vector using the count vectorizer but according to the bag of word approach, it is just an 8 dimension vector.
x = "Your account balance has been deducted for 4300"
model.predict(x, batch_size=None, verbose=0, steps=None)

You need to do this:
# First transform the sentence to bag-of-words according to the already learnt vocabulary
x = cout_vect.transform([x])
# Then send the feature vector to the predict
print(model.predict(x, batch_size=None, verbose=0, steps=None))
You have not shown how you "I tried converting my statement to vector using the count vectorizer but according to the bag of word approach, it is just an 8 dimension vector.", but I'm guessing you did this:
cout_vect.fit_transform([x])
If you call fit() (or fit_transform()), the vectorizer will forget all the previous training and only remember the current vocab, hence you only got a feature vector of size 8, whereas your previous vector was of higher size.

Related

text binary classification error:logits and labels must have the same shape

I'm trying to bulid a model to classify my text to hate (1) or not (0) using nn.
Information about the data, it's consists of tweets and class label (hate (1) or not (0)):
sentences = df['comment']
y = df['isHate']
sentences_train, sentences_test, train_y, test_y = train_test_split(sentences, y, test_size=0.25, random_state=42)
the text get through a lot of Word Embeddings and I applied pad sequences on the tweets and LabelEncoder on the labels.
the problem is when I do the run I get this error:
ValueError: logits and labels must have the same shape ((None, 1) vs (None, 2))
the code of the model:
emb_dim = 16
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim= emb_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
the problem happened in this part:
history = model.fit(X_train, y_train,
batch_size=32,
epochs=15,
validation_data=(X_test, y_test))
Any help?
In your code:
model.add(Dense(1, activation='sigmoid'))
Your last dense layer has only 1 unit but your labels are one hot encoded which consist of 2 classes. So you need to change:
model.add(Dense(2, activation='softmax'))
You also need to change your loss function, because they are one-hot-encoded:
loss='categorical_crossentropy'

Binary classification using 04-zoo.csv data

In the zoo data, there are several types of each sample.
For making the binary classification, type 0,1,2,3 gets label 0 and type 4,5,6 gets label 1.
It's too hard for me to make split the types into two labels. So, I adjusted the data just adding the labels like below screen capture.
and I built the model like below.
import tensorflow as to
import numpy as np
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Dropout
zoo = np.loadtxt('~\\data-04-zoo.csv', delimiter=',', dtype=np.float32)
features = zoo[:, 0:-1]
labels = zoo[:, [-1]]
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.25, shuffle=True, random_state=77)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', Y_train.shape)
print('y_test shape:', Y_test.shape)
model = Sequential()
model.add(Dense(40, input_dim=17, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(40, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['accuracy'])
model.fit(X_train, Y_train, validation_split=0.25, shuffle=True, epochs=20, batch_size=2)
score = model.evaluate(X_test, Y_test, batch_size=2)
print(score)
My professor said that model can't be taught.
What's the problem?? I don't know how to adjust the codes.
The Result

Keras give less accuracy than any classifier

I use python to multi-class text classification , my data set contains 25000 Arabic tweets divided into 10 classes[sport, politics,....]
When I use
training = pd.read_csv('E:\cluster data\One_File_nonnormalizenew2norm.txt', sep="*")
training.dropna(inplace=True)
training.columns = ["text", "class1"]
training['class1'] = training.class1.astype('category').cat.codes
training.dropna(inplace=True)
# create our training data from the tweets
text = training['text']
y = (training['class1'])
from sklearn.model_selection import train_test_split
sentences_train, sentences_test, y_train, y_test = train_test_split(text, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
X_train
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9525099601593625
When I use keras:
model = Sequential()
max_words=5000
model.add(Dense(512, input_shape=(input_dim,), activation='softmax'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='softmax'))
model.add(Dropout(0.5))
model.add(Dense(1,activation='softmax'))
model.add(Dense(10))
model.summary()
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=150, epochs=5, verbose=1, validation_split=0.3,shuffle=True)
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)
0.28127490039840636
where the mistake???
update
I change the code to:
model = Sequential()
max_words=5000
model.add(Dense(512, input_shape=(input_dim,)))
model.add(Dropout(0.5))
model.add(Dense(256))
model.add(Dropout(0.5))
#model.add(Dense(1,activation='sigmoid'))####
model.add(Dense(10))
model.summary()
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train, y_train,batch_size=150,epochs=10,verbose=1,validation_split=0.3,shuffle=True)
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)
0.7201593625498008
still bad accuracy!!!
Some ideas.
Remove all softmax activations (as #Matias said).
Remove the model.add(Dense(1,activation='softmax')), it's probably destroying your results.
Do more than 5 epochs.
You are not using the same tweets for validation in the two approaches.
You should probably give the accuracy on both the training and the testing datasets to be sure what is going on.

TensorFlow - Stop training when losses reach a defined value

I used the first example here as an example of network.
How to stop the training when the loss reach a fixed value ?
So, for example, I would like to fix a maximum of 3000 epochs and the training will stop when the loss will be under 0.2.
I read this topic but it is not the solution I found.
I would want to stop the training when the loss reach a value, not when there is no improvement like with this function proposed in the precedent topic.
Here is the code:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
# Generate dummy data
import numpy as np
x_train = np.random.random((1000, 20))
y_train = keras.utils.to_categorical(np.random.randint(10, size=(1000, 1)), num_classes=10)
x_test = np.random.random((100, 20))
y_test = keras.utils.to_categorical(np.random.randint(10, size=(100, 1)), num_classes=10)
model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(64, activation='relu', input_dim=20))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
model.fit(x_train, y_train,
epochs=3000,
batch_size=128)
score = model.evaluate(x_test, y_test, batch_size=128)
You can use some method like this if you would switch to TensorFlow 2.0:
class haltCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('loss') <= 0.05):
print("\n\n\nReached 0.05 loss value so cancelling training!\n\n\n")
self.model.stop_training = True
You just need to create a callback like that and then add that callback to your model.fit so it becomes something like this:
model.fit(x_train, y_train,
epochs=3000,
batch_size=128,
callbacks=['trainingStopCallback'])
This way, the fitting should stop whenever it reaches down below 0.05 (or whatever value you put on while defining it).
Also, since it's been a long time you asked this question but it still has no actual answer for using it with TensorFlow 2.0, I updated your code snippet to TensorFlow 2.0 so everyone can now easily find and use it with their new projects.
import tensorflow as tf
# Generate dummy data
import numpy as np
x_train = np.random.random((1000, 20))
y_train = tf.keras.utils.to_categorical(
np.random.randint(10, size=(1000, 1)), num_classes=10)
x_test = np.random.random((100, 20))
y_test = tf.keras.utils.to_categorical(
np.random.randint(10, size=(100, 1)), num_classes=10)
model = tf.keras.models.Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(tf.keras.layers.Dense(64, activation='relu', input_dim=20))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
class haltCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('loss') <= 0.05):
print("\n\n\nReached 0.05 loss value so cancelling training!\n\n\n")
self.model.stop_training = True
trainingStopCallback = haltCallback()
sgd = tf.keras.optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy', 'loss'])
model.fit(x_train, y_train,
epochs=3000,
batch_size=128,
callbacks=['trainingStopCallback'])
score = model.evaluate(x_test, y_test, batch_size=128)
Documentation here : EarlyStopping
keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=0, verbose=0, mode='auto', baseline=None)
I solved it by doing this:
history = model.fit(training1, training2, epochs=100, verbose=True)
loss=history.history["loss"]
for x in loss:
if x <= 1:
print("Reached 1 loss value, cancelling training")
model.stop_training = True
break

Why did this Keras python program fail?

I followed a tutorial on youtube and I accidentally didn't add model.add(Dense(6, activation='relu')) on Keras and I got 36% accuracy. After I added this code it rised to 86%. Why did this happen?
This is the code
from sklearn.model_selection import train_test_split
import keras
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
np.random.seed(3)
classifications = 3
dataset = np.loadtxt('wine.csv', delimiter=",")
X = dataset[:,1:14]
Y = dataset[:,0:1]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.66,
random_state=5)
y_train = keras.utils.to_categorical(y_train-1, classifications)
y_test = keras.utils.to_categorical(y_test-1, classifications)
model = Sequential()
model.add(Dense(10, input_dim=13, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(6, activation='relu')) # This is the code I missed
model.add(Dense(6, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(2, activation='relu'))
model.add(Dense(classifications, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=
['accuracy'])
model.fit(x_train, y_train, batch_size=15, epochs=2500, validation_data=
(x_test, y_test))
Number of layers is an hyper parameter just like learning rate,no of neurons.
These play an important role in determining the accuracy.
So in your case.
model.add(Dense(6, activation='relu'))
This layer played the key roll.
We cannot understand what exactly these layers are actually doing.
The best we can do is to do hyper parameter tuning to get the best combination of hyper parameters.
In my opinion, maybe it's the ratio of your training set to your test set. You have 66% of your test set, so it's possible that training with this model will be under fitting. So one less layer of dense will have a greater change in the accuracy . You put test_size = 0.2 and try again the change in the accuracy of the missing layer.

Categories

Resources