how to split data into test and train using tensorflow - python

I’m very new to tensorflow. I’ve attended an online course, but I still have many questions related to data pre-processing. I would really appreciate if someone could help me out!!
My goal is to train a model that classifies Portuguese nouns into two gender categories (feminine and masculine) based on their internal structure. So, for this, I have a file containing about 4300 nouns and their categories (F and M labels).
First question:
I have opened the nouns files and I first tokenized the words and after that I have padded them. I have a put the labels in a separated file. The labels file is a txt list containing the labels ‘f’ and ‘m’. I’ve converted them into 0 and 1 integers and then convert them into a numpy array. I’ve also converted the padded nouns into a numpy array. Is that correct?
What is strange is that I have set the number of epochs for 100, but the program keeps training…
Second question:
How can I separate my train and labels into test and test_labels?
My code so far is below:
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize,wordpunct_tokenize
import re
import os
import sys
from pathlib import Path
import numpy as np
import numpy
import tensorflow as tf
while True:
try:
file_to_open =Path(input("Please, insert your file path: "))
with open(file_to_open,'r', encoding="utf-8") as f:
words = f.read()
break
except FileNotFoundError:
print("\nFile not found. Better try again")
except IsADirectoryError:
print("\nIncorrect Directory path.Try again")
corpus=words.split('\n')
labels = []
new_labels=[]
nouns = []
for i in corpus:
if i == '0':
labels.append(i)
elif i== '1':
labels.append(i)
else:
nouns.append(i)
for x in labels:
new_labels.append(int(x))
training_labels= numpy.array(new_labels)
training_nouns=[]
for w in nouns:
a=list(w)
b=' '.join([str(elem) for elem in a]) + ',' + ' '
training_nouns.append(b)
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_nouns)
word_index = tokenizer.word_index
nouns_sequences = tokenizer.texts_to_sequences(training_nouns)
padded = pad_sequences(nouns_sequences,maxlen=max_length)
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
input_length=max_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(36, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=.
['accuracy'])
model.summary()
training_padded = np.array(padded)
num_epochs = 150
model.fit(training_padded, training_labels, epochs=num_epochs)

If you shouldn't use Tensorflow. you can use train_test_split scikit-learn function like this(you can continue with tensorflow):
from sklearn.model_selection import train_test_split
train_data,train_labels,test_data,test_labels=train_test_split(YOUR DATA,YOUR LABELS)
see here for more information.

Related

Sentiment analysis model return identical output for any input

So I made this sentiment analysis model, and it works just fine in the training-testing script. I built a simple interface using streamlit for my saved model, but it always returned identical scores for any input text. Plus, it somehow returns so many scores when it should only return a single score for a single input.
here is my code:
import streamlit as st
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import pickle
with open("tokenizer.pkl", "rb") as f:
tokenizer = pickle.load(f)
#st.cache(allow_output_mutation=True)
def load_model():
model = tf.keras.models.load_model('C:/Users/k/Downloads/test/model_final.h5')
return model
if __name__ == '__main__':
model = load_model()
st.title('Analisis Sentimen')
txt = st.text_input('masukkan teks')
if not txt:
st.warning("masukkan teks sebelum lanjut")
st.stop()
else:
text = txt
text = tokenizer.texts_to_sequences(text)
text = pad_sequences(text)
prediction = model.predict(text)
st.title('sentimen: ')
if (prediction > 0.5).any():
st.write(prediction)
st.write('positif')
else:
st.write(prediction)
st.write('negatif')
here are some snaps when i try it with streamlit
ML is almost all trial and error. I modified the model like below, and got good results. Kindly try changing the model.
model = Sequential()
embedding_size = 50
model.add(Embedding(input_dim=1500,output_dim=embedding_size,input_length=max_tokens,name='embedding_layer'))
model.add(LSTM(16,))
model.add(Dropout(0.8))
model.add(Dense(16,))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
optimizer = Adam()
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),optimizer=optimizer,metrics=['accuracy'])
Please find the working code here. Thank you!

Failed to convert a NumPy array to a Tensor (Unsupported object type list)

I am getting this error all the time. Any idea?
I have added all the necessary libraries required. Anything wrong with tensorflow? I am not able to understand it.Trying to create a chat bot. The json file is ok. I have checked some videos, I guess I have to change the train_x and train_y data.
import random
import numpy as np
import pandas
import nltk
import pickle
from nltk.stem import WordNetLemmatizer
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
lm = WordNetLemmatizer()
intents = json.loads(open('intents.json').read())
words = []
classes = []
documents = []
ignore = ['?', '!', '.', ',']
for intent in intents['intents']:
for pattern in intent['patterns']:
word_list = nltk.word_tokenize(pattern)
words.extend(word_list)
documents.append((word_list, intent['tag']))
if intent['tag'] not in classes:
classes.append(intent['tag'])
words =[lm.lemmatize(word) for word in words if word not in ignore]
words = sorted(set(words))
classes = sorted(set(classes))
pickle.dump(words, open('words.pkl','wb'))
pickle.dump(words, open('classes.pkl','wb'))
training = []
output = [0] * len(classes)
for document in documents:
area = []
word_pattern = document[0]
word_pattern = [lm.lemmatize(word.lower()) for word in word_pattern]
for word in words:
if word in word_pattern:
area.append(1) if word in word_pattern else area.append(0)
output_rw = list(output)
output_rw[classes.index(document[1])] = 1
training.append([area,output_rw])
random.shuffle(training)
training= np.array(training)
train_x = list(training[:, 0])
train_y = list(training[:, 1])
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
sgd = SGD(lr=0.01, decay=0.000001, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',optimizer=sgd, metrics=['accuracy'])
model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot model.model')
print("Done")
Only 1's are appended to the area due to a faulty if condition in your code, resulting in a variable number of elements in each row of train_x. Since an array must have the same amount of elements in all of its rows, the list is not converted to np.array. Kindly change the code as follows.
for word in words:
# Remove if word in word_pattern:
area.append(1) if word in word_pattern else area.append(0)
Please find the working code here. Thank you!

How to use tensorflow_io's IODataset?

I'm trying to write a program that can uses malicious pcap files as datasets and predicts if other pcaps files have malicious packets in them.
After some digging through the Tensorflow doumentation, I have found TensorIO, but I can't figure out how to use the dataset to create a model and predict with it.
Here's my code:
%tensorflow_version 2.x
import tensorflow as tf
import numpy as np
from tensorflow import keras
try:
import tensorflow_io as tfio
import tensorflow_datasets as tfds
except:
!pip install tensorflow-io
!pip install tensorflow-datasets
import tensorflow_io as tfio
import tensorflow_datasets as tfds
# print(tf.__version__)
dataset = tfio.IODataset.from_pcap("dataset.pcap")
print(dataset) # <PcapIODataset shapes: ((), ()), types: (tf.float64, tf.string)>
(Using Google Collab)
Iv'e tried looking for answers online, but couldn't find any.
I have downloaded two pcap files and concatenated them. Later I have extracted the packet_timestamp and packet_data. Request you to preprocess the packet_data as per your requirement. If you have any labels to be added, you can add to the training dataset (In the below model example, I have created a dummy labels of all zero and adding as a column). If it is in a file then you can zip them to pcap files. Passing a dataset of (feature, label) pairs is all that's needed for Model.fit and Model.evaluate:
Below is an example of packet_data preprocessing - May be you can modify like if packet_data is valid then labels = valid else malicious.
%tensorflow_version 2.x
import tensorflow as tf
import tensorflow_io as tfio
import numpy as np
# Create an IODataset from a pcap file
first_file = tfio.IODataset.from_pcap('/content/fuzz-2006-06-26-2594.pcap')
second_file = tfio.IODataset.from_pcap(['/content/fuzz-2006-08-27-19853.pcap'])
# Concatenate the Read Files
feature = first_file.concatenate(second_file)
# List for pcap
packet_timestamp_list = []
packet_data_list = []
# some dummy labels
labels = []
packets_total = 0
for v in feature:
(packet_timestamp, packet_data) = v
packet_timestamp_list.append(packet_timestamp.numpy())
packet_data_list.append(packet_data.numpy())
labels.append(0)
if packets_total == 0:
assert np.isclose(
packet_timestamp.numpy()[0], 1084443427.311224, rtol=1e-15
) # we know this is the correct value in the test pcap file
assert (
len(packet_data.numpy()[0]) == 62
) # we know this is the correct packet data buffer length in the test pcap file
packets_total += 1
assert (
packets_total == 43
) # we know this is the correct number of packets in the test pcap file
Below is example of using in Model - The model won't work because I have not handled the packet_data which is of string type. Do the preprocessing as explained as per your requirement and use in the model.
%tensorflow_version 2.x
import tensorflow as tf
import tensorflow_io as tfio
import numpy as np
# Create an IODataset from a pcap file
first_file = tfio.IODataset.from_pcap('/content/fuzz-2006-06-26-2594.pcap')
second_file = tfio.IODataset.from_pcap(['/content/fuzz-2006-08-27-19853.pcap'])
# Concatenate the Read Files
feature = first_file.concatenate(second_file)
# List for pcap
packet_timestamp = []
packet_data = []
# some dummy labels
labels = []
# add 0 as label. You can use your actual labels here
for v in feature:
(timestamp, data) = v
packet_timestamp.append(timestamp.numpy())
packet_data.append(data.numpy())
labels.append(0)
## Do the preprocessing of packet_data here
# Add labels to the training data
# Preprocess the packet_data to convert string to meaningful value and use here
train_ds = tf.data.Dataset.from_tensor_slices(((packet_timestamp,packet_data), labels))
# Set the batch size
train_ds = train_ds.shuffle(5000).batch(32)
##### PROGRAM WILL RUN SUCCESSFULLY TILL HERE. TO USE IN THE MODEL DO THE PREPROCESSING OF PACKET DATA AS EXPLAINED ###
# Have defined some simple model
model = tf.keras.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(100),
tf.keras.layers.Dense(10)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_ds, epochs=2)
Hope this answers your question. Happy Learning.

word embedding example in keras predicts different results on each run

I am following the pretrained_word_embeddings and is saving the model using the following piece of code
print('Saving model to disk ...')
model.save('/home/data/pretrained-model.h5'')
I am then loading the pretrained model using
pretrained_model = load_model('/home/data/pretrained-model.h5')
Later the following piece of code for predicting on a different text altogether
predict_texts = [] # list of text samples
for predict_name in sorted(os.listdir(PREDICT_TEXT_DATA_DIR)):
predict_path = os.path.join(PREDICT_TEXT_DATA_DIR, predict_name)
if os.path.isdir(predict_path):
for predict_fname in sorted(os.listdir(predict_path)):
if predict_fname.isdigit():
predict_fpath = os.path.join(predict_path, predict_fname)
if sys.version_info < (3,):
f = open(predict_fpath)
else:
f = open(predict_fpath, encoding='latin-1')
predict_text = f.read()
i = predict_text.find('\n\n') # skip header
if 0 < i:
predict_text = predict_text[i:]
predict_texts.append(predict_text)
f.close()
print('Found %s texts.' % len(predict_texts))
tokenizer.fit_on_texts(predict_texts)
predict_sequences = tokenizer.texts_to_sequences(predict_texts)
predict_data = pad_sequences(predict_sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of predict data tensor:', predict_data.shape)
x_predict = predict_data
y_predict = pretrained_model.predict(x_predict)
max_val = np.argmax(y_predict)
print('Category it belongs to : ',max_val)
The problem that I am facing now is that each time I run this above piece of code, max_val is always a different value.
How do I make predictions consistent please ?
I think you should predict one by one, not merge all texts for all files.
The following code I tested is OK:
from __future__ import print_function
import os
import sys
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.models import load_model
from keras.preprocessing.text import text_to_word_sequence
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
model = load_model('embedding.h5')
PREDICT_TEXT_DATA_DIR = 'predict_data'
predict_path = os.path.join(PREDICT_TEXT_DATA_DIR, '1.txt')
f = open(predict_path, encoding='utf-8')
predict_text = f.read()
f.close()
texts=[predict_text]
# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
x_predict = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of predict data tensor:', x_predict.shape)
y_predict = model.predict(x_predict)
max_val = np.argmax(y_predict)
print('Category it belongs to : ',max_val)

LSTM Error python keras

Good morning, I'm trying to train lstm to classify spam and not spam, I came across the following error:
ValueError: Input 0 is incompatible with layer lstm_1: expected ndim = 3, found ndim = 4
Can someone help me understand where the problem is?
my code:
import sys
import pandas as pd
import numpy as np
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import CountVectorizer
if __name__ == "__main__":
np.random.seed(7)
with open('SMSSpamCollection') as file:
dataset = [[x.split('\t')[0],x.split('\t')[1]] for x in [line.strip() for line in file]]
data = np.array([dat[1] for dat in dataset])
labels = np.array([dat[0] for dat in dataset])
dataVectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
labelVectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
data = dataVectorizer.fit_transform(data).toarray()
labels = labelVectorizer.fit_transform(labels).toarray()
vocab = labelVectorizer.get_feature_names()
print(vocab)
print(data)
print(labels)
data = np.reshape(data, (data.shape[0], 1, data.shape[1]))
input_dim = data.shape
tam = len(data[0])
print(data.shape)
print(tam)
model = Sequential()
model.add(LSTM(tam, input_shape=input_dim))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(data, labels, epochs=100, batch_size=1, verbose=2)
I tried adding another position in the data array but also with no result
my file SMSSpamCollection
ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham Ok lar... Joking wif u oni...
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham U dun say so early hor... U c already then say...
ham Nah I don't think he goes to usf, he lives around here though
spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham Even my brother is not like to speak with me. They treat me like aids patent.
...
thanks
The problem lies in fact that you are adding an additional dimension connected with samples. Try:
input_dim = (data.shape[1], data.shape[2])
This should work.

Categories

Resources