convert predicted sequence back to text in keras? - python

I am using simple RNN model in keras to predict the categorize of simple text data .But I am unable to converted my predicted sequence in to categories
test_sequences = tok.texts_to_sequences(X)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)
classes = model.predict(test_sequences_matrix)
I want to convert classes in to X format (Sequence to text). are there any function from which i can revert back.Thanks in advance
I have seen this but unable to solve to my problem

Related

How to get predicted classes from y_pred

I have trained my model and saved it as json file and then i have saved the wights and load it as h5
Then i have used it in another script in order to feed some records to test the accuracy of the model what i want to do is:
Making the model say what is the predicted label and then i must compare it with the true label from the csv file but I don’t know how to get the predicted values?
I tried this
predict_x= loaded_model(X_test) y_pred=np.argmax(predict_x,axis=1)
And then
for i in y_pred
to loop through y_pred, i thought i could see my y classes which are from 0-4 while actually i found numers like 17,22,13…
Can in one till me what I should do?
Thank you in advance
………………………………………………….

What is the correct way to convert a csv file with text to recordIO format?

I need to convert my dataset (includes text format) to recordIO format. I have tried below code. However, I am unable to fix the below error. Do I need to make further changes in my data format?
ValueError: Unsupported dtype object on array
Code:
import io
import sagemaker.amazon.common as smac
X = df[['Subject','Body']].to_numpy()
y = df[['Label']].to_numpy()
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, X, y)
buf.seek(0)
Dataset example-
Label Subject Body
label a Test one Test Body
label b Test two Test second
According to documentation in "Common Data Formats for Training",
your content-type is associated with the algorithms in the following table:
ContentType
Algorithm
application/x-recordio
Object Detection Algorithm
application/x-recordio-protobuf
Factorization Machines, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, Sequence-to-Sequence
Looking at the guide in documentation (Data conversion), the data should be passed as arrays of numbers, not strings.
This means that an encoder of some kind is needed (e.g. LabelEncoder for labels precisely, but an encoding/embedding algorithm would be needed for the remaining data). Based on the result you want to achieve, you can decide what to use from a variety of methods such as One-hot-encoding, binary encoding, one-of-k-encoding or whatever or even complex word/sentence embedding algorithms.
For example, for a text classification task with RFC/SVM, it is first necessary to encode the text with more or less expressive embedding algorithms (e.g. fastText).

LSTM for binary classification using multiple attributes

I haven't used neural networks for many years, so excuse my ignorance.
I was wondering what is the most appropriate way to train a LSTM model based on my dataset.
I have 3 attributes as follows:
Attribute 1: small int e.g., [123, 321, ...]
Attribute 2: text sequence ['cgtaatta', 'ggcctaaat', ... ]
Attribute 3: text sequence ['ttga', 'gattcgtt', ... ]
Class label: binary [0, 1, ...]
The length of each sample's attributes (2 or 3) is arbitrary; therefore I do not want to use them as words rather as sequences (that's why I want to use RNN/LSTM models).
Is it possible to have more than one (sequence) inputs to the LSTM model (are there examples)? Or should I concatenate them into one e.g., input 1: ["123 cgtaatta ttga", 0]
You don't need to concatonate the inputs into one, that part is done using the tf.keras.layers.Flatten() layer, which takes multiple inputs and and flattens them without affecting the batch size.
Read more here: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten
And here:
https://www.tensorflow.org/tutorials/structured_data/time_series#multi-step_dense
Not sure about most appropriate way since I wondered here looking for my own answers, but I do know you need to classify the data by providing some numerical identities to the text if applicable in your case.
Hope this helps

Keras Lambda Layer Before Embedding: Use to Convert Text to Integers

I currently have a keras model which uses an Embedding layer. Something like this:
input = tf.keras.layers.Input(shape=(20,) dtype='int32')
x = tf.keras.layers.Embedding(input_dim=1000,
output_dim=50,
input_length=20,
trainable=True,
embeddings_initializer='glorot_uniform',
mask_zero=False)(input)
This is great and works as expected. However, I want to be able to send text to my model, have it preprocess the text into integers, and continue normally.
Two issues:
1) The Keras docs say that Embedding layers can only be used as the first layer in a model: https://keras.io/layers/embeddings/
2) Even if I could add a Lambda layer before the Embedding, I'd need it to keep track of certain state (like a dictionary mapping specific words to integers). How might I go about this stateful preprocessing?
In short, I need to modify the underlying Tensorflow DAG, so when I save my model and upload to ML Engine, it'll be able to handle my sending it raw text.
Thanks!
Here are the first few layers of a model which uses a string input:
input = keras.layers.Input(shape=(1,), dtype="string", name='input_1')
lookup_table_op = tf.contrib.lookup.index_table_from_tensor(
mapping=vocab_list,
num_oov_buckets=num_oov_buckets,
default_value=-1,
)
lambda_output = Lambda(lookup_table_op.lookup)(input)
emb_layer = Embedding(int(number_of_categories),int(number_of_categories**0.25))(lambda_output)
Then you can continue the model as you normally would after an embedding layer. This is working for me and the model trains fine from string inputs.
It is recommended that you do the string -> int conversion in some preprocessing step to speed up the training process. Then after the model is trained you create a second keras model that just converts string -> int and then combine the two models to get the full string -> target model.

Show label probability/confidence in NLTK

I'm using the MaxEnt classifier from the Python NLTK library. For my dataset, I have many possible labels, and as expected, MaxEnt returns just one label. I have trained my dataset and get about 80% accuracy. I've also tested my model on unknown data items, and the results are good. However, for any given unknown input, I want to be able to print/display a ranking of all the possible labels based on some internal criteria MaxEnt used to select the one, such as confidence/probability. For example, suppose I had a,b,c as possible labels and I use MaxEnt.classify(input), I get currently one label, let's say c. However, I want to be able to view something like a (0.9), b(0.7), c(0.92), so I can see why c was selected, and possibly choose multiple labels based on those parameters. Apologies for my fuzzy terminology, I'm fairly new to NLP and machine learning.
Solution
Based on the accepted answer, here's a skeleton code example to demonstrate what I wanted and how it can be achieved. More classifier examples on the NLTK website.
import nltk
contents = read_data('mydataset.csv')
data_set = [(feature_sets(input), label) for (label, input) in contents] # User-defined feature_sets() function
train_set, test_set = data_set[:1000], data_set[1000:]
labels = [label for (input, label) in train_set]
maxent = nltk.MaxentClassifier.train(train_set)
maxent.classify(feature_sets(new_input)) # Returns one label
multi_label = maxent.prob_classify(feature_sets(new_input)) # Returns a DictionaryProbDist object
for label in labels:
multi_label.prob(label)
Try prob_classify(input)
It returns dictionary with probability for each label, see docs.

Categories

Resources