Analysing features of text in tensorflow

Analysing features of text in tensorflow - python

I am relatively new to TensorFlow and the idea of auto-encoders, but I was trying create an auto-encoder for text and I didn't want to write this code because it measures token by token.
array = array(shape=(2000,20))
model = keras.sequential()
model.add(Dense(20))
model.add(Dense(5))
model.add(Dense(20))
model.compile(optimizer = 'Adam', loss='mse')
model.fit(array, array, epochs = epochs)
This code isn't ideal because it tries to reconstruct the text piece by piece, which would be fine for images but not for text because it relies on what came before it. Is there a way to have a neural network compare the conceptual similarities of the output and input, similar to PCA with the components being the encoded layer. Or is there a way to store components of a text without an auto-encoder like in this video? https://www.youtube.com/watch?v=_ry6S-Dc2X8 http://codeparade.net/classes/
I want to make the conceptual difference between the output and the input be the loss instead of accuracy with tokens. He mentions something about PCA(principal component analysis) in the video.

I realized that instead of defining the loss based on if it got each token correct, I could have it guess a one hot encoded version of the sample, which shows that it can identify the sample without trying to spell. Like this
array = array(shape=(2000,20))
onehot = array(1-2000, shape = (2000, 2000))
model = keras.sequential()
model.add(Dense(20))
model.add(Dense(5))
model.add(Dense(20))
model.add(Dense(2000))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(array, onehot, epochs = epochs)

Related

Comprehending which inputs have the highest weight in a neuronal network

I am currently working on a Supervised Machine Learning Solution to categorize some data into two classes.
So far I have worked on a keras/tensorflow Python Scipt which seems to manage that just fine:
input_dim = len(data.columns) - 1
print(input_dim)
model = Sequential()
model.add(Dense(8, input_dim=input_dim, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_x, train_y, validation_split=0.33, epochs=1500, batch_size=1000, verbose=1)
The input Data I use is a csv data with 168 input features. As I was first running this script successfully I was very surprised to see that I actually got an accuracy of over 99% after only a couple hundred epochs of training. I didn't even bother to normalize the input data yet.
What I am trying to find out now is which of my 168 input features is responsible for such a high accuracy rate and which features dont take much of an effect while training.
Is there a way to check the weights of each input column to see which of them is being used most, respectively which make the most impact.

Answering your last question:
model.layers[0].get_weights()
However, unless there is an obviously dominating weight, it is unlikely that a single sample gives you good accuracy. For feature selection, try replacing some features of your input by their mean and check how the prediction fluctuates. Little-to-no fluctuation means that the feature is not important.
Also, please consider posting ML questions on https://datascience.stackexchange.com/

There is going to be a connection from each 'column' to each neuron in first layer. You could go two ways (apart from randomizing or dropping (equivalent to replacing with mean as suggested in the answer above) the columns values) about finding the relative importance of columns using the weights. Please keep in mind that these methods make sense only if you input standardized dataset
You could use L1 or L2 norm of each columns weight in the first layer
Say your input has 100 columns. You create a layer that dot products the input with a tensor (trainable) of size (100,). Now, you input the output of this layer to your sequential model. Your trained (100,) tensor is the relative importance of your columns

Why did I receive input error in a simple Keras Functional API?

I hope that together we can solve my problem, I will try to summarise it and paste in some codes.
At our research we are using TensorFlow/Keras for our DNN, to classify images (convolutional network). It was a quite simple Sequential model, but now we are trying to add more inputs, so I started to change the network to the Functional API (honestly, this is the first time that I used it). I recreated the original convolutional network and everything was working fine.
So I generated the necessary additional data into simple text files (one for each image) and these will be the second input in our model. This is were something went wrong, because I've got an error and now I am stuck with the solution.
To recreate the error, I made a new simple python file without the convolution part, just to try out the model's text file part. The input is 5000 text files, containing only a float number. After the preprocessing, we will have 4000 for train and additional 1000 for testing, both are stored in a numpy array. The train and test arrays were split through sklearn.model_selection.train_test_split.
In the original multi-channel network, I tried to concatenate the convolutional part with the text file and after it some dense layers are coming. But no concatenate here, just the train for the data stored in the text files.
Here are the shapes of the input and label arrays:
X_train.shape
(4000,)
y_train.shape
(4000, 5)
The - very - simple network:
inputA = Input(shape=(X_train.shape[0],))
x = Flatten()(inputA)
x = Dense(256, activation='relu')(x)
x = Dense(256, activation='relu')(x)
x = Dense(64, activation='relu')(x)
x = Dense(5, activation='softmax')(x)
x = Model(inputs=inputA, outputs = x)
Compile:
x.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
Model fit:
x.fit(X_train, y_train,
validation_data=(X_test, y_test),
batch_size=batch,
verbose=1,
epochs=epoch_number)
And the received error message:
ValueError: Error when checking input: expected input_8 to have shape (4000,) but got array with shape (1,)
The question is, what did I wrong? The previous codes are working just fine in the Sequential model, but here no. Could somebody help me solve this mystery?
Best wishes,
Tamas

It's because of this line:
inputA = Input(shape=(X_train.shape[0],))
Keras expects the number of features, not the number of samples. In your case, the input_shape=(1,).

Could not increase accuracy from a fixed threshold using Keras Dense layer ANN

I'm learning the simplest neural networks using Dense layers using Keras. I'm trying to implement face recognition on a relatively small dataset (In total ~250 images with 50 images per class).
I've downloaded the images from google images and resized them to 100 * 100 png files. Then I've read those files into a numpy array and also created a one hot label array for training my model.
Here is my code for processing the training data:
X, Y = [], []
feature_map = {
'Alia Bhatt': 0,
'Dipika Padukon': 1,
'Shahrukh khan': 2,
'amitabh bachchan': 3,
'ayushmann khurrana': 4
}
for each_dir in os.listdir('.'):
if os.path.isdir(each_dir):
for each_file in os.listdir(each_dir):
X.append(cv2.imread(os.path.join(each_dir, each_file), -1).reshape(1, -1))
Y.append(feature_map[os.path.basename(each_file).split('-')[0]])
X = np.squeeze(X)
X = X / 255.0 # normalize the training data
Y = np.array(Y)
Y = np.eye(5)[Y]
print (X.shape)
print (Y.shape)
This is printing (244, 40000) and (244, 5). Here is my model:
model = Sequential()
model.add(Dense(8000, input_dim = 40000, activation = 'relu'))
model.add(Dense(1200, activation = 'relu'))
model.add(Dense(700, activation = 'relu'))
model.add(Dense(100, activation = 'relu'))
model.add(Dense(5, activation = 'softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X, Y, epochs=25, batch_size=15)
When I train the model, It stuck at the accuracy 0.2172, which is almost the same as random predictions (0.20).
I've also tried to train mode with grayscale images but still not getting expected accuracy. Also tried with different network architectures by changing the number of hidden layers and neurons in hidden layers.
What am I missing here? Is my dataset too small? or am I missing any other technical detail?
For more details of code, here is my notebook: https://colab.research.google.com/drive/1hSVirKYO5NFH3VWtXfr1h6y0sxHjI5Ey

Two suggestions I can make:
Your data set is probably too small. If you are splitting training and validation at 80/20, that means you are only training on 200 images, which is probably too small. Try increasing your data set to see if results improve.
I would recommend adding Dropout to each layer of your network as your training set is so small. Your network is most likely over-fitting your training data set since it is so small, and Dropout is an easy way to help avoid this problem.
Let me know if these suggestions make a difference!

I agree that the dataset is too small, 50 instances of each person is probably not enough. You can use data augmentation with the keras ImageDataGenerator method to increase the number of images, and rewrite your numpy reshaping code as a pre-processing function for the generator. I also noticed that you haven't shuffled the data, so the network is likely predicting the first class for everything (which is maybe why the accuracy is near random chance).
If increasing the dataset size doesn't help, you'll probably have to play around with the learning rate for the Adam optimizer.

CNN on small dataset is overfiting

I want to classify pattern on image. My original image shape are 200 000*200 000 i reshape it to 96*96, pattern are still recognizable with human eyes. Pixel value are 0 or 1.
i'm using the following neural network.
train_X, test_X, train_Y, test_Y = train_test_split(cnn_mat, img_bin["Classification"], test_size = 0.2, random_state = 0)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(train_Y),
train_Y)
train_Y_one_hot = to_categorical(train_Y)
test_Y_one_hot = to_categorical(test_Y)
train_X,valid_X,train_label,valid_label = train_test_split(train_X, train_Y_one_hot, test_size=0.2, random_state=13)
model = Sequential()
model.add(Conv2D(24,kernel_size=3,padding='same',activation='relu',
input_shape=(96,96,1)))
model.add(MaxPool2D())
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(16, activation='softmax'))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
train = model.fit(train_X, train_label, batch_size=80,epochs=20,verbose=1,validation_data=(valid_X, valid_label),class_weight=class_weights)
I have already run some experiment to find a "good" number of hidden layer and fully connected layer. it's probably not the most optimal architecture since my computer is slow, i just ran different model once and selected best one with matrix confusion, i didn't use cross validation,I didn't try more complex architecture since my number of data is small, i have read small architecture are the best, is it worth to try more complex architecture?
here the result with 5 and 12 epoch, bach size 80. This is the confusion matrix for my test set
As you can see it's look like i'm overfiting. When i only run 5 epoch, most of the class are assigned to class 0; With more epoch, class 0 is less important but classification is still bad
I added 0.8 dropout after each convolutional layer
e.g
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
With drop out, 95% of my image are classified in class 0.
I tryed image augmentation; i made rotation of all my training image, still used weighted activation function, result didnt improve. Should i try to augment only class with small number of image? Most of the thing i read says to augment all the dataset...
To resume my question are:
Should i try more complex model?
Is it usefull to do image augmentation only on unrepresented class? then should i still use weight class (i guess no)?
Should i have hope to find a "good" model with cnn when we see the size of my dataset?

I think according to the imbalanced data, it is better to create a custom data generator for your model so that each of it's generated data batch, contains at least one sample from each class. And also it is better to use Dropout layer after each dense layer instead of conv layer. For data augmentation it is better to at least use combination of rotate, horizontal flip and vertical flip. there are some other approaches for data augmentation like using GAN network or random pixel replacement.
For Gan you can check This SO post
For using Gan as data augmenter you can read This Article.
For combination of pixel level augmentation and GAN pixel level data augmentation

What I used - in a different setting - was to upsample my data with ADASYN. This algorithm calculates the amount of new data required to balance your classes, and then takes available data to sample novel examples.
There is an implementation for Python. Otherwise, you also have very little data. SVMs are good performing even with little data. You might want to try them or other image classification algorithms depending where the expected pattern is always at the same position, or varies. Then you could also try the Viola–Jones object detection framework.

Supervised Extractive Text Summarization

I want to extract potential sentences from news articles which can be part of article summary.
Upon spending some time, I found out that this can be achieved in two ways,
Extractive Summarization (Extracting sentences from text and clubbing them)
Abstractive Summarization (internal language representation to generate more human-like summaries)
Reference: rare-technologies.com
I followed abigailsee's Get To The Point: Summarization with Pointer-Generator Networks for summarization which was producing good results with the pre-trained model but it was abstractive.
The Problem:
Most of the extractive summarizers that I have looked so far(PyTeaser, PyTextRank and Gensim) are not based on Supervised learning but on Naive Bayes classifier, tf–idf, POS-tagging, sentence ranking based on keyword-frequency, position etc., which don't require any training.
Few things that I have tried so far to extract potential summary sentences.
Get all sentences of articles and label summary sentences as 1 and 0 for all others
Clean up the text and apply stop word filters
Vectorize a text corpus using Tokenizer from keras.preprocessing.text import Tokenizer with Vocabulary size of 20000 and pad all sequences to average length of all sentences.
Build a Sqequential keras model a train it.
model_lstm = Sequential()
model_lstm.add(Embedding(20000, 100, input_length=sentence_avg_length))
model_lstm.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(Dense(1, activation='sigmoid'))
model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
This is giving very low accuracy ~0.2
I think this is because the above model is more suitable for positive/negative sentences rather than summary/non-summary sentences classification.
Any guidance on approach to solve this problem would be appreciated.

I think this is because the above model is more suitable for
positive/negative sentences rather than summary/non-summary sentences
classification.
That's right. The above model is used for binary classification, not text summarization. If you notice, the output (Dense(1, activation='sigmoid')) only gives you a score between 0-1 while in text summarization we need a model that generates a sequence of tokens.
What should I do?
The dominant idea to tackle this problem is encoder-decoder (also known as seq2seq) models. There is a nice tutorial on Keras repository which used for Machine translation but it is fairly easy to adapt it for text summarization.
The main part of the code is:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
Based on the above implementation, it is necessary to pass encoder_input_data, decoder_input_data and decoder_target_data to model.fit() which respectively are input text and summarize version of the text.
Note that, decoder_input_data and decoder_target_data are the same things except that decoder_target_data is one token ahead of decoder_input_data.
This is giving very low accuracy ~0.2
I think this is because the above model is more suitable for
positive/negative sentences rather than summary/non-summary sentences
classification.
The low accuracy performance caused by various reasons including small training size, overfitting, underfitting and etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Analysing features of text in tensorflow - python

Related

Comprehending which inputs have the highest weight in a neuronal network

Why did I receive input error in a simple Keras Functional API?

Could not increase accuracy from a fixed threshold using Keras Dense layer ANN

CNN on small dataset is overfiting

Supervised Extractive Text Summarization

Categories

Resources