How to get feature vector column length in Spark Pipeline

How to get feature vector column length in Spark Pipeline - python

I have an interesting question.
I am using Pipeline object to run a ML task.
This is how my Pipeline object looks like.
jpsa_mlp.pipeline.getStages()
Out[244]:
[StringIndexer_479d82259c10308d0587,
Tokenizer_4c5ca5ea35544bb835cb,
StopWordsRemover_4641b68e77f00c8fbb91,
CountVectorizer_468c96c6c714b1000eef,
IDF_465eb809477c6c986ef9,
MultilayerPerceptronClassifier_4a67befe93b015d5bd07]
All the estimators and transformers inside this pipeline object have been coded as part of class methods with JPSA being class object.
Now I want to put a method for hyper parameter tuning. So I use below:
self.paramGrid = ParamGridBuilder()\
.addGrid(self.pipeline.getStages()[5].layers, [len(self.pipeline.getStages()[3].vocab),10,3])\
.addGrid(self.pipeline.getStages()[5].maxIter, [100,300])\
.build()
The problem is for a Neural Network classifier one of the hyper parameter is basically the hidden layer size. The layers attribute of MLP classifier requires the size of input layer, hidden and output layer. Input and Output is fixed (based on data we have). So I wanted to put input layer size as the size of my feature vector. However I don't know the size of my feature vector because the estimator inside the pipeline object to create feature vectors (Count Vectorizer, IDF) have not been fit yet to the data.
The pipeline object will fit the data during cross validation by using a cross validator object of Spark. Then only I would be able to have CountVectorizerModel to know the feature vector size.
If I had Countvectorizer materialized then I can use either the countvectorizerModel.vocab to get the length of the feature vector and use that as a parameter for input layer value in layers attribute of mlp.
SO then how do I add hyper parameters for Layers for mlp (both the hidden and input layer size)?

You can find out that information from your dataframe schema metadata.
Scala code:
val length = datasetAfterPipe.schema(datasetAfterPipe.schema.fieldIndex("columnName"))
.metadata.getMetadata("ml_attr").getLong("num_attrs")

Since is requested PySpark code:
u can se them "navigating" metadata: datasetAfterPipe.schema["features"].metadata["ml_attr"]
here is sample output (xxx is all features made into features columns and the end results is the size):
Out:
{'attrs': {'numeric': [{'idx': xxxxxxx }]}, 'num_attrs': 337}
so u slice metadata:
lenFeatureVect = datasetAfterPipe.schema["features"].metadata["ml_attr"]["num_attrs"]
print('Len feature vector:', lenFeatureVect)
Out:
337
Note: if u have "scaled features" then u need to use "pre-Scaled" column
"features" in order to get attributes info (assuming u scale after vectorizing otherwise is not getting applied limitations if u feed original columns) since u feed feature
vectors to that step into Pipeline.

Related

How can I make predictions from a trained model inside a Tensorflow input pipeline?

I am trying to train a model for emotion recognition, which uses one of VGG's layer's output as an input.
I could manage what I want by running the prediction in a first step, saving the extracted features and then using them as input to my network, but I am looking for a way to do the whole process at once.
The second model uses a concatenated array of feature maps as input (I am working with video data), so I am not able to simply wire it to the output of VGG.
I tried to use a map operation as depicted in the tf.data.dataset API documentations this way :
def trimmed_vgg16():
vgg16 = tf.keras.applications.vgg16.VGG16(input_shape=(224,224,3))
trimmed = tf.keras.models.Model(inputs=vgg16.get_input_at(0),
outputs=vgg16.layers[-3].get_output_at(0))
return trimmed
vgg16 = trimmed_vgg16()
def _extract_vgg_features(images, labels):
pred = vgg16_model.predict(images, batch_size=batch_size, steps=1)
return pred, labels
dataset = #load the dataset (image, label) as usual
dataset = dataset.map(_extract_vgg_features)
But I'm getting this error : Tensor Tensor("fc1/Relu:0", shape=(?, 4096), dtype=float32) is not an element of this graph which is pretty explicit. I'm stuck here, as I don't see a good way of inserting the trained model in the same graph and getting predictions "on the fly".
Is there a clean way of doing this or something similar ?
Edit: missed a line.
Edit2: added details

You should be able to connect the layers by first creating the vgg16 and then retrieving the output of the model as such and afterward you can use that tensor as an input to your own network.
vgg16 = tf.keras.applications.vgg16.VGG16(input_shape=(224,224,3))
network_input = vgg16.get_input_at(0)
vgg16_out = vgg16.layers[-3].get_output_at(0) # use this tensor as input to your own network

Keras Lambda Layer Before Embedding: Use to Convert Text to Integers

I currently have a keras model which uses an Embedding layer. Something like this:
input = tf.keras.layers.Input(shape=(20,) dtype='int32')
x = tf.keras.layers.Embedding(input_dim=1000,
output_dim=50,
input_length=20,
trainable=True,
embeddings_initializer='glorot_uniform',
mask_zero=False)(input)
This is great and works as expected. However, I want to be able to send text to my model, have it preprocess the text into integers, and continue normally.
Two issues:
1) The Keras docs say that Embedding layers can only be used as the first layer in a model: https://keras.io/layers/embeddings/
2) Even if I could add a Lambda layer before the Embedding, I'd need it to keep track of certain state (like a dictionary mapping specific words to integers). How might I go about this stateful preprocessing?
In short, I need to modify the underlying Tensorflow DAG, so when I save my model and upload to ML Engine, it'll be able to handle my sending it raw text.
Thanks!

Here are the first few layers of a model which uses a string input:
input = keras.layers.Input(shape=(1,), dtype="string", name='input_1')
lookup_table_op = tf.contrib.lookup.index_table_from_tensor(
mapping=vocab_list,
num_oov_buckets=num_oov_buckets,
default_value=-1,
)
lambda_output = Lambda(lookup_table_op.lookup)(input)
emb_layer = Embedding(int(number_of_categories),int(number_of_categories**0.25))(lambda_output)
Then you can continue the model as you normally would after an embedding layer. This is working for me and the model trains fine from string inputs.
It is recommended that you do the string -> int conversion in some preprocessing step to speed up the training process. Then after the model is trained you create a second keras model that just converts string -> int and then combine the two models to get the full string -> target model.

Tensorflow WarmStartSettings embedding shape mismatch

I am using the new tf.estimator.WarmStartSettings to initialize my network from a previous checkpoint. I now want to run the same network on a new data source, with other vocabs to use for the embeddings.
This snippet from the documentation page of WarmStartSettings seems to describe my use case:
Warm-start all weights but the embedding parameters corresponding to
sc_vocab_file have a different vocab from the one used in the current
model:
vocab_info = ws_util.VocabInfo(
new_vocab=sc_vocab_file.vocabulary_file,
new_vocab_size=sc_vocab_file.vocabulary_size,
num_oov_buckets=sc_vocab_file.num_oov_buckets,
old_vocab="old_vocab.txt"
)
ws = WarmStartSettings(
ckpt_to_initialize_from="/tmp",
var_name_to_vocab_info={
"input_layer/sc_vocab_file_embedding/embedding_weights": vocab_info
})
tf.estimator.VocabInfo allows to specify the old and new vocab with their respective sizes. However, when I try to use the WarmStartSettings as shown above with 2 vocabs of different sizes, I get the following error:
ValueError: Shape of variable input_layer/sc_vocab_file_embedding/embedding_weights
((1887, 30)) doesn't match with shape of tensor
input_layer/sc_vocab_file_embedding/embedding_weights ([537, 30]) from checkpoint reader.
Why does VocabInfo allow to provide separate sizes for the vocabs if their size has to match anyway?

Hold out tensorflow 1.4 new dataset API

With the new dataset object, is there a way to divide a dataset into training and test dataset, according to a certain ratio, to get an hold out? and a k-fold cross validation?
In my case i wrote all data in only one TFRecord file and then i imported it with tf.data.TFRecordDataset.
Now, for hold out i'd like a way to split this given dataset in two datasets with a ratio. I solved this with data.take() and data.skip() but for ratio i need dataset's lenght, it's not graceful.
def split_dataset(dataset, ratio, n):
count_train = (n*ratio)//100
train = dataset.take(count_train)
test = dataset.skip(count_train)
return train,test
filenames = ["dataset_breast.tfrecords"]
dataset = tf.data.TFRecordDataset(filenames)
train_dataset, test_dataset = split_dataset(dataset, 80, 3360)
While for k-fold, i find only solution with scikit workaround on the dataset, before tf.data.TFRecordDataset importing.

I do not know of any feature like what you're describing. There are, of course, ways to achieve the functionality you're after. Here's two:
"Source" Placeholder
This one comes straight from the API docs. Though it is originally intended for TFRecordDataset, I imagine it could be adapted to other types. I'll copy/paste from the link:
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...) # Parse the record into tensors.
dataset = dataset.repeat() # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
# You can feed the initializer with the appropriate filenames for the current
# phase of execution, e.g. training vs. validation.
# Initialize `iterator` with training data.
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})
# Initialize `iterator` with validation data.
validation_filenames = ["/var/data/validation1.tfrecord", ...]
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})
A little discussion: This works with just one training tfrecord and one validation tfrecord, too. So, you could split your data using sklearn.model_selection.train_test_split before writing the TFRecords. Write one TFRecord dedicated to training data, one dedicated to validation data. With sklearn, you can specify a ratio (or an absolute number).
Two Datasets
Exactly like the name says, forget the filenames = tf.placeholder. Just create two iterators, one for testing and one for training. I usually use TFRecords, but you're free to try another type of dataset. Typically, I put the get_next calls into a tf.cond on a boolean tf.placeholder. If you're especially interested in this method, I could provide a MWE. But, the source placeholder seems to be the preferred method (seeing as it's in the docs...).

CNTK & python: How to pass input data to the eval func?

With CNTK I have created a network with 2 input neurons and 1 output neuron.
A line in the training file looks like
|features 1.567518 2.609619 |labels 1.000000
Then the network was trained with brain script. Now I want to use the network for predicting values. For example: Input data is [1.82, 3.57]. What ist the output from the net?
I have tried Python with the following code, but here I am new. Code does not work. So my question is: How to pass the input data [1.82, 3.57] to the eval function?
On stackoverflow there are some hints, here and here, but this is too abstract for me.
Thank you.
import cntk as ct
import numpy as np
z = ct.load_model("LR_reg.dnn", ct.device.cpu())
input_data= np.array([1.82, 3.57], dtype=np.float32)
pred = z.eval({ z.arguments[0] : input_data })
print(pred)

Here's the most defensive way of doing it. CNTK can be forgiving if you omit some of this when the network is specified with V2 constructs. Not sure about a network that was created with V1 code.
Basically you need a pair of braces for each axis. Which axes exist in Brainscript? There's a batch axis, a sequence axis and then the static axes of your network. You have one dimensional data so that means the following should work:
input_data= np.array([[[1.82, 3.57]]], dtype=np.float32)
This specifies a batch of one sequence, of length one, containing one 1d vector of two elements. You can also try omitting the outermost braces and see if you are getting the same result.
Update based on more information from the comment below, we should not forget that the V1 code also saved the part of the network that computes things like loss and accuracy. If we provide only the features, CNTK will complain that the labels have not been provided. There are two ways to deal with this issue. One possibility is to provide some fake labels, so that the network can evaluate these auxiliary operations. Another possibility is to identify the prediction and use that. If the prediction was called 'p' in V1, this python code
p = z.find_by_name('p')
should create a CNTK function that only needs the features in order to compute the prediction.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get feature vector column length in Spark Pipeline - python

You can find out that information from your dataframe schema metadata. Scala code: val length = datasetAfterPipe.schema(datasetAfterPipe.schema.fieldIndex("columnName")) .metadata.getMetadata("ml_attr").getLong("num_attrs")

Related

How can I make predictions from a trained model inside a Tensorflow input pipeline?

Keras Lambda Layer Before Embedding: Use to Convert Text to Integers

Tensorflow WarmStartSettings embedding shape mismatch

Hold out tensorflow 1.4 new dataset API

CNTK & python: How to pass input data to the eval func?

Categories

Resources