I'd like to compare the performance of following types of CNNs for two different large image data sets. The goal is to measure the similarity between two images, which both have not been seen during training. I have access to 2 GPUs and 16 CPU cores.
Triplet CNN (Input: Three images, Label: encoded in position)
Siamese CNN (Input: Two images, Label: one binary label)
Softmax CNN for Feature Learning (Input: One image, Label: one integer label)
For Softmax I can store the data in a binary format (Sequentially store label and image). Then read it with a TensorFlow reader.
To use the same method for Triplet and Siamese Networks, I'd have to generate the combinations in advance and store them to disk. That would result in a big overhead in both the time it takes to create the file and in disk space. How can it be done on the fly?
Another easy way would be to use feed_dict, but this would be slow. Therefore the problem would be solved if it would be possible to run the same function which I'd use for feed_dict in parallel and convert the result to a TensorFlow tensor as a last step. But as far as I know such a conversion does not exist so one has to read the files with a TensorFlow reader in the first place and do the whole process with TensorFlow methods. Is this correct?
Short answer do the pair/triplet creation online with numpy no need to convert it to a tensor as the feed_dict arguments accepts numpy arrays already.
The best would be to use tf.nn.embedding_lookup() from already existing batches in combination with itertools to create the indices of the pairs but for a naïve non-optimal solution you can look at the gen_batches_siamese.py script in my github repository. Where I reimplemented the caffe siamese example.
Obviously it will be less efficient than using tensorflow queues but my advice would be to try this baseline first before going to the pure tensorflow solution.
Related
Im building a GAN on Tensorflow for Image Deblurring, its an implementation of DeblurGANv2. I setup the GAN in a way it have two inputs, a batch of blurred images, and a batch of sharp images. Following this lines, I design the input to be a Python Dictionary with two Keys ['sharp', 'blur'], each one have a tensor of shape [batch_size, 512, 512, 3], this make it easy for feed the blurred images batch to the generator, and then feed the output of generator and the sharp images batch to the discriminator.
Based on the last requirements, i create a tf.data.Dataset that outputs exactly that, a dict containing the two tensors, each one with their batch dimension. this complements perfectly with my GAN implementation, everything work fine and smoothly.
So keep in mind, my input is not a tensor, but a python dict, that has no batch dimension, this will be relevant for explain my problem later.
Recently, i decided to add support for distributed training using Tensorflow Distribution Strategies. This feature of Tensorflow allows to distribute the training over multiple devices, inclusively over multiple machines. There is a feature with some of the implementations, for example MirroredStrategy, that takes the input tensor, splits it in equal parts, and feed each slice to different devices, that means, if you have a batch size of 16 and 4 GPUs, each GPU will end taking a local batch of 4 datapoints, after this there is some magic for aggregate the results and other stuff that is not relevant to my problem.
As you already notice, is critical for distribution strategies to have a tensor as input, or at least some sort of input with an exterior batch dimension, and what i have is a Python dict, with the batch dimension of the inputs in the internal dictionary tensor values. This is a huge problem, my current implementation is not compatible with distributed training.
I was looking for workarounds, but i cant wrap my head very well around this, maybe just make the input a huge tensor of shape=[batch_size, 2, 512, 512, 3] and slice it? not sure this just come to my mind right now lol. Anyways i see this very ambiguous, i cant not differentiate the two inputs, at least not with the clarity of the dictionary keys. Edit: The problem with this solution is that make my dataset transformations very expensive, hence makes the dataset throughput lot slower, taking into account this is an image loading pipeline, this is a major point.
Maybe my explanation of how distributed strategies work is not the most rigorous one, if im not seeing something feel free to correct me pls.
PD: This is not a bug question or a code error, mostly a "System Design Query", hope this is not illegal here
Instead of using dictionary as input the GAN, you can try mapping a function in the following way,
def load_image(fileA,fileB):
imageA = tf.io.read_file(fileA)
imageA = tf.image.decode_jpeg(imageA, channels=3)
imageB = tf.io.read_file(fileB)
imageB = tf.image.decode_jpeg(imageB)
return imageA,imageB
trainA = glob.glob('blur/*.jpg')
trainB = glob.glob('sharp/*.jpg')
train_dataset = tf.data.Dataset.from_tensor_slices((trainA,trainB))
train_dataset = train_dataset.map(load_image).batch(batch_size)
#for mirrored strategy
dist_dataset = mirrored_strategy.experimental_distribute_dataset(train_dataset)
You can iterate the dataset and update the network by passing both the images.
I hope this helps !
I am working on a problem which requires me to build a deep learning model that based on certain input image it has to output another image. It is worth noting that these two images are conceptually related but they don't have the same dimensions.
At first I thought that a classical CNN with a final dense layer whose argument is the multiplication of the height and width of the output image would suit this case, but when training it was giving strange figures such as accuracy of 0.
While looking for some answers on the Internet I discovered the concepts of CNN autoencoders and I was wondering if this approach could help me solve my problem. Among all the examples I saw, the input and output of an autoencoder had the same size and dimensions.
At this point I wanted to ask if there was a type of CNN autoencoders that produce an output image that has different dimension compared to input image.
Auto-encoder (AE) is an architecture that tries to encode your image into a lower-dimensional representation by learning to reconstruct the data from such representation simultaniously. Therefore AE rely on a unsupervised (don't need labels) data that is used both as an input and as the target (used in the loss).
You can try using a U-net based architecture for your usecase. A U-net would forward intermediate data representations to later layers of the network which should assist with faster learning/mapping of the inputs into a new domain..
You can also experiment with a simple architecture containing a few ResNet blocks without any downsampling layers, which might or might not be enough for your use-case.
If you want to dig a little deeper you can look into Disco-GAN and related methods.They explicitly try to map image into a new domain while maintaining image information.
I have an image classification task where I've created multiple crops of each image as well as flipped/flopped versions to extend my limited dataset. I have written the dataset to a tfrecords file where each record consists of (simplified here to two crops and only a flipped version):
{
lbl: int,
crop_0: np.ndarray,
crop_1: np.ndarray,
crop_0_flipped: np.ndarray,
crop_1_flipped: np.ndarray
}
Basically 4 images / entry. During training, I'd like to treat each image as separate, i.e. feed each record as 4 images with the same label, shuffled with the rest of the images in the dataset, so that N images becomes 4N images. During testing (using a separate but similarly structured dataset), I'd like to take each image, only use the crop_0 and crop_1 images and average the softmax outputs for classification.
My question is - what is the best and most efficient way of training such a dataset? I'm willing to change my approach if this will make training more inefficient, and it seems that the simplest thing to do would have been to have separate tfrecords files for each version (crop & flip/flop images) and interleave the files into one dataset, but I do not want to have a whole bunch of files to deal with if I can help it.
Writing the dataset to disk with 4N images is an approach that you'll come to loath later (I did it this way originally and loath that code now). The better way is to keep your original dataset on disk as-is, don't write your preprocessing steps to disk. Do that kind of preprocessing in the CPU while you train. The tensorflow Dataset preprocessing pipeline makes this easy, modular, and provides the parallelization you need to take advantage of multiple cores at not extra coding expense.
This is the main guide:
https://www.tensorflow.org/programmers_guide/datasets
Your approach should be to create 2 Dataset objects, one for train and one for test. The train Dataset pipeline will perform all the data augmentation you mentioned. The test Dataset pipeline will not, naturally.
One key to understanding this approach is that you will not feed the data to tensorflow using feed_dict, instead, tensorflow will just invoke the Dataset pipeline to pull the data it needs for each batch.
To get parallelization you'll use the Dataset.map function to apply some set of transformations and use the property num_parallel_calls to distribute the operations across multiple cores. If your preprocessing can be done in tensorflow code, great, if not you'll need to use tf.py_func to use python preprocessing code.
The guide I linked to above describes all of this very well. You will want to us a feedable iterator described in the section called "Creating an iterator". This will allow you to get a string ID from each of the 2 datasets (train and test) and pass that string to tensorflow via feed_dict indicating which of the two datasets tensorflow should pull samples from.
I was asked to create a machine algorithm using tensorflow and python that could detect anomalies by creating a range of 'normal' values. I have two perameters, a large array of floats around 1.5 and timestamps. I have not seen similar threads using tensorflow in a basic sense, and since I am new to technology I am looking to make a more basic machine. However, I would like to have it be unsupervised, meaning that I do not specify what an anomaly is, but rather a large amount of past data does. Thank you, I am running python 3.5 and tensorflow 1.2.1.
Deep Learning - Anomaly and Fraud Detection
https://exploreai.org/p/deep-learning-anomaly-and-fraud-detection
Simply normalize the values and feed it to the tensorflow autoencoder model.
So, autoencoders are deep neural networks used to reproduce the input at the output layer i.e. the number of neurons in the output layer is exactly the same as the number of neurons in the input layer. Consider the image below
The autoencoders work in a similar way. The encoder part of the architecture breaks down the input data to a compressed version ensuring that important data is not lost but the overall size of the data is reduced significantly. This concept is called Dimensionality Reduction.
Check this repo for code : Autoencoder in tensorflow
I'm developing a way to compare two spectrograms and score their similarity.
I have been thinking for a long time how to do so, how to pick the whole model/approach.
Audioclips I'm using to make spectrograms are recordings from android phone, i convert them from .m4a to .wav and then process them to plot the spectrogram, all in python.
All audio recordings have same length
Thats something that really help because all the data can then be represented in the same dimensional space.
I filtered the audio using Butterworth Bandpass Filter, which is commonly used in voice filtering thanks to its steady behavior in the persisted part of signal. As cutoff freq i used 400Hz and 3500Hz
After this procedure the output looks like this
My first idea was to find region of interest using OpenCV on that spectrogram, so i filtered color and get this output, which can be roughly use to get the limits of the signal, but that will make every clip different lenght and i perhaps dont want that to happen
Now to get to my question - i was thinking about embedding those spectrograms to multidimensional points and simply score their accuracy as the distance to the most accurate sample, which would be visualisable thanks to dimensionality reduction in some cluster-like space. But thats seems to plain, doesnt involve training and thus making it hard to verify. SO
Is there any possibility to use Convolution Neural Network, or combination of networks like CNN -> delayed NN to embed this spectogram to multidim point and thus making it possible to not compare them directly but comparing output of the network?
If there is anything i missed in this question please comment, i would fix that right away, thank you very much for your time.
Josef K.
EDIT:
After tip from Nikolay Shmyrev i switched to using the Mel spectrogram:
That looks much more promising, but my question remains almost the same, can i use pretrained CNN models, like VGG16 to embed those spectrograms to tensors and thus being able to compare them ?? And if so, how? Just remove last fully connected layer and Flatten it instead?
In my opinion, and according to Yann Lecun, when you target speech recognition with Deep Neural Network you have two obligations:
You need to use a Recurrent Neural Network in order to have the memory ability (memory is really important for speech recognition...)
and
you will need a lot of training data
You may try to use RNN on tensorflow, but you definitely need a lot of training data.
If you don't want (or can't) find or generate a lot training data, you have forget the deep learning to solve this ...
In that case (forget deep learning) you may take a look of how Shazam work (based on fingerprint algorithm)
You can use CNN of course, tensorflow has special classes for that for example as many other frameworks. You simply convert your image to a tensor and apply the network and as a result you get lower-dimensional vector you can compare.
You can train your own CNN too.
For best accuracy it is better to scale lower frequencies (bottom part) and compress higher frequencies in your picture since lower frequencies have more importance. You can read about Mel Scale for more information