I'm trying to use Keras's implementation of resnet for a transfer learaning task with a quite different set of images (B&W 16 bit). So what Keras expects as an input? Image with 3 channels and -127-128 range (that's what I assume zero centered 8 bit image)? 0-255? What would happen if I pass something outside this range?
Thanks.
According to the paper provided in Keras documentation you should provide a 224 x 224 RGB [0 - 225] image. The actual dimension ordering depends on the backend you use in your Keras installation.
The data preparation was performed as in AlexNet so the mean activation was subtracted from each color channel. The mean vector for RGB is 103.939, 116.779, 123.68.
If your color values would extend -255, 255 range - it could harm your training because of the magnitude of data unknown for the network. But still - network could adapt to this changes, but it usually makes more time and make training more chaotic.
In case of monochromatic images - a commonly used technique is a repeating the same channel 3 times in order to make dimensions plausible for network architecture.
Related
How do you flatten an image?
I know that we make use conv2d and pooling to detect the edges and minimize the size of the picture, so do we then flatten it after that?
Will the flattened, pooled image will be a vector in one row and features or one column and the features?
Do we make the equation x_data=x_date/255 after flattening or before convolution and pooling?
I hope to know the answer.
Here's the pipeline:
Input image (could be in batches - let's say your network processes 10 images simultaneously) so 10 images of size (28, 28) -- 28 pixels height / weight and let's say the image has 1 filter only (grayscale).
You are supposed to provide to your network an input of size (10, 28, 28, 1), which will be accepted by a convolutional layer. You are free to use max pooling and maybe an activation function. Your convolutional layer will apply a number of filters of your choice -- let's assume you want to apply 40 filters. These are 40 different kernels applied with different weights. If you want to let's say classify these images you will (most likely) have a number of Dense layers after your convolutional layers. Before passing the output of the convolutional layers (which will be a representation of your input image after a feature extraction process) to your dense layers you have to flatten it in a way (You may use the simplest form of flattening, just passing the numbers one after the other). So your dense layer accepts the output of these 40 filters, which will be 'images' -- their size depends on many things (kernel size, stride, original image size) which will later be flattened into a vector, which supposedly propagates forward the information extracted by your conv layer.
Your second question regarding MinMaxScaling (div by 255) - That is supposed to take place before everything else. There are other ways of normalizing your data (Standar scaling -- converting to 0 mean and unit variance) but keep in mind, when using transformations like that, you are supposed to fit the transformation on your train data and transform your test data accordingly. You are not supposed to fit and transform on your test data. Here, dividing by 255 everything is accepted but keep that in mind for the future.
I'm working on tensorflow and my dataset is composed only by Black and White images, so I thought that I could make my neural net (currently I am using Resnet50) less heavy and easier to train and test by changing the number of channels from 3 to 1,
Is there a way to do so?
(Ik I can treat b/w images as rgb images but I don't want to do that)
Thanks in advance for the answer
The pretrained weights in keras.applications require a 3 channel input. You could one of two things:
Use a different pretrained model that works on grayscale images.
Set the R, G and B channels to replicate your BW input, then fine-tune the entire neural network on your own dataset. This probably won't work without the fine-tuning step.
On a side note, I must say this task will not help in your goal of making it 'less heavy and easier to train and test'. If you call model.summary() on keras Resnet50, you see that of the total trainable 23,534,592 parameters, only about 10K of them are in the initial layer. So at best you can reduce the number of parameters by an insignificant few thousand.
I would instead suggest using a lighter model such as MobileNet that are also available in Keras.
This question is almost a duplicate of the post from Cross Validated, but none has replied to that and I hope it is okay I ask almost the same question here.
I have been reading and looking at implementations of the SRGAN, from Photo-realistic Single Image Super Resolution with Generative Adversarial Networks. I implemented the PyTorch implementation of SRGAN for 3 channel images and it makes some decent super resolution images. However, when I try it out with 1 channel images it fails to generate plausible images.
The grayscale images I use is from the public release MSTAR of high resolution Synthetic Aparture Radar (SAR) data from sdms. The dataset contains of 2774 images and has 10 classes. Some samples are shown below:
Since SRGAN use VGG as one of the networks I had to convert my grayscale images to RGB. I copied the first channel to the other two channels and created a 3 channel image by the following approach:
w, h = hr_image.shape
ret = np.empty((w, h, 3), dtype=np.uint8)
ret[:, :, :] = hr_image[:, :, np.newaxis]
hr_image = ret
The following are the output after 34 epochs (this continues) and is low resolution, high resolution, super resolution.
What I notice is that the Discriminators loss quickly goes to 0, Generators loss to 0.08, generators score to 0 and discriminators score to 1. I assume this means that it is too easy for the discriminator to distinguish between the real and fake image. This presumable causes the generator to not learn anything new and just stops learning.
I tried to isolate one class in the MSTAR dataset, but that did not change anything. I noticed that others Super-resolution SAR Image Reconstruction via Generative Adversarial Network did use SRGAN for SAR images and it seems to work, but their paper does not explain how they implemented it.
I am wondering if I am using the wrong approach and need to change the loss functions. SRGAN is using MSE, TVLoss and perceptual loss. MSE is by itself not the best loss functions and is explained really well on here. But it is probably good to keep the images inside the MSE-hypersphere. But I ask myself whether it make sense to use a network that is trained on low frequency images and use it on high frequency images? As I understand it the loss functions are designed to work really well on low frequency images (something our eyes like to look at) and not that much on high frequency images. My questions are therefore:
Should I change the loss functions? Should I pre train my own network with high frequency grayscale images so the network are more suitable for these images?
How come the generator crashing after few epochs? Why is it so hard for the generator to make plausible transformations? Are the images too "noisy"?
Should I change something in the structure of the generator and discriminator? And should I try to use another pretrained network that is more suitable for grayscale images, and if so, which one?
Should I use a pretrained network that is trained on high frequency images?
UPDATED TEXT 11 JUNE 2020 BECAUSE OF NEW RESULTS
Instead of converting all my images to 3 channels I only did this before the VGG network in line 22 by the following command:
out_images = out_images.repeat(1,3, 1,1)
target_images = target_images.repeat(1, 3, 1,1)
I also made some minor changes in my code and this changed the output of the model after 250:
However as can be seen it did not improve in making the super resolution image. The discriminators quickly learns to distinguish between real and fake images as can be seen on the loss plot below:
Does anyone has any suggestions on how I can make the generator stronger? I tried to add more layers in the generator and removed some of the discriminator but with no success.
I would like to train a new model using my own dataset. I will be
using Darkflow/Tensorflow for it.
Regarding my doubts:
(1) Should we resize our training images for a specific size?
(2) I think smaller images might save time, but can smaller images harm the accuracy?
(3) And what about the images to be predicted, should we resize them as well or is it not necessary?
(1) It already resize it with random=1 in .cfg file.The answer is "yes".The input resolution of images are same.You can resize it by yourself or Yolo can do it.
(2)If your hardware is good enough,I suggest you to use big sized images.Also as a suggest,If you will use webcam,use images as the same resolutions as your webcam uses.
(3)Yes, same as training.
(1) Yes, neural networks have fixed input dimensions. These can be adjusted to fit your purpose, but at last you need to commit to a defined input dimension, and thus you need to input your images fitting these dimensions. For YOLO I found the following:
layer filters size input output
0 conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32
It could be that the framework you are using already does that step for you. Maybe somebody could comment on that.
(3) The images / samples you feed during inference, for prediction should be as similar to the training images / samples as possible. So whatever preprocessing you re doing with your training data, you should definitely do the same on your inference data.
(2) Smaller images make sense if your hardware is not able to hold larger images in memory, or if you train with large batch sizes so that your hardware needs to hold multiple images in memory at ones. In the end, the computational time is rather proportional to the amount of operations of your architecture, not necessarily to the images size.
(1) No, it is not necessary. But if your dataset contains random resolutions, you can put
random = 1
in your .cfg file for better results.
(2) Smaller images don't reduce the time to converge, but if your dataset contains only small images, Yolo will probably fail to converge (Yolov3 is not a good detector for a lot of tiny objects)
(3) It is not necessary
I am trying to see the feasibility of using TensorFlow to identify features in my image data. I have 50x50px grayscale images of nuclei that I would like to have segmented- the desired output would be either a 0 or 1 for each pixel. 0 for the background, 1 as the nucleus.
Example input: raw input data
Example label (what the "label"/real answer would be): output data (label)
Is it even possible to use TensorFlow to perform this type of machine learning on my dataset? I could potentially have thousands of images for the training set.
A lot of the examples have a label correspond to a single category, for example, a 10 number array [0,0,0,0,0,0,0,0,0,0,0] for the handwritten digit data set, but I haven't seen many examples that would output a larger array. I would assume I the label would be a 50x50 array?
Also, any ideas on the processing CPU time for this time of analysis?
Yes, this is possible with TensorFlow. In fact, there are many ways to approach it. Here's a very simple one:
Consider this to be a binary classification task. Each pixel needs to be classified as foreground or background. Choose a set of features by which each pixel will be classified. These features could be local features (such as a patch around the pixel in question) or global features (such as the pixel's location in the image). Or a combination of the two.
Then train a model of your choosing (such as a NN) on this dataset. Of course your results will be highly dependant upon your choice of features.
You could also take a graph-cut approach if you can represent that computation as a computational graph using the primitives that TensorFlow provides. You could then either not make use of TensorFlow's optimization functions such as backprop or if there are some differentiable variables in your computation you could use TF's optimization functions to optimize those variables.
SoftmaxWithLoss() works for your image segmentation problem, if you reshape the predicted label and true label map from [batch, height, width, channel] to [N, channel].
In your case, your final predicted map will be channel = 2, and after reshaping, N = batchheightwidth, then you can use SoftmaxWithLoss() or similar loss function in tensorflow to run the optimization.
See this question that may help.
Try using a convolutional filters for the model. A stacking of convolution and downsampling layers. The input should be the normalized pixel image and output should be the mask. The last layer should be a softmaxWithLoss. HTH.