I'm trying to build a video stabilization deep learning model.
I want to make the model predict how the frame should be stabilized depending on the last 10 frames
I have tried pix2pix, which is image to image, but I didn't get a good result
so, I want the same as pix2pix but multi images to 1 image
is there a method or can I do it using pix2pix?
So, I do not know if you actually need to build this video stabilization using deep learning or if you just want on off-the-shelves solution.
For the on-the-shelves solution, you can look into vidgear that has an awesome stabilisation system built-in: https://abhitronix.github.io/vidgear/latest/gears/stabilizer/overview/
If you want a more advanced solution and architecture, you could take a look at his thread of paper with code: https://paperswithcode.com/task/video-stabilization
Given the current architecture of pix2pix, I do not see how multi-images will provide some stabilisation since, it is just as you said, pix2pix does not consider its previous output nor the flow of images to generate its prediction.
I hope that it helps ^^
Related
I'd like to implement something like the title, but I wonder if it's technically possible.
I know that it is possible to recognize pictures with CNN,
but I don't know if can be automatically covered nipple area.
If have library information about any related information,
I would like to get some advice.
CNNs are able to detect whatever you train them for, to varying degree of accuracy. What you would need are a lot of training samples (ie. samples of ground truths with the original image, and the labeled image) with which to train your models, and then some new data which you can test the accuracy of your model on. The point is, CNNs are not biased to innately learn a task, you have to tell them what to learn!
I can recommend the machine learning library Keras (https://keras.io/) if you plan to do some machine learning using CNNs, as it's pretty simple and somewhat beginner-friendly. Take some of the tutorials for CNNs, which are quite good.
Essentially, you have what I can only assume is a pretty niche problem. The main issue will come down to how much data you have to train your model. CNNs need a lot of training data, especially for a problem like this which isn't simple. A way which would make this simpler would be to have a model which detects the ahem area of interest and denotes it as such on a per-pixel basis. Then a simple mask could be applied to the source image to censor it. This relates to image segmentation, and there are many academic papers on the topic.
So here is my question:
I want to make my very own dataset using a motion capture camera system to get the ground truth poses and one RGB camera to get images, and then using this as input to my network, train/test a convNet.
I have looked around at other datasets for tensorflow, caffe and Matlab. I have viewed the MNIST, Cats/Dogs, Iris, LSP, HumanEva, HumanEva3.6, FLIC, etc. datasets and have viewed and tried to understand their data as best as I can. I have viewed online people trying to make their own datasets. The one thing is usually when you use their datasets as an example, you download a .txt file that already contains the labels.
If anyone could please explain to me how to use the image data with the labels to feed it into my network, it would be a tremendous help. I have made code before using tensorflow to input a .txt file into the network and get the correct predicted output. But, my brain is missing something to understand how to input an image with a label. How to I create that dataset?
Your input images and your labels are two separate variables. You will be writing separate bits of code to import them. The videos typically need to be converted to JPG files (it's a royal pain to read video files directly, mostly because you can't randomly skip around the video easily).
Probably the easiest way to structure you data is via a CSV that contains filename, poseinfoA, poseinfoB, etc. And the filename refers to the JPG image on disk.
To get started on the basics, I suggest looking at the Aymericdamen tutorial examples, I haven't found tutorials anywhere that were as clear and concise.
https://github.com/aymericdamien/TensorFlow-Examples
Those examples don't go into detail on the data input pipeline though. To set up a good data input pipeline in tensorflow I suggest you use the new (as of TF 1.4) Dataset object. It will force you into a good data input pipline workflow, and it's the way all data input is going in tensorflow, so it's worth learning. It's also easy to test and debug when you write it this way. Here's the guide you want to follow.
https://www.tensorflow.org/programmers_guide/datasets
You can start your Dataset object from the CSV, and use a dataset.map_fn() to load the images using tf.image.decode_jpeg
Since you're doing pose estimation I'll also suggest a nice blog I came across recently that will probably interest you. The topic is segmentation, but pose estimation is quite related.
http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review
I am new in scikit-learn, I have a lot of images and images size not all same, A kind of are real scenes image like
cdn.mayike.com/emotion/img/attached/1/image/dt/20170920/12/20170920121356_795.png
cdn.mayike.com/emotion/img/attached/1/image/mainImg/20170916/15/20170916153205_512.png
, another are not real scenes image like
cdn.mayike.com/emotion/img/attached/1/image/dt/20170917/01/20170917011403_856.jpeg
cdn.mayike.com/emotion/img/attached/1/image/dt/20170917/14/20170917145613_197.png
.
I want to use scikit-learn recognizing which not real scenes image, I think it simlar to http://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py. I am totally no idea how to begin.How to creating dateset and extracting features from images? Can someone tell me what should I do?
This seems to not directly be a programming problem here and your questions are related to non-basic 'current' research.
It seems that you should read about Natural Scene (Statistics) and get yourself familiar with one of the current machine learning frameworks like TensorFlow, Caffe.
There are many tutorials out there to get started, for example you could begin with a binary classifier which outputs if the given image shows a natural scene or not.
Your database setup could have a structure like so:
-> Dataset
-> natural_scenes
-> artificial_images
Digits for example can use such a structure to create a dataset and is able to use models designed for Caffe and TensorFlow.
I would also recommend that you read about finetuning nerual networks, as you would need a lot of images in your database if you start training from scratch.
In Caffe you can finetune pretrained models like CaffeNet or GoogeNet.
I think those are some basic information which should get you started.
As of scikit-learn and face-detection: Face-Detection is more looking for local candidates or image patches which could possibly contain a face. Your problem on the other hand is more of a global problem as the whole image is concerned. That said I would start off with a neural network here which is able to extract local and global features for you.
I'm in need of an artificial neural network library (preferably in python) for one (simple) task. I want to train it so that it can tell wether a thing is in an image. I would train it by feeding it lots of pictures and telling it wether it contains the thing I'm looking for or not:
These images contain this thing, return True (or probability of it containing the thing)
These images do not contain this thing, return False (or probability of it containing the thing)
Does such a library already exist? I'm fairly new to ANNs and image recognition; although I understand how they both work in principle I find it quite hard to find an adequate library for this task, and even research in this field has proven to be kind of a frustration - any advice towards the right direction is greatly appreciated.
There are several good Neural Network approaches in Python, including TensorFlow, Caffe, Lasagne, and sknn (Sci-kit Neural Network). sknn provides an easy, out of the box solution, although in my opinion it is more difficult to customize and can be slow on large datasets.
One thing to consider is whether you want to use a CNN (Convolutional Neural Network) or a standard ANN. With an ANN you will mostly likely have to "unroll" your images into a vector whereas with a CNN, it expects the image to be a cube (if in color, a square otherwise).
Here is a good resource on CNNs in Python.
However, since you aren't really doing a multiclass image classification (for which CNNs are the current gold standard) and doing more of a single object recognition, you may consider a transformed image approach, such as one using the Histogram of Oriented Gradients (HOG).
In any case, the accuracy of a Neural Network approach, especially when using CNNs, is highly dependent on successful hyperparamter tuning. Unfortunately, there isn't yet any kind of general theory on what hyperparameter values (number and size of layers, learning rate, update rule, dropout percentage, batch size, etc.) are optimal in a given situation. So be prepared to have a nice Training, Validation, and Test set setup in order to fit a robust model.
I am unaware of any library which can do this for you. I use a lot of Caffe and can give you a solution till you find a single library which can do it for you.
I hope you know about ImageNet and that Caffe has a trained model based on ImageNet.
Here is the idea:
Define what the object is. Say object = "laptop".
Use Caffe's ImageNet trained model, change the code to display the required output you want (you mentioned TRUE or FALSE) when the object is in the output labels.
Here is a link to the ImageNet tutorial which I wrote.
Here is what you might try:
Take a look here. It is a stripped down version of the ImageNet program which I used in a prediction engine.
In line 80 you'll get the top-1 predicted output label. In line 86 you'll get the top-5 predicted labels. Write a line of code to check whether object is in the output_label and return TRUE or FALSE according to it.
I understand that you are looking for a specific library, I will look for it, but this is something I would try out in the beginning.
I learned, that neural networks can replicate any function.
Normally the neural network is fed with a set of descriptors to its input neurons and then gives out a certain score at its output neuron. I want my neural network to recognize certain behaviours from a screen. Objects on the screen are already preprocessed and clearly visible, so recognition should not be a problem.
Is it possible to use the neural network to recognize a pixelated picture of the screen and make decisions on that basis? The amount of training data would be huge of course. Is there way to teach the ANN by online supervised learning?
Edit:
Because a commenter said the programming problem would be too general:
I would like to implement this in python first, to see if it works. If anyone could point me to a resource where i could do this online-learning thing with python, i would be grateful.
I would suggest
http://www.neuroforge.co.uk/index.php/getting-started-with-python-a-opencv
http://docs.opencv.org/doc/tutorials/ml/table_of_content_ml/table_of_content_ml.html
http://blog.damiles.com/2008/11/the-basic-patter-recognition-and-classification-with-opencv/
https://github.com/bytefish/machinelearning-opencv
openCV is basically an image processing library but also has some amazing helper classes that you you can use for almost any task. Its machine learning module is pretty easy to use and you can go through the source to see explanation and background theory about each function.
You could also use a pure python machine learning library like:
http://scikit-learn.org/stable/
But, before you feed in the data from your screen (i'm assuming thats in pixels?) to your ANN or SVM or whatever ML algorithm you choose, you need to perform "Feature Extraction" on your data. (which are the objects on the screen)
Feature Extraction can be thought of like representing the same data on the screen but with fewer numbers so i have less numbers to give to my ANN. You need to experiment with different features before you find a combination that works well for your particular scenario. a sample one could look something like this:
[x1,y1,x2,y2...,col]
This is basically a list of edge points that represent the area your object is in. a sort of ROI (Region of Interest) and perform egde detection, color detection and also extract any other relevant characteristics. The important thing is that now all your objects, their shape/color information is represented by a number of these lists, one for each object detected.
This is the data that can be provided as input to the neural network. but you'll have to define some meaningfull output parameters depending on your specific problem statements before you can train/test your system of course.
Hope this helps.
This is not entirely correct.
A 3-layer feedforward MLP can theoretically replicate any CONTINUOUS function.
If there are discontinuities, then you need a 4th layer.
Since you are dealing with pixelated screens and such, you probably would need to consider a fourth layer.
Finally, if you are looking at circular shapes, etc., than a radial basis function (RBF) network may be more suitable.