Openpose on low resolution images?

Openpose on low resolution images? - python

I'm trying to get the human pose information on low-resolution images. Particularly I've tried Keras OpenPose implementation by michalfaber, but the model seems to not perform well on low-resolution images while performing pretty well on higher resolution. I posted a question as an issue on GitHub repo as well but I thought I'd try here as well as I'm not set on that particular implementation of human pose detection.
My images are about 50-100 pixels width and height wise.
This is an example of the image. I wonder if anyone knows a way to modify the program, network, or knows of a human pose network that performs well on such low-resolution images.

If you are looking for a different human pose estimation network , I would highly recommend the MxNet GluonCV framework (https://gluon-cv.mxnet.io/model_zoo/pose.html). It is very simple to use and also contains many different pose estimation networks that you can try and compare tradeoff between accuracy and speed. For example, to use it you can do (Taken from the tutorial page):
from matplotlib import pyplot as plt
from gluoncv import model_zoo, data, utils
from gluoncv.data.transforms.pose import detector_to_alpha_pose, heatmap_to_coord_alpha_pose
detector = model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True)
pose_net = model_zoo.get_model('alpha_pose_resnet101_v1b_coco', pretrained=True)
# Note that we can reset the classes of the detector to only include
# human, so that the NMS process is faster.
detector.reset_class(["person"], reuse_weights=['person'])
im_fname = utils.download('https://github.com/dmlc/web-data/blob/master/' +
'gluoncv/pose/soccer.png?raw=true',
path='soccer.png')
x, img = data.transforms.presets.yolo.load_test(im_fname, short=512)
print('Shape of pre-processed image:', x.shape)
class_IDs, scores, bounding_boxs = detector(x)
pose_input, upscale_bbox = detector_to_alpha_pose(img, class_IDs, scores, bounding_boxs)
predicted_heatmap = pose_net(pose_input)
pred_coords, confidence = heatmap_to_coord_alpha_pose(predicted_heatmap, upscale_bbox)
For accuracy comparison for example, their AlphaPose with Resnet 101 network is significantly more accurate than OpenPose (You can find more accuracy benchmarks from the link above). A caveat, is however understanding the difference between types of these networks such as implementing Bottom-Up and Top-Down approach since it can affect the inference speed at different scenarios.
For example, the runtime of the top-down approaches is proportional to the number of detected people, it can be time-consuming if your image has a crowd of people.

Related

General object recognition with biggest number of classes

I'm new to the computer vision world, I'm trying to create a script with the objective to gather data from a dataset of images.
I'm interested in what kind of objects are in those images and getting a summary of them in a json file for every image.
I've checked out some YOLO implementations but the ones I've seen are almost always based on COCO and have 80 classes or have a custom dataset.
I've seen that there are algorithms like InceptionV3 etc. which are capable of classifying 1000 classes. But per my understanding object classification is different from object recognition.
Is there a way to use those big dataset classification algos for object detection?
Or any other suggestion?

Unfortunately, I do not know where the breaking point is, and of course, it will depend on acceptable evaluation metrics and training data size.
From a technical point of view, there is no hard limit and if you go to extremes there could be Core ML model size issues and memory issues during inferences. However, that will only happen for an extremely large number of classes.
From a modeling perspective (which is a problem that will happen much earlier than the technical limitation) it is not as clear. As you increase the number of classes, you increase the risk of making classification mistakes. Although, the severity of a lot of the mistakes should simultaneously go down as you will have more and more classes that are naturally similar (breeds of dogs, etc.). The original YOLO9000 paper (https://arxiv.org/pdf/1612.08242.pdf) trained a model using 9000+ classes with reasonable results (lots of mistakes of course, but still impressive). They trained it on a combination of detection and classification data, so if they actually had detection data for all 9000, then results would presumably be even better.
In your experiment, it sounds like 50-60 was OK (thanks for giving us a sample point!). Anything below 100 is definitely tried and true, as long as you have the data. However, will 300 do OK? Will 1000 do OK? Theoretically, I would say yes, if you are able to provide enough training data and you adjust your expectation of what a good evaluation metric is since you know you'll make more mistakes. For instance, for classification with 1000 classes, it is common to report top-5 accuracy (that is, the correct label is in your top-5 classes for a sample).
Here is a useful link - https://github.com/apple/turicreate/issues/968

First, to level set on terminology.
Image Classification based neural networks, such as Inception and Resnet, classify an entire image based upon the classes the network was trained on. So if the image has a dog, then the classifier will most likely return the class dog with a higher confidence score as compared to the other classes the network was trained on. To train a network such as this, it's simple enough to group the same class images (all images with a dog) into folders as inputs. ImageNet and Pascal VOC are examples of public labeled datasets for Image Classification.
Object Detection based neural networks on the other hand, such as SSD and Yolo, will return a set of coordinates that indicate a bounding box and confident score for each class (object) that is detected based upon what the network was trained with. To train a network such as this, each object in an image much as annotated with a set of coordinates that correspond to the bounding boxes of the class (object). The COCO dataset, for example, is an annotated dataset of 80 classes (objects) with coordinates corresponding to the bounding box around each object. Another popular dataset is Object365 that contains 365 classes.
Another important type of neural network that the COCO dataset provides annotations for is Instance Segmentation models, such as Mask RCNN. These models provide pixel-level classification and are extremely compute-intensive, but critical for use cases such as self-driving cars. If you search for Detectron2 tutorials, you will find several great learning examples of training a Mask RCNN network on the COCO dataset.
So, to answer your question, Yes, you can use the COCO dataset (amongst many other options available publicly on the web) for object detection, or, you can also create your own dataset with a little effort by annotating your own dataset with bounding boxes around the object classes you want to train. Try Googling - 'using coco to train ssd model' to get some easy-to-follow tutorials. SSD stands for single-shot detector and is an alternative neural network architecture to Yolo.

How do we really clean or pre-process Image for Image Classification?

I have a simple question to some of you. I have worked on some image classification tutorials. Only the simpler ones like MNIST dataset. Then I noticed that they do this
train_images = train_images / 255.0
Now I know that every value from the matrix (which is the image) gets divided by 255.0. If I remember correctly this is called normalization right? (please correct me if I am wrong otherwise tell me that I am right).
I'm just curious is there a "BETTER WAY","ANOTHER WAY" or "THE BEST WAY" to pre-process or clean images then those cleaned images will be fed to the network for training.
Please if you would like to provide a sample source code. Please! be my guest. I would love to look at code samples.
Thank you!

Pre-processing images prior to image classification can include the followings:
normalisation: which you already mentioned
reshaping into uniform resolution (img height x img width): higher resoltuion leads to better learning and smaller resolution may lose important features. Some models have default input size that you can refer to. Also an average size of all images can be used too.
color channel: 1 refers to gray-scale and 3 refers rgb-scale. Depending on your application you can set this.
data augmentation: if your model is overfitting or your dataset is small, you can reproduce your dataset by altering original images (flipping, rotating, cropping, zooming..) to increase your dataset
image segmentation: segmentation can be performed to highlight the area or boundaries that may benefit your application. For example, in medical image classification, some part of body maybe masked to enhance classification performance.
For example, I recently worked on image classification of lung CT scan images. For pre-processing, I have reshaped the images and made them gray-scale. Then I performed image segmentation to highlight the lungs in the images. And I normalised the image pixels to put into my classification model. Depending on your application, there may be other more pre-processing techniques you might want to consider.

How can I train Super-Resolution Generative Adversarial Network (SRGAN) with high-frequency grayscale images?

This question is almost a duplicate of the post from Cross Validated, but none has replied to that and I hope it is okay I ask almost the same question here.
I have been reading and looking at implementations of the SRGAN, from Photo-realistic Single Image Super Resolution with Generative Adversarial Networks. I implemented the PyTorch implementation of SRGAN for 3 channel images and it makes some decent super resolution images. However, when I try it out with 1 channel images it fails to generate plausible images.
The grayscale images I use is from the public release MSTAR of high resolution Synthetic Aparture Radar (SAR) data from sdms. The dataset contains of 2774 images and has 10 classes. Some samples are shown below:
Since SRGAN use VGG as one of the networks I had to convert my grayscale images to RGB. I copied the first channel to the other two channels and created a 3 channel image by the following approach:
w, h = hr_image.shape
ret = np.empty((w, h, 3), dtype=np.uint8)
ret[:, :, :] = hr_image[:, :, np.newaxis]
hr_image = ret
The following are the output after 34 epochs (this continues) and is low resolution, high resolution, super resolution.
What I notice is that the Discriminators loss quickly goes to 0, Generators loss to 0.08, generators score to 0 and discriminators score to 1. I assume this means that it is too easy for the discriminator to distinguish between the real and fake image. This presumable causes the generator to not learn anything new and just stops learning.
I tried to isolate one class in the MSTAR dataset, but that did not change anything. I noticed that others Super-resolution SAR Image Reconstruction via Generative Adversarial Network did use SRGAN for SAR images and it seems to work, but their paper does not explain how they implemented it.
I am wondering if I am using the wrong approach and need to change the loss functions. SRGAN is using MSE, TVLoss and perceptual loss. MSE is by itself not the best loss functions and is explained really well on here. But it is probably good to keep the images inside the MSE-hypersphere. But I ask myself whether it make sense to use a network that is trained on low frequency images and use it on high frequency images? As I understand it the loss functions are designed to work really well on low frequency images (something our eyes like to look at) and not that much on high frequency images. My questions are therefore:
Should I change the loss functions? Should I pre train my own network with high frequency grayscale images so the network are more suitable for these images?
How come the generator crashing after few epochs? Why is it so hard for the generator to make plausible transformations? Are the images too "noisy"?
Should I change something in the structure of the generator and discriminator? And should I try to use another pretrained network that is more suitable for grayscale images, and if so, which one?
Should I use a pretrained network that is trained on high frequency images?
UPDATED TEXT 11 JUNE 2020 BECAUSE OF NEW RESULTS
Instead of converting all my images to 3 channels I only did this before the VGG network in line 22 by the following command:
out_images = out_images.repeat(1,3, 1,1)
target_images = target_images.repeat(1, 3, 1,1)
I also made some minor changes in my code and this changed the output of the model after 250:
However as can be seen it did not improve in making the super resolution image. The discriminators quickly learns to distinguish between real and fake images as can be seen on the loss plot below:
Does anyone has any suggestions on how I can make the generator stronger? I tried to add more layers in the generator and removed some of the discriminator but with no success.

Applying machine learning algorithms on Google's Quickdraw dataset

I'm trying to apply machine learning algorithms available in python's scikit-learn package to predict doodle names from set of doodle images.
Since I'm a complete beginner in machine learning and I have no knowledge about how neural network work yet. I wanted to try with scikit-learn's algorithms.
I've downloaded doodles ( of cats and guitars ) with the help of api named quickdraw.
Then I load the images with the following code
import numpy as np
from PIL import Image
import random
#To hold image arrays
images = []
#0-cat, 1-guitar
target = []
#5000 images of cats and guitar each
for i in range(5000):
#cat images are named like cat0.png, cat1.png ...
img = Image.open('data/cats/cat'+str(i)+'.png')
img = np.array(img)
img = img.flatten()
images.append(img)
target.append(0)
#guitar images are named like guitar0.png, guitar1.png ...
img = Image.open('data/guitars/guitar'+str(i)+'.png')
img = np.array(img)
img = img.flatten()
images.append(img)
target.append(1)
random.shuffle(images)
random.shuffle(target)
Then I applied the algorithm : -
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(images,target,test_size=0.2, random_state=0)
from sklearn.naive_bayes import GaussianNB
GB = GaussianNB()
GB.fit(X_train,y_train)
print(GB.score(X_test,y_test))
Upon running the above code (with other algorithms like SVM,MLP too), My system just freezes. I've do a force shutdown to get back. I'm not sure why is this happening.
I have tried lowering the number of images to load by changing
for i in range(5000):
to
for i in range(1000):
But I only get accuracy around 50%

First of all, if I may say so:
Since I'm a complete beginner in machine learning and I have no knowledge about >how neural network work yet. I wanted to try with scikit-learn's algorithms.
This is not a good way to approach ML in general, I strongly suggest you start studying the basics at least, otherwise you won't be able to tell what's going on at all (it's not something you can figure out by trying it).
Back to your problem, applying Naive Bayes methods to raw images it's not a good strategy: the problem is that each pixel of your image is a feature and with images you can get a very high number of dimensions easily (also assuming each pixel is independant of its neighbors it's not what you want).
NB is commonly used with documents and looking at this example on wikipedia might help you understand a bit more the algorithm.
In short, NB boils down to computing joint conditional probabilities, which boils down to counting co-occurences of features (words in wikipedia's example) being co-occurences of pixels in your case, which in turn boils down to computing a huge matrix of occurences that you need to formulate your NB model.
Now, if your matrix is made of all the words in a set of documents, this can get pretty expensive in both time and space (O(n^2)/2), with n being the number of features; instead, imagine the matrix being composed of ALL the pixels in your training set, as you're doing in your example... this explodes really fast.
That's why cutting your dataset to 1000 images allows your PC to not run out of memory.
Hope it helps.

Automatically make a composite image for cnn training

i would like to train a CNN for detection and classification of any kind of signs (mainly laboratory and safety markers) using tensorflow.
While I can gather enough training data for the classification training set, using e.g. The Bing API, I‘m struggeling to think about a solution to get enough images for the object detection training set. Since these markers are mostly not public available, I thought I could make a composite of a natrual scene image with the image of the marker itself, to get a training set. Is there any way to do that automatically?
I looked at tensorflow data augmentation class, but it seems it only provides functionality for simpler data augmentation tasks.

You can do it with OpenCV as preprocessing.
The algorithm follows:
Choose a combination of a natural scene image and a sign image randomly.
Sample random position in the natural scene image where the sign image is pasted.
Paste the sign image at the position.
Obtain the pasted image and the position as a part of training data.
Step1 and 2 is done with python standard random module or numpy.
Step3 is done with opencv-python. See overlay a smaller image on a larger image python OpenCv
.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.