how can we apply masked language modelling using multimodal transformer models? - python

It may not be clear from the question, but how can we apply masked language modelling with text and image given using multimodal models such as VisualBERT or CLIP?
For example, if some text is given (it's Masked) and we mask some word in it, how can we apply MML to predict the word as cat?
Is it possible to give only the text to the model, without the image?
How can we implement such a thing and get MLM estimates from it using the huggingface library API?
A code snippet explaining this would be great. If anyone can help, it would help to have a better understanding.

Related

Python compare images of, piece of, clothing (identification)

As an example I have two pictures with a particular type of clothing of a certain brand.
I can download a lot of different images of this same piece, and color, of clothing
I want to create a model which can recognize the item based on a picture.
I tried to do it using this example:
https://www.tensorflow.org/tutorials/keras/classification.
This can recognize the type of clothing (eg shirt or shoe or trousers, etc) But not a specific item and color.
My goal is to have a model that can tell me that the person on my first picture is wearing the item of my second picture.
As mentioned I can upload a few variations of this same item to train my model, if that would be the best approach.
I also tried to use https://pillow.readthedocs.io
This can do something with color recognition but does not solve my initial goal.
i don't think that CNN can help you in your problemes, take a look at the SIFT Technique see this for more détails.it is used for image matching and i think it's better in your cas. if your not looking to get in to much detailes the opencv is a python (and c++ i think) library that has image matching function that are easy to use more détails .
As mentionned by #nadji mansouri, I would use SIFT technique as it suits your need. But I want just to correct something, CNN is also a thing in this case. This being said, I wouldn't tackle the problem as a classification problem, but rather using Distance Metric Learning, i.e, training a model to generate embeddings that are similar in the space when the inputs are similar, and distant otherwise. But to do this you need a large representative dataset.
In short, I suggest starting with SIFT, using OpenCV, or open source implementations on GitHub, playing around with the parameters and see what fits your case best, and then see if it's really necessary to switch to a neural network, and in this case tackling the problem as a metric learning task, maybe with something like siamese networks.
Some definitions:
Metric learning is an approach based directly on a distance metric that aims to establish similarity or dissimilarity between data (images in your case). Deep Metric Learning on the other hand uses Neural Networks to automatically learn discriminative features from the data and then compute the metric. source.
The Scale-Invariant Feature Transform (SIFT) is a method used in computer vision to detect and describe local features in images. The algorithm is invariant to image scale and rotation, and robust to changes in illumination and affine distortion. SIFT features are represented by local image gradients, which are calculated at various scales and orientations, and are used to identify keypoints in an image. These keypoints and their associated descriptor vectors can then be used for tasks such as image matching, object recognition, and structure from motion. source, with modification.

Keras explainer package like eli5, LIME, shapely for standard (non-image) classification?

I've been using XGBoost to predict student retention, and have been using eli5 to provide explanations for the individual predictions. For a few different reasons I decided to give deep learning a try, and it performed surprisingly well on the data. However, the explanation bit is a requirement but eli5 only does explanation for Keras on images. Is there a way to get explanations for a Keras model built for regular classification? Having skimmed the LIME paper I don't really see a reason why it shouldn't work, but it seems like it is only available for image and text classification?
My data is just plain old numerical values (gpa, age, activity, test scores etc) and I am trying to predict 0 or 1 (pass/fail).
Thanks
LIME works for Text, Tabular and Image data. So it would work in your case as well.
You should check out LIME's documentation and explore the given examples for better understanding. https://github.com/marcotcr/lime

Unable to improve the mask RCNN model for document images?

I am training a model to extract all the necessary fields from a resume for which I am using mask rcnn to detect the fields in image. I have trained my mask RCNN model for 1000 training samples with 49 fields to extract. I am unable to improve the accuracy. How to improve the model? Is there any pretrained weights that may help?
Difficulty in reading following text -
Looks like you want to do text classification/processing, you need to extract details from the text but you are applying object detection algorithms. I believe you need to use OCR to extract text (if you have cv as an image) and use the text classification model. Check out the below links more information about text classification -
https://medium.com/#armandj.olivares/a-basic-nlp-tutorial-for-news-multiclass-categorization-82afa6d46aa5
https://www.tensorflow.org/tutorials/tensorflow_text/intro
You can break up the problem two different ways:
Step 1- OCR seems to be the most direct way to get to your data. But increase the image size, thus resolution, otherwise, you may lose data.
Step 2- Store the coordinates of each OCRed word. This is valuable information in this context. How words line up have significance.
Step 3- At this point you can try to use basic positional clustering to group words. However, this can easily fail on a columnar vs row-based distribution of related text.
Step 4- See if you can identify which of 49 tags these clusters belong to.
Look at text classification for Hidden Markov models, Baum-Welch Algorithms. i.e. Go for basic models first.
OR
The above ignores the inherent classification opportunity that is the image of a, well, a properly formatted cv.
Step 1- Train your model to partition the image into sections without OCR. A good model should not break up the sentences, tables etc. This approach may leverage separators lines etc. There is also opportunity to decrease the size of your image since you are not OCRing yet.
Step 2 -OCR image sections and try to classify similar to above.
Another option is to use the neural networks like - PixelLink: Detecting Scene Text via Instance Segmentation
https://arxiv.org/pdf/1801.01315.pdf

Segmentation by map

Goodnight,
I have a body thermo image and I need to do a segmentation based on a body map.
I am attaching the images.
[body] (https://storage.googleapis.com/kaggle-forum-message-attachments/536018/13283/body.jpeg)
[map] (https://storage.googleapis.com/kaggle-forum-message-attachments/536018/13284/map.png)
Anyone have any clue and can help-me?
I have tried to overlay the images but, how they are not perfect fit it not worked.
I expected to have a series of images, one for each region.
If you want to use deep learning technique, its Generative adversarial network (GAN) based method. you can search online, its all over the place. Search keyword: deep fake, GAN
The traditional technique is to use shiftmap based method, e.g object rearrangement using shiftmap technique. opencv has simple for impainting/retargeting implementation you can convert this to deformable model rearrangement case
the detailed work can be found at.
http://www.vision.huji.ac.il/shiftmap/inpainting/

Recognition of nipple exposure in the image, and Automatically cover nipple area

I'd like to implement something like the title, but I wonder if it's technically possible.
I know that it is possible to recognize pictures with CNN,
but I don't know if can be automatically covered nipple area.
If have library information about any related information,
I would like to get some advice.
CNNs are able to detect whatever you train them for, to varying degree of accuracy. What you would need are a lot of training samples (ie. samples of ground truths with the original image, and the labeled image) with which to train your models, and then some new data which you can test the accuracy of your model on. The point is, CNNs are not biased to innately learn a task, you have to tell them what to learn!
I can recommend the machine learning library Keras (https://keras.io/) if you plan to do some machine learning using CNNs, as it's pretty simple and somewhat beginner-friendly. Take some of the tutorials for CNNs, which are quite good.
Essentially, you have what I can only assume is a pretty niche problem. The main issue will come down to how much data you have to train your model. CNNs need a lot of training data, especially for a problem like this which isn't simple. A way which would make this simpler would be to have a model which detects the ahem area of interest and denotes it as such on a per-pixel basis. Then a simple mask could be applied to the source image to censor it. This relates to image segmentation, and there are many academic papers on the topic.

Categories

Resources