I cant extract name using duckling in rasa 2.0 - python

I want to extract name using duckling but keep failing saing that "faild to extract requested slot 'name's "..can anyone explain me

Duckling does not extract names. It specializes in regularly patterned entities like numbers and dates. See their list of supported dimensions. SpaCy offers pretrained models that usually have a PERSON dimension which might be what you want. See e.g. the label scheme for the English models

Related

How to extract out unique words and there pos tags in separate columns while working with Dataset

I am working through Indonesian Data to use data for NER and as I get to know, there is no pretrained NLTK model to help for this language. So, to do this manually I tried to extract all the unique words used in the entire data frame, I still don't know how to apply tags to the words but this is what I did so far.
the first step,
the second step,
the third step,
the fourth step
please let me know if there is any other convenient way to do this, what I did in the following codes. also, let me know how to add tags to each row(if possible) and how to do NER for this.
(I am new to coding that's why I don't know how to ask, but I am trying my best to provide as much information as possible.)
Depending on what you want to do if results is all that matters you could use a pretrained transfomer model from huggingface instead of NLTK. This will be more computionally heavy but also give you a better performance.
There is one fitting model I could find (I don't speak Indonesian obviously, so excuse eventual errors in the sample sentence):
https://huggingface.co/cahya/xlm-roberta-large-indonesian-NER?text=Nama+saya+Peter+dan+saya+tinggal+di+Berlin.
The easiest way to use this would probably be either the API or using an inference-only pipeline, check out this guide, all you would have to do to get this running for the Indonesian model is to replace the previous model path (dslim/bert-base-NER) with cahya/xlm-roberta-large-indonesian-NER.
Note that this Indonesian model is quite large, so you need to have some decent hardware. If you don't you could alternatively use some (free) cloud computing service such as Google Colab.

How to load our own data set for training

I want to train a model that will tell us the PM2.5(this value describe AQI) of any image. For this I use CNN. I am using tensorflow for this purpose. I am new in this field.Please tell me how we upload our own dataset and separate its name and tags. The format of image name is "imageName_tag"(e.g ima01_23.4)
I think we need more information about your case regarding the "how upload our own dataset".
However, if your dataset is on your computer and you want to access it from python, i invite you to take a look at the libraries "glob" and "os".
To split the name (which in your case is "imageName_tag") you can use:
string = "imageName_tag"
name, tag = string.split('_')
As you'll have to do it for all your data, you'll have to use it in a loop and store the extracted informations in lists.

Methods to extract keywords from large documents that are relevant to a set of predefined guidelines using NLP/ Semantic Similarity

I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.
What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.

Machine learning algorithm which gives multiple outputs mapped from single input

I need some help, i am working on a problem where i have the OCR of an image of an invoice and i want to extract certain data from it like invoice number, amount, date etc which is all present within the OCR. I tried with the classification model where i was individually passing each sentence from the OCR to the model and to predict it the invoice number or date or anything else, but this approach takes a lot of time and i don't think this is the right approach.
So, i was thinking whether there is an algorithm where i can have an input string and have outputs mapped from that string like, invoice number, date and amount are present within the string.
E.g:
Inp string: The invoice 1234 is due on 12 oct 2018 with amount of 287
Output: Invoice Number: 1234, Date: 12 oct 2018, Amount 287
So, my question is, is there an algorithm which i can train on several invoices and then make predictions?
Essentially you are looking for NER (Named entity recognition). There are multiple free and paid tools available for intent and entity mapping. You can use Google DialogFlow, MS LUIS, or open source RASA for entity identification in given text.
if you want to develop your own solution then you can look at OpenNLP too.
Please revert on your observation on these wrt to your problem
What you are searching for is invoice data extraction ML. There are plenty of ML algorithms available, but none of them is done for your use case. Why? Because it is a very special use case. You can't just use Tensorflow and use sentences as input, although it can return multiple outputs.
You could use NLP (natural language processing) approaches to extract data. It is used by Taggun to extract data from receipts. In that case, you can use only sentences. But you will still need to convert your sentences into NLP form (tokenization).
You could use deep learning (e.g. Tensorflow). In that case, you need to vectorize your sentences into vectors that can be input into a neural network. This approach needs much more creativity while there is no standard approach to do that. The goal is to describe every sentence as good as possible. But there is still one problem - how to parse dates, amounts, etc. Would it help NN if you would mark sentences with contains_date True/False? Probably yes.
A similar approach is used in invoice data extraction services like:
rossum.ai
typless.com
So if you are doing it for fun/research I suggest starting with a really simple invoice. Try to write a program that will extract invoice number, issue date, supplier and total amount with parsing and if statements. It will help you to define properties for feature vector input of NN. For example, contains_date, contains_total_amount_word, etc. See this tutorial to start with NN.
If you are using it for work I suggest taking a look at one of the existing services for invoice data extraction.
Disclaimer: I am one of the creators of typless. Feel free to suggest edits.

Extracting personal attributes from text

I'd like to extract personal attributes from a text written by a person. e.g.,
I have always been interested in professional cycling. Being a single mother, it was never easy to find enough time to pursue a sport professionally. The best I could do was to go for short rides along Melbourne's beautiful beaches...
Ideally, I'd want to extract something like cycling: interest, female: gender, sports: interest, Melbourne: location. I think this is called named entity extraction, but I'm not sure. I tried Stanford Named Entity Recognizer and it didn't give me quite what I wanted. The most important things are personal attributes, such as gender, age, interests etc. and it missed most of these on different samples.
Is there any tool/library (preferably in Python) that can help me do this? I know about NLTK, but I don't know how/if I can utilize it here.
Normally the Stanford Named Entity Tagger have some default classifiers it's only have some general tagging like 'Name','Location','Organizations'. If you need to some other tagging you have to train your own classifier. You can refer this for create new classifier. I have created custom model and working fine.

Categories

Resources