How to recognize entities not in training examples

How to recognize entities not in training examples - python

I am working on a customer relations chatbot. The user can input either a greeting, inital_query or a query related to a product. The initial query is when the user gives their user_id to the chatbot. This is done to filter results from the database.
I created a few training examples to help the chatbot classify initial_query from the others. But the problem is the chatbot is not able to recognize a user_id as an entity if it is not specified in the training data. for example
## intent:initial_query
- My name is [Karthik](name) and my user ID is [0234](UserID)
this is one such example for initial_query. Here the userId specified is 0234. but the database contains many more users with unique userIds for each user and it is not possible for me to add all the ids into the training example.
What should I do to make the bot understand when a user id is specified? I saw somewhere that lookup tables can be used. But when I tried using lookup tables, it still did not recognize ids not part of the training examples.
This is the link I used to try lookup tables in my code.
intent_entity_featurizer_regex does not seem to work for me. I am stuck here as this is a crucial part of the bot. If lookup tables is not the best solution to this problem I am also open to other ideas.
Thank you

I'm going to get a bad wrap for always saying you Need more training data, but I would imagine thet is playing a part here as well.
I believe you have a few possible courses of action:
Provide more training data, I've never seen a good intent with fewer than 10 training examples. This number increases with every possible permutation of an intent as well as with more similar intents.
Use a pre-built entity recognizer like Duckling or spaCy. They won't necessarily know that 1234 is a userId, but they can auto extract numbers.
If you are using new_crf with Rasa then it is important to realize that it is actually learning the pattern of utterances and recognizes entities by what is around that entity rather than the actual value.
Also you could use regex with Rasa, but the regex featurizer isn't just a lookup tool. It adds a flag to the CRF whether or not the token matches that pattern. Given this it still needs sufficient training data to learn that that token is important for that entity.

Related

Data Quality Process - defining rules

I am working on a Data Quality Monitoring project which is new me.
I started with a Data Profiling to analyse my data and have a global view of it.
Next, i thought about defining some data quality rules, but i'm a little bit confused about how to implement these rules.
If u guys can guide me a little bit as i'm totally new to this.

This is quite ambiguous question but I try to guess a few tips how to start. Since you are a new to data quality and want already implementation hints, lets start from that.
Purpose: Data quality monitoring system wants to a) recognize error and b) trigger next step how to handle it.
First, build a data quality rule for your data set. The rule can be attribute, record, table or cross-table rule. Lets start with attribute level rule. Implement a rule that recognizes that attribute content does not have '#' in it. Run it to email attributes and create an error record for each row that does not have '#' in email attribute. Error record should have these attributes:
ErrorInstanceID; ErrorName; ErrorCategory; ErrorRule; ErrorLevel; ErrorReaction; ErrorScript; SourceSystem; SourceTable; SourceRecord; SourceAttribute; ErrorDate;
"asd2321sa1"; "Email Format Invalid"; "AttributeError"; "Does not contain #"; "Warning|Alert"; "Request new email at next login"; "ScriptID x"; "Excel1"; "Sheet1"; "RowID=34"; "Column=Email"; "1.1.2022"
MONITORING SYSTEM
You need to make above scripts configurable so that you can change systems, tables and columns as well as rules easily. When ran on top of data sets, they will all populate error records to the same structures resulting in a consistent and historical storage of all errors. You should be able to build reports about existing errors in specific systems, trends of errors appearing or getting fixed and so on.
Next, you need to start building a full-sale data quality metadata repository with a proper data model and design a suitable historical versioning for the above information. You need to store information like which rules were ran and when, which systems and tables they checked, and so on. To detect which systems have bee included in monitoring and also to recognize if systems are not monitored with correct rules. In practice, quality monitoring for data quality monitoring system. You should have statistics which systems are monitored with specific rules, when they were ran last time, aggregates of inspected tables, records and errors.
Typically, its more important to focus on errors that need immediate attention and "alert" an end-user to go fix the issue or triggers a new workflow or flag in source system. For example, invalid emails might be categorized as alerts and be just aggregate statistics. We have 2134223 invalid emails. Nobody cares. However, it might be more important to recognize invalid email of a person who has ordered his bills as digital invoices to his email. Alert. That kind of error (Invalid Email AND Email Invoicing) should trigger an alert and set up a flag in CRM for end users to try get email fixed. There should not be any error records for this error. But this kind of rule should be ran on top of all systems that store customer contact and billind preferences.
For a technical person, I could recommend this book. It's a good book that goes deeper in technical and logical issues of data quality assessment and monitoring systems. There is also a small metadata model for data quality metadata structures. https://www.amazon.com/Data-Quality-Assessment-Arkady-Maydanchik/dp/0977140024/

NLP general English to action

I am working on automating task flow of application using text based Natural Language Processing.
It is something like chatting application where the user can type in the text area. At same time python code interprets what user wants and it performs the corresponding action.
Application has commands/actions like:
Create Task
Give Name to as t1
Add time to task
Connect t1 to t2
The users can type in chat (natural language). It will be like a general English conversation, for example:
Can you create a task with name t1 and assign time to it. Also, connect t1 to t2
I could write a rule drive parser, but it would be limited to few rules only.
Which approach or algorithm can I use to solve this task?
How can I map general English to command or action?

I think the best solution would be to use an external service like API.ai or wit.ai. You can create a free account and then you can map certain texts to so-called 'intents'.
These intents define the main actions of your system. You can also define 'entities' that would capture, for instance, the name of the task. Please have a look at these tools. I'm sure they can handle your use case.

I think your issue is related to Rule-based system (Wiki).
You need to two basic components in core of project like this:
1- Role base:
list of your roles.
2- Inference engine:
infers information or takes action based on the interaction of input and the rule base.
spacy is python approach that I think it will help you. (More information).

You may want to try nltk. This is an excellent library for NLP and comes with a handy book to get you started. I think you may find chapter 8 helpful for finding sentence structure, and chapter 7 useful for figuring out what your user is requesting the bot to do. I would recommend you read the entire thing if you have more than a passing interest in NLP, as most of it is quite general and can be applied outside of NLTK.

What you are describing is a general problem with quite a few possible solutions. Your business requirements, which we do not know, are going to heavily influence the correct approach.
For example, you will need to tokenize the natural language input. Should you use a rules-based approach, or a machine learning one? Maybe both? Let's consider your input string:
Can you create a task with name t1 and assign time to it. Also, connect t1 to t2
Our system might tokenize this input in the following manner:
Can you [create a task] with [name] [t1] and [assign] [time] to it. Also, [connect] [t1] to [t2]
The brackets indicate semantic information, entirely without structure. Does the structure matter? Do you need to know that connect t1 is related to t2 in the text itself, or can we assume that it is because all inputs are going to follow this structure?
If the input will always follow this structure, and will always contain these kinds of semantics, you might be able to get away with parsing this using regular expressions and feeding prebuilt methods.
If the input is instead going to be true natural language (ie, you are building a siri or alexa competitor) then this is going to be wildly more complex, and you aren't going to get a useful answer in a SO post like this. You would instead have a few thousand SO posts ahead of you, assuming you have sufficient familiarity with both linguistics and computer science to allow you to approach the problem systematically.

Lets say text is "Please order a pizza for me" or "May I have a cab booking from uber"
Use a good library like nltk and parse these sentences. As social English is generally grammatically incorrect, you might have to train your parser with your custom broken English corpora. Next, These are the steps you have to follow to get an idea about what a user wants.
Find out the full stop's in a paragraph, keeping in mind the abbreviations, lingos like ...., ??? etc.
Next find all the verbs and noun phrases in individual sentences can be done through POS(part of speech tagging) by different libraries.
After that the real work starts, My approach would be to create a graph of verbs where similar verbs are close to each other and dissimilar verbs are very far off.
Lets say you have words like arrange, instruction , command, directive, dictate which are closer to order. So if your user writes any one of the above verbs in their text , your algorithm will identify that user really means to imply order. you can also use edges of that graph to specify the context in which the verb was used.
Now, you have to assign action to this verb "order" based on the noun phrase which were parsed in the original sentence.
This is just a high level explanation of this algorithm, it has many problems which needs serious considerations, some of them are listed below.
Finding similarity index between root_verb and the given verb in very short time.
New words who doesn't have an entry in the graph. A possible approach is to update your graph by searching google for this word, find a context from the pages on which it was mentioned and find an appropriate place for this new word in the graph.
Similarity indexes of misspelled words with proper verbs or nouns.
If you want to build a more sophisticated model, you can construct graph for every part of speech and can select appropriate words from each graph to form sentences in response to the queries. Above mentioned graph is meant for Verb Part of speech.

Although, #whrrgarbl is right. It seems like you do not want to train a bot.
So, then to handle language input variations(lexical, semantic..) you would need a pre-trained bot which you can customize(or may be just add rules according to your need).
The easiest business oriented solution is Amazon Lex. There is a free preview program too.
Another option would be to use Google's Parsey McParseface(a pre-trained English parser, there is support for 40 languages) and integrate it with a chat-framework. Here is a link to a python repo, where the author claims to have made the installation and training process convenient.
Lastly, this provides a comparison of various chatbot platforms.

The maximum number of objects that can be instantiated with a Django model?

I wrote an app to record the user interactions with the website search box,
the query string is saved as an object of the model SearchQuery. Whenever a user enters some data in the search box, I can save the search query and some info related to the query on the database.
This is for the idea of getting the search trends,
the fields in my database models are,
A Character Field (max_length=30)
A PositiveIntegerField
A BooleanField
My Questions are,
How many objects can be instantiated from the model SearchQuery? If there is a limit on numbers?
As the objects are not related (no db relationships) should I use MongoDB or some kind of NoSQLs for performance?
Is this a good design or should I do some more work to make it efficient?
Django version 1.6.5
Python version 2.7

How many objects can be instantiated from the model SearchQuery? If there is a limit on numbers?
As many as your chosen database can handle, this is probably in the millions. If you are concerned you can use a scheduler to delete older queries when they are no longer useful.
As the objects are not related (no db relationships) should I use MongoDB or some kind of NoSQLs for performance?
Could you, but its unlikely to give you much (if any efficiency gains). Because you are doing frequent writes and (presumably) infrequent reads, then its unlikely to hit the database very hard at all.
Is this a good design or should I do some more work to make it efficient?
There are probably two recommendations I'd make.
a. If you are going to be doing frequent reads on the Search log, look at using multiple databases. One for your log, and one for everything else.
b. Consider just using a regular log file for this information. Again, you will probably only be examining this data infrequently. So there are strng arguments to piping it into a log file, probably CSV-like, to make data analysis of it easier.

Best strategy for error handling in an interface to a database and web display

I decided to ask this question after going back and forth 100s of times trying to place error handling routines to optimize data integrity while also taking into account speed and efficiency (and wasting 100s of hours in the process. So here's the setup.
Database -> python classes -> python code -> javascript
MongoDB | that represent | that serves | web interface
the data pages (pyramid)
I want data to be robust, that is the number one requirement. So right now I validate data on the javascript side of the page, but also validate in the python classes which more or so represent data structures. While most server routines run through python classes, sometimes that feel inefficient given that it have to pass through different levels of error checking.
EDIT: I guess I should clarify. I am not looking to unify validation of client and server side code. Sorry for the bad write-up. I'm looking more to figure out where the server side validation should be done. Should it be in the direct interface to the database, or in the web server code where the data is received.
for instance, if I have an object with a barcode, should I validate the barcode in the code that reviews the data through AJAX or should I just call the object's class and validate there?
Again, is there sort of guidelines on how to do error checking in general? I want to be sort of professional, and learn but hopefully not have to go through a whole book.
I am not a software engineer, but I hope those of you who are familiar with complex projects, can tell me where I can find few guidelines on how to model/error check in a situation like this.
I'm not necessarily looking for an answer, but more like pointing me to a short set of guidelines when creating projects with different layers like this. Hopefully not extremely long..
I don't even know what tags to use in the post. HELP!!

Validating on the client and validating on the server serve different purposes entirely. Validating on the server is to make sure your model invariants hold and has to be done to maintain data integrity. Validating on the client is so the user has a friendly error message telling him that his input would've validated data integrity instead of having a traceback blow up into his face.
So there's a subtle difference in that when validating on the server you only really care whether or not the data is valid. On the client you also care, on a finer-grained level, why the input could be invalid. (Another thing that has to be handled at the client is an input format error, i.e. entering characters where a number is expected.)
It is possible to meet in the middle a little. If your model validity constraints are specified declaratively, you can use that metadata to generate some of the client validations, but they're not really sufficient. A good example would be user registration. Commonly you want two password fields, and you want the input in both to match, but the model will only contain one attribute for the password. You might also want to check the password complexity, but it's not necessarily a domain model invariant. (That is, your application will function correctly even if users have weak passwords, and the password complexity policy can change over time without the data integrity breaking.)
Another problem specific to client-side validation is that you often need to express a dependency between the validation checks. I.e. you have a required field that's a number that must be lower than 100. You need to validate that a) the field has a value; b) that the field value is a valid integer; and c) the field value is lower than 100. If any of these checks fails, you want to avoid displaying unnecessary error messages for further checks in the sequence in order to tell the user what his specific mistake was. The model doesn't need to care about that distinction. (Aside: this is where some frameworks fail miserably - either JSF or Spring MVC or either of them first attempts to do data-type conversion from the input strings to the form object properties, and if that fails, they cannot perform any further validations.)
In conclusion, the above implies that if you care about data integrity, and usability, you necessarily have to validate data at least twice, since the validations achieve different purposes even if there's some overlap. Client-side validation will have more checks and more finer-grained checks than the model-layer validation. I wouldn't really try to unify them except where your chosen framework makes it easy. (I don't know about Pyramid - Django makes these concerns separate in that Forms are a different layer than your Models, both can be validated, and they're joined by ModelForms that let you add additional validations to the ones performed by the model.)

Not sure I fully understand your question, but error handling on pymongo can be found here -
http://api.mongodb.org/python/current/api/pymongo/errors.html
Not sure if you're using a particular ORM - the docs have links to what's available, and these individually have their own best usages:
http://api.mongodb.org/python/current/tools.html
Do you have a particular ORM that you're using, or implementing your own through pymongo?

GAE/P: Dealing with eventual consistency

In may app, I have the following process:
Get a very long list of people
Create an entity for each person
Send an email to each person (step 2 must be completed before step 3 starts)
Because the list of people is very large, I don't want to put them in the same entity group.
In doing step 3, I can query the list of people like this:
Person.all()
Because of eventual consistency, I might miss some people in step 3. What is a good way to ensure that I am not missing anyone in step 3?
Is there a better solution than this?:
while Person.all().count() < N:
pass
for p in Person.all()
# do whatever
EDIT:
Another possible solution came to mind. I could create a linked list of the people. I can store a link to the first one, he can link to the second one and so one. It seems that the performance would be poor however, because you'd be doing each get separately and wouldn't have the efficiencies of a query.

UPDATE: I reread your post and saw that you don't want to put them all in the same entity group. I'm not sure how to guarantee strong consistency without doing so. You might want to restructure your data so that you don't have to put them in the same entity group, but in several. Perhaps depending on some aspect of a group of Person entities? (e.g., mailing list they are on, type of email being sent, etc.) Does each Person only contain a name and an email address, or are there other properties involved?
Google suggests a a few other alternatives:
If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, a put, or any operation within a transaction, you will always see the most recently written data.
So it looks like you may want to investigate those possibilities, although I'm not sure how well they would translate to what your app needs.
ORIGINAL POST: Use ancestor queries.
From Google's "Structuring Data for Strong Consistency":
To obtain strongly consistent query results, you need to use an ancestor query limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group is up to date. If your application relies on strongly consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency.
So when you create a Person entity, set a parent for it. I believe you could even just have a specific entity be the "parent" of all the others, and it should give you strong consistency. (Although I like to structure my data a bit with ancestors anyway.)
# Gives you the ancestor key
def ancestor_key(kind, id_or_name):
return db.Key.from_path(kind, id_or_name)
# Kind is the db model your using (should be 'Person' in this case) and
# id_or_name should be the key id or name for the parent
new_person = Person(your_params, parent=ancestor_key('Kind', id_or_name)
You could even do queries at that point for all the entities with the same parent, which is nice. But that should help you get more consistent results regardless.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.