Improving django search

Improving django search - python

I have the following search:
titles = Title.objects.filter(title__icontains=search)
If this is a search to find:
Thomas: Splish, Splash, Splosh
I can type in something like "Thomas" or "Thomas: Splish, Splash, Splosh" and it will work.
However, if I type in something like "Thomas Splash", it will not work. How would I improve the search to do something like that (also note that if we split on words, the comma and other non-alphanumerics should be ignored -- for example, the split words should not be "Thomas:", "Splish," but rather "Thomas", "Splish", etc.

This kind of search is starting to push the boundaries of django and the ORM. Once it gets to this level of complexity I always switch over to a system that is built entirely for search. I dig lucene, so I usually go for ElasticSearch or Solr
Keep in mind that full text searching is a subsystem all unto itself, but can really add a lot of value to your site.

As Django models are using database queries there is not much magic you can do.
You could split your search by non-alphanumeric chars and search objects containing all words but this will not be smart and efficient.
If you want something really smart maybe you should check out haystack:
http://haystacksearch.org/

Related

PyMongo $regex across all text fields and subfields

I have a rather convoluted Mongo collection and I'm trying to implement detailed matching criteria. I have already created a text index across all fields as follows:
db.create_index([("$**", "text")], name='allTextFields')
I am using this for some straightforward search terms in PyMongo (e.g., "immigration") as follows:
db.find({'$text': {'$search': "immigration"}
However, there are certain terms I need to match that are generic enough as to require regex-type specifications. For instance, I want match all occurrences of "ice" without finding "police" and a variety of other exclusion terms.
Ideally, I could create a regex that would search all fields and subfields (see example below), but I can't figure out how to implement this in PyMongo (or Mongo for that matter).
db.find({all_fields_and_subfields: {'$regex': '^ice\s*', '$options': 'i'}
Does anyone know how to do so?

One way of doing this is to add another field to the documents which contains a concatenation of all the fields you want to search, and $regex on that.
Note that unless your regexes are anchored to the beginning of input, they won't be using indexes (so you'll be doing collection scans).
I am surprised that a full text query for "ice" finds "police", surely that's a bug somewhere.
You may also consider Atlas search instead of full-text search, which is more powerful but proprietary to Atlas.

NLP general English to action

I am working on automating task flow of application using text based Natural Language Processing.
It is something like chatting application where the user can type in the text area. At same time python code interprets what user wants and it performs the corresponding action.
Application has commands/actions like:
Create Task
Give Name to as t1
Add time to task
Connect t1 to t2
The users can type in chat (natural language). It will be like a general English conversation, for example:
Can you create a task with name t1 and assign time to it. Also, connect t1 to t2
I could write a rule drive parser, but it would be limited to few rules only.
Which approach or algorithm can I use to solve this task?
How can I map general English to command or action?

I think the best solution would be to use an external service like API.ai or wit.ai. You can create a free account and then you can map certain texts to so-called 'intents'.
These intents define the main actions of your system. You can also define 'entities' that would capture, for instance, the name of the task. Please have a look at these tools. I'm sure they can handle your use case.

I think your issue is related to Rule-based system (Wiki).
You need to two basic components in core of project like this:
1- Role base:
list of your roles.
2- Inference engine:
infers information or takes action based on the interaction of input and the rule base.
spacy is python approach that I think it will help you. (More information).

You may want to try nltk. This is an excellent library for NLP and comes with a handy book to get you started. I think you may find chapter 8 helpful for finding sentence structure, and chapter 7 useful for figuring out what your user is requesting the bot to do. I would recommend you read the entire thing if you have more than a passing interest in NLP, as most of it is quite general and can be applied outside of NLTK.

What you are describing is a general problem with quite a few possible solutions. Your business requirements, which we do not know, are going to heavily influence the correct approach.
For example, you will need to tokenize the natural language input. Should you use a rules-based approach, or a machine learning one? Maybe both? Let's consider your input string:
Can you create a task with name t1 and assign time to it. Also, connect t1 to t2
Our system might tokenize this input in the following manner:
Can you [create a task] with [name] [t1] and [assign] [time] to it. Also, [connect] [t1] to [t2]
The brackets indicate semantic information, entirely without structure. Does the structure matter? Do you need to know that connect t1 is related to t2 in the text itself, or can we assume that it is because all inputs are going to follow this structure?
If the input will always follow this structure, and will always contain these kinds of semantics, you might be able to get away with parsing this using regular expressions and feeding prebuilt methods.
If the input is instead going to be true natural language (ie, you are building a siri or alexa competitor) then this is going to be wildly more complex, and you aren't going to get a useful answer in a SO post like this. You would instead have a few thousand SO posts ahead of you, assuming you have sufficient familiarity with both linguistics and computer science to allow you to approach the problem systematically.

Lets say text is "Please order a pizza for me" or "May I have a cab booking from uber"
Use a good library like nltk and parse these sentences. As social English is generally grammatically incorrect, you might have to train your parser with your custom broken English corpora. Next, These are the steps you have to follow to get an idea about what a user wants.
Find out the full stop's in a paragraph, keeping in mind the abbreviations, lingos like ...., ??? etc.
Next find all the verbs and noun phrases in individual sentences can be done through POS(part of speech tagging) by different libraries.
After that the real work starts, My approach would be to create a graph of verbs where similar verbs are close to each other and dissimilar verbs are very far off.
Lets say you have words like arrange, instruction , command, directive, dictate which are closer to order. So if your user writes any one of the above verbs in their text , your algorithm will identify that user really means to imply order. you can also use edges of that graph to specify the context in which the verb was used.
Now, you have to assign action to this verb "order" based on the noun phrase which were parsed in the original sentence.
This is just a high level explanation of this algorithm, it has many problems which needs serious considerations, some of them are listed below.
Finding similarity index between root_verb and the given verb in very short time.
New words who doesn't have an entry in the graph. A possible approach is to update your graph by searching google for this word, find a context from the pages on which it was mentioned and find an appropriate place for this new word in the graph.
Similarity indexes of misspelled words with proper verbs or nouns.
If you want to build a more sophisticated model, you can construct graph for every part of speech and can select appropriate words from each graph to form sentences in response to the queries. Above mentioned graph is meant for Verb Part of speech.

Although, #whrrgarbl is right. It seems like you do not want to train a bot.
So, then to handle language input variations(lexical, semantic..) you would need a pre-trained bot which you can customize(or may be just add rules according to your need).
The easiest business oriented solution is Amazon Lex. There is a free preview program too.
Another option would be to use Google's Parsey McParseface(a pre-trained English parser, there is support for 40 languages) and integrate it with a chat-framework. Here is a link to a python repo, where the author claims to have made the installation and training process convenient.
Lastly, this provides a comparison of various chatbot platforms.

Searching for abbreviated words in MySQL

I have a MySQL database, working with Python, with some entries for products such as Samsung Television, Samsung Galaxy Phone etc. Now, if a user searches for Samsung T.V or just T.V, is there any way to return the entry for Samsung Television?
Do full text search libraries like Solr or Haystack support such features? If not, then how do I actually proceed with this?
Thanks.

yes, Solr will surely allow you to do this and much more.You can start Here
and SolrCloud is a really good way to provide for High Availabilty to end users.

You should have a look at the SynonymFilterFactory for your analyzer. When reading the documentation you will find this section that rather sounds like the scenario you describe.
Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"
You should keep in mind to have separate analyzers for index and query time, as described in this SO question How to make solr synonyms work.

Whoosh - performance issues with wildcard searches (*something)

I'm noticing that searches like *something consume huge amounts of cpu. I'm using whoosh 2.4.1. I suppose this is because I don't have indexes covering this search case. something* works fine. *something doesnt't.
How do you deal with these queries? Is there a special way to declare your schemas which makes this kind of queries possible?
Thanks!

That's a quite fundamental problem: prefixes are usually easy to find (like when searching foo*), postfixes are not (like *foo).
Prefixes + Wildcard searches get optimized to first do a fast prefix search and then a slow wildcard search on the results given in the first step.
You can't do that optimization with Wildcard + Postfix. But there is a trick:
If you really need that often, you could try indexing a reversed string (and also searching for the reversed search string), so the postfix search becomes a prefix search:
Somehow like:
add_document(title=title, title_rev=title[::-1])
...
# then query = u"*foo"[::-1], search in title_rev field.

Some questions about Django localisation

I intend to localise my Django application and began reading up on localisation on the Django site. This put a few questions in my mind:
It seems that when you run the 'django-admin.py makemessages' command, it scans the files for embedded strings and generates a message file that contains the translations. These translations are mapped to the strings in the file. For example, if I have a string in HTML that reads "Please enter the recipients name", Django would consider it to be the message id. What would happen if i changed something in the string. Let's say I added the missing apostrophe to the word "recipient". Would this break the translation?
In relation to the above scenario, Is it better to use full fledged sentences in the source (which might change) or would I be better off using a word like "RECIPIENT_NAME" which is less likely to change and easier to map to?
Does the 'django-admin.py makemessages' command scan the Python sources as well?
Thanks.

It very probably would, in some cases 'similar' strings can be detected and your translation will be marked with fuzzy. But it depends on the type of string, I don't know what adding an apostrophe would do. Read the GNU gettext docs for more information about this.
However, an easy solution for your problem would be: don't fix the typo in the original, but make a translation like english to english where the translated string is the correct one :). I personally wouldn't recommend this approach, but If you're afraid to break tens of translation files, it can be considered.
No it isn't, it throws away all sense of context. It might look clearer for sites where only a few translation strings are required and you know the exact context by heart. But as soon as you have 100s of strings in the translation file, short names like that will say nothing, you'll always have to look up the exact context. Even worse, it can be you use the same 'short name' for something that actually has to be translated differently, which will end up giving you weirder short names to handle both cases. Finally, if you use one normal language as default, you don't need to translate this language explicitly anymore.
Yes it does, there exist multiple functions to mark strings in python for translation, an overview can be found here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.