I am writing a fairly simple Django application where users can enter string queries. The application will the search through the database for this string.
Entry.objects.filter(headline__contains=query)
This query is pretty strait forward but not really helpful to someone who isn't 100% sure what they are looking for. So I expanded the search.
from django.utils import stopwords
results = Entry.objects.filter(headline__contains=query)
if(!results):
query = strip_stopwords(query)
for(q in query.split(' ')):
results += Entry.objects.filter(headline__contains=q)
I would like to add some additional functionality to this. Searching for miss spelled words, plurals, common homophones (sound the same spelled differently), ect. I was just wondering if any of these things were built into Djangos query language. It isn't important enough for me to write a huge algorithm for I am really just looking for something built in.
Thanks in advance for all the answers.
You could try using python's difflib module.
>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
Problem is that to use difflib one must build a list of words from the database. That can be expensive. Maybe if you cache the list of words and only rebuild it once in a while.
Some database systems support a search method to do what you want, like PostgreSQL's fuzzystrmatch module. If that is your case you could try calling it.
edit:
For your new "requirement", well, you are out of luck. No, there is nothing built in inside django's query language.
djangos orm doesn't have this behavior out-of-box, but there are several projects that integrate django w/ search services like:
sphinx (django-sphinx)
solr, a lightweight version of lucene (djangosearch)
lucene (django-search-lucene)
I cant speak to how well options #2 and #3 work, but I've used django-sphinx quite a lot, and am very happy with the results.
cal_name = request.data['column']['name']
words = []
for col in Column.objects.all():
if cal_name != col.name:
words.append(col.name)
words = difflib.get_close_matches(cal_name, words)
if len(words) > 0 and is_sure != "true":
return Response({
'potential typo': 'Did you mean ' + str(words) + '?',
"note": "If you think you do not have a typo send {'sure' : 'true'} with the data."})
Related
Given a word (English or non-English), how can I construct a list of words (English or non-English) with similar spelling?
For example, given the word 'sira', some similar words are:
sirra
seira
siara
saira
shira
I'd prefer this to be on the verbose side, meaning it should generate as many words as possible.
Preferably in Python, but code in any language is helpful.
The Australian Business Register ABN lookup tool (a tool that finds business registration numbers based on search keywords) does a good job of this.
Thanks
The thing you are looking for is provided by ispell (and family) of dictionaries. There is a relatively easy interface via hunspell library.
The actual data (dictionaries) you can download from here (among other places, like OpenOffice plugin pages).
There is an interface to get a number of similar words based on the edit distance suggested in the comment. Going with the example from GitHub:
>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
>>> hobj.spell('spookie')
False
>>> hobj.suggest('spookie')
['spookier', 'spookiness', 'spook', 'cookie', 'bookie', 'Spokane', 'spoken']
For searching in databases use "LIKE"
The query you'd want is
SELECT * FROM `testTable` WHERE name LIKE "%s%i%r%a%
We have about 9k documents indexed using Haystack 1.2.7 with Whoosh 2.4.1 as backend. Despite of using Haystack, it looks like a Whoosh problem. Take a look at my debug cases:
1) If I just run an exact lookup, Whoosh finds my document (as below):
>>> SearchQuerySet().all().models(PedidoSaida).filter(numero__exact='6210202443/10')
[<SearchResult: logistica.pedidosaida (pk=u'6')>]
2) If I just run a startswith lookup, Whoosh doesn't find my document (as below):
>>> SearchQuerySet().all().models(PedidoSaida).filter(numero__startswith='6210202443/10')
[]
3) If I put all together in a single OR query, Whoosh still doesn't find my document (as below):
>>> SearchQuerySet().all().models(PedidoSaida).filter(SQ(numero__exact='6210202443/10') | SQ(numero__startswith='6210202443/10'))
[]
Taking a look into the queries that Haystack sends to Whoosh, we have:
>>> str(SearchQuerySet().all().models(PedidoSaida).filter(numero__exact='6210202443/10').query)
'(numero:6210202443/10) AND (django_ct:logistica.pedidosaida)'
>>> str(SearchQuerySet().all().models(PedidoSaida).filter(numero__startswith='6210202443/10').query)
'(numero:6210202443/10*) AND (django_ct:logistica.pedidosaida)'
>>> str(SearchQuerySet().all().models(PedidoSaida).filter(SQ(numero__exact='6210202443/10') | SQ(numero__startswith='6210202443/10')).query)
'((numero:6210202443/10 OR numero:6210202443/10*)) AND (django_ct:logistica.pedidosaida)'
As you can observe, the last query is exactly (first OR second). Shouldn't Whoosh find my document? I can't see where my logic is wrong: I'm using OR and it is finding less than when I use one of the statements.
I also think it is weird that Whoosh finds my document with the first query (numero:6210202443/10), but not with the second (numero:6210202443/10*) one. But I guess it has to do with StemmingAnalyzer that Haystack uses in my CharField. I'll take a deep look into that after.
You can use a QueryParser directly to see how Whoosh is parsing that query:
>>> from whoosh.qparser import QueryParser
>>> QueryParser("content", schema=None).parse('((numero:6210202443/10 OR numero:6210202443/10*)) AND (django_ct:logistica.pedidosaida)')
And([Or([Term('numero', '6210202443/10'), Term('numero', '6210202443/')]), Prefix('content', '10'), Term('django_ct', 'logistica.pedidosaida')])
Let's reformat that last line:
And([
Or([
Term('numero', '6210202443/10'),
Term('numero', '6210202443/'),
]),
Prefix('content', '10'),
Term('django_ct', 'logistica.pedidosaida'),
])
So it looks like * is binding more tightly than the / in your search term. I could see arguing this as a bug in whoosh, sure. (I'm sure the maintainer would love your patch ☺)
Workarounds coming to mind:
Build the query yourself instead of round-tripping through Whoosh's fuzzily-defined and human-oriented query language. Of course, that only works if your index is on the same machine and you're reading it with the same process; I don't know much about Haystack.
Avoid using slashes in the numero field. Change them to something less likely to look like query syntax, like underscores.
Avoid including the slash when you do a prefix search; for example, 6210202443* works fine anywhere in a query.
Following #Eevee ideas, I did some tests. Check this one:
>>> QueryParser("content", schema=None).parse('((numero:6210202443/10 OR (numero:6210202443/10*))) AND (django_ct:logistica.pedidosaida)')
And([
Or([
Term('numero', '6210202443/10'),
And([
Term('numero', '6210202443/'),
Prefix('content', '10')
])
]),
Term('django_ct', 'logistica.pedidosaida')
])
It seems that / has precedence over OR. Does it make sense? I think that logical operators should have highest precedence. Do you agree?
If this behaviour is correct than I guess it is a bug in Haystack query generator. Isn't it?
I want to contribute with a patch but I'm not sure if it is really a bug in the parser. Depends on precedence that makes more sense.
I just got haystack with solr installed and created a custom view:
from haystack.query import SearchQuerySet
def post_search(request, template_name='search/search.html'):
getdata = request.GET.copy()
try:
results = SearchQuerySet().filter(title=getdata['search'])[:10]
except:
results = None
return render_to_response(template_name, locals(), context_instance=RequestContext(request))
This view only returns exact matches on the title field. How do I do at least things like the sql LIKE '%string%' (or at least i think it's this) where if I search 'i' or 'IN' or 'index' I will get the result 'index'?
Also are most of the ways you search edited using haystack or solr?
What other good practices/search improvements do you suggest (please give implementation too)?
Thanks a bunch in advance!
When you use Haystack/Solr, the idea is that you have to tell Haystack/Solr what you want indexed for a particular object. So say you wanted to build a find as you type index for a basic dictionary. If you wanted it to just match prefixes, for the word Boston, you'd need to tell it to index B, Bo, Bos, etc. and then you'd issue a query for whatever the current search expression was and you could return the results. If you wanted to search any part of the word, you'd need to build suffix trees and then Solr would take care of indexing them.
Look at templates in Haystack for more info. http://docs.haystacksearch.org/dev/best_practices.html#well-constructed-templates
The question you're asking is fairly generic, it might help to give specifics about what people are searching for. Then it'll be easier to suggest how to index the data. Good luck.
I want to create a SQL interface on top of a non-relational data store. Non-relational data store, but it makes sense to access the data in a relational manner.
I am looking into using ANTLR to produce an AST that represents the SQL as a relational algebra expression. Then return data by evaluating/walking the tree.
I have never implemented a parser before, and I would therefore like some advice on how to best implement a SQL parser and evaluator.
Does the approach described above sound about right?
Are there other tools/libraries I should look into? Like PLY or Pyparsing.
Pointers to articles, books or source code that will help me is appreciated.
Update:
I implemented a simple SQL parser using pyparsing. Combined with Python code that implement the relational operations against my data store, this was fairly simple.
As I said in one of the comments, the point of the exercise was to make the data available to reporting engines. To do this, I probably will need to implement an ODBC driver. This is probably a lot of work.
I have looked into this issue quite extensively. Python-sqlparse is a non validating parser which is not really what you need. The examples in antlr need a lot of work to convert to a nice ast in python. The sql standard grammars are here, but it would be a full time job to convert them yourself and it is likely that you would only need a subset of them i.e no joins. You could try looking at the gadfly (a Python SQL database) as well, but I avoided it as they used their own parsing tool.
For my case, I only essentially needed a where clause. I tried booleneo (a boolean expression parser) written with pyparsing but ended up using pyparsing from scratch. The first link in the reddit post of Mark Rushakoff gives a SQL example using it. Whoosh a full text search engine also uses it but I have not looked at the source to see how.
Pyparsing is very easy to use and you can very easily customize it to not be exactly the same as SQL (most of the syntax you will not need). I did not like ply as it uses some magic using naming conventions.
In short give pyparsing a try, it will most likely be powerful enough to do what you need and the simple integration with python (with easy callbacks and error handling) will make the experience pretty painless.
This reddit post suggests python-sqlparse as an existing implementation, among a couple other links.
TwoLaid's Python SQL Parser works very well for my purposes. It's written in C and needs to be compiled. It is robust. It parses out individual elements of each clause.
https://github.com/TwoLaid/python-sqlparser
I'm using it to parse out queries column names to use in report headers. Here is an example.
import sqlparser
def get_query_columns(sql):
'''Return a list of column headers from given sqls select clause'''
columns = []
parser = sqlparser.Parser()
# Parser does not like new lines
sql2 = sql.replace('\n', ' ')
# Check for syntax errors
if parser.check_syntax(sql2) != 0:
raise Exception('get_query_columns: SQL invalid.')
stmt = parser.get_statement(0)
root = stmt.get_root()
qcolumns = root.__dict__['resultColumnList']
for qcolumn in qcolumns.list:
if qcolumn.aliasClause:
alias = qcolumn.aliasClause.get_text()
columns.append(alias)
else:
name = qcolumn.get_text()
name = name.split('.')[-1] # remove table alias
columns.append(name)
return columns
sql = '''
SELECT
a.a,
replace(coalesce(a.b, 'x'), 'x', 'y') as jim,
a.bla as sally -- some comment
FROM
table_a as a
WHERE
c > 20
'''
print get_query_columns(sql)
# output: ['a', 'jim', 'sally']
Of course, it may be best to leverage python-sqlparse on Google Code
UPDATE: Now I see that this has been suggested - I concur that this is worthwhile:
I am using python-sqlparse with great success.
In my case I am working with queries that are already validated, my AST-walking code can make some sane assumptions about the structure.
https://pypi.org/project/sqlparse/
https://sqlparse.readthedocs.io/en/latest/
Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")
After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.
Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.
Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.
Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result
This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.
If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html