Whoosh: shouldn't OR only increase results? - python

We have about 9k documents indexed using Haystack 1.2.7 with Whoosh 2.4.1 as backend. Despite of using Haystack, it looks like a Whoosh problem. Take a look at my debug cases:
1) If I just run an exact lookup, Whoosh finds my document (as below):
>>> SearchQuerySet().all().models(PedidoSaida).filter(numero__exact='6210202443/10')
[<SearchResult: logistica.pedidosaida (pk=u'6')>]
2) If I just run a startswith lookup, Whoosh doesn't find my document (as below):
>>> SearchQuerySet().all().models(PedidoSaida).filter(numero__startswith='6210202443/10')
[]
3) If I put all together in a single OR query, Whoosh still doesn't find my document (as below):
>>> SearchQuerySet().all().models(PedidoSaida).filter(SQ(numero__exact='6210202443/10') | SQ(numero__startswith='6210202443/10'))
[]
Taking a look into the queries that Haystack sends to Whoosh, we have:
>>> str(SearchQuerySet().all().models(PedidoSaida).filter(numero__exact='6210202443/10').query)
'(numero:6210202443/10) AND (django_ct:logistica.pedidosaida)'
>>> str(SearchQuerySet().all().models(PedidoSaida).filter(numero__startswith='6210202443/10').query)
'(numero:6210202443/10*) AND (django_ct:logistica.pedidosaida)'
>>> str(SearchQuerySet().all().models(PedidoSaida).filter(SQ(numero__exact='6210202443/10') | SQ(numero__startswith='6210202443/10')).query)
'((numero:6210202443/10 OR numero:6210202443/10*)) AND (django_ct:logistica.pedidosaida)'
As you can observe, the last query is exactly (first OR second). Shouldn't Whoosh find my document? I can't see where my logic is wrong: I'm using OR and it is finding less than when I use one of the statements.
I also think it is weird that Whoosh finds my document with the first query (numero:6210202443/10), but not with the second (numero:6210202443/10*) one. But I guess it has to do with StemmingAnalyzer that Haystack uses in my CharField. I'll take a deep look into that after.

You can use a QueryParser directly to see how Whoosh is parsing that query:
>>> from whoosh.qparser import QueryParser
>>> QueryParser("content", schema=None).parse('((numero:6210202443/10 OR numero:6210202443/10*)) AND (django_ct:logistica.pedidosaida)')
And([Or([Term('numero', '6210202443/10'), Term('numero', '6210202443/')]), Prefix('content', '10'), Term('django_ct', 'logistica.pedidosaida')])
Let's reformat that last line:
And([
Or([
Term('numero', '6210202443/10'),
Term('numero', '6210202443/'),
]),
Prefix('content', '10'),
Term('django_ct', 'logistica.pedidosaida'),
])
So it looks like * is binding more tightly than the / in your search term. I could see arguing this as a bug in whoosh, sure. (I'm sure the maintainer would love your patch ☺)
Workarounds coming to mind:
Build the query yourself instead of round-tripping through Whoosh's fuzzily-defined and human-oriented query language. Of course, that only works if your index is on the same machine and you're reading it with the same process; I don't know much about Haystack.
Avoid using slashes in the numero field. Change them to something less likely to look like query syntax, like underscores.
Avoid including the slash when you do a prefix search; for example, 6210202443* works fine anywhere in a query.

Following #Eevee ideas, I did some tests. Check this one:
>>> QueryParser("content", schema=None).parse('((numero:6210202443/10 OR (numero:6210202443/10*))) AND (django_ct:logistica.pedidosaida)')
And([
Or([
Term('numero', '6210202443/10'),
And([
Term('numero', '6210202443/'),
Prefix('content', '10')
])
]),
Term('django_ct', 'logistica.pedidosaida')
])
It seems that / has precedence over OR. Does it make sense? I think that logical operators should have highest precedence. Do you agree?
If this behaviour is correct than I guess it is a bug in Haystack query generator. Isn't it?
I want to contribute with a patch but I'm not sure if it is really a bug in the parser. Depends on precedence that makes more sense.

Related

querying all fields in neo4j index using python-embedded bindings

I'm trying to query a node index across all fields. This is what I thought would work:
idx = db.node.indexes.get('myindex')
idx.query('*:search_query')
But this returns no results. However, this works
idx = db.node.indexes.get('myindex')
idx.query('*:*')
And it returns all the nodes in the index as expected. Am I wrong in assuming that the first version should work at all?
I don't expect the first version to work, and am surprised the second does. Neo4j parses those queries using this Lucene syntax- I don't see anything about wildcard fields. Instead, remove the field to search against an implied "all fields".
Plug - for an easier way to build Lucene queries (compatible with Neo4j), check out lucene-querybuilder. It's used by neo4j-rest-client and neo4django.
EDIT:
I can't seem to find support for the "all fields" implicit search I thought existed- sorry! I guess you'll just have to manually include all fields in the query (eg, "name:falmarri OR userType:falmarri").

Is it possible to use "find" method with "javascript" query with pymongo?

With Mongo, it is OK with the following:
> db.posts.find("this.text.indexOf('Hello') > 0")
But with pymongo, when executing the following:
for post in db.posts.find("this.text.indexOf('Hello') > 0"):
print post['text']
the error occurred.
I think Full Text Search in Mongo is better way in this example, but is it possible to use "find" method with "javascript" query with pymongo?
You are correct - you do this with server side javascript by using the $where clause[1]:
db.posts.find({"$where": "this.text.indexOf('Hello') > 0"})
Will work on all but sharded setups but the costs of doing are deemed prohibitive as you will be inspecting all documents in the collection, which is why generally its not considered a great idea.
You could also do a regular expression search:
db.posts.find({'text':{'$regex':'Hello'}})
This will also do a full collection scan as the regular expression isn't anchored (if you anchor a regular expression for example you're checking if a field begins with an value and have an index on that field you can utilise the index).
Given that those two approaches are expensive and won't perform or scale well then the best approach?
Well the full text search approach as described in the link you gave[2] works well. Create a _keywords field which stores the keywords as lowercase in an array, index that field then you can query like so:
db.posts.find({"_keywords": {"$in": "hello"});
That will scale and utilises an index so will be performant.
[1] http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-JavascriptExpressionsand%7B%7B%24where%7D%7D
[2] http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo

Parsing SQL with Python

I want to create a SQL interface on top of a non-relational data store. Non-relational data store, but it makes sense to access the data in a relational manner.
I am looking into using ANTLR to produce an AST that represents the SQL as a relational algebra expression. Then return data by evaluating/walking the tree.
I have never implemented a parser before, and I would therefore like some advice on how to best implement a SQL parser and evaluator.
Does the approach described above sound about right?
Are there other tools/libraries I should look into? Like PLY or Pyparsing.
Pointers to articles, books or source code that will help me is appreciated.
Update:
I implemented a simple SQL parser using pyparsing. Combined with Python code that implement the relational operations against my data store, this was fairly simple.
As I said in one of the comments, the point of the exercise was to make the data available to reporting engines. To do this, I probably will need to implement an ODBC driver. This is probably a lot of work.
I have looked into this issue quite extensively. Python-sqlparse is a non validating parser which is not really what you need. The examples in antlr need a lot of work to convert to a nice ast in python. The sql standard grammars are here, but it would be a full time job to convert them yourself and it is likely that you would only need a subset of them i.e no joins. You could try looking at the gadfly (a Python SQL database) as well, but I avoided it as they used their own parsing tool.
For my case, I only essentially needed a where clause. I tried booleneo (a boolean expression parser) written with pyparsing but ended up using pyparsing from scratch. The first link in the reddit post of Mark Rushakoff gives a SQL example using it. Whoosh a full text search engine also uses it but I have not looked at the source to see how.
Pyparsing is very easy to use and you can very easily customize it to not be exactly the same as SQL (most of the syntax you will not need). I did not like ply as it uses some magic using naming conventions.
In short give pyparsing a try, it will most likely be powerful enough to do what you need and the simple integration with python (with easy callbacks and error handling) will make the experience pretty painless.
This reddit post suggests python-sqlparse as an existing implementation, among a couple other links.
TwoLaid's Python SQL Parser works very well for my purposes. It's written in C and needs to be compiled. It is robust. It parses out individual elements of each clause.
https://github.com/TwoLaid/python-sqlparser
I'm using it to parse out queries column names to use in report headers. Here is an example.
import sqlparser
def get_query_columns(sql):
'''Return a list of column headers from given sqls select clause'''
columns = []
parser = sqlparser.Parser()
# Parser does not like new lines
sql2 = sql.replace('\n', ' ')
# Check for syntax errors
if parser.check_syntax(sql2) != 0:
raise Exception('get_query_columns: SQL invalid.')
stmt = parser.get_statement(0)
root = stmt.get_root()
qcolumns = root.__dict__['resultColumnList']
for qcolumn in qcolumns.list:
if qcolumn.aliasClause:
alias = qcolumn.aliasClause.get_text()
columns.append(alias)
else:
name = qcolumn.get_text()
name = name.split('.')[-1] # remove table alias
columns.append(name)
return columns
sql = '''
SELECT
a.a,
replace(coalesce(a.b, 'x'), 'x', 'y') as jim,
a.bla as sally -- some comment
FROM
table_a as a
WHERE
c > 20
'''
print get_query_columns(sql)
# output: ['a', 'jim', 'sally']
Of course, it may be best to leverage python-sqlparse on Google Code
UPDATE: Now I see that this has been suggested - I concur that this is worthwhile:
I am using python-sqlparse with great success.
In my case I am working with queries that are already validated, my AST-walking code can make some sane assumptions about the structure.
https://pypi.org/project/sqlparse/
https://sqlparse.readthedocs.io/en/latest/

How to match search strings to content in python

Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")
After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.
Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.
Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.
Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result
This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.
If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html

Django "Did you mean?" query

I am writing a fairly simple Django application where users can enter string queries. The application will the search through the database for this string.
Entry.objects.filter(headline__contains=query)
This query is pretty strait forward but not really helpful to someone who isn't 100% sure what they are looking for. So I expanded the search.
from django.utils import stopwords
results = Entry.objects.filter(headline__contains=query)
if(!results):
query = strip_stopwords(query)
for(q in query.split(' ')):
results += Entry.objects.filter(headline__contains=q)
I would like to add some additional functionality to this. Searching for miss spelled words, plurals, common homophones (sound the same spelled differently), ect. I was just wondering if any of these things were built into Djangos query language. It isn't important enough for me to write a huge algorithm for I am really just looking for something built in.
Thanks in advance for all the answers.
You could try using python's difflib module.
>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
Problem is that to use difflib one must build a list of words from the database. That can be expensive. Maybe if you cache the list of words and only rebuild it once in a while.
Some database systems support a search method to do what you want, like PostgreSQL's fuzzystrmatch module. If that is your case you could try calling it.
edit:
For your new "requirement", well, you are out of luck. No, there is nothing built in inside django's query language.
djangos orm doesn't have this behavior out-of-box, but there are several projects that integrate django w/ search services like:
sphinx (django-sphinx)
solr, a lightweight version of lucene (djangosearch)
lucene (django-search-lucene)
I cant speak to how well options #2 and #3 work, but I've used django-sphinx quite a lot, and am very happy with the results.
cal_name = request.data['column']['name']
words = []
for col in Column.objects.all():
if cal_name != col.name:
words.append(col.name)
words = difflib.get_close_matches(cal_name, words)
if len(words) > 0 and is_sure != "true":
return Response({
'potential typo': 'Did you mean ' + str(words) + '?',
"note": "If you think you do not have a typo send {'sure' : 'true'} with the data."})

Categories

Resources