With Mongo, it is OK with the following:
> db.posts.find("this.text.indexOf('Hello') > 0")
But with pymongo, when executing the following:
for post in db.posts.find("this.text.indexOf('Hello') > 0"):
print post['text']
the error occurred.
I think Full Text Search in Mongo is better way in this example, but is it possible to use "find" method with "javascript" query with pymongo?
You are correct - you do this with server side javascript by using the $where clause[1]:
db.posts.find({"$where": "this.text.indexOf('Hello') > 0"})
Will work on all but sharded setups but the costs of doing are deemed prohibitive as you will be inspecting all documents in the collection, which is why generally its not considered a great idea.
You could also do a regular expression search:
db.posts.find({'text':{'$regex':'Hello'}})
This will also do a full collection scan as the regular expression isn't anchored (if you anchor a regular expression for example you're checking if a field begins with an value and have an index on that field you can utilise the index).
Given that those two approaches are expensive and won't perform or scale well then the best approach?
Well the full text search approach as described in the link you gave[2] works well. Create a _keywords field which stores the keywords as lowercase in an array, index that field then you can query like so:
db.posts.find({"_keywords": {"$in": "hello"});
That will scale and utilises an index so will be performant.
[1] http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-JavascriptExpressionsand%7B%7B%24where%7D%7D
[2] http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo
Related
I have a document reference that I am retreiving from a query on my Firestore database. I want to use the DocumentReference as a query parameter for another query. However, when I do that, it says
TypeError: sequence item 1: expected str instance, DocumentReference found
This makes sense, because I am trying to pass a DocumentReference in my update statement:
db.collection("Teams").document(team).update("Dictionary here") # team is a DocumentReference
Is there a way to get the document name from a DocumentReference? Now before you mark this as duplicate: I tried looking at the docs here, and the question here, although the docs were so confusing and the question had no answer.
Any help is appreciated, Thank You in advance!
Yes,split the .refPath. The document "name" is always the last element after the split; something like lodash _.last() can work, or any other technique that identifies the last element in the array.
Note, btw, the refPath is the full path to the document. This is extremely useful (as in: I use it a lot) when you find documents via collectionGroup() - it allows you to parse to find parent document(s)/collection(s) a particular document came from.
Also note: there is a pseudo-field __name__ available. (really an alias of documentID()). In spite of it's name(s), it returns the FULL PATH (i.e. refPath) to the document NOT the documentID by itself.
I think I figured out - by doing team.path.split("/")[1] I could get the document name. Although this might not work for all firestore databases (like subcollections) so if anyone has a better solution, please go ahead. Thanks!
Is it possible to use whoosh as a matcher without building an index?
My situation is that I have subscriptions pre-defined with strings, and documents coming through in a stream. I check each document matches the subscriptions and send them if so. I don't need to store the documents, or recall them later. Once they've been sent to the subscriptions, they can be discarded.
Currently just using simple matching, but as consumers ask for searches based on fields, and/or logic, etc, I'm wondering if it's possible to use a whoosh matcher and allow whoosh query syntax for this.
I could build an index for each document, query it, and then throw it away, but that seems very wasteful, is it possible to directly construct a Matcher? I couldn't find any docs or questions online indicating a way to do this and my attempts haven't worked.
Alternatively, is this just the wrong library for this task, and is there something better suited?
The short answer is no.
Search indices and matchers work quite differently. For example, if searching for the phrase "hello world", a matcher would simply check the document text contains the substring "hello world". A search index cannot do this, it would have to check every document, and that be very slow.
As documents are added, every word in them is added to the index for that word. So the index for "hello" will say that document 1 matches at position 0, and the index for "world" will say that document 1 matches at position 6. And a search for "hello world" will find all document IDs in the "hello" index, then all in the "world" index, and see if any have a position for "world" which is 6 digits after the position for "hello".
So it's a completely orthogonal way of doing things in whoosh vs a matcher.
It is possible to do this with whoosh, using a new index for each document, like so:
def matches_subscription(doc: Document, q: Query) -> bool:
with RamStorage() as store:
ix = store.create_index(schema)
writer = ix.writer()
writer.add_document(
title=doc.title,
description=doc.description,
keywords=doc.keywords
)
writer.commit()
with ix.searcher() as searcher:
results = searcher.search(q)
return bool(results)
This takes about 800 milliseconds per check, which is quite slow.
A better solution is to build a parser with pyparsing, anbd then create your own nested query classes which can do the matching, better fitting your specific search queries. It's quite extendable too that way. That can bring it down to ~40 microseconds, so, 20,000 times faster.
I'm trying to query a node index across all fields. This is what I thought would work:
idx = db.node.indexes.get('myindex')
idx.query('*:search_query')
But this returns no results. However, this works
idx = db.node.indexes.get('myindex')
idx.query('*:*')
And it returns all the nodes in the index as expected. Am I wrong in assuming that the first version should work at all?
I don't expect the first version to work, and am surprised the second does. Neo4j parses those queries using this Lucene syntax- I don't see anything about wildcard fields. Instead, remove the field to search against an implied "all fields".
Plug - for an easier way to build Lucene queries (compatible with Neo4j), check out lucene-querybuilder. It's used by neo4j-rest-client and neo4django.
EDIT:
I can't seem to find support for the "all fields" implicit search I thought existed- sorry! I guess you'll just have to manually include all fields in the query (eg, "name:falmarri OR userType:falmarri").
I would like to implement a search function in a django blogging application. The status quo is that I have a list of strings supplied by the user and the queryset is narrowed down by each string to include only those objects that match the string.
See:
if request.method == "POST":
form = SearchForm(request.POST)
if form.is_valid():
posts = Post.objects.all()
for string in form.cleaned_data['query'].split():
posts = posts.filter(
Q(title__icontains=string) |
Q(text__icontains=string) |
Q(tags__name__exact=string)
)
return archive_index(request, queryset=posts, date_field='date')
Now, what if I didn't want do concatenate each word that is searched for by a logical AND but with a logical OR? How would I do that? Is there a way to do that with Django's own Queryset methods or does one have to fall back to raw SQL queries?
In general, is it a proper solution to do full text search like this or would you recommend using a search engine like Solr, Whoosh or Xapian. What are their benefits?
I suggest you to adopt a search engine.
We've used Haystack search, a modular search application for django supporting many search engines (Solr, Xapian, Whoosh, etc...)
Advantages:
Faster
perform search queries even without querying the database.
Highlight searched terms
"More like this" functionality
Spelling suggestions
Better ranking
etc...
Disadvantages:
Search Indexes can grow in size pretty fast
One of the best search engines (Solr) run as a Java servlet (Xapian does not)
We're pretty happy with this solution and it's pretty easy to implement.
Actually, the query you have posted does use OR rather than AND - you're using \ to separate the Q objects. AND would be &.
In general, I would highly recommend using a proper search engine. We have had good success with Haystack on top of Solr - Haystack manages all the Solr configuration, and exposes a nice API very similar to Django's own ORM.
Answer to your general question: Definitely use a proper application for this.
With your query, you always examine the whole content of the fields (title, text, tags). You gain no benefit from indexes, etc.
With a proper full text search engine (or whatever you call it), text (words) is (are) indexed every time you insert new records. So queries will be a lot faster especially when your database grows.
SOLR is very easy to setup and integrate with Django. Haystack makes it even simpler.
For full text search in Python, look at PyLucene. It allows for very complex queries. The main problem here is that you must find a way to tell your search engine which pages changed and update the index eventually.
Alternatively, you can use Google Sitemaps to tell Google to index your site faster and then embed a custom query field in your site. The advantage here is that you just need to tell Google the changed pages and Google will do all the hard work (indexing, parsing the queries, etc). On top of that, most people are used to use Google to search plus it will keep your site current in the global Google searches, too.
I think full text search on an application level is more a matter of what you have and how you expect it to scale. If you run a small site with low usage I think it might be more affordable to put some time into making an custom full text search rather than installing an application to perform the search for you. And application would create more dependency, maintenance and extra effort when storing data. By making your search yourself and you can build in nice custom features. Like for example, if your text exactly matches one title you can direct the user to that page instead of showing the results. Another would be to allow title: or author: prefixes to keywords.
Here is a method I've used for generating relevant search results from a web query.
import shlex
class WeightedGroup:
def __init__(self):
# using a dictionary will make the results not paginate
# but it will be a lot faster when storing data
self.data = {}
def list(self, max_len=0):
# returns a sorted list of the items with heaviest weight first
res = []
while len(self.data) != 0:
nominated_weight = 0
for item, weight in self.data.iteritems():
if weight > nominated_weight:
nominated = item
nominated_weight = weight
self.data.pop(nominated)
res.append(nominated)
if len(res) == max_len:
return res
return res
def append(self, weight, item):
if item in self.data:
self.data[item] += weight
else:
self.data[item] = weight
def search(searchtext):
candidates = WeightedGroup()
for arg in shlex.split(searchtext): # shlex understand quotes
# Search TITLE
# order by date so we get most recent posts
query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
arg_hits = query.count() # count is cheap
if arg_hits > 1000:
continue # skip keywords which has too many hits
# Each of these are expensive as it would transfer data
# from the db and build a python object,
for post in query[:50]: # so we limit it to 50 for example
# more hits a keyword has the lesser it's relevant
candidates.append(100.0 / arg_hits, post.post_id)
# TODO add searchs for other areas
# Weight might also be adjusted with number of hits within the text
# or perhaps you can find other metrics to value an post higher,
# like number of views
# candidates can contain a lot of stuff now, show most relevant only
sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))
Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")
After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.
Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.
Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.
Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result
This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.
If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html