Is it possible to use whoosh as a matcher without building an index?
My situation is that I have subscriptions pre-defined with strings, and documents coming through in a stream. I check each document matches the subscriptions and send them if so. I don't need to store the documents, or recall them later. Once they've been sent to the subscriptions, they can be discarded.
Currently just using simple matching, but as consumers ask for searches based on fields, and/or logic, etc, I'm wondering if it's possible to use a whoosh matcher and allow whoosh query syntax for this.
I could build an index for each document, query it, and then throw it away, but that seems very wasteful, is it possible to directly construct a Matcher? I couldn't find any docs or questions online indicating a way to do this and my attempts haven't worked.
Alternatively, is this just the wrong library for this task, and is there something better suited?
The short answer is no.
Search indices and matchers work quite differently. For example, if searching for the phrase "hello world", a matcher would simply check the document text contains the substring "hello world". A search index cannot do this, it would have to check every document, and that be very slow.
As documents are added, every word in them is added to the index for that word. So the index for "hello" will say that document 1 matches at position 0, and the index for "world" will say that document 1 matches at position 6. And a search for "hello world" will find all document IDs in the "hello" index, then all in the "world" index, and see if any have a position for "world" which is 6 digits after the position for "hello".
So it's a completely orthogonal way of doing things in whoosh vs a matcher.
It is possible to do this with whoosh, using a new index for each document, like so:
def matches_subscription(doc: Document, q: Query) -> bool:
with RamStorage() as store:
ix = store.create_index(schema)
writer = ix.writer()
writer.add_document(
title=doc.title,
description=doc.description,
keywords=doc.keywords
)
writer.commit()
with ix.searcher() as searcher:
results = searcher.search(q)
return bool(results)
This takes about 800 milliseconds per check, which is quite slow.
A better solution is to build a parser with pyparsing, anbd then create your own nested query classes which can do the matching, better fitting your specific search queries. It's quite extendable too that way. That can bring it down to ~40 microseconds, so, 20,000 times faster.
Related
I'm using google app engine datastore and have around 1500 blog posts in the datastore.
Using ndb
class BlogPost(ndb.Model):
title = ndb.StringProperty(required=True)
content = ndb.TextProperty()
created = ndb.DateTimeProperty(auto_now_add=True)
So I'm using
words = self.request.get("q")
search_words = words.split()
query = libs.blogs_cache() # returns a list of blogs memcache
search_results = [blog for blog in query for word in search_words
if word.lower() in blog.title.lower()]
This is an example I use for the time being. But unfortunately, this is extremely slow (takes around 6 seconds) because you have to go thru every single data to find the results. If you use multiple words, it will multiply the number of search.
So my question is. What are some ways to speed up the search and google app engine? Any examples and directions to it would be grateful. Thanks in advance.
I think for this type of search , you should use google app engine search api.
https://cloud.google.com/appengine/docs/python/search/
just feed the data in the search documents and you can then query through them
If there are not too many words in search_words, you can make an IN query on the title:
search_words = [word.lower() for word in words.split()]
search_results = BlogPost.query(BlogPost.title.IN(search_words)).fetch()
Notice that this matches the title exactly which might not be what you want and if you need to query for lowercase blog titles, you probably also have to make a ComputerProperty for that.
I think #omair_77's answer is likely best, but an alternative to consider, if the blog posts and the search lists are small enough, is a computed property:
class BlogPost(ndb.Model):
title = ndb.StringProperty(required=True)
content = ndb.TextProperty()
created = ndb.DateTimeProperty(auto_now_add=True)
words = ndb.ComputedProperty(lambda self: content.lower().split())
Now, BlogPost.words.IN(words.lower().split()) will give you the desired semantics -- all blogs containing at least one of the words in space-separated string words, case-insensitive.
If you need to ignore punctuation you'll likely want regular expressions instead (re.findall(r'\w+', whatever.lower()) instead of the simple split calls, but the general ideas in GAE terms are the same: computed properties can be used in queries, and that IN operator locates entities with at least one "hit" -- and it does so rapidly, using indices on the "back-end side" of things.
I have this problem where I am using the hostnames of all the URLs I have in my dataset as features. I'm not able to figure out how to use TfidfVectorizer to extract hostnames only from the URLs and calculate their weights.
For instance, I have a dataframe df where the column 'url' has all the URLs I need. I thought I had to do something like:
def preprocess(t):
return urlparse(t).hostname
tfv = TfidfVectorizer(preprocessor=preprocess)
tfv.fit_transform([t for t in df['url']])
It doesn't seem to work this way, since it splits the hostnames instead of treating them as whole strings. I think it's to do with analyzer='word' (which it is by default), which splits the string into words.
Any help would be appreciated, thanks!
You are right. analyzer=word creates a tokeniser that uses the default token pattern '(?u)\b\w\w+\b'. If you wanted to tokenise the entire URL as a single token, you can change the token pattern:
vect = CountVectorizer(token_pattern='\S+')
This tokenises https://www.pythex.org hello hello.there as ['https://www.pythex.org', 'hello', 'hello.there']. You can then create an analyser to extract the hostname from URLs as shown in this question. You can either extend CountVectorizer to change its build_analyzer method or just monkey patch it:
def my_analyser():
# magic is a function that extracts hostname from URL, among other things
return lambda doc: magic(preprocess(self.decode(doc)))
vect = CountVectorizer(token_pattern='\S+')
vect. build_analyzer = my_analyser
vect.fit_transform(...)
Note: tokenisation is not as simple as is appears. The regex I've used has many limitations, e.g. it doesn't split the last token of a sentence and the first token of the next sentence if there isn't a space after the full stop. In general, regex tokenisers get very unwieldy very quickly. I recommend looking at nltk, which offers several different non-regex tokenisers.
Since the Fall update, GAE now supports partial searching. Per the documentation: "The API supports partial text matching on string fields".
This seems to be a very popular request, per many threads:
Partial matching GAE search API
Does GAE Datastore support 'partial text search'?
So I would assume a search for 'pyt' would now return 'python'
Has anyone gotten this to work? Doesn't work for me. I'm curious if there's some setting required, like the ~ for stemming.
"The API supports partial text matching on string fields" in https://cloud.google.com/appengine/docs/python/search/ refers to matching by tokens. Specifically, see https://cloud.google.com/appengine/docs/python/search/#Python_Tokenizing_string_fields ...:
The string is split into tokens wherever whitespace or special
characters (punctuation marks, hash sign, etc.) appear. The index will
include an entry for each token. This enables you to search for
keywords and phrases comprising only part of a field's value.
Therefore your assumption:
So I would assume a search for 'pyt' would now return 'python'
is ill-founded -- "partial search" means parts of a document (a subset of the tokens in a text field thereof), not parts of each token (that would cause a combinatorial explosion, e.g the single token python would have to be indexed as each and every one of the entries:
p
py
pyt
pyth
pytho
python
y
yt
yth
ytho
ython
t
th
tho
thon
h
ho
hon
o
on
n
If you want that, it's easy enough to write your own code to produce the explosion (producing a pseudo-document with all of these substrings from a real starting document) -- but, for any non-trivial starting document, you may easily end up either paying for a ridiculous amount of resources, or hitting a hard ceiling of absolute maximum quotas.
Hint: if you do a web search for "pyt", do you find docs containing "python"? Try -- the former gives 10 million hits (Peninsula Youth Theater, Michael Jackson's P.Y.T. (Pretty Young Thing), etc etc), the latter, 180 million hits (the language, the snake, the comedy group:-).
With Mongo, it is OK with the following:
> db.posts.find("this.text.indexOf('Hello') > 0")
But with pymongo, when executing the following:
for post in db.posts.find("this.text.indexOf('Hello') > 0"):
print post['text']
the error occurred.
I think Full Text Search in Mongo is better way in this example, but is it possible to use "find" method with "javascript" query with pymongo?
You are correct - you do this with server side javascript by using the $where clause[1]:
db.posts.find({"$where": "this.text.indexOf('Hello') > 0"})
Will work on all but sharded setups but the costs of doing are deemed prohibitive as you will be inspecting all documents in the collection, which is why generally its not considered a great idea.
You could also do a regular expression search:
db.posts.find({'text':{'$regex':'Hello'}})
This will also do a full collection scan as the regular expression isn't anchored (if you anchor a regular expression for example you're checking if a field begins with an value and have an index on that field you can utilise the index).
Given that those two approaches are expensive and won't perform or scale well then the best approach?
Well the full text search approach as described in the link you gave[2] works well. Create a _keywords field which stores the keywords as lowercase in an array, index that field then you can query like so:
db.posts.find({"_keywords": {"$in": "hello"});
That will scale and utilises an index so will be performant.
[1] http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-JavascriptExpressionsand%7B%7B%24where%7D%7D
[2] http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo
Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")
After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.
Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.
Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.
Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result
This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.
If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html