Whoosh - performance issues with wildcard searches (*something) - python

I'm noticing that searches like *something consume huge amounts of cpu. I'm using whoosh 2.4.1. I suppose this is because I don't have indexes covering this search case. something* works fine. *something doesnt't.
How do you deal with these queries? Is there a special way to declare your schemas which makes this kind of queries possible?
Thanks!

That's a quite fundamental problem: prefixes are usually easy to find (like when searching foo*), postfixes are not (like *foo).
Prefixes + Wildcard searches get optimized to first do a fast prefix search and then a slow wildcard search on the results given in the first step.
You can't do that optimization with Wildcard + Postfix. But there is a trick:
If you really need that often, you could try indexing a reversed string (and also searching for the reversed search string), so the postfix search becomes a prefix search:
Somehow like:
add_document(title=title, title_rev=title[::-1])
...
# then query = u"*foo"[::-1], search in title_rev field.

Related

PyMongo $regex across all text fields and subfields

I have a rather convoluted Mongo collection and I'm trying to implement detailed matching criteria. I have already created a text index across all fields as follows:
db.create_index([("$**", "text")], name='allTextFields')
I am using this for some straightforward search terms in PyMongo (e.g., "immigration") as follows:
db.find({'$text': {'$search': "immigration"}
However, there are certain terms I need to match that are generic enough as to require regex-type specifications. For instance, I want match all occurrences of "ice" without finding "police" and a variety of other exclusion terms.
Ideally, I could create a regex that would search all fields and subfields (see example below), but I can't figure out how to implement this in PyMongo (or Mongo for that matter).
db.find({all_fields_and_subfields: {'$regex': '^ice\s*', '$options': 'i'}
Does anyone know how to do so?
One way of doing this is to add another field to the documents which contains a concatenation of all the fields you want to search, and $regex on that.
Note that unless your regexes are anchored to the beginning of input, they won't be using indexes (so you'll be doing collection scans).
I am surprised that a full text query for "ice" finds "police", surely that's a bug somewhere.
You may also consider Atlas search instead of full-text search, which is more powerful but proprietary to Atlas.

How to implement a custom spell check in the search API of GAE

In my python GAE application, I am allowing users to query on items using the search api where I initially put in the documents with the exact tags, but the hits are not much given the spell correction that needs to be present.
The way I found was implementing character ngrams via datastore as this ensures that the user is typing in atleast a part of the word correctly. On the datastore this takes a lot of time.
For example,
"hello" (is broken into) ["hello", "ello", "hell", "hel", "elo", "llo", "he", "el", "ll", "lo"]
and when i search for "helo"
tags -["hel", "elo", "he", "el", "lo"] ( give a positive match)
I rank them according to the length of the tags matched from a word.
On Datastore,
I have to index these break character ngrams separately along with the entities they match. And for each word perform the search on every tag in a similar manner. Which takes a lot of time.
Is there a way of achieving a similar operation using the search api.
Does the MatchScore look into the multiple fields of "OR" ?
Looking for ways to design the search documents and perform multiple spell corrected queries in minimal operations.
If I have multiple fields for languages in each document like for eg.-
([tags - "hello world"] [rank - 2300] [partial tags - "hel", "ell", "llo", "wor", "orl", "rld", "hell", "ello", "worl", "orld"] [english - 1] [Spanish - 0] [French - 0] [German - 0]
Can I perform a MatchScore operation along with sort on the language fields? (each document is associated to only one language)
Search API is a good service for this and much better suited than datastore. If your search documents have the correct language set, Search API will cover certain language specific variations (e.g. singular / plural). But Search API only works for words (typically separated by spaces, hyphens, dots etc.).
UPDATE: Language is defined either in the language property of a field, or in the language property of the entire document. In either case, the value is a two-letter ISO 693-1 language code, for example 'de' for German.
For tokenizing search terms ("hel", "elo",...), you can use the pattern from this answer: https://stackoverflow.com/a/13171181/1549523
Also see my comment to that answer. When you want to use minimal length of tokens (e.g. only 3+ letters) to avoid storage size and frontend instance time, you can use the code I've linked there.
MatchScorer is helpful to weight the frequency of a given term in a document. Since tags typically occur only once per document, it wouldn't help you with that. But for example, if your search is about searching in research papers for the term "combustion", MatchScorer would rank the results, showing first the papers that have the term included most often.
Faceted search would add so called facets to the result of your search query, i.e. (by default) the 10 most often occurring facets for the current query is returned, too. This is helpful with tags or categories, so users can drill down their search by applying any of these suggested filters.
If you want to suggest users the correctly spelled search term, it might make sense to use two indices. One index, the primary index, for your actual search documents (e.g. product descriptions with tags), and a second index just for tags or categories (tokenized, and eventually with synonyms). If your user types into a search field, your app first queries the tag-index, suggesting matching tags. If the user selects one of them, the tag is used to query the primary search index. This would help users to pick up correct tags.
Those tags could be managed in the datastore of course, including their synonyms, if there are people maintaining such lists. And every time a tag is stored, your app updates the corresponding search document (in the secondary index) including all the character ngrams (tokens).

Python: Array vs Database for storage of key/value

Q: Which is quicker for this scenario?
My scenario: my application will be storing either in either an array or postgresql db a list of links, so it might look like:
1) mysite.com
a) /users/login
b) /users/registration/
c) /contact/
d) /locate/search
e) /priv/admin-login
The above entries under 1) - I will be doing string searches on these urls to find for example any path that contains:
'login'
for example.
The above letters a) through e) could maybe have anywhere from 5-100 more entries for a given domain.
*The usage: * This data structure can change potentially as much as everyday, but only once per day. Some key/values will be removed, others will be modified. An individual set like:
dict2 = { 'thesite.com': 123, 98.6: 37 };
Each key will represent 1 and only 1 domain.
I've tried searching a bit on this, but cannot seem to find a real good answer to : when should an array be used and when should a db like postgresql be used?
I've always used a db to handle data (using mysql, not postgresql), but I'm now trying to do it better from now on, so I wondered if an array or other data structure would work better within a loop, and while trying tomatch a given string while looping.
As always, thank you!
A full SQL database would probably be overkill. If you can fit everything in memory, put it all in a dict and then use the pickle module to serialize it and write it to the disk.
Another good option would be to use one of the dbm modules (dbm/dbm.ndbm, gdbm or anydbm) to store the data in a disk-bound hash table. It will have O(1) lookup times without the need to connect and form a query like in a bigger database.
edit: If you have multiple values per key and you don't want a full-blown database, SQLite would be a good choice. There is already a built-in module for it, sqlite3 (as mentioned in the comments)
Test it. It's your dataset, your hardware, your available disk and network IO, your usage pattern. There's no one true answer here. We don't even know how many queries are you planning - are we talking about one per minute or thousands per second?
If your data fits nicely in memory and doesn't take a massive amount of time to load the first time, sticking it into a dictionary in memory will probably be faster.
If you're always looking for full words (like in the login case), you will gain some speed too from splitting the url into parts and indexing those separately.

Improving django search

I have the following search:
titles = Title.objects.filter(title__icontains=search)
If this is a search to find:
Thomas: Splish, Splash, Splosh
I can type in something like "Thomas" or "Thomas: Splish, Splash, Splosh" and it will work.
However, if I type in something like "Thomas Splash", it will not work. How would I improve the search to do something like that (also note that if we split on words, the comma and other non-alphanumerics should be ignored -- for example, the split words should not be "Thomas:", "Splish," but rather "Thomas", "Splish", etc.
This kind of search is starting to push the boundaries of django and the ORM. Once it gets to this level of complexity I always switch over to a system that is built entirely for search. I dig lucene, so I usually go for ElasticSearch or Solr
Keep in mind that full text searching is a subsystem all unto itself, but can really add a lot of value to your site.
As Django models are using database queries there is not much magic you can do.
You could split your search by non-alphanumeric chars and search objects containing all words but this will not be smart and efficient.
If you want something really smart maybe you should check out haystack:
http://haystacksearch.org/

Storing an inverted index

I am working on a project on Info Retrieval.
I have made a Full Inverted Index using Hadoop/Python.
Hadoop outputs the index as (word,documentlist) pairs which are written on the file.
For a quick access, I have created a dictionary(hashtable) using the above file.
My question is, how do I store such an index on disk that also has quick access time.
At present I am storing the dictionary using python pickle module and loading from it
but it brings the whole of index into memory at once (or does it?).
Please suggest an efficient way of storing and searching through the index.
My dictionary structure is as follows (using nested dictionaries)
{word : {doc1:[locations], doc2:[locations], ....}}
so that I can get the documents containing a word by
dictionary[word].keys() ... and so on.
shelve
At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?).
Yes it does bring it all in.
Is that a problem? If it's not an actual problem, then stick with it.
If it's a problem, what kind of problem do you have? Too slow? Too fast? Too colorful? Too much memory used? What problem do you have?
I would use Lucene. Why reinvent the wheel?
Just store it in a string like this:
<entry1>,<entry2>,<entry3>,...,<entryN>
If <entry*> contains ',' character, use some other delimiter like '\t'.
This is smaller in size than an equivalent pickled string.
If you want to load it, just do:
L = s.split(delimiter)
You could store the repr() of the dictionary and use that to re-create it.
If it's taking a long time to load or using too much memory, you might need a database. There are many you might use; I would probably start with SQLite. Then your problem is "reduced" ;-) to simply formulating the right query to get what you need out of the database. This way you will only load what you need.
I am using anydmb for that purpose. Anydbm provides the same dictionary-like interface, except it allow only strings as keys and values. But this is not a constraint since you can use cPickle's loads/dumps to store more complex structures in the index.

Categories

Resources