PyMongo $regex across all text fields and subfields - python

I have a rather convoluted Mongo collection and I'm trying to implement detailed matching criteria. I have already created a text index across all fields as follows:
db.create_index([("$**", "text")], name='allTextFields')
I am using this for some straightforward search terms in PyMongo (e.g., "immigration") as follows:
db.find({'$text': {'$search': "immigration"}
However, there are certain terms I need to match that are generic enough as to require regex-type specifications. For instance, I want match all occurrences of "ice" without finding "police" and a variety of other exclusion terms.
Ideally, I could create a regex that would search all fields and subfields (see example below), but I can't figure out how to implement this in PyMongo (or Mongo for that matter).
db.find({all_fields_and_subfields: {'$regex': '^ice\s*', '$options': 'i'}
Does anyone know how to do so?

One way of doing this is to add another field to the documents which contains a concatenation of all the fields you want to search, and $regex on that.
Note that unless your regexes are anchored to the beginning of input, they won't be using indexes (so you'll be doing collection scans).
I am surprised that a full text query for "ice" finds "police", surely that's a bug somewhere.
You may also consider Atlas search instead of full-text search, which is more powerful but proprietary to Atlas.

Related

How to implement a custom spell check in the search API of GAE

In my python GAE application, I am allowing users to query on items using the search api where I initially put in the documents with the exact tags, but the hits are not much given the spell correction that needs to be present.
The way I found was implementing character ngrams via datastore as this ensures that the user is typing in atleast a part of the word correctly. On the datastore this takes a lot of time.
For example,
"hello" (is broken into) ["hello", "ello", "hell", "hel", "elo", "llo", "he", "el", "ll", "lo"]
and when i search for "helo"
tags -["hel", "elo", "he", "el", "lo"] ( give a positive match)
I rank them according to the length of the tags matched from a word.
On Datastore,
I have to index these break character ngrams separately along with the entities they match. And for each word perform the search on every tag in a similar manner. Which takes a lot of time.
Is there a way of achieving a similar operation using the search api.
Does the MatchScore look into the multiple fields of "OR" ?
Looking for ways to design the search documents and perform multiple spell corrected queries in minimal operations.
If I have multiple fields for languages in each document like for eg.-
([tags - "hello world"] [rank - 2300] [partial tags - "hel", "ell", "llo", "wor", "orl", "rld", "hell", "ello", "worl", "orld"] [english - 1] [Spanish - 0] [French - 0] [German - 0]
Can I perform a MatchScore operation along with sort on the language fields? (each document is associated to only one language)
Search API is a good service for this and much better suited than datastore. If your search documents have the correct language set, Search API will cover certain language specific variations (e.g. singular / plural). But Search API only works for words (typically separated by spaces, hyphens, dots etc.).
UPDATE: Language is defined either in the language property of a field, or in the language property of the entire document. In either case, the value is a two-letter ISO 693-1 language code, for example 'de' for German.
For tokenizing search terms ("hel", "elo",...), you can use the pattern from this answer: https://stackoverflow.com/a/13171181/1549523
Also see my comment to that answer. When you want to use minimal length of tokens (e.g. only 3+ letters) to avoid storage size and frontend instance time, you can use the code I've linked there.
MatchScorer is helpful to weight the frequency of a given term in a document. Since tags typically occur only once per document, it wouldn't help you with that. But for example, if your search is about searching in research papers for the term "combustion", MatchScorer would rank the results, showing first the papers that have the term included most often.
Faceted search would add so called facets to the result of your search query, i.e. (by default) the 10 most often occurring facets for the current query is returned, too. This is helpful with tags or categories, so users can drill down their search by applying any of these suggested filters.
If you want to suggest users the correctly spelled search term, it might make sense to use two indices. One index, the primary index, for your actual search documents (e.g. product descriptions with tags), and a second index just for tags or categories (tokenized, and eventually with synonyms). If your user types into a search field, your app first queries the tag-index, suggesting matching tags. If the user selects one of them, the tag is used to query the primary search index. This would help users to pick up correct tags.
Those tags could be managed in the datastore of course, including their synonyms, if there are people maintaining such lists. And every time a tag is stored, your app updates the corresponding search document (in the secondary index) including all the character ngrams (tokens).

Improving django search

I have the following search:
titles = Title.objects.filter(title__icontains=search)
If this is a search to find:
Thomas: Splish, Splash, Splosh
I can type in something like "Thomas" or "Thomas: Splish, Splash, Splosh" and it will work.
However, if I type in something like "Thomas Splash", it will not work. How would I improve the search to do something like that (also note that if we split on words, the comma and other non-alphanumerics should be ignored -- for example, the split words should not be "Thomas:", "Splish," but rather "Thomas", "Splish", etc.
This kind of search is starting to push the boundaries of django and the ORM. Once it gets to this level of complexity I always switch over to a system that is built entirely for search. I dig lucene, so I usually go for ElasticSearch or Solr
Keep in mind that full text searching is a subsystem all unto itself, but can really add a lot of value to your site.
As Django models are using database queries there is not much magic you can do.
You could split your search by non-alphanumeric chars and search objects containing all words but this will not be smart and efficient.
If you want something really smart maybe you should check out haystack:
http://haystacksearch.org/

Whoosh - performance issues with wildcard searches (*something)

I'm noticing that searches like *something consume huge amounts of cpu. I'm using whoosh 2.4.1. I suppose this is because I don't have indexes covering this search case. something* works fine. *something doesnt't.
How do you deal with these queries? Is there a special way to declare your schemas which makes this kind of queries possible?
Thanks!
That's a quite fundamental problem: prefixes are usually easy to find (like when searching foo*), postfixes are not (like *foo).
Prefixes + Wildcard searches get optimized to first do a fast prefix search and then a slow wildcard search on the results given in the first step.
You can't do that optimization with Wildcard + Postfix. But there is a trick:
If you really need that often, you could try indexing a reversed string (and also searching for the reversed search string), so the postfix search becomes a prefix search:
Somehow like:
add_document(title=title, title_rev=title[::-1])
...
# then query = u"*foo"[::-1], search in title_rev field.

Quicker way of updating subdocuments

My JSON documents (called "i"), have sub documents (called "elements").
I am looping trhough these subdocuments and updating them one at a time. However, to do so (once the value i need is computed), I have mongo scan through all the documents in the database, then through all the subdocuments, and then find the subdocument it needs to update.
I am having major time issues, as I have ~3000 documents and this is taking about 4minutes.
I would like to know if there is a quicker way to do this, without mongo having to scan all the documents but by doing it within the loop.
Here is the code:
for i in db.stuff.find():
for element in i['counts']:
computed_value = element[a] + element[b]
db.stuff.update({'id':i['id'], 'counts.timestamp':element['timestamp']},
{'$set': {'counts.$.total':computed_value}})
I am identifying the overall document by "id" and then the subdocument by its timestamp (which is unique to each subdocument). I need to find a quicker way than this. Thank you for your help.
What indexes do you have on your collection ? This could probably be sped up by creating an index on your embedded documents. You can do this using dot notation -- there's a good explanation and example here.
In your case, you'd do something like
db.stuff.ensureIndex( { "i.elements.timestamp" : 1 });
This will make your searches through embedded documents run much faster.
Your update is based on id (and i assume it is diff from default _id of mongo)
Put index on your id field
You want to set new field for all documents within collection or want to do it only for some matching collection to given criteria? if only for matching collections, use query operator (with index if possible)
dont fetch full document, fetch only those fields which are being used.
What is your avg document size? Use explain and mongostat to understand what is actual bottleneck.

Searching across multiple tables (best practices)

I have property management application consisting of tables:
tenants
landlords
units
properties
vendors-contacts
Basically I want one search field to search them all rather than having to select which category I am searching. Would this be an acceptable solution (technology wise?)
Will searching across 5 tables be OK in the long run and not bog down the server? What's the best way of accomplishing this?
Using PostgreSQL
Why not create a view which is a union of the tables which aggregates the columns you want to search on into one, and then search on that aggregated column?
You could do something like this:
select 'tenants:' + ltrim(str(t.Id)), <shared fields> from Tenants as t union
select 'landlords:' + ltrim(str(l.Id)), <shared fields> from Tenants as l union
...
This requires some logic to be embedded from the client querying; it has to know how to fabricate the key that it's looking for in order to search on a single field.
That said, it's probably better if you just have a separate column which contains a "type" value (e.g. landlord, tenant) and then filter on both the type and the ID, as it will be computationally less expensive (and can be optimized better).
You want to use the built-in full text search or a separate product like Lucene. This is optimised for unstructured searches over heterogeneous data.
Also, don't forget that normal indices cannot be used for something LIKE '%...%'. Using a full text search engine will also be able to do efficient substring searches.
I would suggest using a specialized full-text indexing tool like Lucene for this. It will probably be easier to get up and running, and the result is faster and more featureful too. Postgres full text indexes will be useful if you also need structured search capability on top of this or transactionality of your search index is important.
If you do want to implement this in the database, something like the following scheme might work, assuming you use surrogate keys:
for each searchable table create a view that has the primary key column of that table, the name of the table and a concatenation of all the searchable fields in that table.
create a functional GIN or GiST index on the underlying over the to_tsvector() of the exact same concatenation.
create a UNION ALL over all the views to create the searchable view.
After that you can do the searches like this:
SELECT id, table_name, ts_rank_cd(body, query) AS rank
FROM search_view, to_tsquery('search&words') query
WHERE query ## body
ORDER BY rank DESC
LIMIT 10;
You should be fine, and there's really no other good (easy) way to do this. Just make sure the fields you are searching on are properly indexed though.

Categories

Resources