Google App Engine (GAE) provides a way to do Full Text Search (FTS) and store and retrieve documents. The default document ranking is based on a time offset. Is there a way to do a Lucene style Inverted indices look-up and ranking on GAE? If not what are some other options to do this.
Use case: FTS and intelligent ranking of results (at least search query frequency based) for bunch of html pages.
Both GAE Datastore and GAE Search API can do query-by-index:
Datastore is a NoSQL datastore with user-defined indexes and limited queries. It's a database: fast, distributed and has transactions. Queries are however quite restricted: They can only span one Entity kind, so no JOINs. Only one inequality filter per query, so no geo-point search is possible. Also, string search is exact, so no sub-string search, regex search or LIKE search is possible.
Search API is more like Lucene: you store documents and build indexes from parts of the documents. It supports full-text search and geo-point search (e.g. finding geo-points within certain distance from given geo-point).
If you gave us a more specific use case, we might be able to help you decide which one to use.
Related
I am building an application which identifies people mentioned in free text (order of magnitude of a million) and stores their names as keys with one or more (for handling people with the same name) associated URIs pointing to a SPARQL based knowledge graph node which then lets me retrieve additional data about them.
Currently I am storing the name and last name as keys in a Redis DB but the problem is that I need to use fuzzy search on those names using the Rapidfuzz library https://github.com/maxbachmann/RapidFuzz because Redis only provides a simple LD distance fuzzy query which is not enough.
What architecture would you recommend? Happy 2023!!!
I am using appengine with python (version 2.7) for a web application which deals with job listings and job search.
Backend consists of a "Job" table which consists of 20+ fields such as title,date,experience etc. I have the necessary composite indexes defined for each of the filter's permutation and combination. As you would have guessed, the number of indexes are high.
The front-end provides option for users to search for jobs and filter them using the columns.
This works as expected but with following drawbacks:
Slow Search Performance
The search is divided into two parts: inbuilt datastore filtering and then a custom filtering on top of the refined results. The custom filtering is required to further apply the complex filters which are not supported by appengine.
Exploding composite indexes
Some columns (5 for instance) accepts only a set of values, so filtering using them is pretty straightforward. While other fields can have user defined values and hence filtering through them requires custom python code.
Jinja is the templating engine which then renders the data into the html.
Advanced Search + Index References: https://cloud.google.com/appengine/articles/indexselection
Is there a better approach/algorithm for implementing the search and advanced search in the appengine?
You might want to consider using the Full Text Search API available in App Engine. In essence, when entities are created in Cloud Datastore, you would create a Document with the entity ID/Key and all searchable fields and send it to the Search API for indexing. Any updates to the Datastore entities would also need to update the corresponding Search document. Also, when entities are deleted, delete the corresponding Search document.
Modify your Application's search code to perform the Search on Indexed documents instead of Datastore queries. Retrieve a page (e.g. 50) of Document IDs. Fetch the data for the 50 entities using a Datastore Get and display the results.
Per the documentation -
The Search API lets your application perform Google-like full-text
searches over structured data, and supports Geolocation-based queries.
It can be useful in any application domain that benefits from
full-text search, such as:
This would definitely give a better Search experience for your application users when compared with Datastore queries.
Once you implement this, you might be able to just get rid of the composite indexes from Datastore.
In my python GAE application, I am allowing users to query on items using the search api where I initially put in the documents with the exact tags, but the hits are not much given the spell correction that needs to be present.
The way I found was implementing character ngrams via datastore as this ensures that the user is typing in atleast a part of the word correctly. On the datastore this takes a lot of time.
For example,
"hello" (is broken into) ["hello", "ello", "hell", "hel", "elo", "llo", "he", "el", "ll", "lo"]
and when i search for "helo"
tags -["hel", "elo", "he", "el", "lo"] ( give a positive match)
I rank them according to the length of the tags matched from a word.
On Datastore,
I have to index these break character ngrams separately along with the entities they match. And for each word perform the search on every tag in a similar manner. Which takes a lot of time.
Is there a way of achieving a similar operation using the search api.
Does the MatchScore look into the multiple fields of "OR" ?
Looking for ways to design the search documents and perform multiple spell corrected queries in minimal operations.
If I have multiple fields for languages in each document like for eg.-
([tags - "hello world"] [rank - 2300] [partial tags - "hel", "ell", "llo", "wor", "orl", "rld", "hell", "ello", "worl", "orld"] [english - 1] [Spanish - 0] [French - 0] [German - 0]
Can I perform a MatchScore operation along with sort on the language fields? (each document is associated to only one language)
Search API is a good service for this and much better suited than datastore. If your search documents have the correct language set, Search API will cover certain language specific variations (e.g. singular / plural). But Search API only works for words (typically separated by spaces, hyphens, dots etc.).
UPDATE: Language is defined either in the language property of a field, or in the language property of the entire document. In either case, the value is a two-letter ISO 693-1 language code, for example 'de' for German.
For tokenizing search terms ("hel", "elo",...), you can use the pattern from this answer: https://stackoverflow.com/a/13171181/1549523
Also see my comment to that answer. When you want to use minimal length of tokens (e.g. only 3+ letters) to avoid storage size and frontend instance time, you can use the code I've linked there.
MatchScorer is helpful to weight the frequency of a given term in a document. Since tags typically occur only once per document, it wouldn't help you with that. But for example, if your search is about searching in research papers for the term "combustion", MatchScorer would rank the results, showing first the papers that have the term included most often.
Faceted search would add so called facets to the result of your search query, i.e. (by default) the 10 most often occurring facets for the current query is returned, too. This is helpful with tags or categories, so users can drill down their search by applying any of these suggested filters.
If you want to suggest users the correctly spelled search term, it might make sense to use two indices. One index, the primary index, for your actual search documents (e.g. product descriptions with tags), and a second index just for tags or categories (tokenized, and eventually with synonyms). If your user types into a search field, your app first queries the tag-index, suggesting matching tags. If the user selects one of them, the tag is used to query the primary search index. This would help users to pick up correct tags.
Those tags could be managed in the datastore of course, including their synonyms, if there are people maintaining such lists. And every time a tag is stored, your app updates the corresponding search document (in the secondary index) including all the character ngrams (tokens).
Could you explain how search engines like Sphinx, Haystack, etc fit in to a web framework. If you could explain in a way that someone new to web development could understand that would help.
One example use case I made up for this question is a book search feature. Lets say I have a noSQL database that contains book objects, each containing author, title, ISBN, etc.; how does something like Sphinx/Haystack/other search engine fit in with my database to search for a books with a given ISBN?
Firstly, Haystack isn't a search engine, it's a library that provides a Django API to existing search engines like Solr and Whoosh.
That said, your example isn't really a very good one. You wouldn't use a separate search engine to search by ISBN, because your database would already have an index on the Book table which would efficiently do that search. Where a search engine would come in could be in two places. Firstly, you could index some or all of the book's contents to search on: databases are not very good at full-text search, but this is an area where search engines shine. Secondly, you could provide a search against multiple fields - say, author, title, publisher and description - in one go.
Also, search engines provide useful functionality like suggestions, faceting and so on that you won't get from a database.
did you know the best full-text search on gae ?
thanks
Read this blog post which details how to add full-text search to App Engine models.
It also details how to make only certain fields searchable, and turn on stemming.
Now we can use experimental Search API:
The Search API allows your application to perform Google-like searches
over structured data. You can search across several different types of
data (plain text, HTML, atom, numbers, dates, and geographic
locations). Searches return a sorted list of matching text. You can
customize the sorting and presentation of results.
Documentation: https://developers.google.com/appengine/docs/python/search/overview
Early presentation: http://www.google.com/events/io/2011/sessions/full-text-search.html
Google App Engine - Full Text Search