I am building an application which identifies people mentioned in free text (order of magnitude of a million) and stores their names as keys with one or more (for handling people with the same name) associated URIs pointing to a SPARQL based knowledge graph node which then lets me retrieve additional data about them.
Currently I am storing the name and last name as keys in a Redis DB but the problem is that I need to use fuzzy search on those names using the Rapidfuzz library https://github.com/maxbachmann/RapidFuzz because Redis only provides a simple LD distance fuzzy query which is not enough.
What architecture would you recommend? Happy 2023!!!
Related
we will be building a people database for a specific industry. The info will come from lots of different sources (mostly web scraping and public DBs).
Not every single source will have a usable unique id (such as a tax id), so we are looking for ways to reduce duplicate data.
We were thinking in hashing the person's email and name and using that as a kind of unique key.
Any methods/suggestions that will help us reduce duplicates will be appreciated.
We'll be using MongoDb and many different python scripts if that is useful.
Cheers!
I am trying to build a question answering system that retrieves answer from DbPedia. At a certain step, I need to find the appropriate resource name to perform the query operation. For that, I was planning to store all the resource name in a database and retrieve the most relavant ones.
Now I'm stuck at this step. I'm new to this whole DbPedia thing and I'm not actually sure if there is a way to get all the resource list of DbPedia.
Previously, I had used python's wikipedia library to get the resource name. But it does not return any value for certain test cases. And so I need to change my approach.
As MS Support recently told me that using a "GET" is much more efficient in RUs usage than a sql query. I'm wondering if I can (within the azure.cosmos python package or a custom HTTP request to the REST API) get a document by its unique 'id' field (for which I generated a GUIDs) without an SQL Query.
Every example shown are using the link/path of the doc which is built with the '_rid' metadata of the document and not the 'id' field set when creating the doc.
I use a bulk upsert stored procedure I wrote to create my new documents and never retrieve the metadata for each one of them (I have ~ 100 millions docs) so retrieving the _rid would be equivalent to retrieving the doc itself.
The reason that the ReadDocument method is so much more efficient than a SQL query is because it uses _rid instead of a user generated field, even the required id field. This is because the _rid isn't just a unique value, it also encodes information about where that document is physically stored.
To give an example of how this works, let's say you are explaining to someone where a party is this weekend. You could use the name that you use for the house "my friend Ryan's house" or you could use the address "123 ThatOne Street Somewhere, WA 11111". They both are unique identifiers, but for someone trying to get there one is way more efficient than the other.
Telling someone to go to your friend's house is like using your own id. It does map to a specific house, but the person will still need to find out where that physically is to get there. Using the address is like working with the _rid field. Based on that information alone they can get to the party location. Of course, in the real world the person would probably need directions, but the data storage in a database is a lot more organized than most city streets so an address is sufficient to go retrieve the document.
If you want to take advantage of this method you will need to find a way to work with the _rid field.
I am using appengine with python (version 2.7) for a web application which deals with job listings and job search.
Backend consists of a "Job" table which consists of 20+ fields such as title,date,experience etc. I have the necessary composite indexes defined for each of the filter's permutation and combination. As you would have guessed, the number of indexes are high.
The front-end provides option for users to search for jobs and filter them using the columns.
This works as expected but with following drawbacks:
Slow Search Performance
The search is divided into two parts: inbuilt datastore filtering and then a custom filtering on top of the refined results. The custom filtering is required to further apply the complex filters which are not supported by appengine.
Exploding composite indexes
Some columns (5 for instance) accepts only a set of values, so filtering using them is pretty straightforward. While other fields can have user defined values and hence filtering through them requires custom python code.
Jinja is the templating engine which then renders the data into the html.
Advanced Search + Index References: https://cloud.google.com/appengine/articles/indexselection
Is there a better approach/algorithm for implementing the search and advanced search in the appengine?
You might want to consider using the Full Text Search API available in App Engine. In essence, when entities are created in Cloud Datastore, you would create a Document with the entity ID/Key and all searchable fields and send it to the Search API for indexing. Any updates to the Datastore entities would also need to update the corresponding Search document. Also, when entities are deleted, delete the corresponding Search document.
Modify your Application's search code to perform the Search on Indexed documents instead of Datastore queries. Retrieve a page (e.g. 50) of Document IDs. Fetch the data for the 50 entities using a Datastore Get and display the results.
Per the documentation -
The Search API lets your application perform Google-like full-text
searches over structured data, and supports Geolocation-based queries.
It can be useful in any application domain that benefits from
full-text search, such as:
This would definitely give a better Search experience for your application users when compared with Datastore queries.
Once you implement this, you might be able to just get rid of the composite indexes from Datastore.
Google App Engine (GAE) provides a way to do Full Text Search (FTS) and store and retrieve documents. The default document ranking is based on a time offset. Is there a way to do a Lucene style Inverted indices look-up and ranking on GAE? If not what are some other options to do this.
Use case: FTS and intelligent ranking of results (at least search query frequency based) for bunch of html pages.
Both GAE Datastore and GAE Search API can do query-by-index:
Datastore is a NoSQL datastore with user-defined indexes and limited queries. It's a database: fast, distributed and has transactions. Queries are however quite restricted: They can only span one Entity kind, so no JOINs. Only one inequality filter per query, so no geo-point search is possible. Also, string search is exact, so no sub-string search, regex search or LIKE search is possible.
Search API is more like Lucene: you store documents and build indexes from parts of the documents. It supports full-text search and geo-point search (e.g. finding geo-points within certain distance from given geo-point).
If you gave us a more specific use case, we might be able to help you decide which one to use.