Techniques to reduce data duplication in a people database

Techniques to reduce data duplication in a people database - python

we will be building a people database for a specific industry. The info will come from lots of different sources (mostly web scraping and public DBs).
Not every single source will have a usable unique id (such as a tax id), so we are looking for ways to reduce duplicate data.
We were thinking in hashing the person's email and name and using that as a kind of unique key.
Any methods/suggestions that will help us reduce duplicates will be appreciated.
We'll be using MongoDb and many different python scripts if that is useful.
Cheers!

Related

Fuzzy search of users to retrieve associated data

I am building an application which identifies people mentioned in free text (order of magnitude of a million) and stores their names as keys with one or more (for handling people with the same name) associated URIs pointing to a SPARQL based knowledge graph node which then lets me retrieve additional data about them.
Currently I am storing the name and last name as keys in a Redis DB but the problem is that I need to use fuzzy search on those names using the Rapidfuzz library https://github.com/maxbachmann/RapidFuzz because Redis only provides a simple LD distance fuzzy query which is not enough.
What architecture would you recommend? Happy 2023!!!

Counting Records in Azure Table Storage (Year: 2017)

We have a table in Azure Table Storage that is storing a LOT of data in it (IoT stuff). We are attempting a simple migration away from Azure Tables Storage to our own data services.
I'm hoping to get a rough idea of how much data we are migrating exactly.
EG: 2,000,000 records for IoT device #1234.
The problem I am facing is in getting a count of all the records that are present in the table with some constrains (EG: Count all records pertaining to one IoT device #1234 etc etc).
I did some fair amount of research to find posts that say that this count feature is not implemented in the ATS. These posts however, were circa 2010 to 2014.
I'm assumed (hoped) that this feature has been implemented now since it's now 2017 and I'm trying to find docs to it.
I'm using python to interact with out ATS.
Could someone please post the link to the docs here that show how I can get the count of records using python (or even HTTP / rest etc)?
Or if someone knows for sure that this feature is still unavailable, that would help me move on as well and figure another way to go about things!
Thanks in advance!

Returning number of entities in the table storage is for sure not available in Azure Table Storage SDK and service. You could make a table scan query to return all entities from your table but if you have millions of these entities the query will probably time out. it is also going to have pretty big perf impact on your table. Alternatively you could try making segmented queries in a loop until you reach the end of the table.

Or if someone knows for sure that this feature is still unavailable,
that would help me move on as well and figure another way to go about
things!
This feature is still not available or in other words as of today there's no API which will give you a count of total number of rows in a table. You would have to write your own code to do so.
Could someone please post the link to the docs here that show how I
can get the count of records using python (or even HTTP / rest etc)?
For this you would need to list all entities in a table. Since you're only interested in the count, you can reduce the size response data by making use of Query Projection and fetching just one or two attributes of the entities (may be PartitionKey and RowKey). Please see my answer here for more details: Count rows within partition in Azure table storage.

App Engine social platform - Content interactions modeling strategy

I have a Python server running on Google app engine and implements a social network. I am trying to find the best way (best=fast and cheap) to implement interactions on items.
Just like any other social network I have the stream items ("Content") and users can "like" these items.
As for queries, I want to be able to:
Get the list of users who liked the content
Get a total count of the likers.
Get an intersection of the likers with any other users list.
My Current implementation includes:
1. IntegerProperty on the content item which holds the total likers count
2. InteractionModel - a NdbModel with a key id qual to the content id (fast fetch) and a JsonPropery the holds the likers usernames
Each time a user likes a content I need to update the counter and the list of users. This requires me to run and pay for 4 datastore operations (2 reads, 2 writes).
On top of that, items with lots of likers results in an InteractionModel with a huge json that takes time to serialize and deserialize when reading/writing (Still faster then RepeatedProperty).
None of the updated fields are indexed (built-in index) nor included in combined index (index.yaml)
Looking for a more efficient and cost effective way to implement the same requirements.

I´m guessing you have two entities in you model: User and Content. Your queries seem to aggregate upon multiple Content objects.
What about keeping this aggregated values on the User object? This way, you don´t need to do any queries, but rather only look up the data stored in the User object for these queries.
At some point though, you might consider not using the datastore, but look at sql storage instead. It has a higher constant cost, but I´m guessing at some point (more content/users) it might be worth considering both in terms of cost and performance.

Recommendation engine using collaborative filtering in Python

I have developed a search engine for restaurants. I have a social network wherein users can add friends and form groups and recommend restaurants to each other. Each restaurant may serve multiple cuisines. All of this in Python.
So based on what restaurants a user has recommended, we can zero in on the kind of cuisines a user might like more. At the same time, we will know which price tier the user is more likely to explore(high-end, fast food, cafe, lounges etc)
His friends will recommend some places which will carry more weightage. There are similar non-friend users who have the recommended some the restaurants the user has recommended and some more which the user hasn't.
The end problem is to recommend restaurants to the user based on:
1) What he has recommended(Other restaurants with similar cuisines) - 50% weightage
2) What his friends have recommended(filtering restaurants which serve the cuisines the user likes the most) - 25% weightage
3) Public recommendation by 'similar' non-friend users - 25% weightage.
I am spending a lot of time reading up on Neo4j, and I think Neo4j looks promising. Apart from that I tried pysuggest, but it didn't suit the above problem. I also tried reco4j but it is a Java based solution, whereas I'm looking for a Python based solution. There is also no activity on the Reco4j community, and it is still under development.
Although I've researched quite a lot, I might be missing out on something.
I'd like to know how would you go about implementing the above solution? Could you give any use cases for the same?

I think you won't find any out-of-the box solution for you problem, as it is quite specific. What you could do with Neo4j is to store all your data that you use for building recommendations (users, friendships links, users' restaurants recommendations and reviews etc) and then build you recommendation engine on this data. From my experience it is quite straightforward to do once you get all your data in Neo4j (either with Cypher or Gremlin).

Google app engine, query multiple entities

I have following 2 entities.
class Photo(db.Model):
name=db.StringProperty()
registerdate=db.DateTimeProperty()
iso=db.StringProperty()
exposure=db.StringProperty()
class PhotoRatings(db.Model):
ratings=db.IntegerProperty()
I need to do the following.
Get all the photos (Photo) with iso=800 sorted by ratings (PhotoRatings).
I cannot add add ratings inside Photo because ratings change all the time and I would have to write entire Photo entity every single time. This will cost me more time and money and the application will take performance hit.
I read this,
https://developers.google.com/appengine/articles/modeling
But could not get much information from it.
EDIT: I want to avoid fetching too many items and perform the match manually. I need fast and efficient solution.

You're trying to do relational database queries with an explicitly non-relational Datastore.
As you might imagine, this presents problems. If you want to the Datastore to sort the results for you, it has to be able to index on what you're wanting to sort. Indices cannot span multiple entity types, so you can't have an index for Photos that is ordered by PhotoRatings.
Sorry.
Consider, however - which will happen more often? Querying for this ordering of photos, or someone rating a photo? Chances are, you'll have far more views than actions, so storing the rating as part of the Photo entity might not be as big a hit as you fear.

If you look at the billing docs you'll notice that an Entity write is charged per changed number of properties.
So what you are trying to do will not reduce write cost, but will definitely increase read cost as you'll be reading double the number of entities.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Techniques to reduce data duplication in a people database - python

Related

Fuzzy search of users to retrieve associated data

Counting Records in Azure Table Storage (Year: 2017)

App Engine social platform - Content interactions modeling strategy

Recommendation engine using collaborative filtering in Python

Google app engine, query multiple entities

Categories

Resources