Best way to keep web application data sync with elasticsearh - python

I'm developing a web application in Python with Django for scheduling medical services. The database using is Mysql.
The system has several complex searches. For good performance, I'm using elasticsearch. The idea is to index only searchable data. The document created in the ES is not exactly the same as modeling in the database, for example, the Patient entity has relationships in the DB that are mapped as attributes in the ES.
I am in doubt at what time to update the record in the ES when some relationship is updated.
I am currently updating the registry in ES whenever an object is created, changed or deleted in the DB. In the case of relationships, I need to fetch the parent entity to update the record.
Another strategy would be to do the full index periodically. The problem with this strategy is the delay between a change and it's available for search.
What would be the best strategy in this case?

Related

Django cache real-time data with DRF filtering and sorting

I'm building a web app to manage a fleet of moving vehicles. Each vehicle has a set of fixed data (like their license plate), and another set of data that gets updated 3 times a second via websocket (their state and GNSS coordinates). This is real-time data.
In the frontend, I need a view with a grid showing all the vehicles. The grid must display both the fixed data and the latest known values of the real-time data for each one. The user must be able to sort or filter by any column.
My current stack is Django + Django Rest Framework + Django Channels + PostgreSQL for the backend, and react-admin (React+Redux+FinalForm) for the frontend.
The problem:
Storing real-time data in the database would easily fulfill all my requirements as I would benefit from built-in DRF and Django's ORM sorting/filtering, but I believe that hitting the DB 3 times/sec for each vehicle could easily become a bottleneck when scaling up.
My current solution involves having the RT data in python objects that I serialize/deserialize (pickle) into/from the Django cache (REDIS-backed) instead of storing them as models in the database. However, I have to manually retrieve the data and DRF-serialize it in my DRF views. Therefore, sorting and filtering don't work as there are no SQL queries involved.
If pgsql had some sort of memory-backed tables with zero disk access it would be great, but, according to my research there is no such feature as of today.
So my question is:
What would be the correct/usual approach to my problem?
I think you should separate your service code into smaller service:
Receive data from vehicles
Process data
Separate vehicles into smaller group, use unique socket for each group
Try to update latest values as batch.
Use RAID for your database.
And I think that using cache for realtime data is wasted server resources.

Extracting data continuously from RDS MySQL schemas in parallel

I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.

How do I structure a database cache (memcached/Redis) for a Python web app with many different variables for querying?

For my app, I am using Flask, however the question I am asking is more general and can be applied to any Python web framework.
I am building a comparison website where I can update details about products in the database. I want to structure my app so that 99% of users who visit my website will never need to query the database, where information is instead retrieved from the cache (memcached or Redis).
I require my app to be realtime, so any update I make to the database must be instantly available to any visitor to the site. Therefore I do not want to cache views/routes/html.
I want to cache the entire database. However, because there are so many different variables when it comes to querying, I am not sure how to structure this. For example, if I were to cache every query and then later need to update a product in the database, I would basically need to flush the entire cache, which isn't ideal for a large web app.
I would prefer is to cache individual rows within the database. The problem is, how do I structure this so I can flush the cache appropriately when an update is made to the database? Also, how can I map all of this together from the cache?
I hope this makes sense.
I had this exact question myself, with a PHP project, though. My solution was to use ElasticSearch as an intermediate cache between the application and database.
The trick to this is the ORM. I designed it so that when Entity.save() is called it is first stored in the database, then the complete object (with all references) is pushed to ElasticSearch and only then the transaction is committed and the flow is returned back to the caller.
This way I maintained full functionality of a relational database (atomic changes, transactions, constraints, triggers, etc.) and still have all entities cached with all their references (parent and child relations) together with the ability to invalidate individual cached objects.
Hope this helps.
So a free eBook called "Redis in Action" by Josiah Carlson answered all of my questions. It it quite long, but after reading through, I have a fairly solid understanding of how to structure a caching architecture. It gives real world examples, such as a social network and a shopping site with tons of traffic. I will need to read through it again once or twice to fully understand. A great book!
Link: Redis in Action

Caching a static Database table in Django

I have a Django web application that is currently live and receiving a lot of queries. I am looking for ways to optimize its performance and one area that can be improved is how it interacts with its database.
In its current state, each request to a particular view loads an entire database table into a pandas dataframe, against which queries are done. This table consists of over 55,000 rows of text data (co-ordinates mostly).
To avoid needless queries, I have been advised to cache the database into memory and have it be cached upon the first time its loaded. This will remove some overhead on the DB side of things. I've never used this feature of Django before so I am a bit lost.
The Django manual does not seem to have a concrete implementation of what I want to do. Would it be a good idea to just store the entire table in memory or would storing it in a file be a better idea?
I had a similar problem and django-cache-machine worked like a charm. It uses the Django caching features to cache the results of your queries. It is very easy to set up (assuming you have already configured Django's cache backend):
pip install django-cache-machine
Then in the model you want to cache:
from caching.base import CachingManager, CachingMixin
class MyModel(CachingMixin, models.Model):
objects = CachingManager()
And that's it, your queries will be cached.

I want to know about Google NDB data store keys

I am new to GAE. I started working on NDB data store service. But the Parent key structure of it really confusing me. I also watched some tutorials on YouTube but they just explain its documentation.
I also followed the documentation but still it is not clear to me. It is the link which i explored.
Google App Engine NDB Data Store Service
NDB datastore is a distributed system. Absolute data consistency is very hard for distributed systems in general. By default NDB is eventually consistent. This means that by default:
If you add a record it may not appear immediately in a query
You cannot do transactions across multiple records by default
If you have more strict requirements you can define groups of entities by giving them the same parent key and specifying it in queries. You are then able to get consistent behaviour within these groups.
It is often better to not to use parent keys at all since they come with a heavy performance penalty. Most of the time apps do not need parent keys.
Quote from Entities, Properties, and Keys
There is a write throughput limit of about one transaction per second within a single entity group. This limitation exists because Datastore performs masterless, synchronous replication of each entity group over a wide geographic area to provide high reliability and fault tolerance.

Categories

Resources