How to send a query or stored procedure execution request to a specific location/region of cosmosdb? - python

I'm trying to multi-thread some tasks using cosmosdb to optimize ETL time, and I can't find how, using the python API (but I could do something in REST if required) if I have a stored procedure to call twice for two partitions keys, I could send it to two different regions (namely 'West Europe' and 'Central France)
I defined those as PreferredLocations in the connection policy but don't know how to include to a query, the instruction to route it to a specific location.

The only place you could specify that on would be the options objects of the requests. However there is nothing related to the regions.
What you can do is initialize multiple clients that have a different order in the preferred locations and then spread the load that way in different regions.
However, unless your apps are deployed on those different regions and latency is less, there is no point in doing so since Cosmos DB will be able to cope with all the requests in a single region as long as you have the RUs needed.

Related

Firestore read and update is taking 1-12 sec delay to complete in Python

I am using firebase python client to write data to firestore.Any read / write operation at least takes 1 second to complete.Firestore DB is in us-central and our server is in Singapore.
Is it what causing issues?
During read, I have used a where query with limit like below.
collection_ref.where(
u"field", u"==", u"field_value").limit(1).get()
During write, I use set and update(dict)
Sometimes the lag is around 10 to 12 sec
Did anyone face similar issues?
Any pointers will be appreciated
This article on why is Cloud Firestore query slow mentioned the lists of reasons
If you are downloading a bunch of data you probably don’t need to download all of them.The solution would be to limit the amount that comes back.
Your offline cache is too big. Cloud Firestore does some amazing offline caching but this local cache does not apply the same indexes that the server does. This means when you query documents in your offline cache cloud Firestore needs to pack every documents stored locally for the collection being queried and compare it against your query.The solution is limit how much data is being stored in offline cache.
Without composite indexing Firestore would have to do a lot of searching to get the results set.So instead, create a composite index so Firestore can do a quick lookup.
Used to Realtime Database.Realtime Database generally has a lower latency,you are not really going to notice the difference.But if app needs every second of latency you are probably better off using Realtime
Database in these scenarios.
The laws of physics are keeping you down. Your customer might be too far away from your Firestore Database and the actual latency is taking too long. To fix this use real time listeners which is a technique called latency compensation.

SQLAlchemy session with Celery (Multipart batch writes)

Supposing I have a mobile app which will send a filled form data (which also contains images) to a Commercial software using its API, and this data should be committed all at once.
Since the mobile does not have enough memory to send all the dataset at once, I need to send it as a Multipart batch.
I use transactions in cases where I want to perform a bunch of operations on the database, but I kind of need them to be performed all at once, meaning that I don't want the database to change out from under me while I'm in the middle of making my changes. And if I'm making a bunch of changes, I don't want users to be able to read my set of documents in that partially changed state. And I certainly don't want a set of operations failing halfway through, leaving me in a weird and inconsistent state forever. It's got to be all or nothing.
I know that Firebase provides the batch write operation which does exactly what I need. However, I need to do this into a local database (like redis or postgres).
The first approach I considered is using POST requests identified by a main session_ID.
- POST /session -> returns new SESSION_ID
- POST [image1] /session/<session_id> -> returns new IMG_ID
- POST [image2] /session/<session_id> -> returns new IMG_ID
- PUT /session/<session_id> -> validate/update metadata
However it does not seem very robust to handle errors.
The second approach I was considering is combining SQLAlchemy session with Celery task using Flask or FastAPI. I am not sure if it is common to do this to solve this issue. I just found this question. I would like to know what do you guys recommend for this second case approach (sending all data parts first, and commit all at once) ?

Django cache real-time data with DRF filtering and sorting

I'm building a web app to manage a fleet of moving vehicles. Each vehicle has a set of fixed data (like their license plate), and another set of data that gets updated 3 times a second via websocket (their state and GNSS coordinates). This is real-time data.
In the frontend, I need a view with a grid showing all the vehicles. The grid must display both the fixed data and the latest known values of the real-time data for each one. The user must be able to sort or filter by any column.
My current stack is Django + Django Rest Framework + Django Channels + PostgreSQL for the backend, and react-admin (React+Redux+FinalForm) for the frontend.
The problem:
Storing real-time data in the database would easily fulfill all my requirements as I would benefit from built-in DRF and Django's ORM sorting/filtering, but I believe that hitting the DB 3 times/sec for each vehicle could easily become a bottleneck when scaling up.
My current solution involves having the RT data in python objects that I serialize/deserialize (pickle) into/from the Django cache (REDIS-backed) instead of storing them as models in the database. However, I have to manually retrieve the data and DRF-serialize it in my DRF views. Therefore, sorting and filtering don't work as there are no SQL queries involved.
If pgsql had some sort of memory-backed tables with zero disk access it would be great, but, according to my research there is no such feature as of today.
So my question is:
What would be the correct/usual approach to my problem?
I think you should separate your service code into smaller service:
Receive data from vehicles
Process data
Separate vehicles into smaller group, use unique socket for each group
Try to update latest values as batch.
Use RAID for your database.
And I think that using cache for realtime data is wasted server resources.

Extracting data continuously from RDS MySQL schemas in parallel

I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.

Dynamically select database based on request

I'm trying to keep my RESTful site DRY, and I can't come up with a good way to factor out the code to dynamically select from each "user's" separate database. We've got a separate database for each client. This comes in as a part of the URL, and is passed into each view as a keyword arg. I want to give each and every view the behavior of accessing the corresponding database WITHOUT have to make sure each programmer writing a view remembers to use
Thing.objects.using(user).all()
and
t = Thing()
t.save(using=user)
every time. It seems like there ought to be some way to intercept the request and set the default database based on the args to the view before it hits the view, allowing us to use the usual
Thing.objects.all()
This would also have the advantage of factoring out all the user resolution code into a more appropriate place.
We do this by the following technique.
Apache picks off the first part of the path and routes this to a specific mod_wsgi Daemon.
Each mod_wsgi daemon is a different customer's installation.
We have many parallel customers, each with (nearly) identical code, all based off a single common installation of the base software.
Each customer has a separate settings.py with their unique configuration.
They don't (actually can't) know about each other because Apache has peeled off the top layer of the path for us.

Categories

Resources