I develop a highly loaded application that reads data from DynamoDB on-demand table. Let's say it constantly performs around 500 reads per second.
From time to time I need to upload a large dataset into the database (100 million records). I use python, spark and audienceproject/spark-dynamodb. I set throughput=40k and use BatchWriteItem() for data writing.
In the beginning, I observe some write throttled requests and write capacity is only 4k but then upscaling takes place, and write capacity goes up.
Questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
That's a lot of questions in one question, you'll get a high level answer.
DynamoDB scales by increasing the number of partitions. Each item is stored on a partition. Each partition can handle:
up to 3000 Read Capacity Units
up to 1000 Write Capacity Units
up to 10 GB of data
As soon as any of these limits is reached, the partition is split into two and the items are redistributed. This happens until there is sufficient capacity available to meet demand. You don't control how that happens, it's a managed service that does this in the background.
The number of partitions only ever grows.
Based on this information we can address your questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
The scaling mechanism is the same for read and write activity, but the scaling point differs as mentioned above. In an on-demand table AutoScaling is not involved, that's only for tables with provisioned throughput. You shouldn't notice an impact on your reads here.
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I assume you set the throughput that spark can use as a budget for writing, it won't have that much of an impact on on-demand tables. It's information, it can use internally to decide how much parallelization is possible.
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
If the client uses BatchWriteItem, it will get a list of items that couldn't be written for each request and can enqueue them again. Exponential backoff may be involved but that is an implementation detail. It's not magic, you just have to keep track of which items you've successfully written and enqueue those that you haven't again until the "to-write" queue is empty.
Related
I'm working on a GCP Document AI project. First, let me say this - the OCR works fine :-). I'm curios to know about possibilities of improvement if possible.
What happens now
I have a python module written, which will get the OCR done for a tiff file which is uploaded via a portal or collected by an agent in the system. The module is written in a way to avoid local usage of original file content, as the file is readily available in a cloud bucket. But, the price I have to pay is to use the batch_process_documents() API instead of process_document().
An observation
This is an obvious one, as the document if submitted via inline API gets OCR back in less than 5 seconds most time. But, the batch (with a single document :-|) takes more than 45 seconds almost every time. Sometimes it goes beyond a minute or more.
I'm searching for a solution, to reduce the OCR call time. The inline API does not support gcs uris as much as I'm aware, so either I need to download the content, upload it back via inline api and then do an OCR - or I need to live with the performance reduction.
Is there any one who has handled a similar case ? Or if there are any ways to tackle this without using batch api or downloading the content ? Any help is appreciated.
As per your requirement, since your concern is related to the latency when comparing the response time between the process and batchProcess method calls for the Document API, using a single document with results of 5 and 45 seconds respectively.
The process_documents() method has limits on the number of pages and file size that can be sent and it only allows for one document file per API call.
The batch_process_documents() method allows asynchronous processing of larger files and batch processing of multiple files.
Single requests are oriented to smaller amounts of data that usually takes a very small amount of time to process but may have low performance when dealing with a big amount of data, on the other hand batch requests are oriented to handle bigger amounts of data which would have better performance over the single request but may have lower performance when processing a small amount of data.
Regarding your concerns about the latency on these two method calls, looking into the documentation,I am able to find that for the single request or synchronous ("online") operations ( i.e immediate response) the document data is processed in memory and not persisted to disk. Following this in asynchronous offline batch operations the documents are processed in disk, due that the file could be significatively bigger that could not fit in memory. That's why the asynchronous operations take around 10x time vs the synchronous operations.
Each of these method calls has a particular use case, in this case the choice of which one to use would rely on the trade off that's better for you. If the time response is critical and you would like to have the response as soon as possible, you could split the files to fit the size and make the requests as synchronous operations, keeping in mind the quotas and limitations of the API.
This issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this Link.
Since this was originally posted, the Document AI API added a feature to specify a field_mask in the processing request, which limits the fields returned in the Document object output. This can reduce the latency in some requests since the response will be a smaller size.
https://cloud.google.com/document-ai/docs/send-request#online-processor
I have an API deployed on Cloud Run where each request results in a read + write to Cloud Datastore. A non-trivial amount of the requests are first timers (read from Datastore will return null), so adding caching in front of it may not be too helpful.
Over the past month, the average wall time around calling Datastore and having the data (data = client.get(key, eventual=True)) is 48ms. The payloads are small (a list of dicts, with 10 elements on average, and each dict has two floats).
I'm not sure if I should say that latency is high, but my API has a budget of 100ms to do all that it needs to do and return. If just the data fetching takes ~50% of that time, I'm looking for ways to optimize things.
Questions:
In general, how does 50ms sound for fairly small payloads, fetched by key, from within GCP?
What should I expect (in terms of latency) from Memorystore within GCP?
Assuming you are using Cloud Run and Datastore on the same location, I would say that 50ms is around the expected latency you'd have for reading on datastore, the size of the payload does not matter as much for reads (10 - 1000 document reads do not make a big difference on time of processing/propagating).
Since you have such a small window for your API to operate, this could indeed be a problem if some unnexpected delays occur.
I have never used Memorystore so I can't say what you could expect of actual latency but it might be a better option given that every ms to your app counts.
I'm assuming that the more commits I make to my database, the more put requests I make. Would it be less expensive to commit less frequently (but commit larger queries at a time)?
I am assuming you're either using RDS for MySQL or MySQL-Compatible Aurora; in either case, you're charged based on the number of running hours, storage and I/O rate, and data transferred OUT of the service (Aurora Serverless pricing is a different story).
In RDS, you're not charged by PUT requests, and there is not such a concept with pymysql.
The frequency of commits should be primarily driven by your application functional requirements, not cost. Let's break it down to give you a better idea of how each cost variable would relate to each approach (commit big batches less frequently vs. commit small batches more frequently).
Running hours: Irrelevant, same for both approaches.
Storage: Irrelevant, you'll probably consume the same amount of storage. The amount of data is constant.
I/O rate: There are many factors involved in how the DB engine consumes/optimizes I/O. I wouldn't get to this level of granularity.
Data transferred IN: Irrelevant, free for both cases.
I followed a Youtube video by Chris Pettus called PostgreSQL Proficiency for Python People to edit some of my postgres.conf settings.
My server has 28 gigs of RAM and prior to making the changes, my system memory was averaging around 3GB. Now it hovers around 10GB.
max_connections = 100
shared_buffers = 7GB
work_mem = 64mb
maintenance_work_mem = 1GB
wal_buffers = 16mb
I am not having any issues right now, but I would like to understand the pros and cons of the changes I made. I assume that there must be some tangible benefits of tripling the average memory being used in my system (measured with Datadog).
My server is used to perform ETL (Airflow) and hosts the database. Airflow has a lot of connections but typically the files are pretty small (a few mb) which are processed with pandas, compared with the database to find new rows, and then loaded.
Shared buffers are used for postgres memory cache (at a lower level closer to postgres as compared to OS cache). Setting it to 7gb means that pg will cache to 7gb of data. So if you are doing a lot of full table scans or (recursive) CTEs that may improve performance. Note that postgres master process will allocate this entire amount at database startup, which is why you are seeing your OS use 10GB of ram now.
work_mem is memory used for sorts and each concurrent sort allocates a bucket of this size. Therefore this is only bounded by max_connections * concurrent sorts, so effectively it is only bounded by the sort complexity of your queries, so increasing this poses the most risk to system stability. (That is, if you have a single query that the query planner executes with 8 merge sorts, you will use 8*work_mem every time the query is executed).
maintenance_work_mem is the memory used by VACUUM and friends (including ALTER TABLE ADD FOREIGN KEY! Increasing this may increase VACUUM speed.
wal_buffers has no benefit beyond 16MB, which is the largest WAL chunk the server will write at one time. This can help with slow write i/o.
See also: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
To analyze a large number of websites or financial data and pull out parametric data, what are the optimal strategies?
I'm classifying the following strategies as either "on-the-fly" or "deferred". Which is best?
On-the-fly: Process data on-the-fly and store parametric data into a database
Deferred: Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
Deferred: Store all pages as a BLOB in a database to post-process later, or with a processing-data-daemon
Number 1 is simplest, especially if you only have a single server. Can #2 or #3 be more efficient with a single server, or do you only see the power with multiple servers?
Are there any python projects that are already geared toward this kind of analysis?
Edit: by best, I mean fastest execution to prevent user from waiting with ease of programming as secondary
I'd use celery either on a single or on multiple machines, with the "on-the-fly" strategy. You can have an aggregation Task, that fetches data, and a process Task that analyzes them and stores them in a db. This is a highly scalable approach, and you can tune it according to your computing power.
The "on-the-fly" strategy is more efficient in a sense that you process your data in a single pass. The other two involve an extra step, re-retrieve the data from where you saved them and process them after that.
Of course, everything depends on the nature of your data and the way you process them. If the process phase is slower than the aggregation, the "on-the-fly" strategy will hang and wait until completion of the processing. But again, you can configure celery to be asynchronous, and continue to aggregate while there are data yet unprocessed.
First: "fastest execution to prevent user from waiting" means some kind of deferred processing. Once you decide to defer the processing -- so the user doesn't see it -- the choice between flat-file and database is essentially irrelevant with respect to end-user-wait time.
Second: databases are slow. Flat files are fast. Since you're going to use celery and avoid end-user-wait time, however, the distinction between flat file and database becomes irrelevant.
Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
This is fastest. Celery to load flat files.