Understanding some postgresql.conf settings I changed

Understanding some postgresql.conf settings I changed - python

I followed a Youtube video by Chris Pettus called PostgreSQL Proficiency for Python People to edit some of my postgres.conf settings.
My server has 28 gigs of RAM and prior to making the changes, my system memory was averaging around 3GB. Now it hovers around 10GB.
max_connections = 100
shared_buffers = 7GB
work_mem = 64mb
maintenance_work_mem = 1GB
wal_buffers = 16mb
I am not having any issues right now, but I would like to understand the pros and cons of the changes I made. I assume that there must be some tangible benefits of tripling the average memory being used in my system (measured with Datadog).
My server is used to perform ETL (Airflow) and hosts the database. Airflow has a lot of connections but typically the files are pretty small (a few mb) which are processed with pandas, compared with the database to find new rows, and then loaded.

Shared buffers are used for postgres memory cache (at a lower level closer to postgres as compared to OS cache). Setting it to 7gb means that pg will cache to 7gb of data. So if you are doing a lot of full table scans or (recursive) CTEs that may improve performance. Note that postgres master process will allocate this entire amount at database startup, which is why you are seeing your OS use 10GB of ram now.
work_mem is memory used for sorts and each concurrent sort allocates a bucket of this size. Therefore this is only bounded by max_connections * concurrent sorts, so effectively it is only bounded by the sort complexity of your queries, so increasing this poses the most risk to system stability. (That is, if you have a single query that the query planner executes with 8 merge sorts, you will use 8*work_mem every time the query is executed).
maintenance_work_mem is the memory used by VACUUM and friends (including ALTER TABLE ADD FOREIGN KEY! Increasing this may increase VACUUM speed.
wal_buffers has no benefit beyond 16MB, which is the largest WAL chunk the server will write at one time. This can help with slow write i/o.
See also: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Related

DynamoDB on-demand table: does intensive writing affect reading

I develop a highly loaded application that reads data from DynamoDB on-demand table. Let's say it constantly performs around 500 reads per second.
From time to time I need to upload a large dataset into the database (100 million records). I use python, spark and audienceproject/spark-dynamodb. I set throughput=40k and use BatchWriteItem() for data writing.
In the beginning, I observe some write throttled requests and write capacity is only 4k but then upscaling takes place, and write capacity goes up.
Questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.

That's a lot of questions in one question, you'll get a high level answer.
DynamoDB scales by increasing the number of partitions. Each item is stored on a partition. Each partition can handle:
up to 3000 Read Capacity Units
up to 1000 Write Capacity Units
up to 10 GB of data
As soon as any of these limits is reached, the partition is split into two and the items are redistributed. This happens until there is sufficient capacity available to meet demand. You don't control how that happens, it's a managed service that does this in the background.
The number of partitions only ever grows.
Based on this information we can address your questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
The scaling mechanism is the same for read and write activity, but the scaling point differs as mentioned above. In an on-demand table AutoScaling is not involved, that's only for tables with provisioned throughput. You shouldn't notice an impact on your reads here.
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I assume you set the throughput that spark can use as a budget for writing, it won't have that much of an impact on on-demand tables. It's information, it can use internally to decide how much parallelization is possible.
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
If the client uses BatchWriteItem, it will get a list of items that couldn't be written for each request and can enqueue them again. Exponential backoff may be involved but that is an implementation detail. It's not magic, you just have to keep track of which items you've successfully written and enqueue those that you haven't again until the "to-write" queue is empty.

Is using connection.commit() from pymysql more frequently to AWS RDS more expensive?

I'm assuming that the more commits I make to my database, the more put requests I make. Would it be less expensive to commit less frequently (but commit larger queries at a time)?

I am assuming you're either using RDS for MySQL or MySQL-Compatible Aurora; in either case, you're charged based on the number of running hours, storage and I/O rate, and data transferred OUT of the service (Aurora Serverless pricing is a different story).
In RDS, you're not charged by PUT requests, and there is not such a concept with pymysql.
The frequency of commits should be primarily driven by your application functional requirements, not cost. Let's break it down to give you a better idea of how each cost variable would relate to each approach (commit big batches less frequently vs. commit small batches more frequently).
Running hours: Irrelevant, same for both approaches.
Storage: Irrelevant, you'll probably consume the same amount of storage. The amount of data is constant.
I/O rate: There are many factors involved in how the DB engine consumes/optimizes I/O. I wouldn't get to this level of granularity.
Data transferred IN: Irrelevant, free for both cases.

Does the memory usage of SQLite remain static regardless of DB size?

I have a 700MB SQLite3 database that I'm reading/writing to with a simple Python program.
I'm trying to gauge the memory usage of the program as it operates on the database. I've used these methods:
use Python's memory_profiler to measure memory usage for the main loop function which runs all the insert/selects
use Python's psutil to measure peak memory usage during execution
manually watching the memory usage via top/htop
The first two support the conclusion it uses no more than 20MB at any given time. I can start with an empty database and fill it up with 700MB of data and it remains under 20MB:
Memory profiler's figure never went above 15.805MiB:
Line # Mem usage Increment Line Contents
================================================
...
229 13.227 MiB 0.000 MiB #profile
230 def loop(self):
231 """Loop to record DB entries"""
234 15.805 MiB 2.578 MiB for ev in range(self.numEvents):
...
pstuil said peak usage was 16.22265625MB
Now top/htop is a little weirder. Both said that the python process's memory usage wasn't above 20MB, but I could also clearly see the free memory steadily decreasing as it filled up the database via the used number:
Mem: 4047636k total, 529600k used, 3518036k free, 83636k buffers
My questions:
is there any "hidden" memory usage? Does Python call libsqlite in such a way that it might use memory on its own that isn't reported as belonging to Python either via psutil or top?
is the above method sound for determining the memory usage of a program interacting with the database? Especially top: is top reliable for measuring memory usage of a single process?
is it more or less true that a process interacting with a SQLite database doesn't need to load any sizeable part of it into memory in order to operate on it?
Regarding the last point, my ultimate objective is to use a rather large SQLite database of unknown size on an embedded system with limited RAM and I would like to know if it's true that that memory usage is more or less constant regardless of the size of the database.

SQLite's memory usage doesn't depend on the size of the database; SQLite can handle terabyte-sized databases just fine, and it only loads the parts of the database that it needs (plus a small, configurable-sized cache).
SQLite should be fine on embedded systems; that's originally what it's designed for.

Get_Multi vs. query in NDB. I.e. reads versus cache

so in my app i have a graph search problem (see my previous questions). One of the annoying parts of the algorithm i use is that I have to read in my entire ndb database to memory (about 5500 entities, 1mb in size in the datastore statistics). things work ok with a
nodeconns=JumpAlt.query().fetch(6000)
but i would prefer it if the cache were checked first... doing so with
nodeconns=ndb.get_multi(JumpAlt.query().fetch(keys_only=True))
works offline but generates the following error online:
"Exceeded soft private memory limit with 172.891 MB"
speedwise the normal query is fine but i am a bit concerned that every user generating 5500 reads from the datastore is gonna eat into my quota quite quickly :)
So, my question is, (1) is such a large memory overhead for get_multi normal? (2) is it stupid to read in the entire database for each user anyway?

Sqlite - how to use more memory and cache, and make it run faster

I'm inserting into a table in Sqlite around 220GB of data,
and I noticed it use a lot of Disk I/O, read and write,
but doesn't use the computer's memory in any significant way,
though there is a lot of free memory, and I don't use commit to often.
I think the disk I/O is my bottle neck not CPU nor Memory.
how can I ask it to use more memory, or insert in bulk so it could run faster?

Review all options in http://www.sqlite.org/pragma.html. You can tuning a lot of performance relative aspect of SQLite in your application.
All I/O activity is for the integrity of data. SQLite by default is very safe.
Your filesystem is also important for the performance. Not all FS play fair with fsync and the (default) config for internal logging of SQLite.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.