How to tune your Writes to PostGres when migrating From s3 using Glue

I recently helped a customer with their pipeline where they had to migrate data to RDS postgres from S3 using glue job . So let us understand how to identify the bottlenecks and what can be done to speed up your writes. First check out the MAX number of connections and how many are actually being made, known as the ACTIVE connections.

select * from pg_stat_activity where datname = 'mydatabasename';

Adding more executors will increase number of connections during write. It will make a connection FOR EACH TASK/PARTITION being processed . So say your Job has 10 DPU (10 dpu - 1 for master) x 2 executors - 1 = 17 executors Each executor has 4 vCPU cores. Each vCPU core can process one task/partition at a time. So number of parallel tasks = number of vcpu cores available. The more cores, the more partitions/tasks can be processed in parallel. So by increasing the Job DPU count, you add more cores. This allows you to reparation or coalesce the data so that it all fits nicely into the number of available cores.

If you see that not all the executors are being used, then you would need to either add a repartition() to the code so that it spreads the data across all the executors.
When your job runs, on the PSQL side, check the number of active connections.
You would then also make sure that the PSQL DB resources are not hitting any bottlenecks, and that the metrics for the job shows sufficient executor usage.

Also the bottleneck on the job operation could be on the PSQL Write IOPS. This is how fast the data can be written that is being ingested. When upgrading my PSQL database "class" size so that there is more memory, disk and CPU, we observed considerable improvement in job duration. The issue with Write performance to PSQL is a hot topic. The first advice from the Spark side is to ensure that your task/partition distribution inside the job is at it's best for parallel writing of nice data chunks. As described previously, a default Glue job would split the data and distribute it between the executors. The distribution and splitting will be automatic and should be sufficient until a certain count of DPU.

BUT increasing the number of connections would actually negatively impact the write performance at a certain level. The balance of compute on the PSQL side and the incoming connections/write from the Glue job needs to match. When the executor count is sufficiently parallel for writing to PSQL, then the bottleneck becomes the database side, and from my findings it appears to be specifically Write IOPS.

Initial test
- Glue Job: Glue v1.0. 5x Standard Workers. Default Python3 script. 25GB JSON source to PSQL DB in same VPC
- PSQL DB: Engine 10. Class db.t2.small. 30GB GP SSD disk. 
==> Duration: 30 minutes 

Best test
- Glue Job: Glue v1.0. 15x Standard Workers. Default Python3 script. 25GB JSON source to PSQL DB in same VPC
- PSQL DB: Engine 10. Class db.m4.4xlarge. 200GB GP SSD disk. 
==> Duration: 8 minutes

So to summarize, the recommended steps for increasing performance:

Check your RDS DB Metrics and verify the Write Performance is being limited at the same level, despite your Glue side config tests from before: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html#CHAP_BestPractices.DiskPerformance
On the database side, consider upgrading the class to be a larger size, even just as a temporary change to verify performance increase. And likewise for the storage type, try provisioned IOPS: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html
Do not increase the DPU/Worker count on the job if you aren't seeing the write-utilization going up any further; the additional connections may just be slowing things down: https://docs.aws.amazon.com/glue/latest/dg/monitor-debug-capacity.html
Unless not all your workers are being utilized, or the performance on the DB side does not have the corresponding number of connections to executors vCPU slots, then don't bother repartitioning as it just adds overhead: https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
If you are still not seeing the performance you desire from the Glue Jobs, and your use-case is primarily about migration, you could look at using the AWS Database Migration Service, which supports PSQL: https://aws.amazon.com/dms/