May 29, 2020

How to tune your Writes to PostGres when migrating From s3 using Glue

let us understand how to identify the bottlenecks and what can be done to speed up your writes.

I recently helped a customer with their pipeline where they had to migrate data to RDS postgres from S3 using glue job . So let us understand how to identify the bottlenecks and what can be done to speed up your writes. First check out the MAX number of connections and how many are actually being made, known as the ACTIVE connections.

select * from pg_stat_activity where datname = 'mydatabasename';

Adding more executors will increase number of connections during write. It will make a connection FOR EACH TASK/PARTITION being processed . So say your Job has 10 DPU (10 dpu - 1 for master) x 2 executors - 1 = 17 executors Each executor has 4 vCPU cores. Each vCPU core can process one task/partition at a time. So number of parallel tasks = number of vcpu cores available. The more cores, the more partitions/tasks can be processed in parallel. So by increasing the Job DPU count, you add more cores. This allows you to reparation or coalesce the data so that it all fits nicely into the number of available cores.

  1. If you see that not all the executors are being used, then you would need to either add a repartition() to the code so that it spreads the data across all the executors.
  2. When your job runs, on the PSQL side, check the number of active connections.
  3. You would then also make sure that the PSQL DB resources are not hitting any bottlenecks, and that the metrics for the job shows sufficient executor usage.

Also the bottleneck on the job operation could be on the PSQL Write IOPS. This is how fast the data can be written that is being ingested. When upgrading my PSQL database "class" size so that there is more memory, disk and CPU, we observed considerable improvement in job duration. The issue with Write performance to PSQL is a hot topic. The first advice from the Spark side is to ensure that your task/partition distribution inside the job is at it's best for parallel writing of nice data chunks. As described previously, a default Glue job would split the data and distribute it between the executors. The distribution and splitting will be automatic and should be sufficient until a certain count of DPU.

BUT increasing the number of connections would actually negatively impact the write performance at a certain level. The balance of compute on the PSQL side and the incoming connections/write from the Glue job needs to match. When the executor count is sufficiently parallel for writing to PSQL, then the bottleneck becomes the database side, and from my findings it appears to be specifically Write IOPS.

Initial test
- Glue Job: Glue v1.0. 5x Standard Workers. Default Python3 script. 25GB JSON source to PSQL DB in same VPC
- PSQL DB: Engine 10. Class db.t2.small. 30GB GP SSD disk. 
==> Duration: 30 minutes 

Best test
- Glue Job: Glue v1.0. 15x Standard Workers. Default Python3 script. 25GB JSON source to PSQL DB in same VPC
- PSQL DB: Engine 10. Class db.m4.4xlarge. 200GB GP SSD disk. 
==> Duration: 8 minutes 

So to summarize, the recommended steps for increasing performance:


«  Spark Executor on Task NodeIndia China Standoff  »