In this blog we are going to learn how to you improve query performance of you Glue ETL Job by adding parallelism in your database JDBC connection.
If you are running an AWS Glue ETL Job for a large database table and if you are running OOM (Out of Memory error message) because all the data is read into a single executor or Your Job itself is running slow. To avoid this situation, you can optimize the number of Apache Spark partitions and parallel JDBC connections that are opened during the job execution.
Here is the Steps you need to perform
- In the Data Catalog, edit the table (database table which you have crawled) and add the partitioning parameters hashexpression or hashfield. The job partitions the data for a large table along with the column selected for these parameters, as described following.
- hashexpression: If your database table contains a column with numeric values such as a unique ID or similar data, choose the column name for a parameter hashexpression. In this example, shipmt_id is a monotonically increasing column and a good candidate for hashexpression
- hashfield: If no suitable numeric column is available, find a column containing string values with a good even distribution (high cardinality), and choose the column name for a parameter hashfield.
- hashpartitions: Provide a value of hashpartition as a number. By default, this value is set to 7. This parameter determines the number of Spark partitions and the resulting number of JDBC connections opened to the target database.
MSSQL Table –
Before adding the parallelism ( hashexpression and hashpartitions ) to my Glue Data Catalog table; my query run for 6 min and 15 sec to bring 57 million records.