AWS Glue

Print
Dark
Light

Article summary

Did you find this summary helpful?

Thank you for your feedback!

AWS Glue is a fully managed ETL service. AWS customers can use Glue to prepare and load their data from DoordaHost ready for analytical workloads within the AWS ecosystem. Glue can also catalog your data inside AWS and make it available for future ETL jobs.

Glue uses Apache Spark as the engine for its data processing and therefore Glue ETL scripts can be written in either PySpark or Scala.

The following section details the steps to integrate DoordaHost with AWS Glue in order to extract the required data from a specific table from DoordaHost into Parquet format on AWS Simple Storage Service (S3) in preparation for transformation and/or analytics. The script used in this example can be found in our Github repository.

Pre-requisites

Upload the glue script to a bucket of your choice under the S3 service. Also, create a bucket to store the parquet encoded output file(s) as well as a bucket to host any temporary files.
In addition, upload the DoordaHost JDBC Driver to a bucket of your choice under the S3 service.
Create an Identity and Access Management Role that has sufficient privileges to run the Glue Job. This typically requires read and write access to the appropriate buckets within S3 as well as read and write access to CloudWatch logs.

Create an AWS Glue Job

On the AWS Glue dashboard select Jobs on the left-hand menu and then click on Add Job button.

Configure the job properties as shown in Figure 1. The IAM Role dropdown should show all roles available in your account including the role that was created in the Pre-requisites section. Please modify the S3 paths to reflect the location of the script and temporary file directories.

Graphical user interface, text, application, email

Description automatically generated

Figure 1: Job Properties

Under Security Configuration, script libraries and job parameters add the location of the JDBC Driver as well as the desired number of workers and max concurrency.

Graphical user interface, text, application, email

Description automatically generated

Figure 2: Security Configuration, Script Libraries

A number of job parameters are required to run the script. This includes highly sensitive account credentials which should not be exposed in the console. It is highly recommended AWS System Manager Parameter Store is used to securely manage password as a secret on behalf of Glue Service and the appropriate permissions are given to Glue to access this property.

For testing purposes, these entries can be temporarily added directly to the job parameters as shown below.

Graphical user interface

Description automatically generated

Figure 3: Job Parameters

Running Glue Job

A Glue Job can be configured to run based on a trigger event or ad hoc. It is most appropriate to run an ad hoc glue job when bootstrapping your system of record with data from the Doorda Catalog. A job triggered on a scheduled time is most appropriate for acquiring new records from the Doorda Catalog and a schedule can be configured based on the Doorda Update Schedule.

Graphical user interface, text, application, email

Description automatically generated

Figure 4: Running Glue Job.

What's Next

Google Dataflow