AWS Glue
    • Dark
      Light

    AWS Glue

    • Dark
      Light

    Article summary

    AWS Glue is a fully managed ETL service. AWS customers can use Glue to prepare and load their data from DoordaHost ready for analytical workloads within the AWS ecosystem. Glue can also catalog your data inside AWS and make it available for future ETL jobs.

    Glue uses Apache Spark as the engine for its data processing and therefore Glue ETL scripts can be written in either PySpark or Scala.

    The following section details the steps to integrate DoordaHost with AWS Glue in order to extract the required data from a specific table from DoordaHost into Parquet format on AWS Simple Storage Service (S3) in preparation for transformation and/or analytics. The script used in this example can be found in our Github repository.

     

    Pre-requisites

    • Upload the glue script to a bucket of your choice under the S3 service. Also, create a bucket to store the parquet encoded output file(s) as well as a bucket to host any temporary files.
    • In addition, upload the DoordaHost JDBC Driver to a bucket of your choice under the S3 service. 
    • Create an Identity and Access Management Role that has sufficient privileges to run the Glue Job.  This typically requires read and write access to the appropriate buckets within S3 as well as read and write access to CloudWatch logs.  

     

    Create an AWS Glue Job

    On the AWS Glue dashboard select Jobs on the left-hand menu and then click on Add Job button.

    Configure the job properties as shown in Figure 1. The IAM Role dropdown should show all roles available in your account including the role that was created in the Pre-requisites section. Please modify the S3 paths to reflect the location of the script and temporary file directories.

     

    Graphical user interface, text, application, email

Description automatically generated

    Figure 1:  Job Properties

     Under Security Configuration, script libraries and job parameters add the location of the JDBC Driver as well as the desired number of workers and max concurrency.   

     

    Graphical user interface, text, application, email

Description automatically generated

    Figure 2:  Security Configuration, Script Libraries

     

    A number of job parameters are required to run the script. This includes highly sensitive account credentials which should not be exposed in the console. It is highly recommended AWS System Manager Parameter Store is used to securely manage password as a secret on behalf of Glue Service and the appropriate permissions are given to Glue to access this property.

    For testing purposes, these entries can be temporarily added directly to the job parameters as shown below.

     

    Graphical user interface

Description automatically generated

    Figure 3:   Job Parameters


    Running Glue Job

    A Glue Job can be configured to run based on a trigger event or ad hoc. It is most appropriate to run an ad hoc glue job when bootstrapping your system of record with data from the Doorda Catalog. A job triggered on a scheduled time is most appropriate for acquiring new records from the Doorda Catalog and a schedule can be configured based on the Doorda Update Schedule.

     

    Graphical user interface, text, application, email

Description automatically generated

    Figure 4:   Running Glue Job.


    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.
    ESC

    Eddy AI, facilitating knowledge discovery through conversational intelligence