Azure Data Factory

Print
Dark
Light

Article summary

Did you find this summary helpful?

Thank you for your feedback!

ADF orchestrates and automates the movement, transformation and analysis of data through the cloud. ADF does not perform the transformation or analytical work but provides the pipelines to move data from one service to another.

The body of the document details the steps in integrating DoordaHost with Microsoft Azure Data Factory (ADF) in preparation for transformation and/or analytics. It assumes the reader is familiar with ADF and a Data Factory object has already been created to host the pipeline(s).

Doorda Data Integration with Azure Data Factory

The diagram below depicts the logical components involved in building an integration pipeline for Doorda.

The Linked Service Object provides connection information to a data source as well as compute resources. In this guide, we will create a Linked Service Object to define a connection to DoordaHost.

The Dataset object defines the inputs and outputs for a pipeline. In this guide, we will create a dataset for the information copied from DoordaHost (input).

An activity is a process that performs computations on the data such as transformations, analytics. In this guide, we use the out of the box COPY activity to copy data from DoordaHost to produce records/files to a data sink of your choice.

Activities are rolled up into a pipeline and is the object used to schedule and monitor a particular pipeline.

Process and Architecture

It is important to understand the business objective first in order to identify the data sources within the Doorda Data Catalog(s) that can help with this objective. This is the starting point to set up the appropriate ADF pipeline architecture.

Configuring Azure Portal Copy Data tool

This section walks through the configuration of the Copy Data tool made available inside the Azure Portal. This is the minimum setup required to copy data from DoordaHost to your repository in native format.

Inside Home > Data factories, create a new or select an existing Data Factory object. Select the Author & Monitor tile to start the ADF (UI) application on a separate browser tab.

Next, select the Copy data activity from the “Let’s get started” page. This will take you to the Copy Data wizard which is detailed in the next section.

Copy Data Tool

Properties

Configure the task cadence or task schedule for this pipeline.

It is recommended the following approach is adopted when copying data from DoordaHost:

Execute an ADF pipeline to copy all data from the targeted Doorda Ledger Catalog. This is to bootstrap your targeted system of record with all available data. This only has to be run once.
Schedule a separate ADF pipeline to run on a regular basis to query for record updates on the targeted Doorda Ledger Catalog. This can be achieved by filtering records by date_added column and only copying records with a date after the last successful scheduled copy activity.

Source Connection (Linked Service)

DoordaHost is built on top of Presto, a highly scalable, distributed SQL query engine optimised for Big Data. ADF provides a Presto connector with support for the copy activity which negates the need for any bespoke scripting. Configure the source data store to use the Presto connector and populate the requisite fields as shown in the figure on the right.

Please change the Catalog name to reflect your requirements. The Server version should be set to the latest stable version. The User name and Password are supplied in your welcome email.

If it is disabled, it is recommended the Interactive Authoring is enabled to allow the AutoResolveIntegrationRuntime to test the validity of the Presto Connection configuration. This can be enabled by clicking on the information icon next to the “Interactive authoring disabled” and following the “Edit interactive authoring” link.

The Presto configuration can be tested for correctness by clicking the “Test Connection” at the bottom of the configuration page.

Source Dataset

Configure the source dataset with the desired Doorda table(s) or alternatively create a bespoke query. A bespoke query will be required for regularly scheduled jobs to filter records by date_added column as described in the Properties section.

The full list of Doorda tables associated with the configured Catalog is presented on screen.

Destination Connection (Linked Service)

Select the data sink of your choice and the appropriate dataset configuration for this sink. There are many options to choose, from SQL/NoSQL repositories to various file formats. Please refer to Azure Documentation for further information. In this example, we have chosen Azure Data Lake Storage Gen2 which simply copies the data to blob storage in Parquet format.

Deploy

Once all configured, the pipeline is ready to be deployed. Depending on the properties of the pipeline, this will either run immediately (most appropriate for bootstrapping your system of record with data from the Doorda Catalog) or on a regular schedule (most appropriate for acquiring new records from the Doorda Catalog).

What's Next

AWS Glue