- Print
- DarkLight
Running a Python Script as an Activity on Azure Data Factory
- Print
- DarkLight
This article describes the steps required to run a Python script within an Azure Data Factory (ADF) pipeline. This capability allows the execution of custom activities that ADF cannot perform with its in-built activities or are performed in a sub-optimal manner.
In order to invoke a Python script from an ADF pipeline, an ADF Custom Activity needs to be configured. An ADF Custom Activity is a bespoke script or application, written in the language of your choice and wrapped up in an Azure platform compute service that ADF can call as part of an orchestration pipeline.
In order to deploy and run a Custom Activity within ADF, an Azure Batch pool of virtual machines needs to be configured and deployed. This is because ADF cannot execute custom code directly within its own integration Runtime and therefore requires external compute resources to perform the task.
Azure Batch Service
An Azure Batch comprises of a configured Job running on a dedicated pool of compute nodes. In this case, a pool of 1 or more nodes needs to be configured to execute the Python script. It is the Batch Process responsibility to execute the job on the defined nodes.
The Azure Batch can be configured to access Azure Storage in order to access and/or write files.
Figure 1 illustrates a Batch workflow where an application or service such as ADF utilises Batch Service to execute a workload.
In order to correctly configure the Custom Activity, an Azure Batch Account needs to be created in order to manage a batch process.
A pool then needs to be defined within the Batch Account. A pool is a collection of resources/compute nodes that will execute the python script.
When defining the pool, the Operating System to use is defined as well as the capacity of the VM/node (CPU, RAM) that will perform the computation of the Python code.
Create Batch Account
- In the Azure Portal, select “Create a resource”, from the categories select “Compute” and in “search services and marketplace” search for “Batch Service”. Select Batch Service > Create.
- In the Resource group field, select Create new and enter a name for your resource group.
- Enter a value for the Account name. This name must be unique within the Azure Location selected.
- Under Storage account, select an existing storage account or create a new one.
- Do not change any other settings, Select Review + create, then select Create to create the Batch account.
When the “Deployment succeeded” message appears, go to the Batch account that you created.
Further information about creating a Batch account can be found here.
Create a Pool of Compute Nodes
- In the Batch account, select Pools > Add
- Enter a unique Pool ID
- In Operating System, select one of the Windows/Linux options under Data science category as these images will have python pre-installed.
- Scroll down to enter Node Size and Scale settings and enter appropriate size corresponding to the required resources (CPU, RAM) required for the job. Since Python is single threaded it makes sense to select a node based on memory requirements. The target dedicated nodes can be Fixed to 1 dedicated node unless the pipeline is designed to execute the script in parallel.
- If you require to install extra python modules used by the Python script then enable the Start Task and define the required pip install <module> entries. The Start Task will install these external modules during initialisation of the node.
- Keep the defaults for remaining settings and select OK to create the pool.
Batch creates the pool immediately, but it takes a few minutes to allocate and start the compute node. During this time, the pool’s Allocation state is Resizing. After a few minutes, the allocation state changes to Steady and the nodes start. To check the state of the nodes, select the pool and then select Nodes. When a node’s state is Idle, it is ready to run tasks.
Further documentation for creating a Batch pool can be found here.
Create Batch Account Linked Service in ADF Connection
- Within ADF, Create a Custom Activity.
- Select Azure Batch and create a New Azure Batch linked service
- In the Name field, provide a unique value.
- In the Account name field, enter the value for the Batch account name provided previously.
- In the Access Key, copy the Primary access key found under Keys > Batch Account Credentials section of your Batch account
- In the Batch URL, copy the URL found under Keys > Batch Account Credentials section of you Batch account
- In the Pool name, enter the value for Pool ID provided earlier.
- Under Storage linked service name, select isADL (Azure Data Lake)
- Test connection and Create
Configure Custom Activity to use Python Script
- Select Settings and the python <script_name> as the Command. If you are dynamically passing parameters into the script then select Add dynamic content.
- In Resource linked service, select isADL
- In Folder path, select the folder containing the python script.