With Microsoft Build 2021 currently underway, what better time to take a beginner-friendly deep dive into Azure Data Factory. In this post, we’ll talk about what Azure Data Factory is, how to get started using it, and what you might use it for.
Keep up with all things Azure in the ACG original series Azure This Week!
What is Azure Data Factory?
Azure Data Factory (ADF) is a fully managed, serverless data integration solution for ingesting, preparing, and transforming all your data at scale. It enables every organization in every industry to use it for a rich variety of use cases: data Engineering, migrating their on-premises SSIS packages to Azure, operational data integration, analytics, ingesting data into data warehouses, and more.
Figure 1 Azure Data Factory – Industry leading Enterprise Data Integration
If you have many SQL Server Integration Services (SSIS) packages for on-premises data integration, these SSIS packages run as-it in Azure Data Factory (including custom SSIS components). This enables any developer to use Azure Data Factory for enterprise data integration needs.
Check out the Visual Guide to Azure Data Factory to get the big picture on ADF.
Enterprise Connectors to any Data Stores – Azure Data Factory enables organizations to ingest data from a rich variety of data sources. Whether the data source is on-premises, multi-cloud, or provided by Software-as-a-Service (SaaS) providers, Azure Data Factory connects to all of them at no additional licensing cost. Using the Copy Activity, you can copy data between different data stores.
Azure Data Factory enables you to reduce the time to insight by making it easy to connect to many business data sources, transform them at scale, and writing the processed data in a data store of choice. For example, you can use Azure Data Factory to connect to the following business applications available with Microsoft Dynamics 365 – Dynamics 365 for Marketing Sales, Customer Service, Field Service, and more. This enables Azure Data Factory to copy data from and to Microsoft Dynamics 365, and get the data, in the right shape, to where it is most needed to support critical business reporting. Besides connecting to Microsoft Dynamics 365, Azure Data Factory supports a rich variety of advertising and marketing data sources: Salesforce, Marketo, Google AdWords, and more.
Get the Cloud Dictionary of Pain
Speaking cloud doesn’t have to be hard. We analyzed millions of responses to ID the top terms and concepts that trip students up. In this cloud guide, you’ll find succinct definitions of some of the most painful cloud terms.
On-premises Data Access – For many organizations, there will be enterprise data sources that are on-premises. Azure Data Factory enables organizations to connect to these on-premises data sources using a Self-Hosted Integration Runtime (we will cover the Integration Runtime concept in the next section). The Self-hosted integration runtime enables organizations to move data between on-premises and cloud data sources, without requiring you to open any incoming network ports. This makes it easy for anyone to install the runtime and enable hybrid cloud data integration.
Code-free Data Flow – Azure Data Factory enables any developer to accelerate the development of data transformations with code-free data flows. By using the ADF Studio, any developer can design data transformation without writing any code. To design a data flow in Azure Data Factory, you first specify the data sources that you want to get data from, and then you can apply a rich set of transformation on the data, before writing it to a data store. Underneath the hood, Azure Data Factory runs these data flows for you at scale using a Spark cluster. Whether it is working with megabytes of data (MB) to terabytes of data (TB), Azure Data Factory runs the data transformation at spark scale, without you having to set up a Spark cluster, or tune it. In many ways, the data transformation just works!
Secure Data Integration – Azure Data Factory supports secure data integration, by connecting to private endpoints that are supported by various Azure data stores. To offload the burden of managing your own virtual network, Azure Data Factory manages the virtual network underneath the hood. This makes it easy for you to set up a Data Factory and making sure all data integration happens securely in a virtual network.
CI/CD Support – Azure Data Factory enables any developer to use it as part of a continuous integration and delivery (CI/CD) process. CI/CD with Azure Data Factory enables a developer to move Data Factory assets (pipelines, data flows, linked services, and more) from one environment (development, test, production) to another. Out of the box, Azure Data Factory provides native integration with Azure DevOps and GitHub.
Data Integration and Governance
Bringing Data Integration and Data Governance together enable organizations to derive tremendous insights into lineage, policy management, and more. Azure Data Factory integrates natively with Azure Purview to provide powerful insights into ETL lineage, a holistic view of how data are moved through the organization from various data stores, and more.
For example, a data engineer might want to investigate a data issue where incorrect data has been inserted due to upstream issues. By using Azure Data Factory integration with Azure Purview, the data engineer can now identify the issue easily.
Learn more about how you can integrate and provide Azure Data Factory lineage to Azure Purview.
Figure 2 – Azure Data Factory and Azure Purview Together
The nuts and bolts of Azure Data Factory
Figure 2 provides a quick overview of Azure Data Factory concepts that can help you get started.
- Triggers – Specify when a data pipeline runs. Different types of triggers are supported in Azure Data Factory. A scheduled trigger enables you to specify that a data pipelines run at a specific time of the day (including the Time Zone).
You can also specify that the trigger fires based on a series of fixed-size, non-overlapping, contiguous time intervals using a Data Window Trigger. A Data Window trigger can also be dependent on other data window triggers.
Storage event triggers enable you to run the data pipelines when data is created/inserted into Azure storage. A Custom Event trigger expands the richness of storage events to all custom events that are pushed into Azure Event Grid.
- Pipelines and Activities – A pipeline consists of a grouping of different activities. An Activity in Azure Data Factory can help you copy data, perform data transformation using Data Flows, and various other computes on Azure. You can also specify iteration and logical constructs in a pipeline.
- Integration Runtimes – An Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory for performing various data integration tasks (e.g., data movement, data flow, running a SSIS package, running code on various compute on Azure)
Figure 3 – Azure Data Factory Concepts
Figure 4 Azure Data Factory Learning Path
Get started with Azure Data Factory
Many resources are available on Azure Docs and Azure Data Factory YouTube Channel to help the technical community get started with Azure Data Factory. In addition, you can get started by using the Azure Data Factory Learning Path (part of Azure Learn). Or check out ACG’s Developing a Pipeline in Azure Data Factory Hands-on Lab.
We can’t wait to see what you can build with Azure Data Factory!