Data is everywhere! Consider any industry, be it healthcare, finance, or education; there is a lot of information to be stored. Storing data can be done efficiently in the cloud using data storage services like Azure Blob store, Azure SQL Database, etc., or you can prefer keeping it on-premises.
Whatever may be the case, a considerable amount of unstructured data is stored every day. Also, some enterprises will ingest data across both cloud & on-premises where there might be a need to combine data from both these sources to perform better analytics.
What is Azure Data Factory?
In the above situations, it becomes more important to transform and move data across different datastores, and this is when Azure Data Factory comes into play!
Data Factory is a cloud-based ETL (Extract-Transform-load) and data integration service that allows you to automate data movement between various data stores and perform data transforming by creating pipelines.
Where can I use it?
Data Factory helps in the same way as any other traditional ETL tool, which helps extract raw data from one/multiple sources to transform & load them to any destination like a Data warehouse. But Data Factory differs from other ETL tools by performing these tasks without any code to be written.
Now, don’t you agree that it is a solution that can perfectly fit in if you are looking to transform all your unstructured data into a structured one?
Before getting into the concepts, here is a quick recap of Data Factory’s history
The version we are using has improved and developed in numerous ways compared to the first version made generally available in 2015. Back then, you have to build a workflow only in Visual Studio.
But the version 2 (public preview in 2017) was released to overshadow all the challenges in v1. With Data Factory v2, build code-free ETL processes where you can also leverage 90+ built-in connectors to acquire data from any data store of your choice.
Top-level Concepts
Now imagine, if you are moving a CSV file from a Blob Storage to a customer table in SQL database, then all the below-mentioned concepts will get involved,
Here are the six essential components that you must know,
Pipeline
A pipeline is a logical grouping of activities that performs a unit of work. For example, a pipeline performs a series of tasks like ingesting data from Blob Storage, transforming it into meaningful data, and then writing it into the SQL Database. It involves mapping the activities in a sequential.
So, you can automate the ETL process by creating any number of pipelines for a particular Data Factory.
Activities
These are the actions that get performed on the data in a pipeline. It includes three activities – Data Movement (Copy), Data transformation & control flow activities.
But copying & transforming are the two core activities of Azure Data Factory. So here, copy data activity will get the CSV file from a Blob and loads it into the database, during which you can also convert the file formats.
And transformation is mainly done with the help of a capability called Data Flows, which allows developing data transformation logic without using code.
Datasets
The datasets show what kind of data you are pulling from the data stores. So, it simply points to the data used in the activity as an input or output.
Linked Services
Linked Services helps to connect other resources with Data Factory, acting like connection strings.
When considering the above example, Linked Service will serve as a definition of connectivity to the Blob storage & performs authorization. Similarly, the target (SQL Database) will also have a separate Linked Service.
Triggers
Triggers are to execute your pipelines without any manual intervention (i.e.,) it determines when a pipeline should get completed.
There are three essential triggers in Data Factory,
- Schedule Trigger: A trigger that invokes a pipeline on a set schedule. You can specify both the date and time on which the trigger should initiate the pipeline run.
- Tumbling Window Trigger: A trigger that operates on a periodic interval.
- Event-based Trigger: This trigger executes whenever an event occurs. For instance, when a file is uploaded or deleted in Blob Storage, the trigger will respond to that event.
The triggers and pipelines will have a many-to-many relationship where multiple Triggers can kick off a single pipeline, or a single Trigger can also kick off various pipelines. But as an exception, Tumbling Window Triggers alone will have one-to-many relation with the pipelines.
Integration Runtime
Integration Runtime is the compute infrastructure used by the Data Factory to provide the following data integration capabilities (Data Flow, Data Movement, Activity dispatch, and SSIS package execution) across different networks.
It has three types:
- Azure integration runtime: This is preferred when you want to copy and transform data between data stores in the cloud.
- Self-hosted integration runtime: Utilize this when you want to execute using On-premises data stores.
- Azure-SSIS integration runtime: It helps to execute SSIS packages through Data Factory.
I hope that you have understood what a Data Factory is and how it works. We are now moving into its monitoring and managing aspects.
Monitoring Azure Data Factory
The Azure Monitor supports monitoring your pipeline runs, trigger runs, integration runtimes, and other various metrics. It has an interactive Dashboard where you will view the statistics of all the runs involved in your Data Factory. In addition, you get to create alerts on the metrics to get notified whenever something goes wrong in the Data Factory.
Azure Monitor does offer all the necessities for monitoring and alerting, but the real-time business demands much more than that!
Also, with the Azure Portal, it becomes difficult to manage when there are Data Factories with different pipelines, data sources across various Subscriptions, Regions, and Tenants.
So, here is a third-party tool, Serverless360 that can help your operations and support team manage and monitor the Data Factory much more efficiently.
Capabilities of Serverless360
Serverless360 will serve as a complete support tool for managing and monitoring your Azure resources. It helps to achieve application-level grouping and extensive monitoring features that are not available in the Azure portal.
- A glimpse of what Serverless360 can offer:
- An interactive and vivid dashboard for visualizing complex data metrics.
- Group all the siloed Azure resources using business application feature to achieve application-level monitoring.
- Get one consolidated monitoring report to know the status of all your Azure resources.
- Monitor the health status and get a report at regular intervals, say every 2 hours.
- Configure Threshold monitoring rules and get alerted whenever your resource is not in the expected state, and it can automatically correct it and bring it back to the active state.
- Monitor the resources on various metrics like canceled activity runs, failed pipeline runs, succeeded trigger runs, etc., without any additional cost.
Conclusion
In this blog, I gave an overview of one of the essential ETL Tools (Azure Data Factory) and its core features that you should be aware of if you plan to use it for your business. Along with that, I have also mentioned a third-party Azure support tool capable of reducing the pressure in managing and monitoring your resources.
I hope you had a great time reading this article!