Cloud computing is the future in the present. While it seems rather complicated in the context of data warehousing, nothing could be further from the truth!
Platform as a service (PaaS) is a cloud computing model that guarantees the availability of all key elements within one platform, directly in the cloud. From development tools through launching, monitoring, and management, all the way to the analysis and presentation of data, it’s a novel approach that is still gaining popularity, although it sometimes requires a fresh look at concepts that have been proven for years.
When we think about a data warehouse and data analysis on a corporate scale, usually it appears in our head as traditional mechanisms based on on-premise architecture. The four most important platforms (SQL Server, Integration Services, Analysis Services and Reporting Services) have helped us to build great solutions for years. However, they are often not flexible enough to keep up with the needs of companies that are constantly evolving. In the case of servers, scaling-up vertically (adding components) or scaling-out horizontally (adding machines) sometimes becomes very expensive, takes too long, or is even completely impossible to implement or maintain. The decisive factor is therefore the knowledge of alternative solutions that will allow us to achieve similar goals (with know-how and experience gained from well-known mechanisms) but with the help of new, cost-optimized technologies that bring new possibilities on an unprecedented scale.
Their potential was recognized by one of our clients, the largest retailer of books, press, and music in Poland. By entrusting us with the task of creating a complete data warehouse solution in the PaaS approach, they knew that they could count on speed in designing and creating processes, flexibility in performance scaling and cost management, and further development of their platform based on a growing range of Azure cloud capabilities.
Thanks to the Integration Runtime service (IR), we are able to quickly and easily install software on the client’s side that enables us to securely and efficiently communicate with servers within their infrastructure. Regardless of whether we want to connect to SQL Server database, Oracle, MySQL, PostgreSQL, download the file, or maybe call the REST API, all we need to do is configure the security keys, and we can already connect to them from the Azure services. The list of supported databases and formats, although already extensive, is still growing. In addition, if there is such a need we could even use a specially prepared environment in the cloud that will allow us to launch our Integration Services packages.
Azure Data Factory V2 (ADFv2) is a cloud equivalent of SSIS packages, which enables faster design of data flows and transformations from many sources located both on on-premise servers and in the cloud. Creating, managing, running, and monitoring the so-called “Pipelines” was carried out using a web browser on the dedicated
Azure platform.
In comparison with its previous version – v1, preparing solutions is more convenient and intuitive here. The flow creation process is reminiscent of the well-known BIDS or Visual Studio and Integration Services components. We could also find here such functionalities as the use of Azure Databricks, Azure Datalake or even HDInsight for data analytics, Azure Batch Service for own flows or transformations, and finally support for git repositories and the related mechanisms of continuous integration and implementation. ADFv2 is a powerful tool that we can additionally manage using REST API or directly in PowerShell.
Do you like it and need more?
Don’t worry! Second part of this article will be available next week!
Learn it. Know it. Done.