Reading Delta Lake tables in Power BI

TRAN Ngoc Thach
4 min readNov 1, 2022

Connect Power BI to Delta Lake. Make it easy!

Introduction

Delta Lake, as the terminology Lake suggests, is designed to be a unified place for storing structured, semi-structured and unstructured data, as well as friendly to both streaming and batch operations. Under the hood, Delta Lake is empowered by Apache Spark, with additional support for ACID transactions (aka. fully succeeds or fully aborts). Even more appealing, one can easily save a DataFrame as a Delta Lake table by simply indicating the saving format:

a_spark_dataframe
.write
.format("delta")
.mode("overwrite")
.saveAsTable("test_table")

After being saved, it can be made use of, e.g. spark.sql("SELECT COUNT(*) FROM test_table") right away, assuming Hive support is enabled. However, if you already establish Microsoft Power BI as the platform for Business Intelligence, it is tempted to fetch the Delta Lake tables, whose data are stored somewhere, such as HDFS or Azure Blob Storage, into Power BI and visualize the data, or even do some transformation beforehand with Power Query, e.g. Table.SelectRows() for filtering.

The writer is going to show two recommended ways to load Delta Lake tables into Power BI Desktop, and pros/cons in each approach.

From Delta Lake to Power BI

Method 1: via a pure Power Query function and another File Listing Connector

Thanks to this user guide, we first import (New Source > Blank Query > Advanced Editor) the Power Query function fn_ReadDeltaTable() into Power BI, making it re-usable in other Power Query script of the same Power BI file. The function ends up in Queries panel as:

Do another import, but this time, we start using the aforementioned function. To simplify, let’s assume the Delta Lake table data are stored in HDFS:

let
test_table = fn_ReadDeltaTable(Hdfs.Files("[an_ip]:50070/test_table"), [UseFileBuffer=true])
in
test_table

The test_table output is the desired Delta Lake table. Here come the pros and cons:

  • Pros: simple because the Power Query function fn_ReadDeltaTable() is lightweight. Relying on another File Listing Connector, e.g. Hdfs, here is a blessing: off-load the connection complexity to another 3rd party Connector.
  • Cons: However, relying on another File Listing Connector also brings a hidden limitation: the data loading (before interpreting it as a Delta Lake table) depends on the capability of the 3rd-party Connector. For example, there is no Power BI Connector to fetch data in AWS S3!

Method 2: via the official Connector for Delta Lake

There is actually such an official Connector. The reader can start leveraging this Connector directly through Power Query:

let
Source = DeltaSharing.Contents("[an_ip]:60000/delta-sharing"),
deltalake = Source{[Name="deltalake"]}[Data],
schema = deltalake{[Name="schema"]}[Data],
test_table = schema{[Name="test_table"]}[Data]
in
test_table

Or they may opt for a step-by-step Wizard:

Pros and cons:

  • Pros: official Power BI Connector, by DataBricks.
  • Cons: This Connector is made possible, thanks to Delta Sharing — an Open Protocol for Secure Data Sharing. The Delta Sharing Reference Server, which is a reference implementation server for the Protocol, only supports AWS S3, Azure Blog Storage, Azure Data Lake Gen2, and Google Cloud Storage (not HDFS or some networked/shared disk). The author thinks this is by design because the Server serves as an Authentication and Management point, but the clients, e.g. Power BI, will then need to contact the Storage Provider, e.g. S3, to actually fetch the data through presigned URL. Otherwise, the Server would become a bottle neck! Another limitation: in some encryption schemes, e.g. S3 with SSE-C, it is required that the http requests against presigned URLs, must include additional http headers, indicating the key and its hash. But the author found no way to specify such headers in the official Connector.

Conclusion

We have learned two methods to load Delta Lake tables into Power BI. Depending on specific use-cases, we may prefer one way to another.

For example, if using HDFS, the 1st Connector is more convenient. If intending to share Delta Lake tables to a larger audience on the Internet, the data provider may opt for Delta Sharing protocol; therefore, we shall use the 2nd Connector.

--

--