World’s Fastest Medallion Architecture Load
Imagine you have a streaming dataset with some crucial information you need to report on, but the data is only half the story, and you need to put in an effort to clean, join and prepare the data for analytics use.
Normally, an approach could be to start a Spark Notebook or a Pipeline to get data moving and do the needed changes. From there, sink the data in the medallion architecture to end in the Gold layer for reporting.
This approach is not at all wrong - but if you really need things to be lightning fast - you need a different approach. One that exceeds your expectations and gives you the benefits and flexibility of the Bronze, Silver and Gold layers.
The Real-Time Intelligence service is your answer
With the introduction and release of the Eventhouse service in Fabric, you can now leverage the speed and agility of the KQL engine to make things blazingly fast and deliver data within a few milliseconds from arrival.
Just to be clear on the approach, then we receive data in the Bronze layer as a 1:1 copy of the data coming in from the source. In the silver layer the data might have some new structure and added business value and logic. In the gold layer data is ready to be used in semantic models and analytics. Nothing new there, and we all have our own definition of the different layers. This is just an example.
For this blog post, I’ll use the NYCTaxi dataset, which is available for public use from Microsoft as demo dataset or directly from the source itself right here.
From Bronze to Silver –> Update policies
When a row is landing in the Bronze layer, in this thought case, we need to transform the data and join in a reference table to get the complete dataset.
Remember, that the datamodelling in Real-Time Intelligence (for instance to a star-schema) is not needed, it is quite ok to have one broad table with all the needed data.
The KQL query for the transformation could look something like this - given that I have a reference table named Vendors in my KQL Database:
NYCTaxi
| project VendorID, RatecodeID, PULocationID, DOLocationID, fare_amount, passenger_count
| join kind=inner Vendors on VendorID
| project VendorID, RatecodeID, PULocationID, DOLocationID, fare_amount, passenger_count, VendorName
Above statement adds the Vendorname to the incoming dataset.
This works fine, but we need to move this result set to the silver layer for each row landing in to the bronze layer. This is where the Update Policies kicks in.
Update Policies in KQL is a feature which lets you handle the data landing in a table in what ever manner you need. It could be to offload it to an external data sink, send it onwards to a different service or, as in this case, manipulate it and send it to the next layer in the Medallion architecture. Read more about the update policies right here.
To create a table Update Policy, the best thing to do first is to create a function to handle the query for the data manipulation.
.create function LoadTaxiDataToSilver {
NYCTaxi
| project VendorID, RatecodeID, PULocationID, DOLocationID, fare_amount, passenger_count
| join kind=inner Vendors on VendorID
| project VendorID, RatecodeID, PULocationID, DOLocationID, fare_amount, passenger_count, VendorName
| extend Ingestion_time = ingestion_time()
}
Please notice that I’ve added a new meta-data column called Ingestion_time to keep track of the speed on rows coming in and being handed over to the different layers onwards in the proces.
With above function, we seperate the update policy and the query to handle the data. This is way easier to handle and govern when it is in seperate code sections.
After the function has been created, we need the destination table to be created with the needed columns. This could look something like this:
.create table SilverNYXTaxi (
VendorID:long
,RatecodeID:long
,PULocationID:long
,DOLocationID:long
,fare_amount:real
,passenger_count:real
,Vendorname:string
,Ingestion_time:datetime
)
With the destination table now in place, we can add this function to an update policy like below. Notice that the update policy is added to the target table, not the source table.
.alter table SilverNYCTaxi policy update
@'[{
"IsEnabled": true //yes, we want this to run
,"Source": NYCTaxi //the source table
,"Query": LoadTaxiDataToSilver //the function we just created
,"IsTransactional": true //for each new row in the source table
,"PropagateIngestionProperties": false //this is for debugging
}]'
When above Update Policy is put into the KQL database, the data will now begin to automatically propatage to the Siler layer table with the manipulated data based on the query in the function we created.
All of this happens within a few milliseconds…
From Silver to Gold –> Materialized view
With the data safe and sound loaded to the Silver layer, we can now focus on the load to the Gold layer.
In this scenario, I will show you a “virtual” manipulation of the data in a materialized view. It is a view based on a KQL statement, with an always updated resultset for ultra fast query times. Read more about the Materialized views right here.
To create a materialized view, I have to create a KQL statement with the result set I need for my report. Just like a view from a SQL server, the underlying query is just a normal query to be executed.
It could be something like this:
SilverNYCTaxi
| project VendorName, PULocationID, DOLocationID, fare_amount, Ingestion_time
When I have my KQL Query in place and the resultset is what I want for my analytics, then I just need to create it as a materialized view:
.create materialized-view GoldNYCTaxi on table SilverNYCTaxi {
SilverNYCTaxi
| project VendorName, PULocationID, DOLocationID, fare_amount, Ingestion_time
}
And that’s it!
The materialzed view is now in place, and we have our complete, world’s fastest medallion architecture load from Bronze to Gold. Every row arriving in the Bronze layer is manipulated, loaded to the Silver layer within milliseconds and the materialized view handles the final phase to Gold.
Have you seen this faster
If you have seen any medallion architecture being loaded faster than this, please let me know. Because, then this approach is not the World’s fastest approach.
If you have any comments or ideas, please leave them in the comment section below.