Learn Kusto - clustering your data
This week is going to be very technical. I’ve also learned something new by writing this post and I hop eyou will too.
The Azure Data Explorer team has released a set of new functionality to help with clustering on data using a Log Reduce approach.
The “old” approach
Before the release described below - the ADX service had a good handfull of features to help with anomaly detection and clustering on semi structured data.
With the functions like basket() and autocluster() the service can find patterns based on common values across the columns. The problem with these functions, is that they are not able to parse free text columns and extract tokens and repeatable patterns.
Yes, you could use the diffpatterns_text() function - but that is not strong enough to cover real diversity of free text log records.
The new approach
A set of new features has been released to help cover the above gap.
- log_reduce_fl()
- log_reduce_full_fl()
- log_reduce_train_fl()
- log_reduce_predict_fl()
- log_reduce_predict_full_fl()
The bottom half is good for building and training ML models to predict and classify known patterns or new anomalous. The top half is good for ad-hoc usage when working with scoring models and developing algorithms.
Notice that the Python plugin needs to be enabled on the cluster for this to work, giving that you cannot test this on a free Kusto cluster as of this time
log_reduce_fl()
This function finds common patterns in text fields and outputs a summary of the records with token-inzed (is this a word?) results.
Example data (columnname is [log]):
081110 215858 15511 INFO dfs.DataNode$DataXceiver: Receiving block blk_-8578644687709935034 src: /10.251.107.227:39600 dest: /10.251.107.227:50010
Code:
<tablename>
| invoke log_reduce_fl(reduce_col="log")
Result:
081110 <NUM> <NUM> INFO dfs.DataNode$DataXceiver: Receiving block blk_<NUM> src: <IP> dest: <IP>
Notice the output is replaced with tokens for the repeatable pattern from the text-field.
log_reduce_full_fl()
This does the same as the above function, but returns a full table of the records - compared to a summarized table above. A full table just means that the result is replaced with <parameter_(n)> instead of the <NUM> or <IP> fields and all rows are retunred from the query service.
log_reduce_train_fl()
This is doing the same thing as the log_reduce_fl() function, but not returning a resultset. Instead the function is delivering a serialized mtrained model based on the input data.
log_reduce_predict_fl()
This function used the trained model - perhaps the one from the above function - to find and predict anomalies if not matching pattern is found based on the trained model.
The result set is a summary of the anomaly detected - to get a full result set, you need to use below function.
log_reduce_predict_full_fl()
As with the other full function, this returns a full result set based on the same functionality as the log_reduce_predict_fl() function.
Interested in reading more
The new functions are based on the Log Reduce algorithm then you can follow this link to read more about the functionality - but be aware that it will pick your brain. I know I did that for me.
The complete documentation from Microsoft can be found here:
- log_reduce_fl()
- log_reduce_full_fl()
- log_reduce_train_fl()
- log_reduce_predict_fl()
- log_reduce_predict_full_fl()
Happy coding!