Get statistics from OD API in the most performant way

Hi, we’re currently using the /data/{dataSetID}/statistics endpoint to get some basic statistics for some data tables using the Flexible REST API processor in a Workflow. The current URL addition looks like this:

data/{dataset_id}/statistics?distinctValuesCount=15&computeExactNullCount=true&computeDistinctSummaryForAllColumns=true

As some of the tables we’re getting the statistics from are huge (more than 1 billion rows), is there something we could do to improve performance (the WF runs for many hours without end)? Maybe change some query parameters in the endpoint?

Or is this something that needs to be tackled directly in the OD API?

Disabling the computeDistinctSummaryForAllColumns option (set to false) helps at least for tables that have numeric (integer and decimal) columns (OD RepresentationTypes NUMERIC, INT, DOUBLE).
Imagine a table consisting of 1 String and 50 double columns (e.g. sensor data). The sensor values will most probably have as many distinct values as there are rows in your data. Doing a distinct value analysis is not only time and memory intense, it is limited to the top k occurring values and will most probably not be of any informative value to your statistics. Other than that, I fear there is not much performance to be gained since we’re already using heuristics for larger tables to speed up computing times (note, that the values may not be 100% accurate in such a case).

2 Likes