Parsing datatable API response takes exceptionally long time

Hi all,

I am trying to get datatable related information using API endpoint /data/{dataSetID} for all datatables but parsing the response body using XML / JSON PARSING V1.2 takes an exceptionally long time.

Only the parsing processor takes around 20 minutes for 800 tables (for the sake of comparison, in another instance with 2000 tables parsing takes less than 2 minutes). However, many of these 800 tables are quite large (millions of rows, or several GBs).

Does this long parsing time related to my tables’ size or is there a way to optimize this execution time?

Hey Aman,

first of all, let me make sure to get your use case correctly: You’re trying to parse content of ONE DATA Data Tables that get “transfered into the workflow” via an API call. Is it just the metadata or are you really requesting the data itself? If the latter is the case, I’d like to know more about your particular challange at hand. For the rest of my answer, I assume you are only reading the metadata.

Firstly, when only dealing with metadata, the physical size of the tables (#rows/columns, and size of content) does not matter (unless you include samples/statistics in your requests and these are not yet computed - let’s assume, this does not play a role here).
If you want to find out whether the parsing itself takes up the main portion of execution time, consider a Full Debug run on the parsing processor (or any other downstream successor of it). I have the suspicion that actually it’s the API calls causing the main compute time here (in a Full Debug Run, I expect the execution time in the API processor to be about as high as for the JSON parsing).
If it’s the API calls, you may be facing limitations of your instance concerning its cycle times. There’s not too much you can do about this except trying to distribute the API calls, which can only be recommended to a certain extent since you may easily flood the API by distributed requests, rendering the instasnce quite unresponsive during request execution.
But: If it’s really the JSON parsing, try repartitioning after the API processor (add a “Manipulate Partitions” processor, leave the default partition count in the processor and use Repartition - enable the toggle) to make the JSON parsing a distributed task. By default (when API calls are not executed in a distributed way - see above statement and refer to the API processor’s config), the partition count of the output of the API processor is 1, leaving parallel processing capabilities on the table.
You should also be able to spot individual request roundtrip times in the (success) output of the API processor itself (when using the latest version) or by conducting a few experiments of calling the API from your local machine via postman or the onedata-client in the browser.
This should give you a rough estimate of how long an API request takes. Do the math and find out whether or not it matches your timing of 20min.

1 Like

Hi Flogge,

thanks for the detailed explanation. Indeed, the bottleneck was API processors as far as I see from the Full debug mode.

I only need metadata (not whole data, so I am using api/v1/data/{dataset_id}. However, the response of this API call already contains some sample of data (25 rows of each column as an attribute as you see in the screenshot). Do you think this is the problem here? or is there a way just to ignore these 25?

I am doing API calls with Do Rest-API Queries distributed toggled on. The interesting fact is that the same WF (which collects info about >800 tables) runs in 10 minutes on the Markant DEV instance, on another hand, it takes around 50 minutes on Markant PROD even though PROD should be much more powerful (but datatables are much larger there).

It is actually the opposite: All REST processors in non-distributed mode produce one partition per output row, with the the exception of the Flexible REST processor v1.1 where this issue was fixed. It is user-configurable how many rows are put into a partition, with a default of 100 rows.

1 Like

@amanmyrat.abdullayev two remarks here:

  • The Flexible REST API processor v1.1 outputs contain performance parameter (especially the response times), allowing to deduce such information without running the WF in debug mode
  • if you execute any of the REST processors in distributed mode and the number of partitions in the input is large enough, it will run with the maximum parallelization possible, which can easily flood the target system and lead to situations where the sequential (i.e. non-distributed) execution can lead to overally better results. You can reduce the degree of parallelism by using a manipulate-partitions processor directly before the REST processor, and for example reduce the input to two partitions
1 Like