Spark: Auto-Caching of REST API calls

Hi there,
we are often using ONE DATA API calls in our workflows. A typical workflow looks like this:

Now when having a normal workflow without API calls in between, Spark executes every branch again every time a Spark-action is triggered (e.g. Result Table or Data Table Save).
So in my example when I execute this workflow the top left branches with the API call is executed at least 3 times due to the 3 result tables along the branch.
Does this mean, that Spark triggers the same API call 3 times or is there a kind of auto-caching functionality in the Flex API processor?

Thanks in advance
Christoph

Hey there!

I can confirm your suspicion of the processor caching the result of the API call. I just checked the code and there are two different execution paths here:

  • Distributed API calls:
    Here, the calls to the API are performed inside a Spark Job and the resulting dataframe containing the responses is cached explicitly via the processor itself.
  • Non-distributed API calls (default):
    Here, the API calls are performed by the processor directly without using Spark for parallelization. In this case, the result is a common Java List of the response contents which in turn gets handed over to Spark again for parallel computation along the subsequent processors.

In both cases, there is no redundant API call happening.
If you have further questions, just go ahead asking. :slight_smile:

Cheers
Flogge

Hey, thanks Flogge!
In case of the non-distributed API I have another question to make sure I understood what you are saying:
Does the Spark driver in this scenario collect the data from all the executors, sends the API calls and afterwards schedules new tasks for the executors?
While in the first (distributed) scenario, the executors themselves are sending (each one a part of all) the API calls?

Thanks for your quick answer :slight_smile:

Well, sort of. Except for the “schedule new tasks” part.
You can think of the non-distributed mode like an hourglass. It collects all the URLs from the input to the processor via a Spark action, namely a collect() (not necessarily to the driver, though, but rather via the driver) and performs the API calls sequentially within the processor’s execution thread. The responses are then “scheduled” for being parallelized (see Spark Programming Guide ). It’s not an immediate Spark action but the collection will reside with the processor/driver until an action is triggered by a subsequent processor (e.g. a result table).
In distributed mode, there is no Spark action triggered by the processor. Instead, the API call logic is added to the input RDD as a common transformation (and yes, it’s an RDD from here on, not a Dataframe). This means, that the API calls are only executed when there is a Spark Action triggered by a subsequent processor.
Additional info for distributed mode: When not all the data gets materialized via the action (e.g. Result Table with 400 rows but 500 different API calls enter the processor), chances are, that not all calls are actually executed in this mode but only as many as needed to get 400 rows of data (due to Spark’s lazy materialization). There is no way of controlling which calls are executed in this scenario. Therefore, in distributed mode, one has to make sure that all data gets materialized by an action when all calls should be executed. Most extreme example here: Simply leaving both output ports of the API processor unconnected will not trigger any request in distributed mode whereas in non-distributed mode, all API calls will be executed regardless of what happens to the output.

1 Like