Can the inverted Filter Prozessor kill a Spark context?

Story
I use the inverted filter processor to select the disjunctive set of rows of two tables.
Our Spark execution Context (SEC) went down, whilst this WF was executing.

Question
Was it my WFs fault that the SEC went down? I switched the SEC for now to be able to display tables. However i am afraid to execute the WF again.
Any suggestions?

Yes You can. But i do not know why exactly.

Hey Jonas,

this processor dates back to ancient times where there were no Query processors (and no fully fledged SparkSQL to begin with). Therefore, there was no ANTI JOIN available.
Admittedly, the implementation could have been better even back then. The reason why it makes your SEC unstable is probably a large amount of distinct filter values. This makes the subsequent query very ineffective.
You should have a look at the (LEFT) ANTI JOIN using a Double Input Query processor. Broadcasting the right-hand-side table can improve performance but be cautious since the amount of distinct values in the join partner also affects memory footprints and therefore stability (esp in local-mode SECs).

Just a minor addition to Flogge’s answer: be also careful with the broadcasting if the left dataset is highly fragmented. This usually does not occur “in the mid or end” of the workflow but can happen very close to the loading processor.

1 Like