Dear Spark experts,
I want to know if it is normal behavior or if something is going wrong:
In one production line, we have WF which loads a table with ~50 million rows, then does a LEFT JOIN with another smaller table and saves the joined data (usually ~ 20 million rows). However, sometimes after the JOIN, data partitions are not equal some partitions are substantial (400Mb) some are very small (1Mb). Then Datatable save processor failed with the following error:
message: Task failed while writing rows."
innerError: Object
message: "Java heap space"
exceptionName: "OutOfMemoryError"
Then, Spark restarts itself (or the whole instance, I don’t exactly! in SparkUI, I can see that Total Uptime = the time of failure.
After this failure, next WF in the PL does not run even though we have ignore failure activated.
I would expect, that Spark through the above error and then continues with the next WF in the queue rather than stopping totally. Is it normal behavior or a bug?