Spark restarts (maybe whole instance) after OutOfMemoryError: Java heap space

Dear Spark experts,
I want to know if it is normal behavior or if something is going wrong:

In one production line, we have WF which loads a table with ~50 million rows, then does a LEFT JOIN with another smaller table and saves the joined data (usually ~ 20 million rows). However, sometimes after the JOIN, data partitions are not equal some partitions are substantial (400Mb) some are very small (1Mb). Then Datatable save processor failed with the following error:

message: Task failed while writing rows."
             innerError: Object
             message: "Java heap space"
             exceptionName: "OutOfMemoryError"

Then, Spark restarts itself (or the whole instance, I don’t exactly! in SparkUI, I can see that Total Uptime = the time of failure.

After this failure, next WF in the PL does not run even though we have ignore failure activated.

I would expect, that Spark through the above error and then continues with the next WF in the queue rather than stopping totally. Is it normal behavior or a bug?

How is this instance set up? kubernetes or docker with spark-context running within the onedata-server?

@adrian.berndl it is a docker-compose.

In that case (if no stand-alone spark-execution is configured) a crash of the spark will restart onedata-server and interrupt any running schedules.
On onedata-server versions before 52.5.0 there is the additional problem that you will not see an error in the PL, you will just see unfinished and finished WF jobs (see this ticket for details); you would have to manually navigate to the failed WF job to see an error message stating “the job was in an unfinished state at application startup…”
If you want to prevent spark from crashing onedata-server, you need to have a DevOp create a “stand-alone spark-execution context” with its own resources or migrate the instance to kubernetes

2 Likes

Thanks for the detailed explanation, @adrian.berndl !