Why does Spark think this join is a cross join and are there any best practices in ONE DATA for configuring 'spark.sql.crossJoin.enabled'

After an ONE DATA update a formerly working inner join between a table A (n rows) and table B (1 row) fails with the error shown below.

Question 1: Why is this a cross join for Spark?

Edit: Is it because currently its a 1on1 join? And even if Table A contains more rows the result of any inner/left join will be equivalent to a cross join?

Table A currently contains a single row:

role | attrA | attrB
template | 3 | 5

Table B always contains a single row (basically a couple of template columns that should be added to each row in Table A)

role | attrC | attrD
template | 8 | 9

An inner join is then performed on column “role” using the inner join processor of ONE DATA, leading to the error.

Question 2: Are there any best-practices regarding the setting of this config? I don’t know if it was True before the ONE DATA update or if it got changed due to Spark upgrades (perhaps?), but is it safe to set it to True or are there any downsides like exploding joins crashing the instance?

The MNWE workflow and error:

"AnalysisException: Detected implicit cartesian product for INNER join between logical plans Filter (isnotnull(role#103153) && (role#103153 = template)) +- LogicalRDD [role#103153, page_template#103154, layout_template#103155, table_element_json#103156, header_template#103157], false and Project [app_template#103004, study_id#103005, version#103006L, Operator_app_id#103007, Manager_app_id#103008, Viewer_app_id#103009] +- Relation[app_template#103004,study_id#103005,version#103006L,Operator_app_id#103007,Manager_app_id#103008,Viewer_app_id#103009,__ROW_ID__#103010] JDBCRelation(frt_acc95ccc_a506_422e_a7d4_717831699dff) [numPartitions=1] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true;"

1 Like

You are most probably joining by a condition that structurally always evaluates to TRUE for any row.
The reason for this lies somewhere in your data. We can have a look together to find out why this safeguard was triggered (by comparing the parsed and the physical plans). Seems like role = template evaluates to 1=1 for some reason.
Usually, the table content should not be considered here so Spark “shouldn’t know” what’s inside the tables (if it is a CIT, things are different! Here Spark knows what’s coming…). Since you add the role column via a constant, this is known to have a certain value by nature (i.e. structurally).
The reason why implicit CROSS JOINS are disabled is obvious: It’s a safeguard feature (of Spark, not ONE DATA). Otherwise, JOIN operations with accidental tautologies as conditions can let your data volume (at least the intermediate) grow by magnitudes.

For your example I suspect the left-hand side (get default table element JSON template) to be a CIT with template being always the same value as you are adding in the “add column “role”” processor. Most probably, the CIT only contains one line, correct? In such a constellation, the JOIN condition can be reduced to a trivial TRUE (1=1).

By using a Double Input Query with an explicit CROSS JOIN, you should be able do exactly what you want (if my assumptions above are correct). Doing so might also make the constant role column superfluous. I’ve done quite a lot of this kind of stuff recently if you’re looking for “inspiration”.
You could also add a dummy row with a different value for template to your left-hand side, removing the implicit cartesian nature of your join.
Setting spark.sql.crossJoin.enabled=true is also an option but be aware that this will switch off the safeguard mentioned above.