Recursive Queries

Hi to all ONE DATA/SPARK SQL-Experts! Has anyone of you ever encountered a situation in which you needed to write a recursive SQL statement? Specific use case: I have a hierarchy table, which contains a lot of connections between products and the components that they are made of. Twist: products can be composed of components which can be composed of components which … you get it, right? The table looks essentially like this:

product_id | part_id
A | B
B | C
B | D
D | E

Now, I want to create a table which resolves the hierarchy to its outer level such that all the intermediary steps are eliminated and only the outer shell of that hierarchy remains:
product_id | part_id
A | C
A | E

According to stackoverflow, this is usually solved with a recursive CTE but also according to stackoverflow it is not possible to write recursive queries in Spark SQL. Any smart workarounds/ solutions with SPARK / ONE DATA?

Hi, I encountered a similar use case when processing BoMs to resolve a hierarchical list of components. The query is not really recursive up to N-levels, but you need to specify one LEFT JOIN for every level you need to resolve:

In our case, we also had a base unit and a quantity, which we take from the deepest level, but I guess you may discard that part for your purpose.


Hierarchy: VERP -> HALB -> HALB -> FERT

--level1 = FERT (Fertigprodukt)
--level2 = HALB (Zwischenprodukt)
--level3 = HALB (Zwischenprodukt)
--level4 = VERP (Verpackung -> Baseprodukt)

--module = parent material description
--product = parent material id
SELECT a.parent_material_description,  a.quantity_level1, a.parent_material_id, a.material_id_level1, 
COALESCE(d.quantity_level4, c.quantity_level3, b.quantity_level2, a.quantity_level1 ) as quantity,
COALESCE(d.material_id_level4, c.material_id_level3, b.material_id_level2, a.material_id_level1 ) as material_id,
COALESCE(d.material_base_unit_level4, c.material_base_unit_level3, b.material_base_unit_level2, a.material_base_unit_level1 ) as material_base_unit,
b.material_id_level2, b.quantity_level2,
c.material_id_level3, c.quantity_level3,
d.material_id_level4, d.quantity_level4


FROM 

(SELECT parent_material_description ,quantity as quantity_level1, parent_material_id,  i.material_id as material_id_level1, material_base_unit as material_base_unit_level1 from  inputTable i WHERE process_step = "FERT") a

LEFT JOIN 

(SELECT quantity as quantity_level2,  i.parent_material_id  as parent_material_id_level2, material_id as material_id_level2,material_base_unit as  material_base_unit_level2  from inputTable i WHERE process_step = "HALB") b
ON a.material_id_level1 = b.parent_material_id_level2

LEFT JOIN 

(SELECT quantity as quantity_level3,  i.parent_material_id  as parent_material_id_level3, material_id as material_id_level3 , material_base_unit as  material_base_unit_level3 from inputTable i WHERE  process_step = "HALB") c
ON b.material_id_level2 = c.parent_material_id_level3 

LEFT JOIN 

(SELECT quantity as quantity_level4,  i.parent_material_id  as parent_material_id_level4, material_id as material_id_level4 , material_base_unit as  material_base_unit_level4 from inputTable i WHERE  process_step = "VERP") d
ON c.material_id_level3 = d.parent_material_id_level4

Credits to @daniel.kempter for the initial version of the query!! :smiley:

1 Like

Thanks @DanyEle for your reply. Did you always know up front how deep your hierarchy was? Additional complication at our side is that we do not know, how many levels there are for one final product. Could be 5, could be 10 or just 1. Also the hierarchy is not equally deep at all branches…

In our case, we know how many levels the hierarchy can have at most, but not all materials share the same hierarchy. Also for us, some materials have just 2 or 3 levels and these materials would still be resolved correctly via the COALESCE statements.

Nevertheless, the logic is the following in setting up the query:

#Number of LEFT JOINs = Deepest level in the hierarchy

Ok great! One last question: Which reasons were against a solution via Python-Processors in your case?

That SQL is just more efficient and scalable than Python in ONE DATA when processing big data. I recall we had a solution in May 2021 with Python, but it was very cumbersome and slow. The same processing could be done with this SQL code much more quickly.

1 Like