Spark: Expansion of rows for count(distinct X)

Dear Spark experts!
In one of our projects we faced the challenge of computing several KPIs by using the count(distinct X) operation.
In one of the examples our input has roughly 230 million rows. We then calculate many different KPIs, most of them have the following pattern:

COUNT(distinct CASE WHEN trx_date >= ADD_MONTHS(X, -3) THEN trx_amount END)

When looking at the Spark execution plan we found out, that when computing these KPIs Spark is doing an EXPAND operation on the data, blowing it up to more than 2 billion rows.

Has somebody experience with these kind of operations and has a clue of why Spark needs to do such an expansion?

Thank you in advance,
best Christoph

Hi, I can’t tell you easily why that happens but did you try approx_count_distinct as a more performant alternative?

1 Like

Since there is no way to measure the
Query execution time on OD SQL editor.
I guess You have to try the following on Your own :confused: :

A) Your code
B) (or similar)

    COUNT(k.my_trx_date ) as count_trx_date
      IF(i.trx_date >= ADD_MONTHS(X, -3), i.trx_date, null) as my_trx_date 
    FROM inputTable i
    GROUP BY my_trx_date
) as k

Group by and then count the values within the group.

@peter.dahlberg Thanks for the tip, unfortunately we cannot use approx_count_distinct due to the high requirements for our KPIs.

@jonas.sperling Thanks for your answer, we came up with a similar solution to your proposal. :+1:

Now, we are interested in the reason why this expansion of the data is happening when doing a count(distinct ). :slight_smile:

What we see in your screenshot is the executed physical plan. Can you do a fast debug run somewhere below your query (if you’re not using a WF, simply recreate a WF doing the exact same thing) and provide me with the Plans from the Spark debug result? They can also be obtained from the SparkUI when the whole query is executed in one single Job.
According to Mastering Spark SQL, the Expand might stem from an implicitly resolved TimeWindow in turn stemming from your T1 >= T2 statement.

BTW: One thing that makes me suspicious here is the missing ELSE statement in your CASE WHEN.