I would like to repartition one of my data tables. Therefore, I would load the data table and then use the manipulate partition processor. I am then not sure if I need to connect the “Manipulate Partition” processor with a “Data Table Save” processor and overwrite the data table again to apply the new partitions on the DT. Or does the manipulate partition already partition the data table and there is no need to save it again?
Thanks in advance.
Hey Tim,
The Manipulate Partitions Processor is used to coalesce/repartition data during execution. This means, that the data arrangement across the workers will be controlled with this feature. Partitioning your data with this processor can have many benefits during runtime of your workflow. It is mainly used to reduce data fragmentation into too small pieces and to get rid of skew partitions. When manipulating partitions with this processor and then saving via Save Processor, Spark will create one file for each partition in the same folder. So here, the assumption of a re-save is correct.
However, if you want to partition your data on save (in a physical way being folders based on partitions, see Spark API Documentation) for optimizing later filter and load performance, this is not the Processor you’re looking for.
We’re currently working on an extension of the Save Processor to incorporate this Spark feature as well. Refer to the release notes to get notified once it is released.
Until then, Manipulate Partitions can improve performance mainly during execution of the WF it is placed into. The only use case where subsequent steps can benefit from this Processor is too many small or skew partitions in the preceding WF being resolved by adding this Processor prior Save. The effect will most probably be small, though since Spark already has some on-load optimizations we’re using to avoid loading too many small partitions.
Hope that helps
Flogge
1 Like