I have the following workflow setup, where in the first Extended Mathematical Operation, I access the current datetime with the sql function current_timestamp(). This gets populated as a column to the two tables connected to it.
Unfortunately, the timestamps in both tables differ by a few seconds. I think this is because Spark executes the current_timestamp() function twice for each branch of the tree. So how should I fix this?
Is this a good usecase for a caching processor? Or is there a better way to deal with this?
Thanks a lot,
If they should be the same, why do you have to set the timestamp again after only transforming the data by reducing the columns/rows you write into the meta information table? Couldn’t you just use the first timestamp?
Or isn’t the 2nd EMO-processor adding a timestamp?
Even if you set the timestamp in the upper left Extend Mathematical Operation processor, they can be slightly different, as the Spark optimization decideds to execute the processor twice (one time for the left output, one time for the right one).
I think you must use a caching in order to get exact timestamps in both tables. I guess your data is relatively small (one row?) so that should be no problem. If your data is huge (maybe > 100k rows) you might create the timestamp in a separate branch, cach it and join it to your final data afterwards.
thanks for your response. Actually, that’s what I thought I do here. The second Extended Mathematical Operation Processor does nothing of interest here. So I take the table from the first EMO-Processor and transform it. I am not sure whether that is the explanation, but I think Spark executes the first branch, generates the timestamp and saves the first table. For the second table, it generates the timestamp again a few seconds later. Then, it transforms this one in the second branch of the tree and saves it to the second table.
So at the moment I can’t imagine a setup, that will let me have identical timestamps in both tables when using the current_timestamp() function.
Interesting, kai! Thanks for the insight.
I will go ahead with caching in a separate branch then.
Out of curiosity: The
current_timestamp() in Extended Math will be executed for each row, correct? This would mean that for large datasets, the timestamp might still differ between rows.
If it’s important to have the exact the same timestamp for all rows, the alternative would be to create a single-row dataset and cross-join the timestamp to all rows.
@adrian.berndl Are you sure that the processor executes it for each row? I currently have a dataset with ~10^6 rows and the timestamps in the dataset not differ on the millisecond. Also from the docs on current_timestamp():
current_timestamp() - Returns the current timestamp at the start of query evaluation. All calls of current_timestamp within the same query return the same value.
So it is basically a question of how the EMO-Processor resolves the function call internally.
No, I’m not. And the documentation you referenced shows I was wrong