Fill up timeseries data

Hi all,

Is there a processor that is able to fill NULL values with other values in the same column?

The concrete use case is:
I have a table that contains machine data with measurements of different temperature sensors. Every ten miliseconds the machine writes a value for temperature 1, ten milisecons later a value for temperature 2, ten miliseconds later for temperature 3 and so on. And after lets say 10 seconds it starts again with temperature 1.
The table we have therefore looks like this (the empty fields are NULL values in our case):

I would like to fill up the values and make it look like this (so the value measured for temp. 1 should be valid for the other timestamps as well)

Is there a processor to do that?
Currently we are using a python processor and a function of pandas (pandas.DataFrame.fillna — pandas 1.3.0 documentation) to make it possible, but this could potentially lead to issues once we increase the amount of data being loaded. We will have >1mio data points quickly.
Thanks

There is an easy solution using SparkSQL, that almost achieves what you need. Depending on the use case, the result might still be precise enough.

  • Create an index that spans the period of the measurement of all sensors (in you example this index is the same for every four lines)
  • Group by that index
  • Join the result to the initial index to make your table big again

Depending on how dirty your data is, you might put some more logic in the creation of the index or the grouping…

Using a window function that populates NULL with last/next non null value could help you, like:
LAST(value, TRUE) OVER(PARTITION BY group ORDER BY timestamp ROWS UNBOUNDED PRECEDING)
for

here is an example with data (and FIRST instead of LAST):