Datatable memory and number of partitions

Dear all,

  1. Can we know how much memory a data table in OD occupies?
  2. Can we see how many partitions a data table (in a parquet format) have? Preferably using REST API calls or some SQL tricks!

What I want to achieve is: that one of our customers is interested in seeing all this information in the Apps as a KPI. As far as I know, all the above magics can be done by guardians manually going into a server. But I am wondering if there are other ways to do that.

Thanks in advance!

Create a new workflow.

Add a “Data Table Load” processor to a workflow to load up the data table you would like to inspect.

Click on “Save & Debug” to run the Full Debug Mode, then open up the debugger tab of the Data Table Load processor.

  1. The memory taken up by a data table (in bytes) should be in the “SPARK_DEBUG_INFO” → processorMetrics → inputBytesRead

  1. You can see the number of partitions of a data table in the “SPARK_DEBUG_INFO” → rddDependences → partitions field.

No idea how to get this information via the REST API though!

Thanks, @DanyEle, but I found that this information obtained from save&debug is incorrect. I did several tests. For example
In SPARK_debug_info it says 32 partitions. However, when I checked the corresponding data table from onedata-server directly, I found that it has 1000 partitions (actually this table was made using manipulate partitions processor).

and also the memory is not correct, in spark debug info it says around 4Mb, actually, this parquet file in the OD server is around 45Mb.

I am not sure, if Spark considers all data for debug mode, maybe it is taking only some partitions for debugging purposes.

Maybe @Flogge has some more insights about it!

Hey there!
I did not notice the mention of my name. So sorry for being late to the party.

On the topic:
Partitions on storage not necessarily corresponds with partitions at read-time. Spark has an optimization built in that - when activated, which is the default - scans the physical files (let’s call them storage partitions) and estimates the best (runtime) partitioning possible to get 64MB sized partitions and/or to reach th desired level of parallelism. That means, when you have 1000 storage partitions that are very small, Spark will try to combine them to less but bigger partitions until it reaches the desired parallelism or 64/128MB (depends on config) sized partitions (on average). This does not mean, that too many partitions cannot harm performance. The on-load automatic repartitioning requires all storage partitions to be read individually (twice!). This can hurt. However, after loading, the partition amount is drastically decreased making the subsequent WF execution faster.
What you see in the debugger is the runtime partitioning.

On how to get the amount of storage partitions: There is no way to do this using Spark(SQL) that I know of. Since at the time of loading the data, it will already be loaded with runtime partitions applied. A way to get it programmatically as a developer would be scanning the folder(s) for the amount of files (there is one parquet and one checksum file per storage partition).

On the memory displayed directly at the load processor:
Since Spark here only materializes the first 100 rows and does a count, it will probably not read the whole physical data. row count stems from parquet metadata which is located in the file separatelly. No need to unpack and count everything. Therefore less bytes read than actual size. Also note, that the size on disk is compressed. When unpacked during WF execution, runtime size can be much higher when all data gets materialized.

How to get the size on disk: Again, there needs to be an additional implementation.
How to get size in memory: Either ensure a full materialization of the loaded data and check the debug info or cache it and check the SparkUI if available. Since Spark uses an encoded/compressed representation during query execution, the size on disk and size in memory oftentimes do not match.

The short answer to your question unfortunatly is: We need a dedicated feature to get this in a convenient way.
There might be workarounds using Python and accessing the raw storage directly to get size and file count but this would require giving the Python runtime access to the underlying data storage which in turn is a security issue and can not be performed on any secured instance.

2 Likes

Heureka!

I’ve found a way to get the storage partitioning and the physical size on storage via Query.
Note, that this only works for physical files (not for DB based data, not for VDTs)!
Directly after the DT Load, add this query:

SELECT DISTINCT 
              input_file_name() as file, 
              input_file_block_length() as length_of_block, 
              input_file_block_start() as block_start 
FROM inputTable i

Explanation:

  • file will hold the physical file name on the attached storage.
  • length_of_block is the size of the file on disk (in bytes; note that the content usually is compressed and will take way more space when extracted into memory at runtime!)
  • input_file_block_start usually is 0 for our stored data.

References:

With a COUNT DISTINCT (file) you will get the amount of storage partitions.
With SUM(length_of_block) you will get the total size of all files combined.

Note that for Storages like HDFS, the block information will be important. There can then be more than one block in the same file. A block will be treated most probably as partition on load. The length and start attributes will give you the “from” and “to” position in the file. What I want to say is: Have a look at the data for larger data tables and on non-local file systems. There might be more to the story than what I described.

If we want this information available in an app, we need a WF in the respective project containing the DTs that should be scanned. With a “Scanner WF” that loads a variable data table by ID, and a “Caller WF” that gets the whole list of data tables in the project and feeds them into the Scanner WF one by one via API call, you can automate this for a whole project.

3 Likes

Hi guys,

the A-Team is currently building a new feature in ONE DATA that allows you to see the different partitions of a Data Table in the Data Table overview.

We will display the following informations about the Data Table:

  • Total size of the Data Table on the filesystem
  • Number of partitions
  • Storage location (the path of partition)
  • A list with the different partitions, showing following info about every partition:
    • Partition name, e.g. “Partition 0”
    • Size of partition
    • Partition name (as in the path)

The feature will be delivered with the Q4 release.

In case you have further questions, please contact @Aleksandra from the A-Team.

5 Likes