Hi, I have read the documentation about the “Get Schema” debug function, but to be honest I didn’t really understand how it works: Debug Mode - General Explanation : ONE DATA Service Center
Is it supposed to somehow provide information about the schema of the new table, or what should it be used for?
To understand the “Get Schema” it is required to know a bit about the “workflow-building” in the client (== connecting the processors by lines).
When you build a workflow and for example use a “Column Selection” processor after a “Data Table-Load” processor you will get the columns to select in a dropdown.
This information (‘which columns are available in the “data flow” at this position’) has to be known by the ONE DATA client (=the website you are using) and there are different ways how this information is collected. For many processors this is straight forward (for example the output schema of the “Column Selection” is the list of selected columns).
But there are a couple of processors that allow the user to not select something in a dropdown, but write plain text (for example the “Query”-processor or a Python Script processor). For these processors the schema is only known after an execution of the workflow which will show the client the columns that are outputted by such a processor.
Using the “Get Schema” option provides a way to determine the schema for such processors before the full workflow is executed (for example during building the workflow to be able to use a “Column Selection” processor after a query processor).
Another related topic: If you look at a Python processor you can find a configuration called “Manual TI” which stands for “Manual Type Inference” and this was (or sometimes still is) used to “tell” the client about the output columns of a Python processor in case you need to use these in another processor later on.
@christoph.schober 's answer sums it up very well. Nothing to add, except one technical detail that might be nice to know: Sometimes, parts of the workflow need to be “executed for real” to determine the resulting schema. Sometimes, the schema computation only takes a split second, sometimes it takes longer. Reason for this is column/schema information that depends on actual data “manifestation”, e.g. column names depend on the values inside cells (e.g. transposition processor) or when the schema depends on external logic like python scripts or API calls.
In most cases, and especially for query processors, we don’t have to really compute any data but can rely on Spark computing the schema for us without doing any data manifestation at all. This fact makes “Get Schema” debugging the most efficient way of getting this metadata compared to Fast/Full Debug or regular executions which also compute the data.