Hi, I was reading the documentation for “New Logics” about the architecture of ONE DATA based on Apache Spark, and I just just wondering if ONE DATA can run operations on GPUs?
I reckon that could potentially help with big data-parallel computations involving large datasets, and may speed up operations with big matrix computations (as required by Neural Networks)
There seem to be support for GPUs starting from Apache Spark 3: Deep Dive into GPU Support in Apache Spark 3.x - Databricks
Short answer:
With the actual version of ONE DATA, no operation could be run on GPUs.
Long answer:
The actual version of ONE DATA is based on Apache Hadoop 2. GPU support was introduced with Apache Hadoop 3 and Apache Spark 3.
The last version version of Apache Spark 3.1.1 supports processing on GPUs with some caveats. Spark-rapids are a set of plugin (developed in cooperation with NVIDIA RAPIDS) that intend to go beyond those caveats and optimize the usage of GPUs when working with Spark.
While @anon1396911 is right concerning the Spark part of ONE DATA, there might be a possibility to offload GPU-based computing to a Python runtime.
One of the strengths of ONE DATA is the interoperability with other frameworks. By using a Python processor, you can use any GPU-based parallelization lib. Note, that to achieve this, GPU support inside the Python execution environment must be available and the respective library must be installed in the environment. Also, when leaving the realm of ONE DATA Core and its Spark-managed computing, memory management and other resource restrictions have to be handled by the python lib or your gluecode provided. When in doubt, friendly neighborhood DevOps can help.
Also note, that for moving data from a Spark-based computation towards a Python environment means some transmission overhead. Only if the task at hand is computational intense and benefits from GPU support, you should consider this option. Neural networks are a good example where this will probably pay off.