Important changes in 2.3:
Version
2.3is a lightweight image that contains only core components, reducing exposure to Common Vulnerabilities and Exposures (CVEs). For higher security compliance requirements, use the image version2.3or later, when creating a Dataproc cluster.If you choose to install optional components when creating a Dataproc cluster with
2.3image, they will be downloaded and installed during cluster creation. This might increase the cluster startup time. To avoid this delay, you can create a custom image with the optional components pre-installed. This is achieved by runninggenerate_custom_image.pywith the--optional-componentsflag.
Notes
The following optional components are supported in non-arm 2.3 images:
- Apache Flink
- Apache Hive WebHCat
- Apache Hudi
- Apache Iceberg
- Apache Pig
- Delta Lake
- Docker
- JupyterLab Notebook
- Ranger
- Solr
- Trino
- Zeppelin notebook
- Zookeeper
2.3.x-*-armimages support only the pre-installed components and the following optional components. The other 2.3 optional components and all initialization actions aren't supported:- Apache Hive WebHCat
- Docker
- Zeppelin notebook
- Zookeeper (installed in high availability clusters; optional component in other clusters)
yarn.nodemanager.recovery.enabledand HDFS Audit Logging are enabled by default in 2.3 images.micromamba, instead ofcondain previous image versions, is installed as part of the Python installation.Docker and Zeppelin installation issues:
- Installation fails if the cluster has no public internet access. As a
workaround, create a cluster that uses a custom image with optional
components pre-installed. You can do this by running
generate_custom_image.pywith the--optional-componentsflag. - Installation can fail if the cluster is pinned to an older sub-minor image
version: Packages are installed on demand from public OSS repositories, and a package
might not be available upstream to support the installation.
As a workaround, create a cluster that uses a custom image with optional
components pre-installed in the custom image. To do this, run
generate_custom_image.pywith the--optional-componentsflag.
- Installation fails if the cluster has no public internet access. As a
workaround, create a cluster that uses a custom image with optional
components pre-installed. You can do this by running
The default resource calculator for YARN has been changed from DefaultResourceCalculator to DominantResourceCalculator, which uses the dominant-resource concept to determine resource allocation, such as Memory and CPU allocation. This change impacts Autoscaler, which scales based on the dominant resource usage of the cluster.
Image version 2.3 machine learning (ML) components
The Dataproc 2.3-ml-ubuntu image extends the 2.3 base image
with ML-specific software. It supports 2.3 image optional components and other
2.3 features, and adds the component versions listed in the following sections.
GPU-specific libraries
For Dataproc jobs that use GPU VMs,
the following NVIDIA driver and libraries are available in the
2.3-ml-ubuntu image. You can use them to accomplish the following
tasks:
- Accelerate Spark batch workloads with the NVIDIA Spark Rapids library
- Train machine learning workloads
- Run distributed batch inference using Spark
| Package Name | Version |
|---|---|
| Spark Rapids | 25.04.0 |
| NVIDIA Driver | Ubuntu 22.04 LTS Accelerated with NVIDIA driver version 570 |
| CUDA | 12.6.3 |
| cublas | 12.6.4 |
| cusolver | 11.7.1 |
| cupti | 12.6.80 |
| cusparse | 12.5.4 |
| cuDNN | 9.10.1 |
| NCCL | 2.27.5 |
XGBoost libraries
The following Maven package versions
are available in 2.3-ml-ubuntu image to let you use
XGBoost with Spark in Java or
Scala.
| Group ID | Package Name | Version |
|---|---|---|
| ml.dmlc | xgboost4j-gpu_2.12 | 2.1.1 |
| ml.dmlc | xgboost4j-spark-gpu_2.12 | 2.1.1 |
Python libraries
The 2.3-ml-ubuntu image contains the following libraries, which support different
stages in the ML lifecycle.
| Package | Version |
|---|---|
| accelerate | 1.8.1 |
| conda | 23.11.0 |
| cookiecutter | 2.5.0 |
| curl | 8.12.1 |
| cython | 3.0.12 |
| dask | 2023.12.1 |
| datasets | 3.6.0 |
| deepspeed | 0.17.2 |
| delta-spark | 3.2.0 |
| evaluate | 0.4.5 |
| fastavro | 1.9.7 |
| fastparquet | 2023.10.1 |
| fiona | 1.10.0 |
| gateway-provisioners[yarn] | 0.4.0 |
| gcsfs | 2023.12.2.post1 |
| google-auth-oauthlib | 1.2.2 |
| google-cloud-aiplatform | 1.88.0 |
| google-cloud-bigquery[pandas] | 3.31.0 |
| google-cloud-bigquery-storage | 2.30.0 |
| google-cloud-bigtable | 2.30.1 |
| google-cloud-container | 2.56.1 |
| google-cloud-datacatalog | 3.26.1 |
| google-cloud-dataproc | 5.18.1 |
| google-cloud-datastore | 2.21.0 |
| google-cloud-language | 2.17.2 |
| google-cloud-logging | 3.11.4 |
| google-cloud-monitoring | 2.27.2 |
| google-cloud-pubsub | 2.29.1 |
| google-cloud-redis | 2.18.1 |
| google-cloud-spanner | 3.53.0 |
| google-cloud-speech | 2.32.0 |
| google-cloud-storage | 2.19.0 |
| google-cloud-texttospeech | 2.25.1 |
| google-cloud-translate | 3.20.3 |
| google-cloud-vision | 3.10.2 |
| huggingface_hub | 0.33.1 |
| httplib2 | 0.22.0 |
| ipyparallel | 8.6.1 |
| ipython-sql | 0.3.9 |
| ipywidgets | 8.1.7 |
| jupyter_contrib_nbextensions | 0.7.0 |
| jupyter_http_over_ws | 0.0.8 |
| jupyter_kernel_gateway | 2.5.2 |
| jupyter_server | 1.24.0 |
| jupyterhub | 4.1.6 |
| jupyterlab | 3.6.8 |
| jupyterlab-git | 0.44.0 |
| jupyterlab_widgets | 3.0.15 |
| koalas | 0.22.0 |
| langchain | 0.3.26 |
| lightgbm | 4.6.0 |
| markdown | 3.5.2 |
| matplotlib | 3.8.4 |
| mlflow | 3.1.1 |
| nbconvert | 7.14.2 |
| nbdime | 3.2.1 |
| nltk | 3.9.1 |
| notebook | 6.5.7 |
| numba | 0.58.1 |
| numpy | 1.26.4 |
| oauth2client | 4.1.3 |
| onnx | 1.17.0 |
| openblas | 0.3.25 |
| opencv | 4.11.0 |
| orc | 2.1.1 |
| pandas | 2.1.4 |
| pandas-profiling | 3.0.0 |
| papermill | 2.4.0 |
| pyarrow | 16.1.0 |
| pydot | 2.0.0 |
| pyhive | 0.7.0 |
| pynvml | 12.0.0 |
| pysal | 23.7 |
| pytables | 3.9.2 |
| python | 3.11 |
| regex | 2023.12.25 |
| requests | 2.32.2 |
| requests-kerberos | 0.12.0 |
| rtree | 1.1.0 |
| scikit-image | 0.22.0 |
| scikit-learn | 1.5.2 |
| scipy | 1.11.4 |
| seaborn | 0.13.2 |
| sentence-transformers | 5.0.0 |
| setuptools | 79.0.1 |
| shap | 0.48.0 |
| shapely | 2.1.1 |
| spacy | 3.8.7 |
| spark-tensorflow-distributor | 1.0.0 |
| spyder | 5.5.6 |
| sqlalchemy | 2.0.41 |
| sympy | 1.13.3 |
| tensorflow | 2.18.0 |
| tokenizers | 0.21.4.dev0 |
| toree | 0.5.0 |
| torch | 2.6.0 |
| torch-model-archiver | 0.11.1 |
| torcheval | 0.0.7 |
| tornado | 6.4.2 |
| torchvision | 0.21.0 |
| traitlets | 5.14.3 |
| transformers | 4.53.1 |
| uritemplate | 4.1.1 |
| virtualenv | 20.26.6 |
| wordcloud | 1.9.4 |
| xgboost | 2.1.4 |
R libraries
The following R library versions are included in 2.3-ml-ubuntu image.
| Package Name | Version |
|---|---|
| r-ggplot2 | 3.4.4 |
| r-irkernel | 1.3.2 |
| r-rcurl | 1.98-1.16 |
| r-recommended | 4.3 |