Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access | Java Community | Help. Code. Learn. | Page 1

I have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table.I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location.

This behaviour seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data. I've put up a repository with the issue for clarity. https://github.com/caldempsey/docker-notebook-spark-s3

Steps to reproduce

1. Create a Spark Cluster (locally) and connect a Jupyter Notebook to the Cluster.
2. Write a Delta Table to the Spark filesystem (lets say `/out`)
3. Spark will write the Parquet files to `/out` but error unless the notebook is given access to `/out` despite being registered as an application and not a worker.

Via the repo provided:

1. Clone the repo
2. Remove infra-delta-lake/localhost/docker-compose.yml:63 ./../../notebook-data-lake/data:/data, which prevents the notebook from accessing the `/data` target shared with the Spark Master and Workers on their local filesystem.

Observed results

When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.

Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log

When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!

I expect the _delta_log to be written regardless of whether the Notebook has access to the target filesystem.

Makes no sense. Can anyone help? Feel free to clone and play!

GitHub

GitHub - caldempsey/docker-notebook-spark-s3: Template CI friendly ...

Template CI friendly local development environment for prototyping Spark + Blob Storage data feature requirements - caldempsey/docker-notebook-spark-s3

#Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access