I have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table.I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location.
This behaviour seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data. I've put up a repository with the issue for clarity. https://github.com/caldempsey/docker-notebook-spark-s3
Steps to reproduce
1. Create a Spark Cluster (locally) and connect a Jupyter Notebook to the Cluster.
2. Write a Delta Table to the Spark filesystem (lets say `/out`)
3. Spark will write the Parquet files to `/out` but error unless the notebook is given access to `/out` despite being registered as an application and not a worker.
Via the repo provided:
1. Clone the repo
2. Remove infra-delta-lake/localhost/docker-compose.yml:63 ./../../notebook-data-lake/data:/data, which prevents the notebook from accessing the `/data` target shared with the Spark Master and Workers on their local filesystem.
Observed results
When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.
Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log
When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!
I expect the _delta_log to be written regardless of whether the Notebook has access to the target filesystem.
Makes no sense. Can anyone help? Feel free to clone and play!