#Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access

3 messages · Page 1 of 1 (latest)

amber kiln
#

I have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table.I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location.

This behaviour seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data. I've put up a repository with the issue for clarity. https://github.com/caldempsey/docker-notebook-spark-s3

Steps to reproduce

1. Create a Spark Cluster (locally) and connect a Jupyter Notebook to the Cluster.
2. Write a Delta Table to the Spark filesystem (lets say `/out`)
3. Spark will write the Parquet files to `/out` but error unless the notebook is given access to `/out` despite being registered as an application and not a worker. 

Via the repo provided:

1. Clone the repo
2. Remove infra-delta-lake/localhost/docker-compose.yml:63 ./../../notebook-data-lake/data:/data, which prevents the notebook from accessing the `/data` target shared with the Spark Master and Workers on their local filesystem. 

Observed results

When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.

Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log

When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!

I expect the _delta_log to be written regardless of whether the Notebook has access to the target filesystem.

Makes no sense. Can anyone help? Feel free to clone and play!

GitHub

Template CI friendly local development environment for prototyping Spark + Blob Storage data feature requirements - caldempsey/docker-notebook-spark-s3

fallow spadeBOT
#

This post has been reserved for your question.

Hey @amber kiln! Please use /close or the Close Post button above when your problem is solved. Please remember to follow the help guidelines. This post will be automatically closed after 300 minutes of inactivity.

TIP: Narrow down your issue to simple and precise questions to maximize the chance that others will reply in here.