#Runpod S3 multipart upload error

33 messages · Page 1 of 1 (latest)

dawn briar
#

When I use boto3.client.upload_file for a multipart upload, all parts are successfully uploaded to S3, but I encounter an error: "An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 10): Failed to create final object file".
File size is 2.1GB and storage space(20GB) && part sizes(200~300MB) are within limits.

slender harnessBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

cold ferryBOT
rancid bobcat
# dawn briar When I use boto3.client.upload_file for a multipart upload, all parts are succes...

I have the same issues when I try use boto3, aws cli or s3api

Command for upload:

aws s3 cp --region DATACENTER --endpoint-url https://s3api-[DATACENTER].runpod.io/ [LOCAL_FILE] s3://NETWORK_VOLUME_ID/[REMOTE_FILE] --debug

or

aws s3api get-object --bucket [NETWORK_VOLUME_ID] \
    --key [REMOTE_FILE] \
    --region [DATACENTER] \
    --endpoint-url https://s3api-[DATACENTER].runpod.io/ \
    [LOCAL_FILE]

./s3_put.sh --endpoint https://s3api-[DATACENTER].runpod.io/ --region [DATACENTER] --bucket [NETWORK_VOLUME_ID] --object [REMOTE_FILE] --file [LOCAL_FILE]

Errors:

Error uploading file 'ComfyUI/models/vae/FLUX1/ae.safetensors' to Network Volume 'NETWORK_VOLUME_ID' as 'models/vae/FLUX1/ae.safetensors': Failed to upload ComfyUI/models/vae/FLUX1/ae.safetensors to NETWORK_VOLUME_ID/models/vae/FLUX1/ae.safetensors: An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 4): Failed to create final object file

botocore.exceptions.ClientError: An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 4): Failed to create final object file
dawn briar
# rancid bobcat I have the same issues when I try use boto3, aws cli or s3api Command for uploa...

I've experienced exactly the same issue you're describing with boto3 multipart uploads to RunPod's S3-compatible storage.

Here's the exact error message I consistently received at the final step (CompleteMultipartUpload):

(InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 10): Failed to create final object file

To confirm, I checked that all individual parts were successfully uploaded. I verified this by observing the temporary files stored in the .s3compat_uploads directory. Here's a direct snapshot from my logs showing that all parts (for a ~2.1 GB file) were successfully uploaded:

2025-07-02 17:55:29 300.0 MiB .s3compat_uploads/.../1
2025-07-02 17:55:19 300.0 MiB .s3compat_uploads/.../2
2025-07-02 17:54:35 300.0 MiB .s3compat_uploads/.../3
2025-07-02 17:55:54 300.0 MiB .s3compat_uploads/.../4
2025-07-02 17:57:38 300.0 MiB .s3compat_uploads/.../5
2025-07-02 17:56:58 300.0 MiB .s3compat_uploads/.../6
2025-07-02 17:57:26 300.0 MiB .s3compat_uploads/.../7
2025-07-02 17:57:12 37.6 MiB .s3compat_uploads/.../8
dawn briar
# rancid bobcat I have the same issues when I try use boto3, aws cli or s3api Command for uploa...

When I reached out to RunPod's support team with detailed information, including timestamps and bucket details, they responded as follows:

"A 524 error is generated by Cloudflare, not by RunPod. This error indicates a timeout between Cloudflare and our backend services. One common scenario where this occurs is when a file upload takes too long to complete—this is likely what's happening in your case.

As suggested earlier, using a smaller part size for your uploads can help prevent this timeout. Most S3-compatible clients allow you to set or adjust the part size; we recommend ensuring each part is 500MB or less."

From their explanation, it seems Cloudflare, acting as an intermediary between our client and RunPod's storage backend, triggers this timeout issue specifically during the final object creation step. The suggested workaround from RunPod was explicitly to use smaller chunk sizes, ideally less than or equal to 500MB.

Despite implementing smaller chunks, I've continued to experience intermittent issues, suggesting the underlying network timeout issue remains a problem that isn't fully addressed by chunk size alone.

I hope sharing these detailed experiences and logs helps clarify the underlying issue for you.

still bison
#

Did you update the support with that last detail after you changed the chunk size to smaller sized

rancid bobcat
#

Following team's suggestion, I tried modifying the upload chunk size.
I tested various settings with 16MB, 64MB, and even 500MB chunks, but unfortunately, the error persists.

dawn briar
dawn briar
# still bison Did you update the support with that last detail after you changed the chunk siz...

I've confirmed that the multipart parts were uploaded successfully using the command
aws s3 ls --summarize --human-readable --recursive --region EU-RO-1 --endpoint-url https://s3api-eu-ro-1.runpod.io/ s3://{bucket_id}/
, but an error keeps occurring at the CompleteMultipartUpload operation stage.
I suspect there is a bug in the RunPod S3 backend's logic for concatenating the multipart data.

still bison
dawn briar
still bison
#

Oh i guess theres no need?

#

try to reply to it again

rancid bobcat
#

I have open a new ticket for this Issue

gloomy moth
#

I am having exactly the same issue. When using boto3, even multiparts are not uploaded. With aws s3 cp multiparts are uploaded, but merging them fails with the error above. This happens at 8 MB chunks. Basically this is making the network volume unusable, because I cannot upload any files there.

aws s3api list-multipart-uploads lists the failed upload.

put-object works for 300 MB file, but is much slower. For 10 GB file I get "An error occurred (524) when calling the PutObject operation: ".

onyx cave
#

I'm experiencing this issue as well and am following this thread for updates on the bug.

gloomy moth
#

Any update?

midnight locust
#

Maybe try smaller and bigger chunks?

rancid bobcat
#

The commands we used are as follows:

# File Size
flux1-fill-dev-fp8.safetensors ~ 11GB

# Command to upload to the root
aws s3 cp --profile default --region EU-RO-1 --endpoint-url https://s3api-eu-ro-1.runpod.io/ --cli-connect-timeout 0 --cli-read-timeout 0 ComfyUI/models/checkpoints/FLUX1/flux1-fill-dev-fp8.safetensors s3://xyz/flux1-fill-dev-fp8.safetensors --debug

# Command to upload to a subfolder
aws s3 cp --profile default --region EU-RO-1 --endpoint-url https://s3api-eu-ro-1.runpod.io/ --cli-connect-timeout 0 --cli-read-timeout 0 ComfyUI/models/checkpoints/FLUX1/flux1-fill-dev-fp8.safetensors s3://xyz/models/checkpoints/FLUX1/flux1-fill-dev-fp8.safetensors --debug

Errors:

  • An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 2): Failed to create final object file
  • An error occurred (504) when calling the CompleteMultipartUpload operation (reached max retries: 2): Gateway Timeout
  • An error occurred (524) when calling the UploadPart operation
    Exception caught when parsing error response body: Traceback (most recent call last):
    File "awscli/botocore/parsers.py", line 537, in _parse_xml_string_to_dom
    xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

This consistent failure, irrespective of the destination path or the multipart configuration, strongly reinforces our belief that this is a server-side issue with the final file assembly process.

This problem is preventing us from using the Network Volume service.

wild oyster
#

Hey, sorry for no response here over the long weekend. Let me look through this thread and understand the issue.

wild oyster
#

Thanks for the reminder, I can forget things sometimes wires

To reiterate supports request, we recommend a multipart upload of <500MB parts because of limits placed by Cloudflare.

wild oyster
#

For everything else, I have someone looking into the issues you're seeing - I'll update you here when I get a message from him

rancid bobcat
rancid bobcat
#

Any update from client service?

wild oyster
#

All I have so far is this:

Based on the timestamps of those files, the "Failed to create final object file" errors were occurring before the fix for subdirectories was deployed. So it is possible that they were still hitting that issue.

#

I'm still waiting to hear back on the 524 and 504, but you may just be good to go? Let me know!

dawn briar
#

More than a week after I submitted a ticket for this issue, I received the following response:

Hi there,

Following up on your requests, our engineering team recommends increasing the timeouts in your tools to work around this issue.

For aws s3 and aws s3api:
Set the read timeout using the CLI flag --cli-read-timeout 7200 has been shown to help. While we haven’t thoroughly tested this for every file size, very large files might require even longer timeouts.

Alternatively, you can update your AWS config file (~/.aws/config) with:

[default]  
cli_read_timeout = 7200 
For boto3:
You can set a larger read_timeout using the Config class:

Option 1:

Python

import boto3
from botocore.config import Config
custom_config = Config(
read_timeout=7200,
)
s3_client = boto3.client('s3', config=custom_config)


**Option 2:**
Python

import boto3
from botocore.config import Config
custom_config = Config(
read_timeout=7200,
)
session = boto3.Session()
s3_client = session.client('s3', config=custom_config)


Please give this a try and let us know if the issue persists after implementing the above suggestions. We’re here to help if you need further guidance.
#

I have already tried experimenting with a higher read_timeout setting, which resulted in the exact same error.

Let me be clear: the multi-part concatenation logic within RunPod S3 itself seems to be flawed. Instead of testing or suggesting solutions on the configuration side, why don't you inspect the internal logic of RunPod S3?

To be frank, it is technically very disappointing to receive such a simple answer after Runpod Team supposedly spent over a week investigating the cause. Our company is currently reviewing the integration of RunPod S3 into our services. However, if Runpod Team continue to provide these kinds of answers and the bug remains because have failed to properly identify the root cause, we will be unable to use RunPod.

#

@wild oyster

wild oyster
#

We're aware that Multipart Concat is a little buggy, especially at extremely large file sizes. Admittedly, it can take a very long time for especially technical details to travel through support. You can always ask me here, although after hours like now it could take me a while to reply.

I'm working with our team to make sure users don't get delivered the same suggestions twice. I can't access your ticket while I'm driving, but the engineer working on the S3 API Compatibility was made aware of a potential issue with this code path on Monday and on Tuesday suggested he was still testing for the root cause.

We believe that users at files over 12GB will continue to see issues, but most users today (even those at that file size) should see the issue remedied with a longer timeout. It's a fairly complicated issue, if we're sloppy we risk corrupting files and a better fix would involve patching the file system that powers our network storage clusters (which is what powers the S3 API).

We're definitely not done working on this.