#Finetuning jobs are stuck in queue for days

1 messages · Page 1 of 1 (latest)

dry ivy
#

It’s taking an unusually long time to start a fine-tuning job, even with a small dataset (10–30 examples). I’m using SFT, and the job has been taking over 24 hours to start.

This process used to be much faster—has something changed recently on the platform?

Is there a known issue, or are there any steps to speed up or prioritize job initialization?

quasi pythonBOT
#

Hi @dry ivy. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

There have been recent, temporary platform issues where fine‑tuning jobs validate successfully but then stay in queued for many hours or even days, similar to what you’re seeing. In multiple cases, OpenAI staff acknowledged internal problems and then pushed fixes, after which users had to restart their jobs:

  • Fine‑tuning jobs failing or stalling due to internal errors, later fixed by the fine‑tune team, with a request to “take a look and confirm” after retrying. [Fine-tuning internal error]
  • Fine‑tuning jobs that “never move out of queued state” for specific models (e.g., gpt‑4.1‑2025‑04‑14), where engineers identified and fixed an issue and asked users to “restart any impacted jobs.” [Queued fine-tune jobs]

From these sources:

  • Your experience (jobs stuck in queued for 24+ hours, even with tiny datasets) is consistent with known, intermittent platform issues rather than with your data size or configuration.
  • Once the underlying issue is fixed, the recommended action is to restart (recreate) the affected fine‑tuning jobs. [Queued fine-tune jobs]

As for speeding up or prioritizing:

#
  • The official docs only note that files must first be processed and you may briefly see states like validating_files before training starts; they do not describe any user‑controllable priority mechanism or guaranteed start time. [Chat fine-tuning guide; Vision fine-tuning guide]
  • There’s no documented way in these sources to “prioritize” a fine‑tune job beyond re‑submitting once a platform issue has been resolved.

Given that:

  1. Check the OpenAI status page for any current incidents.
  2. If status is normal but your job has been queued >24 hours, it’s likely similar to the incidents above; once resolved, cancel and recreate the job.
  3. If it continues after a known fix, you’ll likely need to contact OpenAI support with your fine‑tune job ID so they can investigate, as was done in the referenced threads. [Fine-tuning internal error; Queued fine-tune jobs]

The knowledge sources do not mention any additional tuning or configuration you can use to force faster initialization beyond waiting for platform issues to be resolved and restarting affected jobs.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: