#Need help with a Mission Control rerun/reconciliation bug

1 messages · Page 1 of 1 (latest)

thorny hornet
#

Helpdesk, I need help with a Mission Control rerun/reconciliation bug.

Context:
We fixed the assignee-default bug so new planning tasks should inherit the project owner correctly.
But retries are still broken for both Aemeath and Shorekeeper procedures.

What I observed:
When I clean stale links, clear nativeTaskId / nativeRunId, and set a planning task back to launch_requested, Mission Control does attempt relaunch.
But reconciliation then reattaches the planning task back to the old failed task id instead of creating or binding to a fresh retry attempt.
This affects:
policy-protocol-index-creation-proc-0
policy-protocol-index-creation-proc-1
policy-protocol-index-creation-proc-2
also seen on high-signal-governance-stability-sentinel-proc-1

Important symptoms:
stale failed task ids keep getting relinked
retry is not truly fresh
Shorekeeper procedure rerun also drifted through wrong linkage/owner behavior
I preserved retry history in mission-control/data/planning-retry-history.json

Likely fix direction:
retries need a fresh attempt identity, not just the same planningTaskId
reconciliation should ignore superseded failed runs during relink
explicit assignee on retry must be preserved

Can you inspect and patch the rerun/relink logic?

dull hamletBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

lethal violetBOT
junior vigilBOT
#

I checked the local builderz-labs/mission-control checkout I have here, and the public main branch does not contain the planning/procedure code you’re describing. I couldn’t find launch_requested, nativeTaskId, nativeRunId, or planning-retry-history.json, so I can’t safely produce a real tested patch from this workspace.

That said, your diagnosis looks right. The fix should be:

1. Make retry identity explicit

Do not use planningTaskId as the retry identity.

Add something like:

  • currentAttemptId
  • attemptNumber
  • requestedAt
  • requestedAssigneeId
  • supersededAt / supersededByAttemptId

When a rerun is requested:

  • generate a new attemptId
  • persist it on the planning task
#
  • mark the old attempt superseded
  • clear active linkage only for the new attempt

2. Reconciliation must bind by attempt, not just planning task

Right now it sounds like reconciliation does roughly:

find native run where planningTaskId = X

That will always pick up old failed runs.

It needs to become:

find native run where
#
  planningTaskId = X
  and attemptId = currentAttemptId

If you don’t have attemptId on native runs yet, use a short-term guard:

createdAt >= retryRequestedAt
and nativeTaskId/nativeRunId not in superseded history

3. Ignore superseded failed runs during relink

When retry starts, old failed links should be permanently ineligible for relink.

Something like:

#
const blockedNativeIds = new Set(
  retryHistory
    .filter(a => a.supersededAt || a.attemptId !== currentAttemptId)
    .flatMap(a => [a.nativeTaskId, a.nativeRunId].filter(Boolean))
)

Then exclude those during reconciliation.

4. Preserve explicit assignee on retry

Do not recompute assignee from project owner during reconciliation.

At retry request time, freeze it:

#
requestedAssigneeId =
  explicitRetryAssigneeId
  ?? planningTask.assigneeId
  ?? project.ownerId

Then launcher and reconciler should both prefer requestedAssigneeId.

5. Tests I’d add immediately

You want regression tests for:

  • failed attempt → rerun → reconcile does not relink old failed native task
  • retry creates new attemptId
  • explicit assignee survives rerun
  • legacy task without attempt ids still works once
  • launch_requested after clearing ids binds only to runs created after retry request
#

Minimal patch shape

If you want the smallest safe change:

  1. On retry:

    • create attemptId
    • store retryRequestedAt
    • store requestedAssigneeId
  2. On native launch:

    • stamp attemptId onto created task/run
  3. On reconciliation:

    • only relink rows with matching attemptId
    • fallback to createdAt >= retryRequestedAt
    • never reuse ids in superseded retry history

For the already-broken tasks

#

After patching, I’d repair them by:

  • generating a fresh attemptId
  • marking prior retry-history entries superseded
  • keeping explicit assignee
  • resetting to launch_requested
  • reconciling only against rows newer than that retry event

If you paste the actual launcher/reconciler files, I can turn this into a line-by-line diff pretty quickly.

thorny hornet
#

Helpdesk, I narrowed the Mission Control retry bug to reconciliation identity.

Problem:
Retries are effectively keyed by planningTaskId, so when a planning task is relaunched, reconciliation can reattach it to an older failed native run instead of binding to a fresh retry. This also causes retry-owner drift in some cases.

Required fix:
Make retry identity explicit. Do not use planningTaskId as the retry identity.

Add fields like:
currentAttemptId
attemptNumber
retryRequestedAt
requestedAssigneeId
retry history entries with attemptId, nativeTaskId, nativeRunId, supersededAt, supersededByAttemptId

On rerun:
generate a new attemptId
persist it on the planning task
increment attemptNumber
freeze requestedAssigneeId
mark prior attempt/history entries as superseded
clear active linkage only for the new attempt

On native launch:
stamp attemptId onto the created native task/run

On reconciliation:
only relink rows where planningTaskId = X and attemptId = currentAttemptId

If native runs do not carry attemptId yet, short-term fallback should require:
createdAt >= retryRequestedAt
exclude any nativeTaskId / nativeRunId present in superseded retry history

Important rule:
Superseded failed runs must be permanently ineligible for relink.

Assignee rule:
Preserve explicit retry assignee. Do not recompute from project owner during reconciliation. Launcher and reconciler should both prefer requestedAssigneeId.

Already-broken tasks to repair after patch:
policy-protocol-index-creation-proc-0
policy-protocol-index-creation-proc-1
policy-protocol-index-creation-proc-2
high-signal-governance-stability-sentinel-proc-1

Regression tests:
failed attempt → rerun does not relink old failed native task
rerun creates new attemptId
explicit assignee survives rerun
launch_requested after clearing ids binds only to runs created after retry request

junior vigilBOT
#

Yes, this is the bug. Your framing is solid.

I still can’t see the branch with that planning/procedure code from here, but if I were patching it, I’d do it in exactly three paths:

1. Rerun mutation

When a planning task is moved back to launch_requested, do not just clear ids.

Do this instead:

const nextAttemptId = crypto.randomUUID()
const nextAttemptNumber = (task.attemptNumber ?? 0) + 1
const requestedAssigneeId =
  explicitRetryAssigneeId
  ?? task.requestedAssigneeId
  ?? task.assigneeId
#
  ?? project.ownerId

markCurrentAttemptSuperseded(task.id, {
  supersededAt: now,
  supersededByAttemptId: nextAttemptId,
})

updatePlanningTask(task.id, {
  status: 'launch_requested',
  currentAttemptId: nextAttemptId,
  attemptNumber: nextAttemptNumber,
  retryRequestedAt: now,
  requestedAssigneeId,
  nativeTaskId: null,
  nativeRunId: null,
#
})

And append a retry-history row keyed by attemptId.


2. Native launch path

Every spawned native task/run needs the attempt stamped onto it:

createNativeRun({
  planningTaskId: task.id,
  attemptId: task.currentAttemptId,
  assigneeId: task.requestedAssigneeId,
#
  ...
})

If you only stamp planningTaskId, reconciliation will keep drifting.


3. Reconciliation

This is the critical guard:

const candidates = nativeRuns.filter((run) => {
  if (run.planningTaskId !== task.id) return false
#
  if (run.attemptId) {
    return run.attemptId === task.currentAttemptId
  }

  // legacy fallback
  return (
    run.createdAt >= task.retryRequestedAt &&
    !blockedNativeTaskIds.has(run.nativeTaskId) &&
    !blockedNativeRunIds.has(run.nativeRunId)
  )
})

Where blocked* comes from superseded retry history, and those ids are permanently ineligible.


#

Assignee rule

Also yes, reconciliation should never recalculate ownership from project owner once a retry exists.

Use:

const effectiveAssigneeId =
  task.requestedAssigneeId
  ?? task.assigneeId
  ?? project.ownerId

and keep that consistent in both launcher + reconciler.


Migration shape

#

I’d add nullable fields first:

  • currentAttemptId
  • attemptNumber
  • retryRequestedAt
  • requestedAssigneeId

And in retry history:

  • attemptId
  • nativeTaskId
  • nativeRunId
  • supersededAt
  • supersededByAttemptId

Legacy tasks can default to attempt 0 until first rerun.

#

Repairing the already-broken tasks

For each affected task:

  1. generate a fresh attemptId
  2. freeze requestedAssigneeId
  3. mark prior history entries superseded
  4. clear active linkage
  5. set launch_requested
  6. reconcile only against runs newer than retryRequestedAt or matching attemptId

Regression tests

The four you listed are exactly the right ones. I’d also add: