#General Inquiry - Eval Methods

1 messages · Page 1 of 1 (latest)

prisma apex
#

"Hi all! I'm curious about the evaluation processes OpenAI uses for training latest models. Do you typically rely on human evaluations, automated benchmarks, or a mix of both? Any resources or papers I could look into to better understand the eval process?

pale frigateBOT
#

Hi @prisma apex. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

OpenAI uses a mix of human evaluation and automated methods, and strongly recommends that developers do the same for their own applications.

At a high level, OpenAI describes evals as structured tests to measure model performance and reliability, and emphasizes that they should combine numerical metrics with human judgment rather than relying on a single score. [Eval best practices]

Types of evaluation methods

From the available resources, OpenAI highlights three main categories:

  1. Metric-based / automated evals
    • Examples: exact or string match, ROUGE, BLEU, BERTScore, function-call accuracy, executable checks (e.g., does generated SQL run). [Eval best practices; Summarization eval]
    • These are useful for regression testing and large-scale comparisons, but can miss nuance and may not correlate perfectly with human judgments, especially for open‑ended tasks. [Summarization eval]
#
  1. Human evaluations
    • Human annotators (domain experts, crowd workers, or end‑users) grade outputs, rank alternatives, or give ratings (e.g., thumbs up/down, 1–5 stars). [Eval best practices]
    • Pros: highest reliability and best at capturing real‑world utility and subtle quality differences. [Temporal retrieval eval]
    • Cons: expensive, slower, and subject to disagreement; OpenAI recommends multiple reviewers, clear rubrics, and pass/fail thresholds. [Eval best practices]
#
  1. LLM‑as‑a‑judge / model‑graded evals
    • A stronger model (e.g., GPT‑4.1 or o‑series) is used to grade outputs: pairwise comparisons, scoring a single answer, or grading against a reference. [Eval best practices; OpenAI eval templates]
    • These methods can reach human‑level agreement (around 80%) when carefully designed, and are much cheaper and more scalable than pure human evals. [Eval best practices]
    • OpenAI also uses this pattern in examples like summarization (G‑Eval) and agent evaluation. [Summarization eval; Self‑evolving agents]

Overall evaluation workflow

The documents describe a general workflow that OpenAI recommends (and appears to follow internally):

#
  • Design task‑specific evals: define clear objectives, collect representative datasets (including edge and adversarial cases), and choose metrics aligned with those objectives. [Eval best practices]
  • Combine methods: use automated metrics for scale, LLM‑as‑judge for nuanced but scalable grading, and human evals to calibrate and validate automated methods. [Eval best practices]
  • Continuous evaluation / flywheel: run evals on every change, log failures, and iteratively refine prompts, system design, and (when applicable) fine‑tuning data. [Eval best practices; Model optimization; Resilient prompts flywheel]

The sources provided do not give a full, model‑by‑model internal evaluation protocol for “latest models,” but they consistently describe this mixed strategy of human evals, automated metrics, and LLM‑as‑judge, tied together in a continuous eval loop.

Resources to learn more

#
  • Evaluation best practices (OpenAI docs) – high‑level guidance and examples for summarization and Q&A eval design. [Eval best practices]
  • OpenAI Evals guide & templates – how to build and run evals, including model‑graded templates. [Getting started with evals]
  • Summarization evaluation cookbook (G‑Eval) – concrete example using ROUGE, BERTScore, and GPT‑4 as a judge. [Summarization eval]
  • Model optimization guide – shows how evals, prompt engineering, and fine‑tuning form a feedback loop. [Model optimization]

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@prisma apex Kapa.ai is still learning and improving, please let me know how I did by reacting below.

prisma apex
#

Does openai ai work with outside contractors for evals? if yes how do we reach out to them for the same?