#General Inquiry - Eval Methods
1 messages · Page 1 of 1 (latest)
Hi @prisma apex. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
OpenAI uses a mix of human evaluation and automated methods, and strongly recommends that developers do the same for their own applications.
At a high level, OpenAI describes evals as structured tests to measure model performance and reliability, and emphasizes that they should combine numerical metrics with human judgment rather than relying on a single score. [Eval best practices]
Types of evaluation methods
From the available resources, OpenAI highlights three main categories:
- Metric-based / automated evals
- Examples: exact or string match, ROUGE, BLEU, BERTScore, function-call accuracy, executable checks (e.g., does generated SQL run). [Eval best practices; Summarization eval]
- These are useful for regression testing and large-scale comparisons, but can miss nuance and may not correlate perfectly with human judgments, especially for open‑ended tasks. [Summarization eval]
- Human evaluations
- Human annotators (domain experts, crowd workers, or end‑users) grade outputs, rank alternatives, or give ratings (e.g., thumbs up/down, 1–5 stars). [Eval best practices]
- Pros: highest reliability and best at capturing real‑world utility and subtle quality differences. [Temporal retrieval eval]
- Cons: expensive, slower, and subject to disagreement; OpenAI recommends multiple reviewers, clear rubrics, and pass/fail thresholds. [Eval best practices]
- LLM‑as‑a‑judge / model‑graded evals
- A stronger model (e.g., GPT‑4.1 or o‑series) is used to grade outputs: pairwise comparisons, scoring a single answer, or grading against a reference. [Eval best practices; OpenAI eval templates]
- These methods can reach human‑level agreement (around 80%) when carefully designed, and are much cheaper and more scalable than pure human evals. [Eval best practices]
- OpenAI also uses this pattern in examples like summarization (G‑Eval) and agent evaluation. [Summarization eval; Self‑evolving agents]
Overall evaluation workflow
The documents describe a general workflow that OpenAI recommends (and appears to follow internally):
- Design task‑specific evals: define clear objectives, collect representative datasets (including edge and adversarial cases), and choose metrics aligned with those objectives. [Eval best practices]
- Combine methods: use automated metrics for scale, LLM‑as‑judge for nuanced but scalable grading, and human evals to calibrate and validate automated methods. [Eval best practices]
- Continuous evaluation / flywheel: run evals on every change, log failures, and iteratively refine prompts, system design, and (when applicable) fine‑tuning data. [Eval best practices; Model optimization; Resilient prompts flywheel]
The sources provided do not give a full, model‑by‑model internal evaluation protocol for “latest models,” but they consistently describe this mixed strategy of human evals, automated metrics, and LLM‑as‑judge, tied together in a continuous eval loop.
Resources to learn more
- Evaluation best practices (OpenAI docs) – high‑level guidance and examples for summarization and Q&A eval design. [Eval best practices]
- OpenAI Evals guide & templates – how to build and run evals, including model‑graded templates. [Getting started with evals]
- Summarization evaluation cookbook (G‑Eval) – concrete example using ROUGE, BERTScore, and GPT‑4 as a judge. [Summarization eval]
- Model optimization guide – shows how evals, prompt engineering, and fine‑tuning form a feedback loop. [Model optimization]
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@prisma apex Kapa.ai is still learning and improving, please let me know how I did by reacting below.
Does openai ai work with outside contractors for evals? if yes how do we reach out to them for the same?