#LLM response quality
1 messages · Page 1 of 1 (latest)
Hi @cedar depot. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
There are several libraries and tools that can be used to evaluate the quality of responses generated by Large Language Models (LLMs). Here are a few:
-
Arthur Shield (https://www.arthur.ai/get-started): A paid product for detecting toxicity, hallucination, prompt injection, etc.
-
Baserun (https://baserun.ai/): A paid product for testing, debugging, and monitoring LLM-based apps.
-
Guardrails.ai (https://shreyar.github.io/guardrails/): A Python library for validating outputs and retrying failures. It's still in alpha, so expect sharp edges and bugs.
-
OpenAI Evals (https://github.com/openai/evals): An open-source library for evaluating task performance of language models and prompts.
-
HoneyHive (https://honeyhive.ai): An enterprise platform to evaluate, debug, and monitor LLM apps.
In addition to these libraries, OpenAI has developed a method (G-Eval (https://arxiv.org/pdf/2303.16634.pdf)) that leverages Large Language Models (LLMs) as a novel, reference-free metric for assessing abstractive summaries. This method uses gpt-4 to score candidate outputs, as gpt-4 has effectively learned an internal model of language quality that allows it to differentiate between fluent, coherent text and low-quality text.
For more information, you can refer to the OpenAI Cookbook (https://cookbook.openai.com) and the OpenAI Platform Documentation (https://platform.openai.com/docs/guides/prompt-engineering).
Sources:
- Prompting libraries & tools (https://cookbook.openai.com/articles/related_resources#prompting-libraries--tools-in-alphabetical-order)
- How to evaluate a summarization task (https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization)