#🧵 I did a quick comparison of a couple

hoary egret · 2023-12-20T13:36:51.060Z

🧵 I did a quick comparison of a couple | Google Developer Community | Page 1

1 messages · Page 1 of 1 (latest)

hoary egret Dec 20, 2023, 1:36 PM

LLM Model Showdown: Finding the Best AI for Your Needs

When it comes to choosing the right language model for your project, it's not just about the most popular or the most advanced—it's about what fits your specific needs. To help you make an informed decision, we've conducted a comparison of several leading language models based on a simple but telling task: summarizing a pull request code diff for non-technical users.

The Setup

Using hunch.tools, we presented each model with the same input—a pull request code diff—and the following prompt:

Please summarize in one sentence how the latest software update (as detailed in the attached pull request code diff) will impact the day-to-day experience of our non-technical users, and provide this explanation for inclusion in our upcoming release notes.

Make a number of attempts, headed 'Attempt n'. Follow the attempt with a critique 'Critique n' that will be used as the basis to improve the next attempt.

Keep going until the explanation is excellent.

The Outputs

Each model produced a unique response, aiming to distill the technical changes into a user-friendly summary:

OpenAI GPT-4

In our latest update, we've added a quality check for all new content. This means you'll only see features and designs that meet our top standards, ensuring a smoother user experience.

OpenAI GPT-4 Turbo

Now, when you explore new tools, you can be confident that every option you see has been thoroughly reviewed and approved for quality and safety, simplifying your choice and saving you time.

OpenAI GPT-3.5 Turbo

The latest software update introduces a new feature that allows administrators to approve or disapprove tools in the system. This means that non-technical users will have a streamlined experience, as they will only see and be able to use tools that have been approved by administrators.

Anthropic Claude 2.1

The latest update introduces curation of publicly listed tools and templates to improve content quality. Users will now only see approved, higher quality items in public listings.

Anthropic Claude Instant 1

This update allows admins to ensure only the best tools are available to users, so they can focus on the most relevant and helpful options for their work.

Google Gemini Pro

With the latest update, non-technical users can now easily find and use the best tools for their projects, as our team of experts has reviewed and approved them.

Execution Time and Cost

The comparison also took into account the execution time and cost for each model, with GPT-4 Turbo set as the baseline:

| Model                       | Duration         | Relative Duration | Cost       | Relative Cost |
|-----------------------------|------------------|-------------------|------------|---------------|
| GOOGLE_GEMINI_PRO           | 00:00:10.432363  | 43.87             | 0.00441450 | 5.57          |
| ANTHROPIC_CLAUDE_INSTANT_1  | 00:00:04.59422   | 19.32             | 0.00423200 | 5.34          |
| GPT 3.5                     | 00:00:11.865002  | 49.90             | 0.00746200 | 9.42          |
| ANTHROPIC_CLAUDE_2          | 00:00:30.279665  | 127.34            | 0.04071200 | 51.40         |
| GPT 4 TURBO                 | 00:00:23.778624  | 100.00            | 0.07921000 | 100.00        |
| GPT 4                       | 00:00:09.100596  | 38.27             | 0.20385000 | 257.35        |

Conclusion

While OpenAI's GPT-4 Turbo is a common choice, it's not always the best option for every task. In our experiment, Anthropic's Claude Instant 1 not only provided the most suitable output but also did so with the quickest execution time and lowest cost. This exercise underscores the importance of considering various models and not just defaulting to the most familiar option. Your project's needs should dictate the AI you choose, balancing quality, speed, and cost to find the perfect match.

To evaluate multiple LLMs using your own prompt, head to hunch.tools