Comparing available LLMs for non-technical users

How do ChatGPT, Mistral, Gemini, and Llama3 stack up for common tasks like generating sales emails?

Apr 22, 2024

∙ Paid

If you’re using Large Language Models at work to help automate your job, or even just for personal use, you’ve now got the choice of several different models instead of just ChatGPT. Between Gemini (formerly Bard), Mixtral, Llama3, and ole’ reliable (ChatGPT), which is the best for the kinds of tasks you need it to do?

A few weeks ago, I argued that benchmarks – the standard way companies are measuring model performance – are bad because they’re completely disconnected from real world use cases. Who cares if a model can pass the BAR exam if it can’t generate a half decent tweet? Instead, the focus of this post is looking at how models perform in real world tasks that functional teams like marketing, product, and operations would actually use them for.

The TL;DR, for impatient readers:

Most models are roughly at parity with each other for common chat-oriented tasks
ChatGPT performed significantly worse than I thought it would
Gemini, to the surprise of everyone, was the best performing model by a decent margin
Overall, model responses were usable, but would need a lot of cleanup and work to use practically

The ringer we shall put these models through

I designed 3 use cases to test each model against, designed to mimic a real world task that you might have an LLM do for in the course of your job. They’re all centered around generating text, even though some of these models are multimodal (can do images as well).

Keep reading with a 7-day free trial

Subscribe to Technically to keep reading this post and get 7 days of free access to the full post archives.