The LLM community is obsessed with benchmarking model performance. Mistral released their new “flagship” model this week, and immediately focused the discussion on how it performs on “commonly used benchmarks” relative to other models:
The entire blog post (I’d recommend reading it) is just a read through of how this model performs relative to other models on benchmarks, from math and coding to multilingual capabilities.
It’s not just Mistral that anchors the value of their offering in benchmark accuracy. HuggingFace’s popular OpenLLM leaderboard compares open source LLMs across similar lines. The entire marketing push from Anthropic for their new model, Claude 3, is about benchmarks:
This tendency to fixate on benchmarks is understandable – right now, it’s basically the only semi-objective way to measure how these models stack up against each other. It’s something vendors in other spaces, like data streaming, do too. But it is dangerous because it misses the point of where this whole AI thing is going, and is a textbook product marketing anti-pattern.
In a trend that we’ve seen hundreds of times in developer tooling, the underlying LLM is not going to matter within a few years. Large Language Model performance is already highly commoditized, and will continue to head in that direction. All that will matter is the experience that you build on top of these models, and what that enables for your customers.
A lot of ChatGPT is about the Chat, not the GPT
I’ve used all of the major LLMs, including any available interfaces that come with them (e.g. Mistral’s new “Le Chat”) and ChatGPT is far and away the superior experience, save for parts of Gemini. Why?
Let’s take a look at the ChatGPT interface. Here’s a common prompt I’ve been using for testing, asking the model to summarize the contents of an external link into a tweet thread. Unrelated aside, the responses to this prompt are virtually identical across every major LLM.
Which parts of this interface are the underlying model – GPT-4 in this case – and which are an experience built by OpenAI on top of the underlying model?
The text response, minus any formatting, is what the model generated. But the:
Ability of the model to access and scrape content from a web page
Context of the prompt, including setting the system as a helpful assistant
Formatting the response, like changing the numbers to gray
UI for typing the prompt
Filepicker for attaching media to the prompt
Prompt history
Model switcher / picker (this one is meta)
Ability to persist and share the model responses
…and more not show here
are all not GPT-4, they’re features built by OpenAI on top of GPT-4 to create an experience that is helpful and worth paying for. Some of these are harder to build than others – OpenAI’s secret sauce obviously isn’t the little arrow that scrolls down to the bottom of the response. ChatGPT would be nothing without GPT-4 – but the reverse may also be true!
The retort to this line of reasoning is that these chat interfaces are primarily for non-technical users, while the real money for these model providers comes from developer use cases, building LLMs into user-facing applications. I’ve worked closely with one of the major model compute providers, so this is not foreign to me. But experience matters to developers too!
OpenAI has dedicated significant resources to building a seamless developer experience beyond “docs for the model.” Here’s their playground for prompting GPT models – you can adjust parameters like temperature and penalties, plus change the system prompt to be any other style.
There are similarly dedicated experiences for fine tuning models to your data:
Plus handling API keys, storage, etc. etc.
The point of all of this isn’t to say that OpenAI is so awesome (although that’s a reasonable conclusion) – it’s that the framing of this open source vs. closed source conversation is missing the point.
This distinction has major implications for open source
For a closed source model provider like OpenAI, the difference between what is model and what is experience is academic – you’re paying for both. They are one thing. But where this really matters is in open source. Does the convergence of open source performance to closed source performance really matter if the experience of using that open source is bad?
This is why Mistral’s launch of “Le Chat” – their chat interface for the aforementioned Mistral Large model – missed the mark for me. At first glance, the UI looks pretty similar to ChatGPT, or maybe phrased better, the standard chat interface for LLMs.
But here’s the issue: Mistral doesn’t have the capability to scrape the provided link in the prompt. This is an extra-model experience that they haven’t built yet. But instead of telling me that, like ChatGPT used to do before they built this functionality, the model hallucinated and made up a summary of the information about microservices that was in its training set.
This is an innocent enough mistake in this low stakes context, but it easily could have been much worse.
Ironically enough (depending on how you look at it), the response is pretty similar to the ChatGPT response, even though Mistral Large hallucinated it. Which brings us to the important question for open source: underlying models are getting closer and closer to each other. So what will differentiate these companies from one another?
The open source discussion has been too anchored on reaching performance parity with OpenAI models. This is a small piece of the puzzle. For developers looking to build applications with these open source models, and especially the pro-sumer chat use case, users need to consider the holistic experience that model providers offer. Integrating LLMs into your app is almost never going to be the “drop in” experience you see on marketing sites – and my concern is that the “open source is approaching parity with OpenAI!” narrative is not actually true in a meaningful way.
Folks working in AI can look to previous examples of this phenomenon in developer tools for guidance: A couple of years ago, I wrote about how underlying performance of production relational databases is becoming commoditized, and vendors are focusing much more on developer experience. It’s going to happen here too, the question is just when.
As an aside, I’m a big fan of LMSYS’s Chatbot Arena. It tries to evaluate LLMs not based on system benchmarks with a dubious relationship to real world performance, but instead by soliciting real feedback from real users about prompt responses. The original paper evaluates actually using LLMs to evaluate LLMs, and it’s worth a read.
Not my insight but sharing here as it’s relevant. Buying proprietary content for a model is another way to differentiate, assuming you pat for exclusivity. This is what Gemini did with Reddit.