|
▷ This week in the stack
The LLM API Landscape Is Finally Interesting Again
For a while, the answer to "which LLM API should I build on?" was basically GPT-4, obviously and we all moved on. That's no longer true — and if you're still defaulting to OpenAI out of habit, you might be leaving real performance and cost gains on the table.
I spent a few weeks running the same eval suite across Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and a handful of open-source models for tasks that actually come up in production: structured JSON extraction, multi-turn reasoning, RAG summarization, and long-context document parsing.
Here's what surprised me:
| Model |
Verdict |
| Claude 3.5 Sonnet |
Won on instruction-following + structured output fidelity |
| GPT-4o |
Best ecosystem — function calling, Assistants API, fine-tuning |
| Gemini 1.5 Pro |
1M token context is real — but hallucination variance is too |
| Open-source |
Shockingly competitive. Cost difference at scale is significant |
Full benchmark writeup — including eval prompts and cost-per-1k-tokens breakdown — is live on the blog.
|