// jaymes in the stack
The LLM API comparison nobody
was being honest about
Issue #001  ·  Feb 2026  ·  3 min read
>_
STACK

Hey —

I've been heads-down running benchmarks this week. Turned into a rabbit hole. Sharing everything below.

 
▷ This week in the stack

The LLM API Landscape Is Finally Interesting Again

For a while, the answer to "which LLM API should I build on?" was basically GPT-4, obviously and we all moved on. That's no longer true — and if you're still defaulting to OpenAI out of habit, you might be leaving real performance and cost gains on the table.

I spent a few weeks running the same eval suite across Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and a handful of open-source models for tasks that actually come up in production: structured JSON extraction, multi-turn reasoning, RAG summarization, and long-context document parsing.

Here's what surprised me:

Model Verdict
Claude 3.5 Sonnet Won on instruction-following + structured output fidelity
GPT-4o Best ecosystem — function calling, Assistants API, fine-tuning
Gemini 1.5 Pro 1M token context is real — but hallucination variance is too
Open-source Shockingly competitive. Cost difference at scale is significant

Full benchmark writeup — including eval prompts and cost-per-1k-tokens breakdown — is live on the blog.

► Read the full breakdown →
 
▷ Tool worth knowing
AI/ML API

I've been routing experimental API calls through AI/ML API because they give unified access to 200+ models — Claude, GPT-4, Gemini, Llama, Mixtral — under one API key. No juggling five billing dashboards. One OpenAI-compatible endpoint.

← affiliate link  ·  aimlapi.com/?via=jaymes

 
▷ Quick hits
Llama 3.1 405B is matching GPT-4 on several reasoning benchmarks at ~1/10th the cost
Gemini Flash pricing dropped again — worth revisiting for high-volume classification pipelines
Qwen 2.5 72B is underrated for multilingual workloads — quietly excellent
 

Drop me a reply if you want the raw results CSV. Happy to share.

— Jaymes

Berkeley EECS  ·  Developer tools & AI/ML infrastructure
@jaymes_stack on X  ·  jaymesinthestack.com

P.S.  The open-source model results genuinely caught me off guard — one of them outperformed GPT-4o on 3 of my 6 eval categories at roughly 1/10th the cost. Full breakdown in the blog post. You'll want to see the table.

You're receiving this because you subscribed at jaymesinthestack.beehiiv.com
Unsubscribe  ·  Update preferences

JITS #001

 

Keep reading