The LLM API comparison nobody was being honest about (2026 edition)

  
                  // jaymes in the stack
The LLM API comparison nobody
was being honest about
Issue #001  ·  Feb 2026  ·  3 min read

                  >_
STACK

            Hey —
I've been heads-down running benchmarks this week. Turned into a rabbit hole. Sharing everything below.

            ▷ This week in the stack
The LLM API Landscape Is Finally Interesting AgainFor a while, the answer to "which LLM API should I build on?" was basically GPT-4, obviously and we all moved on. That's no longer true — and if you're still defaulting to OpenAI out of habit, you might be leaving real performance and cost gains on the table.
I spent a few weeks running the same eval suite across Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and a handful of open-source models for tasks that actually come up in production: structured JSON extraction, multi-turn reasoning, RAG summarization, and long-context document parsing.
Here's what surprised me:

                  Model
                  Verdict
                
                  Claude 3.5 Sonnet
                  Won on instruction-following + structured output fidelity
                
                  GPT-4o
                  Best ecosystem — function calling, Assistants API, fine-tuning
                
                  Gemini 1.5 Pro
                  1M token context is real — but hallucination variance is too
                
                  Open-source
                  Shockingly competitive. Cost difference at scale is significant
                
Full benchmark writeup — including eval prompts and cost-per-1k-tokens breakdown — is live on the blog.

                    ► Read the full breakdown →
                  
            ▷ Tool worth knowing

                  AI/ML API
I've been routing experimental API calls through AI/ML API because they give unified access to 200+ models — Claude, GPT-4, Gemini, Llama, Mixtral — under one API key. No juggling five billing dashboards. One OpenAI-compatible endpoint.
← affiliate link  ·  aimlapi.com/?via=jaymes

            ▷ Quick hits

                    →
                    Llama 3.1 405B is matching GPT-4 on several reasoning benchmarks at ~1/10th the cost
                  
                    →
                    Gemini Flash pricing dropped again — worth revisiting for high-volume classification pipelines
                  
                    →
                    Qwen 2.5 72B is underrated for multilingual workloads — quietly excellent
                  
            Drop me a reply if you want the raw results CSV. Happy to share.
— Jaymes
Berkeley EECS  ·  Developer tools & AI/ML infrastructure

              @jaymes_stack on X  ·  jaymesinthestack.com

                  P.S.  The open-source model results genuinely caught me off guard — one of them outperformed GPT-4o on 3 of my 6 eval categories at roughly 1/10th the cost. Full breakdown in the blog post. You'll want to see the table.

                      You're receiving this because you subscribed at jaymesinthestack.beehiiv.com

                      Unsubscribe  ·  Update preferences
                    
                  JITS #001
The LLM API comparison nobody was being honest about (2026 edition)

The LLM API Landscape Is Finally Interesting Again

Keep reading

Jaymes In The Stack